Exploring Neural Text Simplification Models

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems


Introduction
Neural sequence to sequence models have been successfully used in many applications (Graves, 2012), from speech and signal processing to text processing or dialogue systems (Serban et al., 2015). Neural machine translation  is a particular type of sequence to sequence model that recently attracted a lot of attention from industry (Wu et al., 2016) and academia, especially due to the capability to obtain state-of-the-art results for various translation tasks (Bojar et al., 2016). Unlike classical statistical machine translation (SMT) systems (Koehn, 2010), neural networks are being trained end-to-end, without the need to have external decoders, language models or phrase tables. The architectures are relatively simpler and more flexible, making possible the use of character models (Luong and Manning, 2016) or even training multilingual systems in one go (Firat et al., 2016).
Automated text simplification (ATS) systems are meant to transform original texts into differ-ent (simpler) variants which would be understood by wider audiences and more successfully processed by various NLP tools. In the last several years, great attention has been given to addressing ATS as a monolingual machine translation problem translating from 'original' to 'simple' sentences. So far, attempts were made at standard phrase-based SMT (PBSMT) models (Specia, 2010;Štajner et al., 2015), PBSMT models with added phrasal deletion rules (Coster and Kauchak, 2011) or reranking of the n-best outputs according to their dissimilarity to the output (Wubben et al., 2012), tree-based translation models (Zhu et al., 2010;Paetzold and Specia, 2013), and syntax-based MT with specially designed tuning function . Recently, lexical simplification (LS) was addressed by unsupervised approaches leveraging word-embeddings, with reported good success (Glavaš andŠtajner, 2015;Paetzold and Specia, 2016).
To the best of our knowledge, our work is the first to address the applicability of neural sequence to sequence models for ATS. We make use of the recent advances in neural machine translation (NMT) and adapt the existing architectures for our specific task. We also perform an extensive human evaluation to directly compare our systems with the current state-of-the-art (supervised) MT-based and unsupervised lexical simplification systems.

Neural Text Simplification (NTS)
We use the OpenNMT framework (Klein et al., 2017) to train and build our architecture with two LSTM layers (Hochreiter and Schmidhuber, 1997), hidden states of size 500 and 500 hidden units, and a 0.3 dropout probability (Srivastava et al., 2014). The vocabulary size is set to 50,000 and we train the model for 15 epochs with plain SGD optimizer, and after epoch 8 we halve the learning rate. At the end of each epoch we save the current state of the model and predict the perplexity values of the models on the development set. We employ early-stopping and select the model resulted from the epoch with the best perplexity to avoid over-fitting. The parameters are initialized over uniform distribution with support [-0.1, 0.1]. Additionally, for the decoder we employ global attention in combination with input feeding as described by Luong et al. (2015). The architecture 1 is depicted in Figure 1, with the input feeding approach represented only for the last hidden state of the decoder. For the attention layer, we compute a context vector c t by using the information provided from the hidden states of the source sentence and by computing a weighted average with the alignment weights a t . The new hidden state is obtained using a concatenation of the previous hidden state and the context vector: The global alignment weights a t are being computed with a softmax function over the general scoring method for attention: a t (s) = exp h T t W ashs s exp h T t W as h s Input feeding is a process that sends the previous hidden state obtained using the alignment method, to the input at the next step, presumably making the model keep track of anterior alignment decisions. Luong et al. (2015) showed this approach can increase the evaluation scores for neural machine translation, while in our case, for monolingual data, we believe it can be helpful to create better alignments. Our approach does not involve the use of character-based models (Sennrich et al., 2015;Luong and Manning, 2016) to handle out of vocabulary words and entities. Instead, we make use of alignment probabilities between the predictions and the original sentences to retrieve the original words.

Word2vec Embeddings
Furthermore, we are interested to explore whether large scale pre-trained embeddings can improve text simplification models. Kauchak (2013) indicates that combining normal data with simplified data can increase the performance of ATS systems. Therefore, we construct a secondary model (NTS-w2v) using a combination of pre-trained word2vec from Google News corpus (Mikolov et al., 2013a) of size 300 and locally trained embeddings of size 200. To ensure good representations of lowfrequency words, we use word2vec (Řehůřek and Sojka, 2010;Mikolov et al., 2013b) to train skipgram with hierarchical softmax and we set a window of 10 words.
Following Garten et al. (2015) who showed that simple concatenation can improve the word representations, we construct two different sets of embeddings for the encoder and for the decoder. The former are constructed using the word2vec trained on the original English texts combined with Google News and the later (decoder) embeddings are built from word2vec trained on the simplified version of the training data combined with Google News. To merge the local and global embeddings, we concatenate the representations for each word in the vocabulary, thus obtaining a new representation of size 500. If a word is missing in the global embeddings, we replace it with a sample from a Gaussian distribution with mean 0 and standard deviation of 0.9. The remaining parameters are left unchanged from the previous model description.

Prediction Ranking
To ensure the best predictions and the best simplified sentences at each step, we use beam search to sample multiple outputs from the two systems described previously . Beam search works by generating the first k hypotheses at each step ordered by the log-likelihood of the target sentence given the input sentence. By default, we use a beam size of 5 and take the first hypothesis, but we also observe that higher beam size and lower-ranked hypotheses can generate good simplification results. Therefore, we generate the first two candidate hypotheses for each beam size from 5 to 12. We then attempt to find the best beam size and hypothesis based on two metrics: the traditional MT-evaluation metric, BLEU (Papineni et al., 2002;Bird et al., 2009) with NIST smoothing (Bird et al., 2009), and SARI , a recent text-simplification metric.

Dataset
To train our models, we use the publicly available dataset provided by Hwang et al. (2015) based on manual and automatic alignments between standard English Wikipedia and Simple English Wikipedia (EW-SEW). We discard the uncategorized matches, and use only good matches and partial matches which were above the 0.45 threshold (Hwang et al., 2015), totaling to 280K aligned sentences (around 150K full matches and 130K partial matches). It is one of the largest freely available resources for text simplification, and unlike the previously used EW-SEW corpus 2 (Kauchak, 2013), which only contains full matches (167K pairs), the newer dataset also contains partial matches. Therefore, it is not only larger, but it also allows for learning sentence shortening (dropping irrelevant parts) transformations (see Table 3 We use the Stanford NER system (Finkel et al., 2005) to get an approximate number of locations, persons, organizations and miscellaneous entities in the corpus. A brief analysis of the vocabulary is rendered in Table 1.
The dataset we use contains an abundant amount of named entities and consequently a large amount of low frequency words, but the majority of entities are not part of the model's 50,000 words vocabulary due to their small frequency. These words are replaced with 'UNK' symbols during training. At prediction time, we replace the unknown words with the highest probability score from the attention layer. We believe it is important to ensure that the models learn good word representations, either during the model training or through word2vec, in order to accurately create alignments between source and target sentences.
Given that in TS there is not only one best simplification, and that the quality of simplifications in Simple English Wikipedia has been disputed before (Amancio and Specia, 2014;Xu et al., 2015), for tuning and testing we use the dataset previously released by , which contains 2000 sentences for tuning and 359 for testing, each with eight simplification variants obtained by eight Amazon Mechanical Turkers. 3 The tune subset is also used as reference corpus in combination with BLEU and SARI to select the best beam size and hypothesis for prediction reranking.

Evaluation
For the first 70 original sentences of the Xu et al.'s (2016) test set 4 we perform three types of human evaluation to assess the output of our best systems and three ATS systems of different architectures: (1) the PBSMT system with reranking of n-best outputs (Wubben et al., 2012), which represent the best PBSMT approach to ATS, trained and tuned over the same datasets as our systems; (2) the state-of-the-art SBMT system  with modified tuning function (using SARI) and using PPDB paraphrase database (Ganitkevitch et al., 2013); 5 and (3) one of the state-of-theart unsupervised lexical simplification (LS) systems that leverages word-embeddings (Glavaš and 3 None of the 359 test sentences was present in the datasets we used for training and tuning. 4 https://github.com/cocoxu/ simplification/ 5 For the first two systems, we use publicly available output at: https://github.com/ cocoxu/simplification/tree/master/data/ systemoutputs Š tajner, 2015). 6 We evaluate the output of all systems using three types of human evaluation.
Correctness and Number of Changes. First, we count the total number of changes made by each system (Total), counting the change of a whole phrase (e.g. "become defunct" → "was dissolved") as one change. Those changes that preserve the original meaning and grammaticality of the sentence (assessed by two native English speakers) and, at the same time, make the sentence easier to understand (assessed by two non-native fluent English speakers) are marked as Correct. In the case of content reduction, we instructed the annotators to count the deletion of each array of consecutive words as one change and consider the meaning unchanged if the main information of the sentence was retained and unchanged. The sentences for which the two annotators did not agree were given to a third annotator to obtain the majority vote.
Grammaticality and Meaning Preservation. Second, three native English speakers rate the grammaticality (G) and meaning preservation (M) of each (whole) sentence with at least one change on a 1-5 Likert scale (1 -very bad; 5 -very good). The obtained inter-annotator agreement (quadratic Cohens kappa) was 0.78 for G and 0.63 for M.
Simplicity of sentences. Third, the three nonnative fluent English speakers were shown original (reference) sentences and target (output) sentences, one pair at the time, and asked whether the target sentence is: +2 -much simpler; +1 -somewhat simpler; 0 -equally difficult; -1 -somewhat more difficult; -2 -much more difficult, than the reference sentence. The obtained inter-annotator agreement (quadratic Cohens kappa) was 0.66.
While the correctness of changes takes into account the influence of each individual change on grammaticality, meaning and simplicity of a sentence, the Scores (G and M) and Rank (S) take into account the mutual influence of all changes within a sentence.

Results and Discussion
The results of the human evaluation (Table 2) revealed that all NTS models achieve higher percentage of correct changes and more simplified output than any of the state-of-the-art ATS systems with different architectures (PBSMT-R, SBMT, and LightLS). We also notice that the best models according to BLEU are obtained with hypothesis 1 and the maximum beam size for both models, while the SARI re-ranker prefers hypothesis 2 and beam size 5 for the first NTS and the maximum beam size for the custom word embeddings model.
The NTS with custom word2vec embeddings ranked with the text simplification specific metric (SARI) obtained the highest total number of changes among the neural systems, one of the highest percentage of correct changes, the second highest simplicity score, and solid grammaticality and meaning preservation scores. An example of the output of different systems is presented in Table 4 (Appendix A).
The use of different metrics for ranking the NTS predictions optimizes the output towards different evaluation objectives: SARI leads to the highest number of total changes, BLEU to the highest percentage of correct changes, and the default beam scores to the best grammaticality (G) and meaning preservation (M). In addition, custom composed global and local word embeddings in combination with SARI metric improve the default translation system, given the joint scores for each evaluation criterion.
Here is important to note that for ATS systems, the precision of the system (correctness of changes, grammaticality, meaning preservation, and simplicity of the output) is more important than the recall (the total number of changes made). The low recall would just leave the sentences similar to their originals thus not improving much the understanding or reading speed of the target users, or not improving much the NLP systems in which they are used as a pre-processing step. A low precision, on the other hand, would make texts even more difficult to read and understand, and would worsen the performances of the NLP systems in which ATS is used as a pre-processing step.

Conclusions
We presented a first attempt at modelling sentence simplification with a neural sequence to sequence model. Our extensive human evaluation showed that our NTS systems, if the output is ranked with the right metric, can significantly 7 outperform the best phrase-based and syntax-based MT approaches, and unsupervised lexical ATS approach, 7 Wilcoxon's signed rank test, p < 0.001. by grammaticality, meaning preservation and simplicity of the output sentences, the percentage of correct transformations, while at the same time achieving more than 1.5 changes per sentence, on average. Furthermore, we discovered that NTS systems are capable of correctly performing significant content reduction, thus being the only TS models proposed so far which can jointly perform lexical simplification and content reduction.

Match Transformation
Sentence pair Full syntactic simplification; reordering of sentence constituents "During the 13th century, gingerbread was brought to Sweden by German immigrants." and "German immigrants brought it to Sweden during the 13th century." Full lexical paraphrasing "During the 13th century, gingerbread was brought to Sweden by German immigrants." and "German immigrants brought it to Sweden during the 13th century." Partial strong paraphrasing "Gingerbread foods vary, ranging from a soft, moist loaf cake to something close to a ginger biscuit." and "Gingerbread is a word which describes different sweet food products from soft cakes to a ginger biscuit." Partial adding explanations "Humidity is the amount of water vapor in the air." and "Humidity (adjective: humid) refers to water vapor in the air, but not to liquid droplets in fog, clouds, or rain." Partial sentence compression; dropping irrelevant information "Falaj irrigation is an ancient system dating back thousands of years and is used widely in Oman, the UAE, China, Iran and other countries." and "The ancient falaj system of irrigation is still in use in some areas." Table 3: Examples of full and partial matches from the EW-SEW dataset (Hwang et al., 2015).