Sentence Simplification with Memory-Augmented Neural Networks

Sentence simplification aims to simplify the content and structure of complex sentences, and thus make them easier to interpret for human readers, and easier to process for downstream NLP applications. Recent advances in neural machine translation have paved the way for novel approaches to the task. In this paper, we adapt an architecture with augmented memory capacities called Neural Semantic Encoders (Munkhdalai and Yu, 2017) for sentence simplification. Our experiments demonstrate the effectiveness of our approach on different simplification datasets, both in terms of automatic evaluation measures and human judgments.


Introduction
The goal of sentence simplification is to compose complex sentences into simpler ones so that they are more comprehensible and accessible, while still retaining the original information content and meaning. Sentence simplification has a number of practical applications. On one hand, it provides reading aids for people with limited language proficiency (Watanabe et al., 2009;Siddharthan, 2003), or for patients with linguistic and cognitive disabilities (Carroll et al., 1999). On the other hand, it can improve the performance of other NLP tasks (Chandrasekar et al., 1996;Knight and Marcu, 2000;Beigman Klebanov et al., 2004).
Inspired by the success of neural MT Cho et al., 2014), recent work has started exploring neural simplification with sequence to sequence (Seq2seq) models, also referred to as encoder-decoder models. Nisioi et al. (2017) implemented a standard LSTM-based Seq2seq model and found that they outperform PBMT, SBMT, and unsupervised lexical simplification approaches. Zhang and Lapata (Zhang and Lapata, 2017) viewed the encoder-decoder model as an agent and employed a deep reinforcement learning framework in which the reward has three components capturing key aspects of the target output: simplicity, relevance, and fluency.
The common practice for Seq2seq models is to use recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber, 1997) or Gated Recurrent Unit (GRU, Cho et al., 2014) for the encoder and decoder (Nisioi et al., 2017;Zhang and Lapata, 2017). These architectures were designed to be capable of memorizing long-term dependencies across sequences. Nevertheless, their memory is typically small and might not be enough for the simplification task, where one is confronted with long and complicated sentences.
In this study, we go beyond the conventional LSTM/GRU-based Seq2seq models and propose to use a memory-augmented RNN architecture called Neural Semantic Encoders (NSE). This architecture has been shown to be effective in a wide range of NLP tasks (Munkhdalai and Yu, 2017). The contribution of this paper is twofold: (1) First, we present a novel simplification model which is, to the best of our knowledge, the first model that use memory-augmented RNN for the task. We investigate the effectiveness of neural Seq2seq models when different neural architectures for the encoder are considered. Our experiments reveal that the NSELSTM model that uses an Figure 1: Attention-based encoder-decoder model. The model may attend to relevant source information while decoding the simplification, e.g., to generate the target word won the model may attend to the source words received, nominated and Prize. NSE as the encoder and an LSTM as the decoder performed the best among these models, improving over strong simplification systems. (2) Second, we perform an extensive evaluation of various approaches proposed in the literature on different datasets. Results of both automatic and human evaluation show that our approach is remarkably effective for the task, significantly reducing the reading difficulty of the input, while preserving grammaticality and the original meaning. We further discuss some advantages and disadvantages of these approaches.
2 Neural Sequence to Sequence Models

Attention-based Encoder-Decoder Model
Our approach is based on an attention-based Seq2seq model (Bahdanau et al., 2015) ( Figure  1). Given a complex source sentence X = x 1:Tx , the model learns to generate its simplified version Y = y 1:Ty . The encoder reads through X and computes a sequence of hidden states h 1:Tx : where F enc is a non-linear activation function (e.g., LSTM), h t is the hidden state at time t. Each time the model generates a target word y t , the decoder looks at a set of positions in the source sentence where the most relevant information is located. Specifically, another non-linear activation function F dec is used for the decoder where the hidden state s t at time t is computed by: Here, the context vector c t is computed as a weighted sum of the hidden vectors h 1:Tx : where is the dot product of two vectors. Generation is conditioned on c t and all the previously generated target words y 1:t−1 : P (Y|X ) = Ty t=1 P (y t |{y 1:t−1 }, c t ), P (y t |{y 1:t−1 }, c t ) = G(y t−1 , s t , c t ), where G is some non-linear function. The training objective is to minimize the cross-entropy loss of the training source-target pairs.

Neural Semantic Encoders
An RNN allows us to compute a hidden state h t of each word summarizing the preceding words x 1:t , but not considering the following words x t+1:Tx that might also be useful for simplification. An alternative approach is to use a bidirectional-RNN (Schuster and Paliwal, 1997). Here, we propose to use Neural Semantic Encoders (NSE, Munkhdalai and Yu, 2017). During each encoding time step t, we compute a memory matrix M t ∈ R Tx×D where D is the dimensionality of the word vectors. This matrix is initialized with the word vectors and is refined over time through NSE's functions to gain a better understanding of the input sequence. Concretely, NSE sequentially reads the tokens x 1:Tx with its read function: Then, a compose function is used to compose r t with relevant information retrieved from the memory at the previous time step, M t−1 : compose is a multi-layer perceptron with one hidden layer, c t ∈ R 2D is the output vector, and m t ∈ R D is a linear combination of the memory slots of M t−1 , weighted by σ ti ∈ R: .
Here, M t−1,i is the i th row of the memory matrix at time t − 1, M t−1 . Next, a write function is used to map c t to the encoder output space: Finally, the memory is updated accordingly. The retrieved memory content pointed by σ ti is erased and the new content is added: NSE gives us unrestricted access to the entire source sequence stored in the memory. As such, the encoder may attend to relevant words when encoding each word. The sequence w 1:Tx is then used as the sequence h 1:Tx in Section 2.1.

Decoding
We differ from the approach of Zhang et al. (2017) in the sense that we implement both a greedy strategy and a beam-search strategy to generate the target sentence. Whereas the greedy decoder always chooses the simplification candidate with the highest log-probability, the beam-search decoder keeps a fixed number (beam) of the highest scoring candidates at each time step. We report the best simplification among the outputs based on automatic evaluation measures.

Models and Training Details
We implemented two attention-based Seq2seq models, namely: (1) LSTMLSTM: the encoder 1 https://newsela.com is implemented by two LSTM layers; (2) NSEL-STM: the encoder is implemented by NSE. The decoder in both cases is implemented by two LSTM layers. For all experiments, our models have 300-dimensional hidden states and 300dimensional word embeddings. Parameters were initialized from a uniform distribution [-0.1, 0.1). We used the same hyperparameters across all datasets. Word embeddings were initialized either randomly or with Glove vectors (Pennington et al., 2014) pre-trained on Common Crawl data (840B tokens), and fine-tuned during training. We used a vocabulary size of 20K for Newsela, and 30K for WikiSmall and WikiLarge. Our models were trained with a maximum number of 40 epochs using Adam optimizer (Kingma and Ba, 2015) with step size α = 0.001 for LSTMLSTM, and 0.0003 for NSELSTM, the exponential decay rates β 1 = 0.9, β 2 = 0.999. The batch size is set to 32. We used dropout (Srivastava et al., 2014) for regularization with a dropout rate of 0.3. For beam search, we experimented with beam sizes of 5 and 10. Following (Jean et al., 2015), we replaced each out-of-vocabulary token unk with the source word x k with the highest alignment score α ti , i.e., k = argmax i (α ti ).
Our models were tuned on the development sets, either with BLEU (Papineni et al., 2002) that scores the output by counting n-gram matches with the reference, or SARI (Xu et al., 2016) that compares the output against both the reference and the input sentence. Both measures are commonly used to automatically evaluate the quality of simplification output. We noticed that SARI should be used with caution when tuning neural Seq2seq simplification models. Since SARI depends on the differences between a system's output and the input sentence, large differences may yield very good SARI even though the output is ungrammatical. Thus, when tuning with SARI, we ignored epochs in which the BLEU score of the output is too low, using a threshold ς. We set ς to 22 on Newsela, 33 on WikiSmall, and 77 on WikiLarge.

Comparing Systems
We compared our models, either tuned with BLEU (-B) or SARI (-S), against systems reported in (Zhang and Lapata, 2017), namely DRESS, a deep reinforcement learning model, DRESS-LS, a combination of DRESS and a lexical simplification model (Zhang and Lapata, 2017), PBMT-R, a PBMT model with dissimilarity-based reranking (Wubben et al., 2012), HYBRID, a hybrid semantic-based model that combines a simplification model and a monolingual MT model (Narayan and Gardent, 2014), and SBMT-SARI, a SBMT model with simplification-specific components. (Xu et al., 2016).

Evaluation
We measured BLEU, and SARI at corpus-level following (Zhang and Lapata, 2017). In addition, we also evaluated system output by eliciting human judgments. Specifically, we randomly selected 40 sentences from each test set, and included human reference simplifications and corresponding simplifications from the systems above 2 . We then asked three volunteers 3 to rate simplifications with respect to Fluency (the extent to which the output is grammatical English), Adequacy (the extent to which the output has the same meaning as the input sentence), and Simplicity (the extent to which the output is simpler than the input sentence) using a five point Likert scale.

Automatic Evaluation Measures
The results of the automatic evaluation are displayed in Table 2. We first discuss the results on Newsela that contains high-quality simplifications composed by professional editors. In terms of BLEU, all neural models achieved much higher scores than PBMT-R and HYBRID. NSELSTM-B scored highest with a BLEU score of 26.31. With regard to SARI, NSELSTM-S scored best among neural models (29.58) and came close to the performance of HYBRID (30.00). This indicates that NSE offers an effective means to better encode complex sentences for sentence simplification.

Human Judgments
The results of human judgments are displayed in Table 3. On Newsela, NSELSTM-B scored highest on Fluency. PBMT-R was significantly better than all other systems on Adequacy while LSTMLSTM-S performed best on Simplicity. NSELSTM-B did very well on both Adequacy and Simplicity, and was best in terms of Average. Example model outputs on Newsela are provided in Table 4. On WikiSmall, NSELSTM-B performed best on both Fluency and Adequacy. On WikiLarge, LSTMLSTM-B achieved the highest Fluency score while NSELSTM-B received the highest Adequacy score. In terms of Simplicity and Average, NSELSTM-S outperformed all other systems on both WikiSmall and WikiLarge.
As shown in Table 3, neural models often outperformed traditional systems (PBMT-R, HY-BRID, SBMT-SARI) on Fluency. This is not surprising given the recent success of neural Seq2seq models in language modeling and neural machine translation (Zaremba et al., 2014;Jean et al., 2015). On the downside, our manual inspection reveals that neural models learn to perform copying very well in terms of rewrite operations (e.g., copying, deletion, reordering, substitution), often outputting the same or parts of the input sentence.
Finally, as can be seen in Table 3, REFER-ENCE scored lower on Adequacy compared to Fluency and Simplicity on Newsela. On Wikipediabased datasets, REFERENCE obtained high Adequacy scores but much lower Simplicity scores compared to Newsela. This supports the assertion by previous work (Xu et al., 2015) that SEW has a large proportion of inadequate simplifications.   COMPLEX: Stowell believes that even documents about Lincoln 's death will give people a better understanding of the man who was assassinated 150 years ago this April . REFERENCE: Stowell thinks that even information about Lincoln 's death will help people understand him . PBMT-R: Stowell thinks that even documents about Lincoln 's death will give people a better understanding of the man who was killed 150 years ago this April . HYBRID: documents that will give people a understanding the man was assassinated 150 years ago . DRESS: Stowell thinks that even documents about Lincoln 's death will give people a better understanding of the man . DRESS-LS: Stowell thinks that even documents about Lincoln 's death will give people a better understanding of the man . LSTMLSTM-B: Stowell believes that only documents about Lincoln 's death will give people a better understanding . NSELSTM-B: Stowell believes that the discovery about Lincoln 's death will give people a better understanding of the man . LSTMLSTM-S: Stowell thinks that even documents about Lincoln 's death will give people a better understanding of the man . NSELSTM-S: Stowell thinks that even papers about Lincoln 's death will give people a better understanding of the man .  Table 5 shows the correlations between the scores assigned by humans and the automatic evaluation measures. There is a positive significant correlation between Fluency and Adequacy (0.69), but a negative significant correlation between Adequacy and Simplicity (-0.64). BLEU correlates well with Fluency (0.63) and Adequacy (0.90) while SARI correlates well with Simplicity (0.73). BLEU and SARI show a negative significant correlation (-0.54). The results reflect the challenge of managing the trade-off between Fluency, Adequacy and Simplicity in sentence simplification.

Conclusions
In this paper, we explore neural Seq2seq models for sentence simplification. We propose to use an architecture with augmented memory capacities which we believe is suitable for the task, where one is confronted with long and complex sentences. Results of both automatic and human evaluation on different datasets show that our model is capable of significantly reducing the reading difficulty of the input, while performing well in terms of grammaticality and meaning preservation.