EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing

We present the first sentence simplification model that learns explicit edit operations (ADD, DELETE, and KEEP) via a neural programmer-interpreter approach. Most current neural sentence simplification systems are variants of sequence-to-sequence models adopted from machine translation. These methods learn to simplify sentences as a byproduct of the fact that they are trained on complex-simple sentence pairs. By contrast, our neural programmer-interpreter is directly trained to predict explicit edit operations on targeted parts of the input sentence, resembling the way that humans perform simplification and revision. Our model outperforms previous state-of-the-art neural sentence simplification models (without external knowledge) by large margins on three benchmark text simplification corpora in terms of SARI (+0.95 WikiLarge, +1.89 WikiSmall, +1.41 Newsela), and is judged by humans to produce overall better and simpler output sentences.


Introduction
Sentence simplification aims to reduce the reading complexity of a sentence while preserving its meaning. Simplification systems can benefit populations with limited literacy skills (Watanabe et al., 2009), such as children, second language speakers and individuals with language impairments including dyslexia (Rello et al., 2013), aphasia (Carroll et al., 1999) and autism (Evans et al., 2014).
Inspired by the success of machine translation, many text simplification (TS) systems treat sentence simplification as a monolingual translation task, in which complex-simple sentence pairs are presented to the models as source-target pairs (Zhang and Lapata, 2017). Two major machine translation (MT) approaches are adapted into TS systems, each with its advantages: statistical machine translation (SMT)-based models (Zhu et al., 2010;Wubben et al., 2012;Narayan and Gardent, 2014;Xu et al., 2016) can easily integrate human-curated features into the model, while neural machine translation (NMT)-based models (Nisioi et al., 2017;Zhang and Lapata, 2017; can operate in an end-to-end fashion by extracting features automatically. Nevertheless, MTbased models must learn the simplifying operations that are embedded in the parallel complexsimple sentences implicitly. These operations are relatively infrequent, as a large part of the original complex sentence usually remains unchanged in the simplification process . This leads to MT-based models that often produce outputs that are identical to the inputs (Zhao et al., 2018), which is also confirmed in our experiments.
We instead propose a novel end-to-end Neural Programmer-Interpreter (Reed and de Freitas, 2016) that learns to explicitly generate edit operations in a sequential fashion, resembling the way that a human editor might perform simplifications on sentences. Our proposed framework consists of a programmer and an interpreter that operate alternately at each time step: the programmer predicts a simplifying edit operation (program) such as ADD, DELETE, or KEEP; the interpreter executes the edit operation while maintaining a context and an edit pointer to assist the programmer for further decisions. Table 1 shows sample runs of our model. Intuitively, our model learns to skip words that do not need to be modified by predicting KEEP, so it can focus on simplifying the parts that actually require changes. An analogy can be drawn to residual connections popular in deep neural archi- Reference clark said that schools do sometimes lower fees for students who do n't have enough money . tectures for image recognition, which give models the flexibility to directly copy parameters from previous layers if they are not the focus of the visual signal (He et al., 2016). In addition, the edit operations generated by our model are easier to interpret than the black-box MT-based seq2seq systems: by looking at our model's generated programs, we can trace the simplification operations used to transform complex sentences to simple ones. Moreover, our model offers control over the ratio of simplification operations. By simply changing the loss weights on edit operations, our model can prioritize different simplification operations for different sentence simplification tasks (e.g., compression or lexical replacement).
The idea of learning sentence simplification through edit operations was attempted by . They were mainly focused on creating better-aligned simplification edit labels ("silver" labels) and showed that a simple sequence labelling model (BiLSTM) fails to predict these silver simplification labels. We speculate that the limited success of their proposed model is due to the facts that the model relies on an external system and assumes the edit operations are independent of each other. We address these two problems by 1) using variants of Levenshtein distances to create edit labels that do not require external tools to execute; 2) using an interpreter to execute the programs and summarize the partial output sequence immediately before making the next edit decision. Our interpreter also acts as a language model to regularize the operations that would lead to ungrammatical outputs, as a programmer alone will output edit labels with little consideration of context and grammar. In addition, our model is completely end-to-end and does not require any extra modules.
Our contributions are two-fold: 1) we propose to model the edit operations explicitly for sentence simplification in an end-to-end fashion, rather than relying on MT-based models to learn the simplification mappings implicitly, which often generates outputs by blindly repeating the source sentences; 2) we design an NPI-based model that simulates the editing process by a programmer and an interpreter, which outperforms the state-of-the-art neural MT-based TS models by large margins in terms of SARI and is judged by humans as simpler and overall better.

Related Work
MT-based Sentence Simplification SMT-based models and NMT-based models have been the main approaches for sentence simplification. They rely on learning simplification rewrites implic-itly from complex-simple sentence pairs. For SMT-based models, Zhu et al. (2010) adopt a tree-based SMT model for sentence simplification; Woodsend and Lapata (2011) propose a quasi-synchronous grammar and use integer linear programming to score the simplification rules; Wubben et al. (2012) employ a phrase-based MT model to obtain candidates and re-rank them based on the dissimilarity to the complex sentence; Narayan and Gardent (2014) develop a hybrid model that performs sentence splitting and deletion first and then re-rank the outputs similar to Wubben et al. (2012); Xu et al. (2016) propose SBMT-SARI, a syntax-based machine translation framework that uses an external knowledge base to encourage simplification. On the other side, many NMT-based models have also been proposed for sentence simplification: Nisioi et al. (2017) employ vanilla recurrent neural networks (RNNs) on text simplification; Zhang and Lapata (2017) propose to use reinforcement learning methods on RNNs to optimize a specific-designed reward based on simplicity, fluency and relevancy;  incorporate memory-augmented neural networks for sentence simplification; Zhao et al. (2018) integrate the transformer architecture and PPDB rules to guide the simplification learning; Sulem et al. (2018b) combine neural MT models with sentence splitting modules for sentence simplification.

Edit-based Sentence Simplification
The only previous work on sentence simplification by explicitly predicting simplification operations is by .  use MASSAlign  to obtain 'silver' labels for simplification edits and employ a BiLSTM to sequentially predict three of their silver labels-KEEP, REPLACE and DELETE. Essentially, their labelling model is a non-autoregressive classifier with three classes, where a downstream module  is required for applying the REPLACE operation and providing the replacement word. We instead propose an end-toend neural programmer-interpreter model for sentence simplification, which does not rely on external simplification rules nor alignment tools 2 .
Neural Programmer-Interpreter Models The neural programmer-interpreter (NPI) was first proposed by Reed and de Freitas (2016) as a machine learning model that learns to execute programs given their execution traces. Their experiments demonstrate success for 21 tasks including performing addition and bubble sort. It was adopted by Ling et al. (2017) to solve algebraic word problems and by Bérard et al. (2017); Vu and Haffari (2018) to perform automatic post-editing on machine translation outputs. We instead design our NPI model to take monolingual complex input sentences and learn to perform simplification operations on them.

Model
Conventional sequence-to-sequence learning models map a sequence x = x 1 , . . . , x |x| to another one y = y 1 , . . . , y |y| , where elements of x and y are drawn from a vocabulary of size V , by modeling the conditional distribution P (y t |y 1:t−1 , x) directly. Our proposed model, EditNTS, tackles sentence simplification in a different paradigm by learning the simplification operations explicitly. An overview of our model is shown in Figure 1.

EditNTS Model
EditNTS frames the simplification process as executing a sequence of edit operations on complex tokens monotonically. We define the edit operations as {ADD(W), KEEP, DELETE, STOP}. Similar to the sequence-to-sequence learning models, we assume a fixed-sized vocabulary of V words that can be added. Therefore, the number of prediction candidates of the programmer is V + 3 after including KEEP, DELETE, and STOP. To solve the out-of-vocabulary (OOV) problem, conventional Seq2Seq models utilize a copy mechanism (Gu et al., 2016) that selects a word from source (complex) sentence directly with a trainable pointer. In contrast, EditNTS has the ability to copy OOV words into the simplified sentences by directly learning to predict KEEP on them in complex sentences. We argue that our method has advantage over a copy mechanism in two ways: 1) our method does not need extra parameters for copying; 2) a copy mechanism may lead to the model copying blindly rather than performing simplifications.
We detail other constraints on the edit opera- Figure 1: Our model contains two parts: the programmer and the interpreter. At time step t, the programmer predicts an edit operation z t on the complex word x kt by considering the interpreter-generated words y 1:jt−1 , programmer-generated edit labels z 1:t−1 , and a context vector c t obtained by attending over all words in the complex sentence. The interpreter executes the edit operation z t to generate the simplified token y jt and provides the interpreter context y 1:jt to the programmer for the next decision.
tions in Section 3.2. It turns out that the sequence of edit operations z constructed by Section 3.2 is deterministic given x and y (an example of of z can be seen in Table 2). Consequently, EditNTS can learn to simplify by modelling the conditional distribution P (z|x) with a programmer, an interpreter and an edit pointer:  At time step t, the programmer decides an edit operation z t on the word x kt , which is assigned by the edit pointer, based on the following contexts: 1) the summary of partially edited text y 1:j t−1 , 2) the previously generated edit operations z 1:t−1 , 3) and the complex input sentence x. The interpreter then executes the edit operation z t into a simplified token y jt and updates the interpreter context based on y 1:jt to help the programmer at the next time step. The model is trained to maximize Equation 1 where z is the expert edit sequence created in 3.2. We detail the components and functions of the programmer and the interpreter hereafter.
Programmer. The programmer employs an encoder-decoder structure to generate programs; i.e., sequences of edit operations z. An encoder transforms the input sentence x = x 1 , . . . x |x| into a sequence of latent representations h enc i . We additionally utilize the part-of-speech (POS) tags g = g 1 , . . . g |x| to inject the syntactic information of sentences into the latent representations. The specific transformation process is: where e 1 (·) and e 2 (·) are both look-up tables. The decoder is trained to predict the next edit label z t (Eq. 3), given the vector representation h enc kt for the word x kt that currently needs to be edited (Eq. 2), vector representation h edit t of previously generated edit labels z 1:t−1 (Eq. 4), the source context vector c t (Eq.5), and the vector representation of previously generated words by the interpreter y 1:j t−1 (Eq. 6).
Note that there are three attentions involved in the computation of the programmer. 1) the soft attention over all complex tokens to form a context c t ; 2) k t : the hard attention over complex input tokens for the edit pointer, which determines the index position of the current word that needs to be edited at t. We force k t to be the number of KEEP and DELETE previously predicted by the programmer up to time t. 3) j t−1 : the hard attention over simple tokens for training (this attention is used to speed up the training), which is the number of KEEP and ADD(W) in the reference gold labels up to time t − 1. During inference, the model no longer needs this attention and instead incrementally obtains y 1:j t−1 based on its predictions.
Interpreter. The interpreter contains two parts: 1) a parameter-free executor exec(z t , x kt ) that applies the predicted edit operation z t on word x kt , resulting in a new word y jt . The specific execution rules for the operations are as follows: execute KEEP/DELETE to keep/delete the word and move the edit pointer to the next word; execute ADD(W) to add a new word W and the edit pointer stays on the same word; and execute STOP to terminate the edit process. 2) an LSTM interpreter (Eq. 6) that summarizes the partial output sequence of words produced by the executor so far. The output of the LSTM interpreter is given to the programmer in order to generate the next edit decision.

Edit Label Construction
Unlike neural seq2seq models, our model requires expert programs for training. We construct these expert edit sequences from complex sentences to simple ones by computing the shortest edit paths using a dynamic programming algorithm similar to computing Levenshtein distances without substitutions. When multiple paths with the same edit distance exist, we further prioritizes the path that ADD before DELETE. By doing so, we can generate a unique edit path from a complex sentence to a simple one, reducing the noise and variance that the model would face 3 . Table 2 demonstrates an example of the created edit label path and Table 3 shows the counts of the created edit labels 3 We tried other way of labelling, such as 1) preferring DELETE to ADD; 2) deciding randomly when there is a tie; 3) including REPLACE as an operation. However, models trained with these labelling methods do not give good results from our empirical studies. on the training sets of the three text simplification corpora.  Table 3: Counts of the edit labels constructed by our label edits algorithm on three dataset (identical complexsimple sentence pairs are removed).

KEEP DELETE ADD STOP
As can be seen from Table 3, our edit labels are very imbalanced, especially on DELETE. We resolve this by two approaches during training: 1) we associate the inverse of edit label frequencies as the weights to calculate the loss; 2) the model only executes DELETE when there is an explicit DELETE prediction. Thus, if the system outputs STOP before finish editing the whole complex sequence, our system will automatically pad KEEP until the end of the sentence, ensuring the system outputs remain conservative with respect to the complex sequences.

Dataset
Three benchmark text simplification datasets are used in our experiments. WikiSmall contains automatically aligned complex-simple sentence pairs from standard to simple English Wikipedia (Zhu et al., 2010). We use the standard splits of 88,837/205/100 provided by Zhang and Lapata (2017) as train/dev/test sets. WikiLarge (Zhang and Lapata, 2017) is the largest TS corpus with 296,402/2000/359 complex-simple sentence pairs for training/validating/testing, constructed by merging previously created simplification corpora (Zhu et al., 2010;Woodsend and Lapata, 2011;Kauchak, 2013). In addition to the automatically aligned references, Xu et al. (2016) created eight more human-written simplified references for each complex sentence in the development/test set of WikiLarge. The third dataset is Newsela (Xu et al., 2015), which consists of 1130 news articles. Each article is rewritten by professional editors four times for children at different grade levels (0-4 from complex to simple). We use the standard splits provided by Zhang and Lapata (2017), which contains 94,208/1129/1076 sentence pairs for train/dev/test.

Baselines
We compare against three state-of-the-art SMTbased TS systems: PBMT-R (Wubben et al., 2012) where the phrase-based MT system's outputs are re-ranked; 2) Hybrid (Narayan and Gardent, 2014) where syntactic transformation such as sentence splits and deletions are performed before re-rank; 3) SBMT-SARI (Xu et al., 2016), a syntax-based MT framework with external simplification rules. We also compare against four stateof-the-art NMT-based TS systems: vanilla RNNbased model NTS (Nisioi et al., 2017), memoryaugmented neural networks N SE L STM , deep reinforcement learning-based neural network DRESS and DRESS-LS (Zhang and Lapata, 2017), and DMASS+DCSS (Zhao et al., 2018) that integrates the transformer model with external simplification rules. In addition, we compare our NPI-based EditNTS with the BiLSTM sequence labelling model  that are trained on our edit labels 4 , we call it Seq-Label model.

Evaluation
We report two widely used sentence simplification metrics in the literature: SARI (Xu et al., 2016) and FKGL (Kincaid et al., 1975). FKGL (Kincaid et al., 1975) measures the readability of the system output (lower FKGL implies simpler output) and SARI (Xu et al., 2016) evaluates the system output by comparing it against the source and reference sentences. Earlier work also used BLEU as a metric, but recent work has found that it does not reflect simplification (Xu et al., 2016) and is in fact negatively correlated with simplicity (Sulem et al., 2018a). Systems with high BLEU scores are thus biased towards copying the complex sentence as a whole, while SARI avoids this by computing the arithmetic mean of the N -gram (N ∈ {1, 2, 3, 4}) F1-scores of three rewrite operations: add, delete, and keep. We also report the F1-scores of these three operations. In addition, we report the percentage of unchanged sentences that are directly copied from the source sentences. We treat SARI as the most important measurement in our study, as Xu et al. (2016) demonstrated that SARI has the highest correlation with human judgments in sentence simplification tasks. In addition to automatic evaluations, we also report human evaluations 5 of our system outputs compared to the best MT-based systems, external knowledge-based systems, and Seq-Label by three human judges 6 with a five-point Likert scale. The volunteers are asked to rate simplifications on three dimensions: 1) fluency (is the output grammatical?), 2) adequacy (how much meaning from the original sentence is preserved?), and 3) simplicity (is the output simper than the original sentence?).

Training Details
We used the same hyperparameters across the three datasets. We initialized the word and edit operation embeddings with 100-dimensional GloVe vectors (Pennington et al., 2014) and the part-ofspeech tag 7 embeddings with 30 dimensions. The number of hidden units was set to 200 for the encoder, the edit LSTM, and the LSTM interpreter. During training, we regularized the encoder with a dropout rate of 0.3 (Srivastava et al., 2014). For optimization, we used Adam (Kingma and Ba, 2014) with a learning rate 0.001 and weight decay of 10 −6 . The gradient was clipped to 1 (Pascanu et al., 2013). We used a vocabulary size of 30K and the remaining words were replaced with UNK. In our main experiment, we used the inverse 5 The outputs of PBMT-R, Hybrid, SBMT-SARI and DRESS are publicly available and we are grateful to Sanqiang Zhao for providing their system's outputs. 6 Three volunteers (one native English Speaker and two non-native fluent English speakers) are participated in our human evaluation, as one of the goal of our system is to make the text easier to understand for non-native English speakers. The volunteers are given complex setences and different system outputs in random order, and are asked to rate from one to five (the higher the better) in terms of simplicity, fluency, and adequacy. 7 We used the NLTK toolkit with the default Penn Treebank Tag set to obtain the part-of-speech tags; there are 45 possible POS-tags (36 standard tags and 7 special symbols) in total. of the edit label frequencies as the loss weights, aiming to balance the classes. Batch size across all datasets was 64.   (add,keep,delete). In addition, we report the percentage of unchanged sentences (%unc.) in the system outputs when compared to the source sentences. Table 5 summarizes the results of our automatic evaluations. In terms of readability, our system obtains lower (= better) FKGL compared to other MT-based systems, which indicates our system's output is easier to understand. In terms of the percentage of unchanged sentences, one can see that MT-based models have much higher rates of unchanged sentences than the reference. Thus, the models learned a safe but undesirable strategy of copying the sources sentences directly. By contrast, our model learns to edit the sentences and has a lower rate of keeping the source sentences unchanged.
In term of SARI, the edit labelling-based models Seq-Label and EditNTS achieve better or comparable results with respect to state-of-the-art MTbased models, demonstrating the promise of learning edit labels for text simplification. Compared to Seq-Label, our model achieves a large improvement of (+1.14,+1.85,+1.88 SARI) on WikiLarge, Newsela, and WikiSmall. We believe this improvement is mainly from the interpreter in Ed-itNTS, as it provides the proper context to the programmer for making edit decisions (more ablation studies in section 5.1). On Newsela and Wik-iSmall, our model significantly outperforms stateof-the-art TS models by a large margin (+1.89, +1.41 SARI), showing that EditNTS learns simplification better on smaller datasets with respect to MT-based simplification models. On WikiLarge, our model outperforms the best NMT-based system DRESS-LS by a large margin of +0.95 SARI and achieves comparable performance to the best SMT-based model PBMT-R. While the overall SARI are similar between EditNTS and PBMT-R, the two models prefer different strategies: Edit-NTS performs extensive DELETE while PBMT-R is in favour of performing lexical substitution and simplification.
On WikiLarge, two models SBMT-SARI and DMASS+DCSS reported higher SARI scores as they employ external knowledge base PPDB for word replacement. These external rules can provide reliable guidance about which words to modify, resulting in higher add/keep F1 scores (Table 5-a). On the contrary, our model is inclined to generate shorter sentences, which leads to high F1 scores on delete operations 8 . Nevertheless, our model is preferred by human judges than SBMT-  SARI and DMASS+DCSS in terms of all the measurements (Table 6), indicating the effectiveness of our model on correctly performing deleting operations while maintaining fluent and adequate outputs. Moreover, our model can be easily integrated with these external PPTB simplification rules for word replacement by adding a new edit label "replacement" for further improvements.
The results of our human evaluations are presented in Table 6. As can be seen, our model outperforms MT-based models on Fluency, Simplicity, and Average overall ratings. Despite our system EditNTS is inclined to perform more delete operations, human judges rate our system as adequate. In addition, our model performs significantly better than Seq-Label in terms of Fluency, indicating the importance of adding an interpreter to 1) summarize the partial edited outputs and 2) regularize the programmer as a language model. Interestingly, similar to the human evaluation results in Zhang and Lapata (2017), judges often prefer system outputs than the gold references.
Controllable Generation: In addition to the state-of-the-art performance, EditNTS has the flexibility to prioritize different edit operations. Note that NMT-based systems do not have this feature at all, as the sentence length of their systems' output is not controllable and are purely depends on the training data. Table 7 shows that by simply changing the loss weights on different edit labels, we can control the length of system's outputs, how much words it copies from the original sentences and how much novel words the system adds.

Ablation Studies
In the ablation studies, we aim to investigate the effectiveness of each component in our model. We  Table 7: Results on Newsela by controlling the edit label ratios. We increase the loss weight on ADD,KEEP,DELETE ten times respectively. The three rows show the systems' output statistics on the average output sentence length (Avg. len), the average percentage of tokens that are copied from the input (% copied), and the average percentage of novel tokens that are added with respect to the input sentence (% novel). compare the full model with its variants where POS tags removed, interpreter removed, context removed. As shown in Table 8, the interpreter is a critical part to guarantee the performance of the sequence-labelling model, while POS tags and attention provide further performance gains.

Conclusion
We propose an NPI-based model for sentence simplification, where edit-labels are predicted by the programmer and then executed into simplified tokens by the interpreter. Our model outperforms previous state-of-the-art machine translation-based TS models in most of the au-tomatic evaluation metrics and human ratings, demonstrating the effectiveness of learning edit operations explicitly for sentence simplification. Compared to the black-box MT-based systems, our model is more interpretable by providing generated edit operation traces, and more controllable with the ability to prioritize different simplification operations.