Sentence Simplification with Deep Reinforcement Learning

Sentence simplification aims to make sentences easier to read and understand. Most recent approaches draw on insights from machine translation to learn simplification rewrites from monolingual corpora of complex and simple sentences. We address the simplification problem with an encoder-decoder model coupled with a deep reinforcement learning framework. Our model, which we call DRESS (as shorthand for Deep REinforcement Sentence Simplification), explores the space of possible simplifications while learning to optimize a reward function that encourages outputs which are simple, fluent, and preserve the meaning of the input. Experiments on three datasets demonstrate that our model outperforms competitive simplification systems.


Introduction
The main goal of sentence simplification is to reduce the linguistic complexity of text, while still retaining its original information and meaning. The simplification task has been the subject of several modeling efforts in recent years due to its relevance for NLP applications and individuals alike (Siddharthan, 2014;Shardlow, 2014). For instance, a simplification component could be used as a preprocessing step to improve the performance of parsers (Chandrasekar et al., 1996), summarizers (Beigman Klebanov et al., 2004), and semantic role labelers (Vickrey and Koller, 2008;Woodsend and Lapata, 2014). Automatic simplification would also benefit people with low-literacy skills (Watanabe et al., 2009), such as children and non-native speakers as well as individuals with autism (Evans et al., 2014), aphasia (Carroll et al., 1999), or dyslexia (Rello et al., 2013).
The most prevalent rewrite operations which give rise to simplified text include substituting rare words with more common words or phrases, rendering syntactically complex structures simpler, and deleting elements of the original text (Siddharthan, 2014). Earlier work focused on individual aspects of the simplification problem. For example, several systems performed syntactic simplification only, using rules aimed at sentence splitting (Carroll et al., 1999;Chandrasekar et al., 1996;Vickrey and Koller, 2008;Siddharthan, 2004) while others turned to lexical simplification by substituting difficult words with more common WordNet synonyms or paraphrases (Devlin, 1999;Inui et al., 2003;Kaji et al., 2002).
Recent approaches view the simplification process more holistically as a monolingual textto-text generation task borrowing ideas from statistical machine translation. Simplification rewrites are learned automatically from examples of complex-simple sentences extracted from online resources such as the ordinary and simple English Wikipedia.
For example, Zhu et al. (2010) draw inspiration from syntax-based translation and propose a model similar to Yamada and Knight (2001) which additionally performs simplification-specific rewrite operations (e.g., sentence splitting). Woodsend and Lapata (2011) formulate simplification in the framework of Quasi-synchronous grammar (Smith and Eisner, 2006) and use integer linear programming to score the candidate translations/simplifications. Wubben et al. (2012) propose a two-stage model: initially, a standard phrase-based machine translation (PBMT) model is trained on complex-simple sentence pairs. During inference, the K-best outputs of the PBMT model are reranked according to their dis-similarity to the (complex) input sentence. The hybrid model developed in Narayan and Gardent (2014) also operates in two phases. Initially, a probabilistic model performs sentence splitting and deletion operations over discourse representation structures assigned by Boxer (Curran et al., 2007). The resulting sentences are further simplified by a model similar to Wubben et al. (2012). Xu et al. (2016) train a syntax-based machine translation model on a large scale paraphrase dataset (Ganitkevitch et al., 2013) using simplification-specific objective functions and features to encourage simpler output.
In this paper we propose a simplification model which draws on insights from neural machine translation (Bahdanau et al., 2015;Sutskever et al., 2014). Central to this approach is an encoderdecoder architecture implemented by recurrent neural networks. The encoder reads the source sequence into a list of continuous-space representations from which the decoder generates the target sequence. Although our model uses the encoder-decoder architecture as its backbone, it must also meet constraints imposed by the simplification task itself, i.e., the predicted output must be simpler, preserve the meaning of the input, and grammatical. To incorporate this knowledge, the model is trained in a reinforcement learning framework (Williams, 1992): it explores the space of possible simplifications while learning to maximize an expected reward function that encourages outputs which meet simplificationspecific constraints. Reinforcement learning has been previously applied to extractive summarization (Ryang and Abekawa, 2012), information extraction (Narasimhan et al., 2016), dialogue generation (Li et al., 2016), machine translation, and image caption generation (Ranzato et al., 2016). We evaluate our system on three publicly available datasets collated automatically from Wikipedia (Zhu et al., 2010;Woodsend and Lapata, 2011) and human-authored news articles (Xu et al., 2015b). We experimentally show that the reinforcement learning framework is the key to successful generation of simplified text bringing significant improvements over strong simplification models across datasets.

Neural Encoder-Decoder Model
We will first define a basic encoder-decoder model for sentence simplification and then explain how to embed it in a reinforcement learning framework. Given a (complex) source sentence X = (x 1 , x 2 , . . . , x |X| ), our model learns to predict its simplified target Y = (y 1 , y 2 , . . . , y |Y | ). Inferring the target Y given the source X is a typical sequence to sequence learning problem, which can be modeled with attention-based encoderdecoder models (Bahdanau et al., 2015;Luong et al., 2015). Sentence simplification is slightly different from related sequence transduction tasks (e.g., compression) in that it can involve splitting operations. For example, a long source sentence (In 1883, Faur married Marie Fremiet, with whom he had two sons.) can be simplified as two sentences (In 1883, Faur married Marie Fremiet. They had two sons.). Nevertheless, we still view the target as a sequence, i.e., two or more sequences concatenated with full stops.
The encoder-decoder model has two parts (see left hand side in Figure 1). The encoder transforms the source sentence X into a sequence of hidden states (h S 1 , h S 2 , . . . , h S |X| ) with a Long Short-Term Memory Network (LSTM; Hochreiter and Schmidhuber 1997), while the decoder uses another LSTM to generate one word y t+1 at a time in the simplified target Y . Generation is conditioned on all previously generated words y 1:t and a dynamically created context vector c t , which encodes the source sentence: where g(·) is a one-hidden-layer neural network with the following parametrization: where W o ∈ R |V |×d , U h ∈ R d×d , and W h ∈ R d×d ; |V | is the output vocabulary size and d the hidden unit size. h T t is the hidden state of the decoder LSTM which summarizes y 1:t , i.e., what has been generated so far: The dynamic context vector c t is the weighted sum of the hidden states of the source sentence: whose weights α ti are determined by an attention mechanism: where · is the dot product between two vectors. We use the dot product here mainly for efficiency reasons; alternative ways to compute attention scores have been proposed in the literature and we refer the interested reader to Luong et al. (2015). The model sketched above is usually trained by minimizing the negative log-likelihood of the training source-target pairs.

Reinforcement Learning for Sentence Simplification
In this section we present DRESS, our Deep REinforcement Sentence Simplification model. Despite successful application in numerous sequence transduction tasks (Jean et al., 2015;Xu et al., 2015a), a vanilla encoder-decoder model is not ideal for sentence simplification. Although a number of rewrite operations (e.g., copying, deletion, substitution, word reordering) can be used to simplify text, copying is by far the most common. We empirically found that 73% of the target words are copied from the source in the Newsela dataset. This number further increases to 83% when considering Wikipedia-based datasets (we provide details on these datasets in Section 5). As a result, a generic encoder-decoder model learns to copy all too well at the expense of other rewrite operations, often parroting back the source or making only a few trivial changes.
To encourage a wider variety of rewrite operations while remaining fluent and faithful to the meaning of the source, we employ a reinforcement learning framework (see Figure 1). We view the encoder-decoder model as an agent which first reads the source sentence X; then at each step, it takes an actionŷ t ∈ V (where V is the output vocabulary) according to a policy P RL (ŷ t |ŷ 1:t−1 , X) (see Equation (2)). The agent continues to take actions until it produces an End Of Sentence (EOS) token yielding the action sequenceŶ = (ŷ 1 ,ŷ 2 , . . . ,ŷ |Ŷ | ), which is also the simplified output of our model. A reward r is then received and the REINFORCE algorithm (Williams, 1992) is used to update the agent. In the following, we first introduce our reward and then present the details of the REINFORCE algorithm.

Reward
The reward r(Ŷ ) for system outputŶ is the weighted sum of the three components aimed at capturing key aspects of the target output, namely simplicity, relevance, and fluency: where X is the source, Y the reference (or target), andŶ the system output. r S , r R , and r F are shorthands for simplicity r S (X, Y,Ŷ ), relevance r R (X,Ŷ ), and fluency r F (Ŷ ). We provide details for each reward summand below.
Simplicity To encourage the model to apply a wide range of simplification operations, we use SARI (Xu et al., 2016), a recently proposed metric which compares System output Against References and against the Input sentence. SARI is the arithmetic average of n-gram precision and recall of three rewrite operations: addition, copying, and deletion. It rewards addition operations where system output was not in the input but occurred in the references. Analogously, it rewards words retained/deleted in both the system output and the references. In experimental evaluation Xu et al. (2016) demonstrate that SARI correlates well with human judgments of simplicity, whilst correctly rewarding systems that both make changes and simplify the input.
One caveat with using SARI as a reward is the fact that it relies on the availability of multiple references which are rare for sentence simplification. Xu et al. (2016) provide eight references for 2,350 sentences, but these are primarily for system tuning and evaluation rather than training. The majority of existing simplification datasets (see Section 5 for details) have a single reference for each source sentence. Moreover, they are unavoidably noisy as they are mostly constructed automatically, e.g., by aligning sentences from the ordinary and simple English Wikipedias. When relying solely on a single reference, SARI will try to reward accidental n-grams that should never have occurred in it. To countenance the effect of noise, we apply SARI(X,Ŷ , Y ) in the expected direction, with X as the source,Ŷ the system output, and Y the reference as well as in the reverse direction with Y as the system output andŶ as the reference. Assuming our system can produce reasonably good simplifications, by swapping the output Get Action Seq.Ŷ

Relevance Model
Grammar Model REINFORCE algorithm Y XŶ XŶ Y Figure 1: Deep reinforcement learning simplification model. X is the complex sentence, Y the reference (simple) sentence andŶ the action sequence (simplification) produced by the encoder-decoder model. and the reference, reverse SARI can be used to estimate how good a reference is with respect to the system output. Our first reward is therefore the weighted sum of SARI and reverse SARI: Relevance While the simplicity-based reward r S tries to encourage the model to make changes, the relevance reward r R ensures that the generated sentences preserve the meaning of the source. We use an LSTM sentence encoder to convert the source X and the predicted targetŶ into two vectors q X and qŶ . The relevance reward r R is simply the cosine similarity between these two vectors: We use a sequence auto-encoder (SAE; Dai and Le 2015) to train the LSTM sentence encoder on both the complex and simple sentences. Specifically, the SAE uses sentence X = (x 1 , . . . , x |X| ) to infer itself via an encoder-decoder model (without an attention mechanism). Firstly, an encoder LSTM converts X into a sequence of hidden states (h 1 , . . . , h |X| ). Then, we use h |X| to initialize the hidden state of the decoder LSTM and recover/generate X one word at a time.
Fluency Xu et al. (2016) observe that SARI correlates less with fluency compared to other metrics such as BLEU (Papineni et al., 2002). The fluency reward r F models the well-formedness of the generated sentences explicitly. It is the normalized sentence probability assigned by an LSTM language model trained on simple sentences: We take the exponential ofŶ 's perplexity to ensure that r F ∈ [0, 1] as is the case with r S and r R .

The REINFORCE Algorithm
The goal of the REINFORCE algorithm is to find an agent that maximizes the expected reward. The training loss for one sequence is its negative expected reward: where P RL is our policy, i.e., the distribution produced by the encoder-decoder model (see Equation(2)) and r(·) is the reward function of an action sequenceŶ = (ŷ 1 , . . . ,ŷ |Ŷ | ), i.e., a generated simplification. Unfortunately, computing the expectation term is prohibitive, since there is an infinite number of possible action sequences. In practice, we approximate this expectation with a single sample from the distribution of P LR (·|X). We refer to Williams (1992) for the full derivation of the gradients. The gradient of L(θ) is: To reduce the variance of gradients, we also introduce a baseline linear regression model b t to estimate the expected future reward at time t (Ranzato et al., 2016). b t takes the concatenation of h T trained by minimizing mean squared error. We do not back-propagate this error to h T t or c t during training (Ranzato et al., 2016).

Learning
Presented in its original form, the REINFORCE algorithm starts learning with a random policy. This assumption can make model training challenging for generation tasks like ours with large vocabularies (i.e., action spaces). We address this issue by pre-training our agent (i.e., the encoderdecoder model) with a negative log-likelihood objective (see Section 2), making sure it can produce reasonable simplifications, thereby starting off with a policy which is better than random. We follow prior work (Ranzato et al., 2016) in adopting a curriculum learning strategy. In the beginning of training, we give little freedom to our agent allowing it to predict the last few words for each target sentence. For every target sequence, we use negative log-likelihood to train the first L (initially, L = 24) tokens and apply the reinforcement learning algorithm to the (L + 1)th tokens onwards. Every two epochs, we set L = L − 3 and the training terminates when L is 0.

Lexical Simplification
Lexical substitution, the replacement of complex words with simpler alternatives, is an integral part of sentence simplification (Specia et al., 2012). The model presented so far learns lexical substitution and other rewrite operations jointly. In some cases, words are predicted because they seem natural in the their context, but are poor substitutes for the content of the complex sentence. To countenance this, we learn lexical simplifications explicitly and integrate them with our reinforcement learning-based model.
We use an pre-trained encoder-decoder model (which is trained on a parallel corpus of complex and simple sentences) to obtain probabilistic word alignments, aka attention scores (see α t in Equation (6)). Let X = (x 1 , x 2 , . . . , x |X| ) denote a source sentence and Y = (y 1 , y 2 , . . . , y |Y | ) a target sentence. We convert X into |X| hidden states (v 1 , v 2 , . . . , v |X| ) with an LSTM. Note that v t ∈ R d×1 corresponds to the context dependent representation of x t . Let α t denote the alignment scores α t1 , α t2 , . . . , α t|X| . The lexical simplification probability of y t given the source sentence and the alignment scores is: where W l ∈ R |V |×d and s t represents the source: The lexical simplification model on its own encourages lexical substitutions, without taking into account what has been generated so far (i.e., y 1:t−1 ) and as a result fluency could be compromised. A straightforward solution is to integrate lexical simplification with our reinforcement learning trained model (Section 3) using linear interpolation, where η ∈ [0, 1]: P (y t |y 1:t−1 , X) = (1 − η) P RL (y t |y 1:t−1 , X)

Experimental Setup
In this section we present our experimental setup for assessing the performance of the simplification model described above. We give details on our datasets, model training, evaluation protocol, and the systems used for comparison. We also constructed WikiLarge, a larger Wikipedia corpus by combining previously created simplification corpora. Specifically, we aggregated the aligned sentence pairs in Kauchak (2013), the aligned and revision sentence pairs in Woodsend and Lapata (2011), and Zhu's (2010) WikiSmall dataset described above. We used the development and test sets created in Xu et al. (2016). These are complex sentences taken from WikiSmall paired with simplifications provided by Amazon Mechanical Turk workers. The dataset contains 8 (reference) simplifications for 2,359 sentences partitioned into 2,000 for development and 359 for testing. After removing duplicates and sentences in development and test sets, the resulting training set contains 296,402 sentence pairs.
Our third dataset is Newsela, a corpus collated by Xu et al. (2015b) who argue that Wikipediabased resources are suboptimal due to the automatic sentence alignment which unavoidably introduces errors, and their uniform writing style which leads to systems that generalize poorly. Newsela 2 consists of 1,130 news articles, each rewritten four times by professional editors for children at different grade levels (0 is the most complex level and 4 is simplest). Xu et al. (2015b) provide multiple aligned complex-simple pairs within each article. We removed sentence pairs corresponding to levels 0-1, 1-2, and 2-3, since they were too similar to each other. The first 1,070 documents were used for training (94,208 sentence pairs), the next 30 documents for development (1,129 sentence pairs) and the last 30 documents for testing (1,076 sentence pairs). 3 We are not aware of any published results on this dataset.
Training Details We trained our models on an Nvidia GPU card. We used the same hyperparameters across datasets. We first trained an encoder-decoder model, and then performed reinforcement learning training (Section 3), and trained the lexical simplification model (Section 4). Encoder-decoder parameters were uniformly initialized to [−0.1, 0.1]. We used Adam (Kingma and Ba, 2014) to optimize the model with learning rate 0.001; the first momentum coefficient was set to 0.9 and the second momentum coefficient to 0.999. The gradient was rescaled when the norm exceeded 5 (Pascanu et al., 2013). Both encoder and decoder LSTMs have two layers with 256 hidden neurons in each layer. We regularized all LSTMs with a dropout rate of 0.2 (Zaremba et al., 2014). We initialized the encoder and decoder word embedding matrices with 300 dimensional Glove vectors (Pennington et al., 2014).
During reinforcement training, we used plain stochastic gradient descent with a learning rate of 0.01. We set β = 0.1, λ S = 1, λ R = 0.25 and λ F = 0.5. 4 Training details for the lexical 2 https://newsela.com 3 If a sentence has multiple references in the development or test set, we use the reference with highest simplicity level. 4 Weights were tuned on the development set of the Newsela dataset and kept fixed for the other two datasets. simplification model are identical to the encoderdecoder model except that word embedding matrices were randomly initialized. The weight of the lexical simplification model was set to η = 0.1.
To reduce vocabulary size, named entities were tagged with the Stanford CoreNLP  and anonymized with a NE@N token, where NE ∈ {PER, LOC, ORG, MISC} and N indicates NE@N is the N -th distinct NE typed entity. For example, "John and Bob are . . . " becomes "PER@1 and PER@2 are . . . ". At test time, we de-anonymize NE@N tokens in the output by looking them up in their source sentences. Note that the de-anonymization may fail, but the chance is small (around 2% of the time on the Newsela development set). We replaced words occurring three times or less in the training set with UNK. At test time, when our models predict UNK, we adopt the UNK replacement method proposed in Jean et al. (2015).
Evaluation Following previous work (Woodsend and Lapata, 2011;Xu et al., 2016) we evaluated system output automatically adopting metrics widely used in the simplification literature. Specifically, we used BLEU 5 (Papineni et al., 2002) to assess the degree to which generated simplifications differed from gold standard references and the Flesch-Kincaid Grade Level index (FKGL;Kincaid et al. 1975) to measure the readability of the output (lower FKGL 6 implies simpler output). In addition, we used SARI (Xu et al., 2016), which evaluates the quality of the output by comparing it against the source and reference simplifications. 7 BLEU, FKGL, and SARI are all measured at corpus-level. We also evaluated system output by eliciting human judgments via Amazon's Mechanical Turk. Specifically (selfreported) native English speakers were asked to rate simplifications on three dimensions: Fluency (is the output grammatical and well formed?), Adequacy (to what extent is the meaning expressed in the original sentence preserved in the output?) and Simplicity (is the output simpler than the original sentence?). All ratings were obtained using a five point Likert scale.
Comparison Systems We compared our model against several systems previously proposed in the literature. These include PBMT-R, a mono- lingual phrase-based machine translation system with a reranking post-processing step 8 (Wubben et al., 2012) and Hybrid, a model which first performs sentence splitting and deletion operations over discourse representation structures and then further simplifies sentences with PBMT-R (Narayan and Gardent, 2014). Hybrid 9 is state of the art on the WikiSmall dataset. Comparisons with SBMT-SARI, a syntax-based translation model trained on PPDB (Ganitkevitch et al., 2013) and tuned with SARI (Xu et al., 2016), are problematic due to the size of PPDB which is considerably larger than any of the datasets used in this work (it contains 106 million sentence pairs with 2 billion words). Nevertheless, we compare 10 against SBMT-SARI, but only models trained on Wikilarge, our largest dataset.

Results
Since Newsela contains high quality simplifications created by professional editors, we performed the bulk of our experiments on this dataset. Specifically, we set out to answer two questions: (a) which neural model performs best and (b) how do neural models which are resource lean and do not have access to linguistic annotations fare against more traditional systems. We therefore compared the basic attention-based encoder- 8 We made a good-faith effort to re-implement their system following closely the details in Wubben et al. (2012). 9 We are grateful to Shashi Narayan for running his system on our three datasets. 10 The output of SBMT-SARI is publicly available.   Section 3), and a linear combination of DRESS and the lexical simplification model (DRESS-LS; Section 4). Neural models were further compared against two strong baselines, PBMT-R and Hybrid. Table 3 shows example output of all models on the Newsela dataset. The top block in Table 1 summarizes the results of our automatic evaluation. As can be seen, all neural models obtain higher BLEU, lower FKGL and higher SARI compared to PBMT-R. Hybrid has the lowest FKGL and highest SARI. Compared to EncDecA, DRESS scores lower on FKGL and higher on SARI, which indicates that the model has indeed learned to optimize the reward function which includes SARI. Integrating lexical simplification (DRESS-LS) yields better BLEU, but slightly worse FKGL and SARI.
The results of our human evaluation are presented in the top block of Table 2. We elicited judgments for 100 randomly sampled test sentences.
Aside from comparing system output (PBMT-R, Hybrid, EncDecA, DRESS, and DRESS-LS), we also elicited ratings for the gold standard Reference as an upper bound. We report results for Fluency, Adequacy, and Simplicity individually and in combination (All is the average rating of the three dimensions). As can be seen, DRESS and DRESS-LS outperform PBMT-R and Complex There's just one major hitch: the primary purpose of education is to develop citizens with a wide variety of skills. Reference The purpose of education is to develop a wide range of skills. PBMT-R It's just one major hitch: the purpose of education is to make people with a wide variety of skills. Hybrid one hitch the purpose is to develop citizens. EncDecA The key of education is to develop people with a wide variety of skills.

DRESS
There's just one major hitch: the main goal of education is to develop people with lots of skills. DRESS-LS There's just one major hitch: the main goal of education is to develop citizens with lots of skills. Complex "They were so burdened by the past they couldn't think about the future," said Barnet, 62, who was president of Columbia Records, the No.1 record label in the United States, before joining Capitol. Reference Capitol was stuck in the past. It could not think about the future, Barnett said. PBMT-R "They were so affected by the past they couldn't think about the future," said Barnett, 62, was president of Columbia Records, before joining Capitol building. Hybrid 'They were so burdened by the past they couldn't think about the future," said Barnett, 62, who was Columbia Records, president of the No.1 record label in the united states, before joining Capitol. EncDecA "They were so burdened by the past they couldn't think about the future," said Barnett, who was president of Columbia Records, the No.1 record labels in the United States. DRESS "They were so sicker by the past they couldn't think about the future," said Barnett, who was president of Columbia Records. DRESS-LS "They were so burdened by the past they couldn't think about the future," said Barnett, who was president of Columbia Records. Hybrid on Fluency, Simplicity, and overall. The fact that neural models (EncDecA, DRESS and DRESS-LS) fare well on Fluency, is perhaps not surprising given the recent success of LSTMs in language modeling and neural machine translation (Zaremba et al., 2014;Jean et al., 2015).
Neural models obtain worse ratings on Adequacy but are closest to the human references on this dimension. DRESS-LS (and DRESS) are significantly better (p < 0.01) on Simplicity than EncDecA, PBMT-R, and Hybrid which indicates that our reinforcement learning based model is effective at creating simpler output. Combined ratings (All) for DRESS-LS are significantly different compared to the other models but not to DRESS and the Reference. Nevertheless, integration of the lexical simplification model boosts performance as ratings increase almost across the board (Simplicity is slightly worse). Returning to our original questions, we find that neural models are more fluent than comparison systems, while performing non-trivial rewrite operations (see the SARI scores in Table 1) which yield simpler output (see the Simplicity column in Table 2). Based on our judgment elicitation study, neural models trained with reinforcement learning perform best, with DRESS-LS having a slight advantage. We further analyzed model performance by computing various statistics on the simplified output. We measured average sentence length and the degree to which DRESS and comparison systems perform rewriting operations. We approximated the latter with Translation Error Rate (TER; Snover et al. 2006), a measure commonly used to automatically evaluate the quality of machine translation output. We used TER to compute the (average) number of edits required to change an original complex sentence to simpler output. We also report the number of edits by type, i.e., the number of insertions, substitutions, deletions, and shifts needed (on average) to convert complex to simple sentences.
As shown in Table 4, Hybrid obtains the highest TER, followed by our models (DRESS and  DRESS-LS), which indicates that they actively perform rewriting. Perhaps Hybrid is too aggressive when simplifying a sentence, it obtains low Fluency and Adequacy scores in human evaluation (Table 2). There is a strong correlation between sentence length and number of deletion operations (i.e., more deleteions lead to shorter sentences) and PBMT-R performs very few deletions. Overall, reinforcement learning encourages deletion (see DRESS and DRESS-LS), while performing a reasonable amount of additional operations (e.g., substitutions and shifts) compared to EncDecA and PBMT-R.
The middle blocks in Tables 1 and 2 report results on the WikiSmall dataset. FKGL and SARI follow a similar pattern as on Newsela. BLEU scores for PBMT-R, Hybrid, and EncDecA are much higher compared to DRESS and DRESS-LS. Hybrid obtains best BLEU and SARI scores, while DRESS and DRESS-LS do very well on FKGL. In human evaluation, we elicited judgments on the entire WikiSmall test set (100 sentences). We compared DRESS-LS, with PBMT-R, Hybrid, and gold standard Reference simplifications. As human experiments are time consuming and expensive, we did not include other neural models besides DRESS-LS based on our Newsela study which showed that EncDecA is inferior to variants trained with reinforcement learning and that DRESS-LS is the better performing model (however, we do compare all models in Table 1). DRESS-LS is significantly better on Simplicity than PBMT-R, Hybrid, and the Reference. It performs on par with PBMT-R on Fluency and worse on Adequacy (but still closer to the human Reference than PBMT-R or Hybrid). When combining all ratings (All in Table 2), DRESS-LS is significantly better than PBMT-R, Hybrid, and the Reference.
The bottom blocks in Tables 1 and 2 report results on Wikilarge. We compared our models with PBMT-R, Hybrid, and SBMT-SARI (Xu et al., 2016). The FKGL follows a similar pattern as in the previous datasets. PBMT-R and our models are best in terms of BLEU while SBMT-SARI outperforms all other systems on SARI. 11 Because there are 8 references for each complex sentence in the test set, BLEU scores are much higher compared to Newsela and WikiSmall. In human evaluation, we again elicited judgments for 100 randomly sampled test sentences. We randomly selected one of the 8 references as the Reference upper bound. On Simplicity, DRESS-LS is significantly better than all comparison systems, except Hybrid. On Adequacy, it is better than Hybrid but significantly worse than other comparison systems. On Fluency, it is on par with PBMT-R 12 but better than Hybrid and SBMT-SARI. On All dimension DRESS-LS significantly outperforms all comparison systems.

Conclusions
We developed a reinforcement learning-based text simplification model, which can jointly model simplicity, grammaticality, and semantic fidelity to the input. We also proposed a lexical simplification component that further boosts performance. Overall, we find that reinforcement learning offers a great means to inject prior knowledge to the simplification task achieving good results across three datasets. In the future, we would like to explicitly model sentence splitting and simplify entire documents (rather than individual sentences). Beyond sentence simplification, the reinforcement learning framework presented here is potentially applicable to generation tasks such as sentence compression , generation of programming code (Ling et al., 2016), or poems (Zhang and Lapata, 2014).