Iterative Edit-Based Unsupervised Sentence Simplification

We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by a scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word and phrase-level edits on the complex sentence. Compared with previous approaches, our model does not require a parallel training set, but is more controllable and interpretable. Experiments on Newsela and WikiLarge datasets show that our approach is nearly as effective as state-of-the-art supervised approaches.


Introduction
Sentence simplification is the task of rewriting text to make it easier to read, while preserving its main meaning and important information. Sentence simplification is relevant in various real-world and downstream applications. For instance, it can benefit people with autism (Evans et al., 2014), dyslexia (Rello et al., 2013), and low-literacy skills (Watanabe et al., 2009). It can also serve as a preprocessing step to improve parsers (Chandrasekar et al., 1996) and summarization systems (Klebanov et al., 2004).
Recent efforts in sentence simplification have been influenced by the success of machine translation. In fact, the simplification task is often treated as monolingual translation, where a complex sentence is translated to a simple one. Such simplification systems are typically trained in a supervised way by either phrase-based machine translation (PBMT, Wubben et al., 2012;Narayan and Gardent, 2014;Xu et al., 2016) or neural machine translation (NMT, Zhang and Lapata, 2017;Guo et al., 2018;Kriz et al., 2019). Recently, sequence-to-sequence (Seq2Seq)-based NMT systems are shown to be more successful and serve as the state of the art.
However, supervised Seq2Seq models have two shortcomings. First, they give little insight into the simplification operations, and provide little control or adaptability to different aspects of simplification (e.g., lexical vs. syntactical simplification). Second, they require a large number of complexsimple aligned sentence pairs, which in turn require considerable human effort to obtain.
In previous work, researchers have addressed some of the above issues. For example, Alva-Manchego et al. (2017) and Dong et al. (2019) explicitly model simplification operators such as word insertion and deletion. Although these approaches are more controllable and interpretable than standard Seq2Seq models, they still require large volumes of aligned data to learn these operations. To deal with the second issue, Surya et al. (2019) recently proposed an unsupervised neural text simplification approach based on the paradigm of style transfer. However, their model is hard to interpret and control, like other neural network-based models. Narayan and Gardent (2016) attempted to address both issues using a pipeline of lexical substitution, sentence splitting, and word/phrase deletion. However, these operations can only be executed in a fixed order.
In this paper, we propose an iterative, editbased unsupervised sentence simplification approach, motivated by the shortcomings of existing work. We first design a scoring function that measures the quality of a candidate sentence based on the key characteristics of the simplification task, namely, fluency, simplicity, and meaning preservation. Then, we generate simplified candidate sentences by iteratively editing the given complex sentence using three simplification operations (lexical simplification, phrase extraction, deletion and reordering). Our model seeks the best simplified Figure 1: An example of three edit operations on a given sentence. Note that dropping clauses or phrases is common in text simplification datasets.
candidate sentence according to the scoring function. Compared with Narayan and Gardent (2016), the order of our simplification operations is not fixed and is decided by the model. Figure 1 illustrates an example in which our model first chooses to delete a sentence fragment, followed by reordering the remaining fragments and replacing a word with a simpler synonym.
We evaluate our approach on the Newsela (Xu et al., 2015) and WikiLarge (Zhang and Lapata, 2017) corpora. Experiments show that our approach outperforms previous unsupervised methods and even performs competitively with state-ofthe-art supervised ones, in both automatic metrics and human evaluations. We also demonstrate the interpretability and controllability of our approach, even without parallel training data.

Related Work
Early work used handcrafted rules for text simplification, at both the syntactic level (Siddharthan, 2002) and the lexicon level (Carroll et al., 1999). Later, researchers adopted machine learning methods for text simplification, modeling it as monolingual phrase-based machine translation (Wubben et al., 2012;Xu et al., 2016). Further, syntactic information was also considered in the PBMT framework, for example, constituency trees (Zhu et al., 2010) and dependency trees (Bingel and Søgaard, 2016). Narayan and Gardent (2014) performed probabilistic sentence splitting and deletion, followed by MT-based paraphrasing. Nisioi et al. (2017) employed neural machine translation (NMT) for text simplification, using a sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014). Zhang and Lapata (2017) used reinforcement learning to optimize a reward based on simplicity, fluency, and relevance. Zhao et al. (2018a) integrated the transformer architecture and paraphrasing rules to guide simplification learning. Kriz et al. (2019) produced diverse simplifications by generating and re-ranking candidates by fluency, adequacy, and simplicity. Guo et al. (2018) showed that simplification benefits from multi-task learning with paraphrase and entailment generation. Martin et al. (2019) enhanced the transformer architecture with conditioning parameters such as length, lexical and syntactic complexity.
Recently, edit-based techniques have been developed for text simplification. Alva-Manchego et al. (2017) trained a model to predict three simplification operators (keep, replace, and delete) from aligned pairs. Dong et al. (2019) employed a similar approach but in an end-to-end trainable manner with neural networks. However, these approaches are supervised and require large volumes of parallel training data; also, their edits are only at the word level. By contrast, our method works at both word and phrase levels in an unsupervised manner.
For unsupervised sentence simplification, Surya et al. (2019) adopted style-transfer techniques, using adversarial and denoising auxiliary losses for content reduction and lexical simplification. However, their model is based on a Seq2Seq network, which is less interpretable and controllable. They cannot perform syntactic simplification since syntax typically does not change in style-transfer tasks. Narayan and Gardent (2016) built a pipeline-based unsupervised framework with lexical simplification, sentence splitting, and phrase deletion. However, these operations are separate components in the pipeline, and can only be executed in a fixed order.
Unsupervised edit-based approaches have recently been explored for natural language generation tasks, such as style transfer, paraphrasing, and sentence error correction. Li et al. (2018) proposed edit-based style transfer without parallel supervision. They replaced style-specific phrases with those in the target style, which are retrieved from the training corpus. Miao et al. (2019) used Metropolis-Hastings sampling for constrained sentence generation. In this paper, we model text generation as a search algorithm, and design search objective and search actions specifically for text simplification. Concurrent work further shows the success of search-based unsupervised text generation for paraphrasing (Liu et al., 2020) and summa-rization (Schumann et al., 2020).

Model
In this section, we first provide an overview of our approach, followed by a detailed description of each component, namely, the scoring function, the edit operations, and the stopping criteria.

Overview
We first define a scoring function as our search objective. It allows us to impose both hard and soft constraints, balancing the fluency, simplicity, and adequacy of candidate simplified sentences (Section 3.2).
Our approach iteratively generates multiple candidate sentences by performing a sequence of lexical and syntactic operations. It starts from the input sentence; in each iteration, it performs phrase and word edits to generate simplified candidate sentences (Section 3.3).
Then, a candidate sentence is selected according to certain criteria. This process is repeated until none of the candidates improve the score of the source sentence by a threshold value. The last candidate is returned as the simplified sentence (Section 3.4).

Scoring Function
Our scoring function is the product of several individual scores that evaluate various aspects of a candidate simplified sentence. This is also known as the product-of-experts model (Hinton, 2002).
SLOR score from a syntax-aware language model (f eslor ). This measures the language fluency and structural simplicity of a candidate sentence. A probabilistic language model (LM) is often used as an estimate of sentence fluency (Miao et al., 2019). In our work, we make two important modifications to a plain LM.
First, we replace an LM's estimated sentence probability with the syntactic log-odds ratio (SLOR, Pauls and Klein, 2012), to better measure fluency and human acceptability. According to Lau et al. (2017), SLOR shows the best correlation to human acceptability of a sentence, among many sentence probability-based scoring functions. SLOR was also shown to be effective in unsupervised text compression (Kann et al., 2018).
Given a trained language model (LM) and a sentence s, SLOR is defined as where P LM is the sentence probability given by the language model, P U (s) = w∈s P (w) is the product of the unigram probability of a word w in the sentence, and |s| is the sentence length. SLOR essentially penalizes a plain LM's probability by unigram likelihood and the length. It ensures that the fluency score of a sentence is not penalized by the presence of rare words. Consider two sentences, "I went to England for vacation" and "I went to Senegal for vacation." Even though both sentences are equally fluent, a standard LM will give a higher score to the former, since the word "England" is more likely to occur than "Senegal." In simplification, SLOR is preferred for preserving rare words such as named entities. 2 Second, we use a syntax-aware LM, i.e., in addition to words, we use part-of-speech (POS) and dependency tags as inputs to the LM (Zhao et al., 2018b). For a word w i , the input to the syntax- Note that our LM is trained on simple sentences. Thus, the syntax-aware LM prefers a syntactically simple sentence. It also helps to identify sentences that are structurally ungrammatical.
Cosine Similarity (f cos ). Cosine similarity is an important measure of meaning preservation. We compute the cosine value between sentence embeddings of the original complex sentence (c) and the generated candidate sentence (s), where our sentence embeddings are calculated as the idf weighted average of individual word embeddings. Our sentence similarity measure acts as a hard filter, i.e., f cos (s) = 1 if cos(c, s) > τ , or f cos (s) = 0 otherwise, for some threshold τ .
Entity Score (f entity ). Entities help identify the key information of a sentence and therefore are also useful in measuring meaning preservation. Thus, we count the number of entities in the sentence as part of the scoring function, where entities are detected by a third-party tagger.
Length (f len ). This score is proportional to the inverse of the sentence length. It forces the model to generate shorter and simpler sentences. However, we reject sentences shorter than a specified length (≤6 tokens) to prevent over-shortening. FRE (f fre ). The Flesch Reading Ease (FRE) score (Kincaid et al., 1975) measures the ease of readability in text. It is based on text features such as the average sentence length and the average number of syllables per word. A higher scores indicate that the text is simpler to read.
We compute the overall scoring function as the product of individual scores.
where the weights α, β, γ, and δ balance the relative importance of the different scores. Recall that the cosine similarity measure does not require a weight since it is a hard indicator function. In Section 4.5, we will experimentally show that the weights defined for different scores affect different characteristics of simplification and thus provide more adaptability and controllability.

Generating Candidate Sentences
We generate candidate sentences by editing words and phrases. We use a third-party parser to obtain the constituency tree of a source sentence. Each clause-and phrase-level constituent (e.g., S, VP, and NP) is considered as a phrase. Since a constituent can occur at any depth in the parse tree, we can deal with both long and short phrases at different granularities. In Figure 2, for example, both "good" (ADJP) and "tasted good" (VP) are constituents and thus considered as phrases, whereas "tasted" is considered as a single word. For each phrase, we generate a candidate sentence using the edit operations explained below, with Figure 1 being a running example.
Removal. For each phrase detected by the parser, this operation generates a new candidate sentence by removing that phrase from the source sentence. In Figure 1, our algorithm can drop the phrase "according to a Seattle based reporter," which is not the main clause of the sentence. The removal operation allows us to remove peripheral information in a sentence for content reduction.
Extraction. This operation simply extracts a selected phrase (including a clause) as the candidate sentence. This allows us to select the main clause in a sentence and remove remaining peripheral information.
Reordering. For each phrase in a sentence, we generate candidate sentences by moving the phrase before or after another phrase (identified by clauseand phrase-level constituent tags). In the running example, the phrase "In 2016 alone" is moved between the phrases "12 billion dollars" and "on constructing theme parks." As seen, the reordering operation is able to perform syntactic simplification.
Substitution. In each phrase, we identify the most complex word as the rarest one according to the idf score. For the selected complex word, we generate possible substitutes using a two-step strategy.
First, we obtain candidate synonyms by taking the union of the WordNet synonym set (Miller, 1995) and the closest words from GloVe (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013) embeddings (where embedding closeness is measured by Euclidean distance). Second, a candidate synonym is determined to be an appropriate simple substitute if it satisfies the following conditions: a) it has a lower idf score than the complex word, where the scores are computed from the target simple sentences, b) it is not a morphological inflection of the complex word, c) its word embedding exceeds a cosine similarity threshold to the complex word, and, d) it is has the same partof-speech and dependency tags in the sentence as the complex word. We then generate candidate sentences by replacing the complex word with all qualified lexical substitutes. Notably, we do not replace entity words identified by entity taggers.
In our example sentence, consider the phrase "constructing theme parks." The word "constructing" is chosen as the word to be simplified, and is replaced with "building." As seen, this operation performs lexical simplification.

The Iterative Algorithm
Given an input complex sentence, our algorithm iteratively performs edits to search for a higher-scoring candidate.
In each iteration, we consider all the operations (i.e., removal, extraction, reordering, and substitution). Each operation may generate multiple candidates (e.g., multiple words for substitution); we filter out a candidate sentence if the improvement does not pass an operation-specific threshold. We choose the highest-scoring sentence from those that are not filtered out. Our algorithm terminates if no edit passes the threshold, and the final candidate is our generated simplified sentence.
Our algorithm includes a filtering step for each operation. We only keep a candidate sentence if it is better than the previous one by a multiplicative factor, i.e., where s is the sentence given by the previous iteration, and c is a candidate generated by operator op from s. Notably, we allow different thresholds for each operation. This provides control over different aspects of simplification, namely, lexicon simplification, syntactic simplification, and content reduction. A lower threshold for substitution, for example, encourages the model to perform more lexical simplification.

Data
We use the Newsela (Xu et al., 2015) and the Wiki-Large datasets (Zhang and Lapata, 2017) for evaluating our model.
Newsela is a collection of 1,840 news articles written by professional editors at 5 reading levels for children. We use the standard split and exclude simple-complex sentence pairs that are one reading level apart, following Zhang and Lapata (2017). This gives 95,208 training, 1,129 validation, and 1,077 test sentences.
The WikiLarge dataset is currently the largest text simplification corpus. It contains 296,402, 2,000, and 359 complex-simple sentence pairs for training, validation, and testing, respectively. The training set of WikiLarge consists of automatically aligned sentence pairs from the normal and simple Wikipedia versions. The validation and test sets contain multiple human-written references, against which we evaluate our algorithm.
For each corpus, we only use its training set to learn a language model of simplified sentences. For the WikiLarge dataset, we also train a Word2Vec embedding model from scratch on its source and target training sentences. These embeddings are used to obtain candidate synonyms in the substitution operation.

Training Details
For the LM, we use a two-layer, 256-dimensional recurrent neural network (RNN) with the gated recurrent unit (GRU, Chung et al., 2014). We initialize word embeddings using 300-dimensional GloVe (Pennington et al., 2014); out-of-vocabulary words are treated as UNK, initialized uniformly in the range of ±0.05. Embeddings for POS tags and dependency tags are 150-dimensional, also initialized randomly. We fine-tune all embeddings during training.
We use the Averaged Stochastic Gradient Descent (ASGD) algorithm (Polyak and Juditsky, 1992) to train the LM, with 0.4 as the dropout and 32 as the batch size. For the Newsela dataset, the thresholds r op in the scoring function are set to 1.25 for all the edit operations. All the weights in our scoring function (α, β, γ, δ) are set to 1. For the WikiLarge dataset, the thresholds are set as 1.25 for the removal and reordering operations, 0.8 for substitution, and 5.0 for extraction. The weights in the scoring function (α, β, γ, δ) are set to 0.5, 1.0, 0.25 and 1.0, respectively.
We use CoreNLP  to construct the constituency tree and Spacy 3 to generate part-of-speech and dependency tags.

Competing Methods
We first consider the reference to obtain an upperbound for a given evaluation metric. We also consider the complex sentence itself as a trivial baseline, denoted by Complex.
Next, we develop a simple heuristic that removes rare words occurring ≤ 250 times in the simple sentences of the training corpus, denoted by Reduce-250. As discussed in Section 4.4, this simple heuristic demonstrates the importance of balancing different automatic evaluation metrics.
For unsupervised competing methods, we compare with Surya et al. (2019), which is inspired by unsupervised neural machine translation. They proposed two variants, UNMT and UNTS, but their results are only available for WikiLarge. We also compare our model with supervised methods. First, we consider non-neural phrasebased machine translation (PBMT) methods: PBMT-R (Wubben et al., 2012), which re-ranks sentences generated by PBMT for diverse simplifications; SBMT-SARI (Xu et al., 2016), which uses an external paraphrasing database; and Hybrid (Narayan and Gardent, 2014), which uses a combination of PBMT and discourse representation structures. Next, we compare our method with neural machine translation (NMT) systems: EncDecA, which is a vanilla Seq2Seq model with attention (Nisioi et al., 2017); Dress and Dress-Ls, which are based on deep reinforcement learning (Zhang and Lapata, 2017); DMass (Zhao et al., 2018a), which is a transformer-based model with external simplification rules; EncDecP, which is an encoder-decoder model with a pointermechanism; EntPar, which is based on multi-task learning (Guo et al., 2018); S2S-All-FA, which a reranking based model focussing on lexical simplification (Kriz et al., 2019); and Access, which is based on the transformer architecture (Martin et al., 2019). Finally, we compare with a supervised edit-based neural model, Edit-NTS (Dong et al., 2019). We evaluate our model with a different subset of operations, i.e., removal (RM), extraction (EX), reordering (RO), and lexical substitution (LS). In our experiments, we test the following variants: RM+EX, RM+EX+LS, RM+EX+RO, and RM+EX+LS+RO.

Automatic Evaluation
Tables 1 and 2 present the results of the automatic evaluation on the Newsela and WikiLarge datasets, respectively.
We use the SARI metric (Xu et al., 2016) to measure the simplicity of the generated sentences. SARI computes the arithmetic mean of the n-gram F1 scores of three rewrite operations: adding, deleting, and keeping. The individual F1-scores of these operations are reported in the columns "Add," "Delete," and "Keep." We also compute the BLEU score (Papineni et al., 2002) to measure the closeness between a candidate and a reference. Xu et al. (2016) and Sulem et al. (2018) show that BLEU correlates with human judgement on fluency and meaning preservation for text simplification. 4  In addition, we include a few intrinsic measures (without reference) to evaluate the quality of a candidate sentence: the Flesch-Kincaid grade level (FKGL) evaluating the ease of reading, as well as the average length of the sentence.
A few recent text simplification studies (Dong et al., 2019;Kriz et al., 2019) did not use BLEU for evaluation, noticing that the complex sentence itself achieves a high BLEU score (albeit a low SARI score), since the complex sentence is indeed fluent and preserves meaning. This is also shown by our Complex baseline.
For the Newsela dataset, however, we notice that the major contribution to the SARI score is from the deletion operation. By analyzing previous work such as EntPar, we find that it reduces the sentence length to a large extent, and achieves high SARI due to the extremely high F1 score of "Delete." However, its BLEU score is low, showing the lack of fluency and meaning. This is also seen from the high SARI of (Reduce-250) in Table 1. Ideally, we want both high SARI and high BLEU, and thus, we calculate the geometric mean (GM) of them as the main evaluation metric for the Newsela dataset.
On the other hand, this is not the case for Wiki-Large, since none of the models can achieve high SARI by using only one operation among "Add," "Delete," and "Keep." Moreover, the complex sentence itself yields an almost perfect BLEU score (partially due to the multi-reference nature of Wik-iLarge). Thus, we do not use GM, and for this dataset, SARI is our main evaluation metric.
Overall results on Newsela. Table 1 shows the results on Newsela. By default (without †), validation is performed using the GM score. Still, our unsupervised text simplification achieves a SARI score around 26-27, outperforming quite a few supervised methods. Further, we experiment with SARI-based validation (denoted by †), following the setting of most previous work (Dong et al., 2019;Guo et al., 2018). We achieve 30.44 SARI, which is competitive with state-of-the-art supervised methods.
Our model also achieves high BLEU scores. As seen, all our variants, if validated by GM (without †), outperform competing methods in BLEU. One of the reasons is that our model performs text simplification by making edits on the original sentence instead of rewriting it from scratch.
In terms of the geometric mean (GM), our unsupervised approach outperforms all previous work, showing a good balance between simplicity and content preservation. The readability of our generated sentences is further confirmed by the intrinsic FKGL score.   Overall results on WikiLarge. For the Wikilarge experiments in Table 2, we perform validation on SARI, which is the main metric in this experiment. Our model outperforms existing unsupervised methods, and is also competitive with state-of-the-art supervised methods.
We observe that lexical simplification (LS) is important in this dataset, as its improvement is large compared with the Newsela experiment in Table 1. Additionally, reordering (RO) does not improve performance, as it is known that WikiLarge does not focus on syntactic simplification (Xu et al., 2016). The best performance for this experiment is obtained by the RM+EX+LS model.

Controllability
We now perform a detailed analysis of the scoring function described in Section 3.2 to understand the effect on different aspects of simplification. We use the RM+EX+LS+RO variant and the Newsela corpus as the testbed.
The SLOR score with syntax-aware LM. We analyze our syntax-aware SLOR score in the search objective. First, we remove the SLOR score and use the standard sentence probability. We observe that SLOR helps preserve rare words, which may be entities. As a result, the readability score (FKGL) becomes better (i.e., lower), but the BLEU score decreases. We then evaluate the importance of using a structural LM instead of a standard LM. We see a decrease in both SARI and BLEU scores. In both cases, the GM score decreases.
Threshold values and relative weights. Table 4 analyzes the effect of the hyperparameters of our model, namely, the threshold in the stopping criteria and the relative weights in the scoring function.
As discussed in Section 3.4, we use a threshold as the stopping criteria for our iterative search algorithm. For each operation, we require that a new candidate should be better than the previous iteration by a multiplicative threshold r op in Equation (3). In this analysis, we set the same threshold for all operations for simplicity. As seen in Table 4, increasing the threshold leads to better meaning preservation since the model is more conservative (making fewer edits). This is shown by the higher BLEU and lower SARI scores.
Regarding the weights for each individual scoring function, we find that increasing the weight β for the FRE readability score makes sentences shorter, more readable, and thus simpler. This is also indicated by higher SARI values. When sentences are rewarded for being short (with large γ), SARI increases but BLEU decreases, showing less meaning preservation. The readability scores initially increase with the reduction in length, but then decrease. Finally, if we increase the weight δ for the entity score, the sentences become longer and more complex since the model is penalized more for deleting entities.
In summary, the above analysis shows the controllability of our approach in terms of different simplification aspects, such as simplicity, meaning preservation, and readability.

Human Evaluation
We conducted a human evaluation on the Newsela dataset since automated metrics may be insufficient for evaluating text generation. We chose 30 sentences from the test set for annotation and considered a subset of baselines. For our model variants, we chose RM+EX+LS+RO, considering both validation settings (GM and SARI).
We followed the evaluation setup in Dong et al. (2019), and measure the adequacy (How much meaning from the original sentence is preserved?), simplicity (Is the output simper than the original sentence?), and fluency (Is the output grammatical?) on a five-point Likert scale. We recruited three volunteers, one native English speaker and two non-native fluent English speakers. Each of the volunteer was given 30 sentences from different models (and references) in a randomized order. Additionally, we asked the volunteers to measure the number of instances where models produce incorrect details or generate text that is not implied by the original sentence. We did this because neural models are known to hallucinate information (Rohrbach et al., 2018). We report the average count of false information per sentence, denoted as FI.
We observe that our model RM+EX+LS+RO (when validated by GM) performs better than Hybrid, a combination of PBMT and discourse representation structures, in all aspects. It also performs competitively with remaining supervised NMT models.
For adequacy and fluency, Dress-Ls performs the best since it produces relatively longer sentences. For simplicity, S2S-All-FA performs the best since it produces shorter sentences. Thus, a balance is needed between these three measures. As seen, RM+EX+LS+RO ranks second in terms of the average score in the list (reference excluded). The human evaluation confirms the effectiveness of our unsupervised text simplification, even when compared with supervised methods.
We also compare our model variants RM+EX+LS+RO (validated by GM) and RM+EX+LS+RO † (validated by SARI). As expected, the latter generates shorter sentences, performing better in simplicity but worse in adequacy and fluency.
Regarding false information (FI), we observe that previous neural models tend to generate more false information, possibly due to the vagueness in  Table 5: Human evaluation on Newsela, where we measure adequacy (A), simplicity (S), fluency (F), and their average score (Avg), based on 1-5 Likert scale. We also count average instances of false information per sentence (FI). the continuous space. By contrast, our approach only uses neural networks in the scoring function, but performs discrete edits of words and phrases. Thus, we achieve high fidelity (low FI) similar to the non-neural Hybrid model, which also performs editing on discourse parsing structures with PBMT.
In summary, our model takes advantage of both neural networks (achieving high adequacy, simplicity, and fluency) and traditional phrase-based approaches (achieving high fidelity).
Interestingly, the reference of Newsela has a poor (high) FI score, because the editors wrote simplifications at the document level, rather than the sentence level.

Conclusion
We proposed an iterative, edit-based approach to text simplification. Our approach works in an unsupervised manner that does not require a parallel corpus for training. In future work, we plan to add paraphrase generation to generate diverse simple sentences.