Neural CRF Model for Sentence Alignment in Text Simplification

The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.

Automatic text simplification is primarily addressed by sequence-to-sequence (seq2seq) models whose success largely depends on the quality and quantity of the training corpus, which consists of pairs of complex-simple sentences. Two widely used corpora, NEWSELA (Xu et al., 2015) and WIK-ILARGE (Zhang and Lapata, 2017), were created by automatically aligning sentences between comparable articles. However, due to the lack of reliable annotated data, 2 sentence pairs are often aligned using surface-level similarity metrics, such as Jaccard coefficient (Xu et al., 2015) or cosine distance of TF-IDF vectors (Paetzold et al., 2017), which fails to capture paraphrases and the context of surrounding sentences. A common drawback of text simplification models trained on such datasets is that they behave conservatively, performing mostly deletion, and rarely paraphrase (Alva-Manchego et al., 2017). Moreover, WIKILARGE is the concatenation of three early datasets (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011) that are extracted from Wikipedia dumps and are known to contain many errors (Xu et al., 2015).
To address these problems, we create the first high-quality manually annotated sentence-aligned datasets: NEWSELA-MANUAL with 50 article sets, and WIKI-MANUAL with 500 article pairs. We design a novel neural CRF alignment model, which utilizes fine-tuned BERT to measure semantic similarity and leverages the similar order of content be- tween parallel documents, combined with an effective paragraph alignment algorithm. Experiments show that our proposed method outperforms all the previous monolingual sentence alignment approaches (Štajner et al., 2018;Paetzold et al., 2017;Xu et al., 2015) by more than 5 points in F1.
By applying our alignment model to all the 1,882 article sets in Newsela and 138,095 article pairs in Wikipedia dump, we then construct two new simplification datasets, NEWSELA-AUTO (666,645 sentence pairs) and WIKI-AUTO (488,332 sentence pairs). Our new datasets with improved quantity and quality facilitate the training of complex seq2seq models. A BERT-initialized Transformer model trained on our datasets outperforms the stateof-the-art by 3.4% in terms of SARI, the main automatic metric for text simplification. Our simplification model produces 25% more rephrasing than those trained on the existing datasets. Our contributions include: 1. Two manually annotated datasets that enable the first systematic study for training and evaluating monolingual sentence alignment; 2. A neural CRF sentence alinger and a paragraph alignment algorithm that employ finetuned BERT to capture semantic similarity and take advantage of the sequential nature of parallel documents; 3. Two automatically constructed text simplification datasets which are of higher quality and 4.7 and 1.6 times larger than the existing datasets in their respective domains; 4. A BERT-initialized Transformer model for automatic text simplification, trained on our datasets, which establishes a new state-of-theart in both automatic and human evaluation.

Neural CRF Sentence Aligner
We propose a neural CRF sentence alignment model, which leverages the similar order of content presented in parallel documents and captures editing operations across multiple sentences, such as splitting and elaboration (see Figure 1 for an example). To further improve the accuracy, we first align paragraphs based on semantic similarity and vicinity information, and then extract sentence pairs from these aligned paragraphs. In this section, we describe the task setup and our approach.

Problem Formulation
Given a simple article (or paragraph) S of m sentences and a complex article (or paragraph) C of n sentences, for each sentence s i (i ∈ [1, m]) in the simple article, we aim to find its corresponding sentence c a i (a i ∈ [0, n]) in the complex article. We use a i to denote the index of the aligned sentence, where a i = 0 indicates that sentence s i is not aligned to any sentence in the complex article. The full alignment a between article (or paragraph) pair S and C can then be represented by a sequence of alignment labels a = (a 1 , a 2 , . . . , a m ). Figure  1 shows an example of alignment labels. One specific aspect of our CRF model is that it uses a varied number of labels for each article (or paragraph) pair rather than a fixed set of labels.

Neural CRF Sentence Alignment Model
We learn P (a|S, C), the conditional probability of alignment a given an article pair (S, C), using linear-chain conditional random field: where |S| = m denotes the number of sentences in article S. The score |S| i=1 ψ(a i , a i−1 , S, C) sums over the sequence of alignment labels a = (a 1 , a 2 , . . . , a m ) between the simple article S and the complex article C, and could be decomposed into two factors as follows: where sim(s i , c a i ) is the semantic similarity score between the two sentences, and T (a i , a i−1 ) is a pairwise score for alignment label transition that a i follows a i−1 .
Semantic Similarity A fundamental problem in sentence alignment is to measure the semantic similarity between two sentences s i and c j . Prior work used lexical similarity measures, such as Jaccard similarity (Xu et al., 2015), TF-IDF (Paetzold et al., 2017), and continuous n-gram features (Štajner et al., 2018). In this paper, we fine-tune BERT (Devlin et al., 2019) on our manually labeled dataset (details in §3) to capture semantic similarity.

Alignment Label Transition
In parallel documents, the contents of the articles are often presented in a similar order. The complex sentence c a i that is aligned to s i , is often related to the complex sentences c a i−1 and c a i+1 , which are aligned to s i−1 and s i+1 , respectively. To incorporate this intuition, we propose a scoring function to model the transition between alignment labels using the following features: where g 1 is the absolute distance between a i and a i−1 , g 2 and g 3 denote if the current or prior sentence is not aligned to any sentence, and g 4 indicates whether both s i and s i−1 are not aligned to any sentences. The score is computed as follows: where [, ] represents concatenation operation and FFNN is a 2-layer feedforward neural network. We provide more implementation details of the model in Appendix A.1.

Inference and Learning
During inference, we find the optimal alignmentâ: using Viterbi algorithm in O(mn 2 ) time. During training, we maximize the conditional probability of the gold alignment label a * : The second term sums the scores of all possible alignments and can be computed using forward algorithm in O(mn 2 ) time as well.

Paragraph Alignment
Both accuracy and computing efficiency can be improved if we align paragraphs before aligning sentences. In fact, our empirical analysis revealed that sentence-level alignments mostly reside within the corresponding aligned paragraphs (details in §4.4 and Table 3). Moreover, aligning paragraphs first provides more training instances and reduces the label space for our neural CRF model.
We propose Algorithm 1 and 2 for paragraph alignment. Given a simple article S with k paragraphs S = (S 1 , S 2 , . . . , S k ) and a complex article C with l paragraphs C = (C 1 , C 2 , . . . , C l ), we first apply Algorithm 1 to calculate the semantic similarity matrix simP between paragraphs by averaging or maximizing over the sentence-level similarities ( §2.2). Then, we use Algorithm 2 to generate the paragraph alignment matrix alignP . We align paragraph pairs if they satisfy one of the two conditions: (a) having high semantic similarity and appearing in similar positions in the article pair (e.g., both at the beginning), or (b) two continuous paragraphs in the complex article having relatively high semantic similarity with one paragraph in the simple side, (e.g., paragraph splitting or fusion). The difference of relative position in documents is defined as d(i, j) = | i k − j l |, and the thresholds τ 1 -τ 5 in Algorithm 2 are selected using the dev set. Finally, we merge the neighbouring paragraphs which are aligned to the same paragraph in the simple article before feeding them into our neural CRF aligner. We provide more details in Appendix A.1.

Constructing Alignment Datasets
To address the lack of reliable sentence alignment for Newsela (Xu et al., 2015) and Wikipedia (Zhu et al., 2010;Woodsend and Lapata, 2011), we designed an efficient annotation methodology to first manually align sentences between a few complex and simple article pairs. Then, we automatically aligned the rest using our alignment model trained on the human annotated data. We created two sentence-aligned parallel corpora (details in §5), which are the largest to date for text simplification.

Sentence Aligned Newsela Corpus
Newsela corpus (Xu et al., 2015) consists of 1,932 English news articles where each article (level 0) is  re-written by professional editors into four simpler versions at different readability levels (level 1-4). We annotate sentence alignments for article pairs at adjacent readability levels (e.g., 0-1, 1-2) as the alignments between non-adjacent levels (e.g., 0-2) can be then derived automatically. To ensure efficiency and quality, we designed the following three-step annotation procedure: 1. Align paragraphs using CATS toolkit (Štajner et al., 2018), and then correct the automatic paragraph alignment errors by two in-house annotators. 3 Performing paragraph alignment as the first step significantly reduces the number of sentence pairs to be annotated from every possible sentence pair to the ones within the aligned paragraphs. We design an efficient visualization toolkit for this step, for which a screenshot can be found in Appendix E.2. 2. For each sentence pair within the aligned paragraphs, we ask five annotators on the Figure  Figure 2: Manual inspection of 100 random sentence pairs from our corpora (NEWSELA-AUTO and WIKI-AUTO) and the existing Newsela (Xu et al., 2015) and Wikipedia (Zhang and Lapata, 2017) corpora. Our corpora contain at least 44% more complex rewrites (Deletion + Paraphrase or Splitting + Paraphrase) and 27% less defective pairs (Not Aligned or Not Simpler).
Eight 4 crowdsourcing platform to classify into one of the three categories: aligned, partiallyaligned, or not-aligned. We provide the annotation instructions and interface in Appendix E.1. We require annotators to spend at least ten seconds per question and embed one test question in every five questions. Any worker whose accuracy drops below 85% on test questions is removed. The inter-annotator agreement is 0.807 measured by Cohen's kappa (Artstein and Poesio, 2008). 3. We have four in-house annotators (not authors) verify the crowdsourced labels.
We manually aligned 50 article groups to create the NEWSELA-MANUAL dataset with a 35/5/10 split for train/dev/test, respectively. We trained our aligner on this dataset (details in §4), then automatically aligned sentences in the remaining 1,882 article groups in Newsela (Table 1) to create a new sentence-aligned dataset, NEWSELA-AUTO, which consists of 666k sentence pairs predicted as aligned and partially-aligned. NEWSELA-AUTO is considerably larger than the previous NEWSELA (Xu et al., 2015) dataset of 141,582 pairs, and contains 44% more interesting rewrites (i.e., rephrasing and splitting cases) as shown in Figure

Sentence Aligned Wikipedia Corpus
We also create a new version of Wikipedia corpus by aligning sentences between English Wikipedia and Simple English Wikipedia. Previous work (Xu et al., 2015) has shown that Wikipedia is much noisier than the Newsela corpus. We provide this dataset in addition to facilitate future research.
We first extract article pairs from English and Simple English Wikipedia by leveraging Wikidata, a well-maintained database that indexes named entities (and events etc.) and their Wikipedia pages in different languages. We found this method to be more reliable than using page titles (Coster and Kauchak, 2011) or cross-lingual links (Zhu et al., 2010;Woodsend and Lapata, 2011), as titles can be ambiguous and cross-lingual links may direct to a disambiguation or mismatched page (more details in Appendix B). In total, we extracted 138,095 article pairs from the 2019/09 Wikipedia dump, which is two times larger than the previous datasets (Coster and Kauchak, 2011;Zhu et al., 2010) of only 60∼65k article pairs, using an improved version of the WikiExtractor library. 5 Then, we crowdsourced the sentence alignment annotations for 500 randomly sampled document pairs (10,123 sentence pairs total). As document length in English and Simple English Wikipedia articles vary greatly, 6 we designed the following annotation strategy that is slightly different from Newsela. For each sentence in the simple article, we select the sentences with the highest similarity scores from the complex article for manual annotation, based on four similarity measures: lexical similarity from CATS (Štajner et al., 2018), cosine similarity using TF-IDF (Paetzold et al., 2017), cosine similarity between BERT sentence embeddings, and alignment probability by a BERT model fine-tuned on our NEWSELA-MANUAL data ( §3.1). As these four metrics may rank the same sentence at the top, on an average, we collected 2.13 complex sentences for every simple sentence and annotated the alignment label for each sentence pair. Our pilot study showed that this method captured 93.6% of the aligned sentence pairs. We named this manually labeled dataset WIKI-MANUAL with a train/dev/test split of 350/50/100 article pairs. Finally, we trained our alignment model on this   annotated dataset to automatically align sentences for all the 138,095 document pairs (details in Appendix B). In total, we yielded 604k non-identical aligned and partially-aligned sentence pairs to create the WIKI-AUTO dataset. Figure 2 illustrates that WIKI-AUTO contains 75% less defective sentence pairs than the old WIKILARGE (Zhang and Lapata, 2017) dataset.

Evaluation of Sentence Alignment
In this section, we present experiments that compare our neural sentence alignment against the stateof-the-art approaches on NEWSELA-MANUAL ( §3.1) and WIKI-MANUAL ( §3.2) datasets.

Existing Methods
We compare our neural CRF aligner with the following baselines and state-of-the-art approaches:

Evaluation Metrics
We report Precision, Recall and F1 on two binary classification tasks: aligned + partially-aligned vs. not-aligned (Task 1) and aligned vs. partiallyaligned + not-aligned (Task 2). It should be noted that we excluded identical sentence pairs in the evaluation as they are trivial to classify. Table 2 shows the results on NEWSELA-MANUAL test set. For similarity-based methods, we choose a threshold based on the maximum F1 on the dev set. Our neural CRF aligner outperforms the stateof-the-art approaches by more than 5 points in F1.

Results
In particular, our method performs better than the previous work on partial alignments, which contain many interesting simplification operations, such as sentence splitting and paraphrasing with deletion. Similarly, our CRF alignment model achieves 85.1 F1 for Task 1 (aligned + partially-aligned vs. not-aligned) on the WIKI-MANUAL test set. It outperforms one of the previous SOTA approaches CATS (Štajner et al., 2018) by 15.1 points in F1. We provide more details in Appendix C.

Ablation Study
We analyze the design choices crucial for the good performance of our alignment model, namely CRF component, the paragraph alignment and the BERTbased semantic similarity measure. Table 3 shows the importance of each component with a series of ablation experiments on the dev set.  CRF Model Our aligner achieves 93.2 F1 and 88.1 F1 on Task 1 and 2, respectively, which is around 3 points higher than its variant without the CRF component (BERT f inetune + ParaAlign). Modeling alignment label transitions and sequential predictions helps our neural CRF aligner to handle sentence splitting cases better, especially when sentences undergo dramatic rewriting.
Paragraph Alignment Adding paragraph alignment (BERT f inetune + ParaAlign) improves the precision on Task 1 from 93.3 to 98.4 with a negligible decrease in recall when compared to not aligning paragraphs (BERT f inetune ). Moreover, paragraph alignments generated by our algorithm (Our Aligner) perform close to the gold alignments (Our Aligner + gold ParaAlign) with only 0.9 and 0.3 difference in F1 on Task 1 and 2, respectively.
Semantic Similarity BERT f inetune performs better than other neural models, including Infersent (

Experiments on Automatic Sentence Simplification
In this section, we compare different automatic text simplification models trained on our new parallel corpora, NEWSELA-AUTO and WIKI-AUTO, with their counterparts trained on the existing datasets. We establish a new state-of-the-art for sentence simplification by training a Transformer model with initialization from pre-trained BERT checkpoints.

Comparison with existing datasets
Existing datasets of complex-simple sentences, NEWSELA (Xu et al., 2015) and WIKILARGE (Zhang and Lapata, 2017), were aligned using lexical similarity metrics. NEWSELA dataset (Xu et al., 2015) was aligned using JaccardAlign ( §4.1). WIK-ILARGE is a concatenation of three early datasets (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011) where sentences in Simple/Normal English Wikipedia and editing history were aligned by TF-IDF cosine similarity. For our new NEWSELA-AUTO, we partitioned the article sets such that there is no overlap between the new train set and the old test set, and vice-versa. Following Zhang and Lapata (2017), we also excluded sentence pairs corresponding to the levels 0-1, 1-2 and 2-3. For our WIKI-AUTO dataset, we eliminated sentence pairs with high (>0.9) or low (<0.1) lexical overlap based on BLEU scores (Papineni et al., 2002(Papineni et al., ), follow-ingŠtajner et al. (2015. We observed that sentence pairs with low BLEU are often inaccurate paraphrases with only shared named entities and the pairs with high BLEU are dominated by sentences merely copied without simplification. We used the benchmark TURK corpus (Xu et al., 2016) for evaluation on Wikipedia, which consists of 8 human-written references for sentences in the validation and test sets. We discarded sentences in TURK corpus from WIKI-AUTO. Table 4 shows the statistics of the existing and our new datasets.

Baselines and Simplification Models
We compare the following seq2seq models trained using our new datasets versus the existing datasets:    Table 7: Human evaluation of fluency (F), adequacy (A) and simplicity (S) on NEWSELA-AUTO test set.

Results
In this section, we evaluate different simplification models trained on our new datasets versus on the old existing datasets using both automatic and human evaluation.

Automatic Evaluation
We report SARI (Xu et al., 2016), Flesch-Kincaid (FK) grade level readability (Kincaid and Chissom, 1975), and average sentence length (Len). While SARI compares the generated sentence to a set of reference sentences in terms of correctly inserted, kept and deleted n-grams (n ∈ {1, 2, 3, 4}), FK measures the readability of the generated sentence.
We also report the three rewrite operation scores used in SARI: the precision of delete (del), the F1scores of add (add), and keep (keep) operations. Tables 5 and 8 show the results on Newsela and Wikipedia datasets respectively. Systems trained on our datasets outperform their equivalents trained on the existing datasets according to SARI. The difference is notable for Transformer bert with a 6.4% and 3.7% increase in SARI on NEWSELA-AUTO test set and TURK corpus, respectively. Larger size and improved quality of our datasets enable the training of complex Transformer models. In fact, Transformer bert trained on our new datasets outperforms the existing state-of-the-art systems for automatic text simplification. Although improvement in SARI is modest for LSTM-based models (LSTM and EditNTS), the increase in F1 scores for addition and deletion operations indicate that the models trained on our datasets make more meaningful changes to the input sentence.

Human Evaluation
We also performed human evaluation by asking five Amazon Mechanical Turk workers to rate fluency, adequacy and simplicity (detailed instructions in Appendix D.2) of 100 random sentences generated by different simplification models trained on NEWSELA-AUTO and the existing dataset. Each  worker evaluated these aspects on a 5-point Likert scale. We averaged the ratings from five workers. Table 7 demonstrates that Transformer bert trained on NEWSELA-AUTO greatly outperforms the one trained on the old dataset. Even with shorter sentence outputs, our Transformer bert retained similar adequacy as the LSTM-based models. Our Transformer bert model also achieves better fluency, adequacy, and overall ratings compared to the SOTA systems (Table 6). We provide examples of system outputs in Appendix D.3. Our manual inspection ( Figure 3) also shows that Transfomer bert trained on NEWSELA-AUTO performs 25% more paraphrasing and deletions than its variant trained on the previous NEWSELA (Xu et al., 2015) dataset.

Related Work
Text simplification is considered as a text-totext generation task where the system learns how to simplify from complex-simple sentence pairs. There is a long line of research using methods based on hand-crafted rules ( . As the existing datasets were built using lexical similarity metrics, they frequently omit paraphrases and sentence splits. While training on such datasets creates conservative systems that rarely paraphrase, evaluation on these datasets exhibits an unfair preference for deletion-based simplification over paraphrasing. Sentence alignment has been widely used to extract complex-simple sentence pairs from parallel articles for training text simplification systems. Previous work used surface-level similarity metrics, such as TF-IDF cosine similarity (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011;Paetzold et al., 2017), Jaccard-similarity (Xu et al., 2015), and other lexical features (Hwang et al., 2015;Štajner et al., 2018). Then, a greedy (Štajner et al., 2018) or dynamic programming (Barzilay and Elhadad, 2003;Paetzold et al., 2017) algorithm was used to search for the optimal alignment. Another related line of research (Smith et al., 2010;Tufis , et al., 2013;Tsai and Roth, 2016;Gottschalk and Demidova, 2017;Aghaebrahimian, 2018;Thompson and Koehn, 2019) aligns parallel sentences in bilingual corpora for machine translation.

Conclusion
In this paper, we proposed a novel neural CRF model for sentence alignment, which substantially outperformed the existing approaches. We created two high-quality manually annotated datasets (NEWSELA-MANUAL and WIKI-MANUAL) for training and evaluation. Using the neural CRF sentence aligner, we constructed two largest sentencealigned datasets to date (NEWSELA-AUTO and WIKI-AUTO) for text simplification. We showed that a BERT-initalized Transformer trained on our new datasets establishes new state-of-the-art performance for automatic sentence simplification.

A.1 Implementation Details
We used PyTorch 9 to implement our neural CRF alignment model. For the sentence encoder, we used Huggingface implementation (Wolf et al., 2019) of BERT base 10 architecture with 12 layers of Transformers. When fine-tuning the BERT model, we use the representation of [CLS] token for classification. We use cross entropy loss and update the weights in all layers. Table 9 summarizes the hyperparameters of our model. Table 10 provides the thresholds for our paragraph alignment Algorithm 2, which were chosen based on NEWSELA-MANUAL dev data.

Parameter Value
Parameter Value hidden units 768 # of layers 12 learning rate 0.00002 # of heads 12 max sequence length 128 batch size 8  For Wikipedia data, we tailored our paragraph alignment algorithm (Algorithm 3 and 4). Table  11 provides the thresholds for Algorithm 4, which were chosen based on WIKI-MANUAL dev data.

B Sentence Aligned Wikipedia Corpus
We present more details about our pre-processing steps for creating the WIKI-MANUAL and WIKI-AUTO corpora here. In Wikipedia, Simple English tuned on the dev set. After filtering, we ended up with 970 aligned sentence pairs in total from these 13,036 article pairs.

C Sentence Alignment on Wikipedia
In this section, we compare different approaches for sentence alignment on the WIKI-MANUAL dataset. Tables 12 and 13 report the performance for Task 1 (aligned + partially-aligned vs. not-aligned) on dev and test set. To generate prediction for MAS-SAlign, CATS and two BERT f inetune methods, we first utilize the method in §3.2 to select candidate sentence pairs, as we found this step helps to improve their accuracy. Then we apply the similarity metric from each model to calculate the similarity of each candidate sentence pair. We tune a threshold for max f1 on the dev set and apply it to the test set. Candidate sentence pairs with a similarity larger than the threshold will be predicted as aligned, otherwise not-aligned. Sentence pairs that are not selected as candidates will also be predicted as not-aligned.      Table 14 shows the values of other hyperparameters. For the LSTM baseline, we replicated the LSTM encoder-decoder model used by Zhang and Lapata (2017). We preprocessed the data by replacing the named entities in a sentence using spaCy 13 toolkit. We also replaced all the words with frequency less than three with <UNK>. If our model predicted <UNK>, we replaced it with the aligned source word (Jean et al., 2015). Table 15 summarizes the hyperparameters of LSTM model. We used 300dimensional GloVe word embeddings (Pennington et al., 2014) to initialize the embedding layer.

Generated by LSTM baseline Complex (input)
In Seattle , eight activists between ages 10 and 15 petitioned Washington state last year to adopt stricter science-based regulations to protect them against climate change.

Simple (reference)
In Seattle, eight youths between 10 to 15 years old petitioned the state of Washington to change the law. New (this work) in seattle , eight activists between ages 10 and 15 asked washington state last year to keep the environment safe. (Phrasal Praphrase + Deletion) Old (Xu et al., 2015) in seattle , eight activists between ages 10 and 15 asked washington state last year to adopt stricter science -based rules to protect them against climate change. (Lexical Paraphrase) Complex (input) He recognized that another recommendation would be controversial with police groups: independent investigations after police shootings.

Simple (reference)
He admitted that police would not like one of the recommendations. New (this work) he thought another suggestion would be against the police. (Phrasal Paraphrase + Deletion) Old (Xu et al., 2015) he recognized that another suggestion would be controversial with police groups. (Lexical Paraphrase + Deletion) Complex (input) The Philadelphia Museum of Art has two famous selfie spots , both from the movie " Rocky. " Simple (reference) The Philadelphia Museum of Art has two big selfie spots. New (this work) the philadelphia museum of art has two picture spots. (Lexical Paraphrase + Deletion) Old (Xu et al., 2015) the philadelphia museum of art has two famous spots. (Deletion) Generated by Transformer bert Complex (input) Some Chicago residents got angry about it.

Simple (reference)
The plan made some people angry. New (this work) some people in chicago were angry. (Phrasal Paraphrase) Old (Xu et al., 2015) some chicago residents got angry. (Deletion) Complex (input) Emissions standards have been tightened , and the government is investing money in solar , wind and other renewable energy.

Simple (reference)
China has also put a great deal of money into solar, wind and other renewable energy.

New (this work)
the government is putting aside money for new types of energy. (Phrasal Paraphrase + Deletion) Old (Xu et al., 2015) the government is investing in money , wind and other equipment. (Lexical Paraphrase + Deletion) Complex (input) On Feb. 9 , 1864 , he was sitting for several portraits , including the one used for the $5 bill.

Simple (reference)
On Feb. 9, 1864, several artists painted pictures of him. New (this work) on feb. 9, 1864 , he was sitting for several portraits. (Deletion) Old (Xu et al., 2015) on feb 9, 1864 , he was sitting for several , including the $ 5 bill for the bill. (Deletion) Table 16: Examples of simplified sentences generated by LSTM and Transformer bert models trained on our new NEWSELA-AUTO (this work) and old existing NEWSELA (Xu et al., 2015) datasets. The source sentences are from our new NEWSELA-AUTO test set. Models trained on our new data rephrase the input sentence more often than the models trained on old data. Bold indicates deletions or paraphrases.

Complex (input)
Now at age 9, his teachers say Richie reads at the level of a student in high school, and his vocabulary is well above those of his classmates.

Simple (reference)
He reads like a high school student. LSTM now he is age 9. EditNTS (Dong et al., 2019) he say his classmates are using a special job. Rerank (Kriz et al., 2019) but it is well above those of his classmates. Transfomer bert (this work) now at age 9 , his teachers say that richie reads high schoolwork.

Complex (input)
He can recall the special feeling when, at age 7, he built his first kite and saw it waft into the air.

Simple (reference)
He can remember the special feeling when he built his first kite . LSTM he can remember the people when he was age 7. EditNTS (Dong et al., 2019) , at age 7, he built his first kite. Rerank (Kriz et al., 2019) he could remember the special feeling when. Transfomer bert (this work) he can remember the special feeling when he was 7 years old.

Complex (input)
Following the action , two middle-aged brothers slid down a hill holding signs.

Simple (reference)
For example , two grownup brothers slid down a hill holding signs . LSTM <unk> middle -aged brothers slid down a hill holding signs. EditNTS (Dong et al., 2019) two middle-aged brothers , 14 , heard down a hill signs. Rerank (Kriz et al., 2019) he made a hill holding signs. Transfomer bert (this work) two middle-aged brothers slid down a hill holding signs.