Controllable Text Simplification with Explicit Paraphrasing

Text Simplification improves the readability of sentences through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting. Current simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to perform all these operations simultaneously. However, such systems limit themselves to mostly deleting words and cannot easily adapt to the requirements of different target audiences. In this paper, we propose a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and couples them with a neural paraphrasing model to produce varied rewriting styles. We introduce a new data augmentation method to improve the paraphrasing capability of our model. Through automatic and manual evaluations, we show that our proposed model establishes a new state-of-the-art for the task, paraphrasing more often than the existing systems, and can control the degree of each simplification operation applied to the input texts.

Since 2016, nearly all text simplification systems have been sequence-to-sequence (seq2seq)  Table 1: Output statistics of 500 random sentences from the Newsela test set. Existing systems rely on deletion and do not paraphrase well. OLen, %new, %eq and %split denote the average output length, percentage of new words added, percentage of system outputs that are identical to the inputs, and percentage of sentence splits, respectively. †We used the system outputs shared by their authors. models trained end-to-end, which have greatly increased the fluency of the outputs (Zhang and Lapata, 2017;Nisioi et al., 2017;Zhao et al., 2018;Kriz et al., 2019;Dong et al., 2019;Jiang et al., 2020). However, these systems mostly rely on deletion and tend to generate very short outputs at the cost of meaning preservation (Alva-Manchego et al., 2017). Table 1 shows that they neither split sentences nor paraphrase well as reflected by the low percentage of splits (< 1%) and new words introduced (< 11.2%). While deleting words is a viable (and the simplest) way to reduce the complexity of sentences, it is suboptimal and unsatisfying. Professional editors are known to use a sophisticated combination of deletion, paraphrasing, and sentence splitting to simplify texts (Xu et al., 2015). Another drawback of these end-to-end neural systems is the lack of controllability. Simplification is highly audience dependant, and what constitutes simplified text for one group of users may not be acceptable for other groups (Xu et al., 2015;Lee and Yeung, 2018). An ideal simplification system should be able to generate text with varied characteristics, such as different lengths, readability levels, and number of split sentences, which can be difficult to control in end-to-end systems. To address these issues, we propose a novel hybrid approach that combines linguisticallymotivated syntactic rules with data-driven neural models to improve the diversity and controllability of the simplifications. We hypothesize that the seq2seq generation model will learn lexical and structural paraphrases more efficiently from the parallel corpus, when we offload some of the burden of sentence splitting (e.g., split at comma) and deletion (e.g., remove trailing preposition phrases) decisions to a separate component. Previous hybrid approaches for simplification (Narayan and Gardent, 2014;Siddharthan and Mandya, 2014;Sulem et al., 2018c) used splitting and deletion rules in a deterministic step before applying an MT-based paraphrasing model. In contrast, our approach provides a more flexible and dynamic integration of linguistic rules with the neural models through ranking and data augmentation ( Figure 1).
We compare our method to several state-of-theart systems in both automatic and human evaluations. Our model achieves overall better performance measured by SARI (Xu et al., 2016) and other metrics, showing that the generated outputs are more similar to those written by human editors. We also demonstrate that our model can control the extent of each simplification operation by: (1) imposing a soft constraint on the percentage of words to be copied from the input in the seq2seq model, thus limiting lexical paraphrasing; and (2) selecting candidates that underwent a desired amount of splitting and/or deletion. Finally, we create a new test dataset with multiple human references for Newsela (Xu et al., 2015), the widely used text simplification corpus, to specifically evaluate lexical paraphrasing.
2 Our Approach Figure 1 shows an overview of our hybrid approach. We combine linguistic rules with data-driven neural models to improve the controllability and diversity of the outputs. Given an input complex sentence x, we first generate a set of intermediate simplifications V = {v 1 , v 2 , . . . , v n } that have undergone splitting and deletion ( §2.1). These intermediate sentences are then used for two purposes: (1) Selected by a pairwise neural ranking model ( §2.2) based on the simplification quality and then rewritten by the paraphrasing component; (2) Used for data augmentation to improve the diversity of the paraphrasing model ( §2.3).

Splitting and Deletion
We leverage the state-of-the-art system for structural simplification, called DisSim (Niklaus et al., 2019), to generate candidate simplifications that focus on splitting and deletion. 2 The English version of DisSim applies 35 hand-crafted grammar rules to break down a complex sentence into a set of hierarchically organized sub-sentences (see Figure 1 for an example). We choose a rule-based approach for sentence splitting because it works really well. In our pilot experiments, DisSim successfully split 92% of 100 complex sentences from the training data with more than 20 words, and introduced errors for only 6.8% of these splits. We consider these sub-sentences as candidate simplifications for the later steps, except those that are extremely short or long (compression ratio / ∈ [0.5, 1.5]). The compression ratio is calculated as the number of words in a candidate simplification v i (which may contain one or more sub-sentences) divided by that of the original sentence x.
To further increase the variety of generated candidates, we supplement DisSim with a Neural Deletion and Split module trained on the text simplification corpus ( §3.1). We use a Transformer seq2seq model with the same configuration as the base model for paraphrasing ( §2.3). Given the input sentence x, we constrain the beam search to generate 10 outputs with splitting and another 10 outputs without splitting. Then, we select the outputs that do not deviate substantially from x (i.e., Jaccard similarity > 0.5). We add outputs from the two systems to the candidate pool V .

Candidate Ranking
We design a neural ranking model to score all the candidates that underwent splitting and deletion, V = {v 1 , v 2 , . . . , v n }, then feed the top-ranked one to the lexical paraphrasing model for the final output. We train the model on a standard text simplification corpus consisting of pairs of complex sentence x and manually simplified reference y.
Scoring Function. To assess the "goodness" of each candidate v i during training, we define the gold scoring function g * as a length-penalized BERTscore: BERTScore (Zhang et al., 2020b) is a text similarity metric that uses BERT (Devlin et al., 2019) embeddings to find soft matches between word pieces (Wu et al., 2016) instead of exact string matching. We introduce a length penalty to favor the candidates that are of similar length to the human reference y and penalize those that deviate from the target compression ratio φ y . λ defines the extent of penalization and is set to 1 in our experiments. φ v i represents the compression ratios of v i compared to the input x. In principle, other similarity metrics can also be used for scoring.
Pairwise Ranking Model. We train the ranking model in a pairwise setup since BERTScore is sensitive to the relative rather than absolute similarity, when comparing multiple candidates with the same reference. We transform the gold ranking of V (|V | = n) into n 2 pairwise comparisons for every candidate pair, and learn to minimize the pairwise ranking violations using hinge loss: where g(.) is a feedforward neural network, m is the number of training complex-simple sentence pairs, k is the index of training examples, and n k represents the number of generated candidates ( §2.1). On average, n k is about 14.5 for a sentence of 30 words, and can be larger for longer sentences. We consider 10 randomly sampled candidates for each complex sentence during training.
Features. For the feedforward network g(.), we use the following features: number of words in v i and x, compression ratio of v i with respect to x, Jaccard similarity between v i and x, the rules applied on x to obtain v i , and the number of rule applications. We vectorize all the real-valued features using Gaussian binning (Maddela and Xu, 2018), which has shown to help neural models trained on numerical features (Liu et al., 2016;Sil et al., 2017;Zhong et al., 2020). We concatenate these vectors before feeding them to the ranking model. We score each candidate v i separately and rank them in the decreasing order of g(v i ). We provide implementation details in Appendix A.

Paraphrase Generation
We then paraphrase the top-ranked candidatev ∈ V to generate the final simplification outputŷ. Our paraphrase generation model can explicitly control the extent of lexical paraphrasing by specifying the percentage of words to be copied from the input sentence as a soft constraint. We also introduce a data augmentation method to encourage our model to generate more diverse outputs.
Base Model. Our base generation model is a Transformer encoder-decoder initialized by the BERT checkpoint (?), which achieved the best reported performance on text simplification in the recent work (Jiang et al., 2020). We enhance this model with an attention-based copy mechanism to encourage lexical paraphrasing, while remaining faithful to the input.
Copy Control. Given the input candidatev = (v 1 ,v 2 , . . . ,v l ) of l words and the percentage of copying cp ∈ (0, 1], our goal is to paraphrase the rest of (1 − cp) × l words inv to a simpler version. To achieve this, we convert cp into a vector of the same dimension as BERT embeddings using Gaussian binning (Maddela and Xu, 2018) and add it to the beginning of the input sequencev. The Transformer encoder then produces a sequence of context-aware hidden states H = (h 1 , h 2 . . . h l ), where h i corresponds to the hidden state ofv i . Each h i is fed into the copy network which predicts the probability p i that wordv i should be copied to output. We create a new hidden stateh i by adding h i to a vector u scaled according to p i . In other words, the scaled version of u informs the decoder whether the word should be copied. A single vector u is used across all sentences and hidden states, and is randomly initialized then updated during training. More formally, the encoding process can be described as follows: The Transformer decoder generates the output sequence fromH. Our copy mechanism is incorporated into the encoder rather than copying the input words during the decoding steps (Gu et al., 2016;See et al., 2017). Unless otherwise specified, we use the average copy ratio of the training dataset, 0.7, for our experiments.
Multi-task Training. We train the paraphrasing model and the copy network in a multi-task learning setup, where predicting whether a word should be copied serves as an auxiliary task. The gold labels for this task are obtained by checking if each word in the input sentence also appears in the human reference. When a word occurs multiple times in the input, we rely on the monolingual word alignment results from JacanaAlign (Yao et al., 2013) to determine which occurrence is the one that gets copied. We train the Transformer model and the copy network jointly by minimizing the cross-entropy loss for both decoder generation and binary word classification. We provide implementation and training details in Appendix A.
Data Augmentation. The sentence pairs in the training corpus often exhibit a variable mix of splitting and deletion operations along with paraphras-ing (see Figure 1 for an example), which makes it difficult for the encoder-decoder models to learn paraphrases. Utilizing DisSim, we create additional training data that focuses on lexical paraphrasing For each sentence pair x, y , we first generate a set of candidates V = {v 1 , v 2 , . . . , v n } by applying DisSim to x, as described in §2.1. Then, we select a a subset of V , called V = {v 1 , v 2 , . . . , v n } (V ∈ V ) that are fairly close to the reference y, but have only undergone splitting and deletion. We score each candidate v i using the length-penalized BERTScore g * (v i , y) in Eq. (1), and discard those with scores lower than 0.5. While calculating g * , we set φ y and λ to 1 and 2 respectively to favor candidates of similar length to the reference y. We also discard the candidates that have different number of split sentences with respect to the reference. Finally, we train our model on the filtered candidate-reference sentence pairs v 1 , y , v 2 , y , . . . , v n , y , which focus on lexical paraphrasing, in addition to x, y .

Controllable Generation
We can control our model to concentrate on specific operations. For split-or delete-focused simplification, we select candidates with desirable length or number of splits during the candidate generation step. We perform only the paraphrase generation step for paraphrase-focused simplification. The paraphrasing model is designed specifically to paraphrase with minimal deletion and without splitting. It retains the length and the number of split sentences in the output, thus preserving the extent of deletion and splitting controlled in the previous steps. We control the degree of paraphrasing by changing the copy ratio.

Experiments
In this section, we compare our approach to various sentence simplification models using both automatic and manual evaluations. We show that our model achieves a new state-of-the-art and can adapt easily to different simplification styles, such as paraphrasing and splitting without deletion.

Data and Experiment Setup
We train and evaluate our models on Newsela (Xu et al., 2015) 3 and Wikipedia copora (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011 Table 2: Automatic evaluation results on NEWSELA-AUTO test set. We report SARI, the main automatic metric for simplification, and its three edit scores namely precision for delete (del) and F1 scores for add and keep operations. We also report FKGL (FK), average sentence length (SLen), output length (OLen), compression ratio (CR), self-BLEU (s-BL), percentage of sentence splits (%split), average percentage of new words added to the output (%new), and percentage of sentences identical to the input (%eq). Bold typeface denotes the best performances (i.e., closest to the reference).
articles with each article rewritten by professional editors for students in different grades. We used the complex-simple sentence pairs automatically aligned by Jiang et al. (2020) To demonstrate that our model can be controlled to generate diverse simplifications, we evaluate under the following settings: (i) Standard evaluation on the NEWSELA-AUTO test set similar to the methodology in the recent literature (Jiang et al., 2020;Dong et al., 2019;Zhang and Lapata, 2017), and (ii) Evaluation on different subsets of the NEWSELA-AUTO test set that concentrate on a specific operation. We selected 9,356 sentence pairs with sentence splits for split-focused evaluation. Similarly, we chose 9,511 sentence pairs with compression ratio < 0.7 and without sentences splits to evaluate delete-focused simplification. We created a new dataset, called NEWSELA-TURK, to evaluate lexical paraphrasing. 4 Similar to the WIKIPEDIA-TURK benchmark corpus (Xu et al., 2016), NEWSELA-TURK consists of human-written references focused on lexical para-phrasing. We first selected sentence pairs from the NEWSELA-AUTO test set of roughly similar length (compression ratio between 0.8 and 1.2) and no sentence splits because they more likely involve paraphrasing. Then, we asked Amazon Mechanical Turk workers to simplify the complex sentence without any loss in meaning. 5 To ensure the quality of simplifications, we manually selected the workers using the qualification test proposed in Alva-Manchego et al. (2020), during which the workers were asked to simplify three sentences. We selected top 35% of the 300 workers that participated in the test. We periodically checked the submissions and removed the bad workers. In the end, we collected 500 sentences with 4 references for each sentence.

Existing Methods
We use the following simplification approaches as baselines: (i) BERT-Initialized Transfomer (?), where the encoder is initialized with BERT base checkpoint and the decoder is randomly initialized. It is the current state-of-the-art for text simplification (Jiang et al., 2020). (ii) EditNTS (Dong et al., 2019), 6 another state-of-the-art model that uses a neural programmer-interpreter (Reed and de Freitas, 2016) to predict the edit operation on each word, and then generates the simplified sentence. (iii) LSTM baseline, a vanilla encoderdecoder model used in Zhang and Lapata (2017). (iv) Hybrid-NG (Narayan and Gardent, 2014), 7 one of the best existing hybrid systems that performs splitting and deletion using a probabilistic model and lexical substitution with a phrase-based machine translation system. We retrained all the models on the NEWSLA-AUTO dataset.  Table 3: Automatic evaluation results on NEWSELA-TURK that focuses on paraphrasing (500 complex sentences with 4 human written paraphrases). We control the extent of paraphrasing of our models by specifying the percentage of words to be copied (cp) from the input as a soft constraint.

Automatic Evaluation
Metrics. We report SARI (Xu et al., 2016), which averages the F1/precision of n-grams (n ∈ {1, 2, 3, 4}) inserted, deleted and kept when compared to human references. More specifically, it computes the F1 score for the n-grams that are added (add), 8 which is an important indicator if a model is good at paraphrasing. The model's deletion capability is measured by the F1 score for n-grams that are kept (keep) and precision for those deleted (del). 9 To evaluate a model's para- 8 We slightly improved the SARI implementation by Xu et al. (2016) to exclude the spurious ngrams while calculating the F1 score for add. For example, if the input contains the phrase "is very beautiful", the phrase "is beautiful" is treated as a new phrase in the original implementation even though it is caused by the delete operation. 9 SARI score of a reference with itself may not always be 100 as it considers 0 divided by 0 as 0, instead of 1, when calculating n-gram precision and recall. This avoids the inflation of del scores when the input is same as the output. phrasing capability and diversity, we calculate the BLEU score with respect to the input (s-BL), the percentage of new words (%new) added, and the percentage of system outputs identical to the input (%eq). Low s-BL, %eq, or high %new indicate that the system is less conservative. We also report Flesch-Kincaid (FK) grade level readability (Kincaid and Chissom, 1975), average sentence length (SLen), the percentage of splits (%split), compression ratio (CR), and average output length (OLen). We do not report BLEU because it often does not correlate with simplicity (Sulem et al., 2018a,b;Xu et al., 2016).  Table 6: Human evaluation of 100 random simplifications from the NEWSELA-AUTO test set and the split-focused subset of the same test set. Has Split and Correct Split denote the percentage of the output sentences that have undergone splitting and the percentage of coherent splits respectively. * denotes that our model is significantly better than the corresponding baseline (according to a t-test with p < 0.05). deletion as they show high self-BLEU (>66.5) and FK (>8.8) scores despite having compression ratios similar to other systems. Transformer model alone is rather conservative and copies 10.2% of the sentences directly to the output. Although Hybrid-NG makes more changes than any other baselines, its SARI and add scores are 3.7 and 1.7 points lower than our model indicating that it generates more errors. Our model achieves the lowest self-BLEU (48.7), FK (7.9), and percentage of sentences identical to the input (0.4), and the highest add (3.3) score and percentage of new words (16.2%). In other words, our system is the least conservative, generates more good paraphrases, and mimics the human references better. We provide examples of system outputs in Table 9 and Appendix C. Tables 3, 4, and 5 show the results on NEWSELA-TURK, split-focused, and delete-focused subsets of NEWSELA-AUTO test set respectively. For these experiments, we configure our model to focus on specific operations (details in §2.4). Our model again outperforms the existing systems according to SARI, add score, and percentage of new words, which means that our model is performing more meaningful paraphrasing. We show that we can control the extent of paraphrasing by varying the copy ratio (cp). Our model splits 93.5% of the sentences, which is substantially better than the other models.

Human Evaluation
We performed two human evaluations: one to measure the overall simplification quality and the other to specifically capture sentence splitting. 11 For the first one, we asked five Amazon Mechanical Turk workers to evaluate fluency, adequacy and simplicity of 100 random simplifications from the NEWSELA-AUTO test set. We supplemented the 2-3 readability levels in NEWSELA-AUTO, which contained more lexical overlaps and inflated the scores for EditNTS. 11 We provide instructions in Appendix E.
fluency and adequacy ratings with binary questions described in Zhang et al. (2020a) for the second evaluation over another 100 simplifications from the NEWSELA-AUTO split-focused test set. We asked if the output sentence exhibits spitting and if the splitting occurs at the correct place. While fluency measures the grammaticality of the output, adequacy captures the extent of meaning preserved when compared to the input. Simplicity evaluates if the output is simpler than the input. Each sentence was rated on a 5-point Likert scale and we averaged the ratings from the five workers. We chose the majority value for the binary ratings. We used the output of our model that is tailored for sentence splitting for the second evaluation. Table 6 demonstrates that our model achieves the best fluency, simplicity, and overall ratings. The adequacy rating is also very close to that of Transformer bert and EditNTS even though our model is performing more paraphrasing (Table 2), which verifies that the changes made by our system are meaningful. Our model achieves the most number of correct sentence splits (90%), and the highest fluency (4.19) for syntactic simplification, showing that it can generate more number of coherent sentence splits when compared to other models.

Model Analysis
In this section, we analyze the contribution of each model component and examine the system errors.

System Ablations
We evaluate our key design choices, namely candidate ranking that is based on length-penalized BERTScore and paraphrase generation that uses data augmentation and copy attention. Table 8 summarizes the results. Our pairwise ranking model (BERTScore len ) achieves an increase of 3.2 points in SARI when compared to choosing a random (Random) candidate. Randomly selecting a candidate also performs fairly well, indicating that the The Seattle kids petitioned Washington state last year to adopt stricter science-based regulations to protect them against climate change.

Simple
The eight kids from Seattle who petitioned Washington state last year are fighting for the state to adopt stricter laws to protect their future against climate change.

Our Model
The Seattle kids asked the Washington state last year to take tougher rules against climate change. Hallucinations (13%) Complex "It's more of a family than living outside," said Jessica Konczal, 33, whose husband is Sgt. Matthew Konczal .

Simple
Jessica Konczal, age 33, says the base is its own type of family .

Our Model
"It's more of a family than living outside," said Jessica Konczal, one of the protesters.

Fluency Errors (14%) Complex
Parental feedback on the menu additions so far, from some of the early adopter markets, has been "phenomenal," Leverton said.

Simple
Leverton said parents love the new food.

Our Model
So far, parents parents have feedback on the menu changes has been a great deal, he added. Bad Substitution (15%) Complex One of the device's inventors explained to the president that the machine was a prototype.

Simple
One of the girls explained it was just a prototype.

Our Model
One of the inventors told the president that the first design was a design.

Anaphora Resolution / Entailment (12%) Complex
Sea slugs dubbed sacoglossans are some of the most remarkable biological burglars on the planet.

Simple
The sea slugs known as sacoglossans are some of the most interesting burglars on the planet.

Our Model
Scientists say these are some of the most interesting creatures on the planet.

Human Reference Errors (4%) Complex
They float in and out of formations that combine the underwater world with the stage.

Simple
In the dance on a round stage on the seabed, the dancers swim together and apart.

Our Model
They float in and out of places that combine stage with the underwater.  Compared to our final model (Our Model), its variants without data augmentation (− augmentation) and copy mechanism (− copy attn) suffer a drop of 1.0 and 2.6 points in SARI respectively and a decrease of at least 3.0% of new words, which demonstrates that these components encourage the system to paraphrase. Our model trained on only DisSim (− only DisSim) and Transformer (− only Transformer) candidates performs close to our best model (Our Model) in terms of SARI.

Error Analysis
To understand the errors generated by our model, we manually classified 200 simplifications from the NEWSELA-AUTO test set into the following categories: (a) Good, where the model generated meaningful simplifications, (b) Hallucinations, where the model introduced information not in the input, (c) Fluency Errors, where the model generated ungrammatical output, (d) Anaphora Resolution, where it was difficult to resolve pronouns in the output. (e) Bad substitution, where the model inserted an incorrect simpler phrase, and (e) Human Reference Errors, where the reference does not reflect the source sentence. Note that a simplification can belong to multiple error categories. Table  7 shows the examples of each category.

Related Work
Before the advent of neural networks, text simplification approaches performed each operation separately in a pipeline manner using either handcrafted rules (Carroll et al., 1999;Siddharthan, 2002;Siddharthan et al., 2004) or data-driven methods based on parallel corpora (Zhu et al., 2010;Woodsend and Lapata, 2011;Narayan and Gardent, 2014). Following neural machine translation, the trend changed to performing all the operations together end-toend (Zhang and Lapata, 2017;Nisioi et al., 2017;Zhao et al., 2018;Alva-Manchego et al., 2017;Vu  have revealed who owned the ship. Our Model (cp = 0.6) scientists have found a secret deal. they have discovered who owned the ship. Our Model (cp = 0.7) scientists have found documents in portugal. they have also found out who owned the ship. Our Model (cp = 0.8) scientists have found documents in portugal. they have discovered who owned the ship.

Complex
Experts say China's air pollution exacts a tremendous toll on human health.

Simple
China's air pollution is very unhealthy. Hybrid-NG experts say the government's air pollution exacts a toll on human health. LSTM experts say china's air pollution exacts a tremendous toll on human health. Transformer bert experts say china's pollution has a tremendous effect on human health. EditNTS experts say china's air pollution can cause human health. Our Model (cp = 0.6) experts say china's air pollution is a big problem for human health. Our Model (cp = 0.7) experts say china 's air pollution can cause a lot of damage on human health. Our Model (cp = 0.8) experts say china 's air pollution is a huge toll on human health.  Controllable text simplification has been attempted before, but only with limited capability. Scarton and Specia (2018)  Another long body of research focuses on a single simplification operation and can be broadly divided into three categories: (1) Lexical Simplification (Specia et al., 2012;Horn et al., 2014;Glavaš and Štajner, 2015;Paetzold andSpecia, 2017, 2015;Maddela and Xu, 2018;Qiang et al., 2020), where complex words are substituted with simpler words.

Conclusion
We proposed a novel hybrid approach for sentence simplification that performs better and produces more diverse outputs than the existing systems. We designed a new data augmentation method to encourage the model to paraphrase. We created a new dataset, NEWSELA-TURK, to evaluate paraphrasing-focused simplifications. We showed that our model can control various attributes of the simplified text, such as number of sentence splits, length, and number of words copied from the input.

Acknowledgments
We thank the anonymous reviewers for their valuable feedback. We thank Newsela for sharing the data and NVIDIA for providing GPU computing resources. This research is supported in part by the NSF award IIS-1822754, ODNI and IARPA via the BETTER program contract 19051600004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of NSF, ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A Implementation and Training Details
We implemented two separate Transformer models for neural deletion and split component ( §2.1) and paraphrase generation ( §2.3) using the Fairseq 12 toolkit. Both the encoder and decoder follow BERT base 13 architecture, while the encoder is also initialized with BERT base checkpoint. For neural deletion and split component, we used a beam search of width 10 to generate candidates. The copy attention mechanism is a feedforward network containing 3 hidden layers, 1000 nodes in each layer with tanh activation, and a single linear output node with sigmoid activation. We used Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0001, linear learning rate warmup of 40k steps, and 100k training steps. We used a batch size of 64. We used BERT WordPiece tokenizer. During inference, we constrained the beam-search to not repeat trigrams and emitted sentences that avoided aggressive deletion (compression ratio ∈ [0.9, 1.2]. We chose the best checkpoint based on the SARI score (Xu et al., 2016) on the dev set. We saved a checkpoint after every epoch. We did not perform any hyperparameter search and directly used the hyperparameters of the BERT-initialized Transformer described in ?. The model takes 10 hours to train on 1 NVIDIA GeForce GPU. Our pairwise ranking model, implemented using the PyTorch framework, consists of 3 hidden layers, 100 nodes in each layer, tanh activation, and a single linear output node. We used Adam optimizer with a learning rate of 0.01 and 10 epochs. We applied a dropout of 0.2. For Gaussian binning, we vectorized the numerical features into 10 dimensional vectors. The model takes half hour to train on 1 NVIDIA GeForce GPU. We do not perform any extensive hyperparameter tuning. We just examined few values for learning rate (0.001, 0.01 and 0.1) and chose the best based on the SARI score on the dev set. We used the original code for DisSim. 14 12 https://github.com/pytorch/fairseq 13 https://github.com/google-research/ bert 14 https://github.com/Lambda-3/ DiscourseSimplification 3550 B Annotation Interface

C System Outputs System Outputs Complex
This year, the FAA has approved dozens of permits for agricultural drone businesses.

Simple
This year, it approved dozens of permits for agricultural drone businesses.

Hybrid-NG
this year, the government has approved dozens of drone permits for agricultural businesses. LSTM this year, the faa has approved dozens of permits for agricultural drone businesses. Transformer bert this year, the faa has approved dozens of permits for agricultural businesses.

EditNTS
this year, the government has approved dozens of permits for drone businesses for no permission. Our Model (cp = 0.6) this year, the faa has allowed many businesses to use drones. Our Model (cp = 0.7, 0.8) this year, the faa has approved dozens of permits for drones.

Complex
The room echoed with the sounds of song, the beat of drums, the voices of young men.

Simple
As she spoke, the building echoed with music and the beat of drums. Hybrid-NG echoed the room. LSTM the room echoed with the sounds of song, the voices of young men. Transformer bert the room echoed with the sound of song, the beat of drums, the voices of young men. EditNTS the room echoed with the sounds of song, the beat of drums, the voices of young men who are hungry and legs. Our Model (cp = 0.6) the sound of the room was full of sounds of young men and the voices of cellos. Our Model (cp = 0.7) the sound of the room sounded like a lot of music, and the voices of young men. Our Model (cp = 0.8) the sound of the room sounded like a song, the beat of drums, and the voices of young men.   Table 11: Automatic evaluation results on a subset of Newsela test set that focuses on paraphrasing (8371 complexsimple sentence with compression ratio > 0.9 and no splits). We control the extent of paraphrasing of our models by specifying the percentage of words to be copied (cp) from the input as a soft constraint.     We use the complex-simple sentence pairs from WIKI-AUTO (Jiang et al., 2020), which contains 138,095 article pairs and 604k non-identical aligned and partially-aligned sentence pairs. To capture sentence splitting, we join the sentences in the simple article mapped to the same sentence in the complex article. Similar to Newsela, we remove the sentence pairs with high (>0.9) and low (<0.1) BLEU (Papineni et al., 2002) scores. For validation and testing purposes, we use the following two corpora: (i) TURK corpus (Xu et al., 2015) for lexical paraphrasing and (ii) ASSET corpus (Alva-Manchego et al., 2020) for multiple rewrite operations. While the former corpus has 8 humanwritten references for 2000 validation and 359 test sentences, the latter corpus provides 10 references for the same sentences. We remove the validation and test sentences from the training corpus . Tables  12 and 13 show the results on TURK and ASSET respectively.