On the Helpfulness of Document Context to Sentence Simplification

Most of the research on text simplification is limited to sentence level nowadays. In this paper, we are the first to investigate the helpfulness of document context on sentence simplification and apply it to the sequence-to-sequence model. We firstly construct a sentence simplification dataset in which the contexts for the original sentence are provided by Wikipedia corpus. The new dataset contains approximately 116K sentence pairs with context. We then propose a new model that makes full use of the context information. Our model uses neural networks to learn the different effects of the preceding sentences and the following sentences on the current sentence and applies them to the improved transformer model. Evaluated on the newly constructed dataset, our model achieves 36.52 on SARI value, which outperforms the best performing model in the baselines by 2.46 (7.22%), indicating that context indeed helps improve sentence simplification. In the ablation experiment, we show that using either the preceding sentences or the following sentences as context can significantly improve simplification.


Introduction
Text simplification is a hot issue in the field of natural language generation (NLG). It is also one of the critical needs of society (Woodsend and Lapata, 2011). Text simplification aims to adapt a complex text into a more readable version with the same meaning (Sulem et al., 2018b), which will benefit young children (Kajiwara et al., 2013) and non-native English speakers (Paetzold, 2015;Paetzold and Specia, 2016). It includes many ways to deal with the input text, such as deletion, reordering, paraphrase, and sentence separation (Saggion, 2017). Besides, text simplification is closely related to many natural language processing (NLP) tasks, such as machine translation (Štajner and Popović, 2016;Hasler et al., 2017), paraphrase generation Cao et al., 2017;Zhao et al., 2018) and text summarization Ma and Sun, 2017;Jin et al., 2020).
In recent years, many researchers have proposed various models to improve the performance of text simplification. However, few people have explored the impact of document context (i.e., the preceding and following sentences of the original sentence to be simplified in a document) on text simplification, let alone establish a large-scale dataset containing context. Many examples like the one in Table 1 arouse our interest in exploring the influence of context. In the simplified sentence, "played and sang" is the simplification of "in her performances" in the original sentence. The phrase "sing as well as play" in the context may provide additional information to help the simplification. The phrase "at the bar" is retained because of the presence of "at the Midtown Bar" in the context. Correspondingly, there is no mention of information related to "fan base" in the context, so adjectives such as "loyal", "small" are deleted.
In this paper, we are committed to investigating the influence of document context on text simplification and proposing a neural model to improve simplification using context. Using the Wikipedia datasets (Coster and Kauchak, 2011;Kauchak, 2013), we first construct a dataset in which the document context

Context
To fund her private lessons , Simone performed at the Midtown Bar & Grill on Pacific Avenue in Atlantic City , whose owner insisted that she sing as well as play the piano .

Original
Simone 's mixture of jazz , blues , and classical music in her performances at the bar earned her a small , but loyal , fan base .

Simplified
Simone played and sang a mixture of jazz , blues and classical music at the bar . She began to get fans . of the original sentence in each sentence pair is provided 1 . This dataset is built automatically, and the training set has more than 150K sentence pairs. Then, we propose a new model named simplification using context (SUC) by improving the transformer model. We use two multi-head self-attention modules to learn the representations of the context. We also use neural networks to learn the different weights and multiply them with the self-attention layer's output. In the end, we conduct experiments to explore the different effects of the preceding sentences and the following sentences on the current sentence.
The experiment results verify that our model has achieved remarkable results. Compared with the original transformer model, our model improves the SARI value by 4.05 points. The ablation experiments show that the use of sentence pairs with context information can significantly improve the SARI points.
There are the following main contributions of our work: (1) We are the first to investigate the influence of document context on sentence simplification and build a large dataset for training and testing.
(2) We propose and train a new model named simplification using context (SUC), which makes full use of context information. SUC outperforms the baselines on both automatic evaluation and human evaluation.
(3) We use ablation experiments to illustrate the different effects of the preceding sentences and the following sentences on the current sentences to be simplified.

Related works
Text simplification has developed rapidly in the past decade. Wubben et al. (2012) proposed the PBMT-R model while Narayan and Gardent (2014) put forward the Hybrid model. Both models were based on statistic machine learning. Using a small number of manual simplifications and a large number of paraphrases, Xu et al. (2016) adapted the method of the statistical machine translation. Nisioi et al. (2017) were the first to apply the sequence-to-sequence model to automatic text simplification, using the framework of neural machine translation and making specific improvements. Zhang and Lapata (2017) proposed the DRESS model which rewards the simple, fluent output sentences with proper meaning. Vu et al. (2018) used memory-augmented neural networks to adapt the existing architecture and Guo et al. (2018) used multi-task learning to improve the entailment and paraphrase capabilities.
Recently, Kriz et al. (2019) used two techniques to solve the problem that the model tends to copy words directly, resulting in a long and complicated output sentence. Nishihara et al. (2019) proposed a method to simplify the original sentences to different level sentences. Different from most sequenceto-sequence models, Dong et al. (2019) proposed a neural programmer-interpreter approach to predict explicit edit operations directly.  proposed the neural CRF model to get better sentence alignment.
However, up to now, most of the research on text simplification is limited to sentence level, ignoring the influence of document context on sentence simplification. Pitler and Nenkova (2008) are the first to empirically demonstrate that discourse relations are closely related to the perceived quality of the text. So far,  are the first and the only ones to focus on discourse level factors of text simplification. Their results have shown that using discourse level factors is useful for predicting sentence deletion. Nevertheless, different from our research focusing on sentence simplification, this work mainly analyzes and predicts the sentence deletion in document simplification. In addition to deletion operation, sentence simplification also includes reservation, separation, synonym replacement, and other operations (Xu et al., 2016). Whether the preceding sentences and the following sentences have different effects on the original sentence has not been taken into account.
Even in a similar field of machine translation, most of the research focuses on the preceding sentences. Based on the transformer model,  used a new encoder to represent the context and proposed a new model. Werlen et al. (2018) proposed a hierarchical attention model that captures the context in a structured and dynamic way. Both of the models that have received widespread attention only focuses on the preceding sentences. In addition to the preceding sentences, the following sentences also contain the information of keeping, paraphrasing, and deleting the words in the original sentence in text simplification. To the best of our knowledge, we are the first to study the effects of the preceding sentences and the following sentences and apply them to a sequence-to-sequence model in the field of text simplification.

Our Model
The SUC model consists of four parts: the transformer model, context information module, pointergenerator network, and coverage mechanism. Among them, the context information module is an original module proposed by us to get the context representation and apply it to the transformer model.

The Transformer Model
The transformer model is based on attention mechanism, and the model has a straightforward structure (Vaswani et al., 2017). The original transformer model has an encoder and a decoder, and both of them use multi-head attention mechanism and feed-forward networks. It turns a set of queries into a matrix Q, and turns the keys and values into matrix K and V respectively. Define the dimension of queries and keys as d k and values as d v , and the output can be computed as: A multi-head attention is defined as follows:

Representation of Context Information
First, we will give an overview of the module of computing context information representation, as shown in Figure 1. We use two additional encoders to calculate the representations of the preceding sentences and following sentences, respectively. The input text first passes through the embedding layer and the position encoding layer. Define X ∈ R D×L as the representation of input text after embedding and position encoding where D is embedding dimension and L is the length of input text.
The additional encoder consists of N identical layers, each of which contains two components. N is equal to the number of layers of the encoder and decoder. The first component is a multi-head attention, which is the same as the one in the transformer model. The input matrix Q, K, and V of multi-head attention are all matrix X. The formula is defined as: The second component is a fully connected network. The network consists of two linear transformation layers and a GELU activation layer, the same as the one in the transformer model. Thus the attention result of the input text is obtained.
(4) Figure 1: An overview of how to get the representation of the context information. The inputs are the preceding sentences and the following sentences of the current sentence. The output is the representation R C of the context information.
It is worth noting that in order to prevent overfitting, the dropout mechanism and the LayerNorm is added at the end of each component, which is defined as: The symbols I and O represent the input and output of the component respectively. Given the input matrix X, the output R N ∈ R d f ×L of the additional encoder is obtained where d f represents the dimension of the inner-layer. We design a new neural network to calculate the weight V corresponding to R N . The network consists of a linear transformation layer and a sigmoid activation layer. The linear transformation layer converts R N to weights V with dimension 1. The sigmoid activation layer converts V to a value between 0 and 1. When the input of the sigmoid function is near 0, the derivative is larger. When the input approaches positive infinity or negative infinity, the output approaches 1 or 0 respectively. The calculation process of V can be defined as: Then we multiply V by R N to get the weighted output of the additional encoder. There are two additional encoders in our model, which are used to obtain the representations for the preceding sentences and the following sentences, respectively. We combine two weighted outputs to get R C , which is defined as: R C is the representation of the context information. V 1 is different from V 2 , indicating that the weights of the preceding sentences and the following sentences are different.

Incorporation of Context Information
We also give an overview of how to use the context information, as shown in Figure 2. We apply the representation R C of context information to encoder and decoder at the same time. The encoder and the decoder contain N identical improved layers. We also use the dropout mechanism and LayerNorm, as we use in the additional encoder. Figure 2: An overview of how to use the context information. The blue modules are the ones we add to process the representation of context information.
The improved encoder layer consists of three components, including two multi-head attentions and a neural network. The first component is a multi-head attention. Define X origin as the representation of the original sentence after embedding and position encoding. The input matrix Q, K, and V of the first multi-head attention are all matrix X origin , which is defined as: The second component is also a multi-head attention. We input the representation R C of the context into this multi-head attention as matrices K and V . Matrix Q is the output Attn of the previous multihead attention. The formula can be defined as: The last component is a neural network, which is the same as the one in the additional encoder. Define R E as the output of encoder, and the formula can be expressed as: The improved decoder layer consists of four components, including three multi-head attentions and a neural network. Following the original transformer model, we offset the word embedding of the simple sentence by one position. Define that X simple is the representation of the simple sentence after embedding and position encoding. The input matrix Q, K, and V of the first multi-head attention are all matrix X simple , which is defined as: The second component is the same as the second component in encoder layer. It also takes the representation R C of the context and output of the first component as input, which is defined as: The third component is a multi-head attention but we take the output R E of the encoder as matrices K and V . The matrix Q is the output Attn of the previous multi-head attention. The formula can be defined as: The last component is a neural network, which is the same as the one in encoder. Define R D as the output of the decoder, and the formula can be expressed as:

Pointer-Generator Network and Coverage Mechanism
The pointer-generator network copies words from the original sentence to solve the problem of out-ofvocabulary (See et al., 2017). Following the implementation 2 , the pointer-generator network used in our model contains a multi-head attention. R D is taken as matrix Q and R E is taken as matrices K and V , which is defined as: The generation probability P gen can be calculated as: The vectors W 1 , W 2 , W 3 and the scalar b are all learnable parameters. The final probability distribution can be defined as: Where D can be obtained from multi-head attention and P vocab is the vocabulary. The coverage model focuses on solving the problem of generating text repeatedly in sequence-tosequence model (Tu et al., 2016). The general coverage model is given by: C i is the coverage vector which summarizes the previous attentions at time step i to help adjust future attention. g update updates C i after new attention α i when decoding at time step i. Φ(h) is a word-specific feature and Ψ are different auxiliary inputs.

Dataset
Since the contexts are not provided in commonly used datasets such as Wikismall (Zhu et al., 2010) and Newsela (Xu et al., 2015), we need to build appropriate datasets first. With the help of the Wikipedia datasets (Coster and Kauchak, 2011;Kauchak, 2013), we successfully construct a dataset for our research 3 . The Wikipedia datasets contain about 167K aligned sentence pairs. We extract the context sentences of each original sentence from the document-aligned data. For those sentences without the preceding sentences or the following sentences, we choose to retain them to train the sentence-level modules of our model. In the training set, there are around 110K aligned sentence pairs with context information and around 41K aligned sentence pairs without context information. From the remaining sentence pairs with context information, we use 5K as the validation set and 1K as the test set. There is no repetition among the sentences in the test, validation and training sets.
Previous study on machine translation has shown that using too much context information will not only not improve the results, but increase the computational complexity . Following , we take two preceding sentences and two following sentences of the current sentence as context information. If there is only one preceding sentence or following sentence, we choose to keep it.
The SARI metric may be the most important criteria to measure the result of text simplification. 4 The SARI metric compares the simplified sentence with the original one and the reference one at the same time. The SARI value comes from three parts: adding words, deleting words properly and keeping words properly. The values of the three parts are also reported in our results.
The BLEU metric is commonly used previously and used to measure the similarity of output to a reference sentence (Zhao et al., 2020), so we decide to use this metric. 5 It is worth noting, however, that the BLEU metric has been found often negatively correlates with simplicity recently (Sulem et al., 2018a).
Following Dong et al. (2019), we choose to use the FKGL method to measure the readability of the output sentences. 6 The lower the FKGL value, the simpler the output sentences are.
In this paper, we regard SARI as the most important criterion to judge the effect of simplification. We also employ human judges to conduct more reliable evaluation for this task.

Training Details
Our model is based on the transformer model (Vaswani et al., 2017). The addition encoder, encoder and decoder in our model have 4 layers with 4 multi-heads. We set the size of the vocabulary to 45800 and other uncommon words in the training set are replaced with the out-of-vocabulary token UNK. When predicting, we follow the method proposed by Jean et al. (2015) to replace UNK. We use the Adagrad optimizer (Duchi et al., 2011) to train our model and we train 50 epochs. We set the learning rate to 0.1 and the training batch size to 16. Following BERT (Devlin et al., 2019), we replace the ReLU activation function with a GELU activation function (Hendrycks and Gimpel, 2016) that performs better on transformer. When training, we firstly use the sentence pairs with context information to train the whole model, then we fix the parameters of the document-level module. Next, we use the sentence pairs without context information to train the sentence-level module.

Baselines
The main purpose of our work is to investigate the effect of context on sentence simplification, rather than propose a model that outperforms all existing models. Thus, we use four representative models as baselines, which are all trained on the training set we construct. They are also tested on the test set we construct. The four baselines are: (1) A BiLSTM-based encoder-decoder model which is used in DRESS (Zhang and Lapata, 2017). 7 (2) The original transformer model (Vaswani et al., 2017).
(3) The transformer model with the pointer-generator network (TP).
(4) The transformer model with the pointer-generator network and coverage mechanism (TPC). The sentence pairs used by the baselines in the training set and test set are the same as those used by our SUC model, but do not include context.  Table 2: Results of the automatic evaluation on our test set. We report the values of SARI, each part of SARI and FKGL(lower is better) and BLEU and use Bold to mark the best results. We regard SARI as the most important criterion for automatic evaluation.

Automatic Evaluation
The results of the automatic evaluation are shown in Table 2. Our SUC model achieves 36.52 on SARI value, which outperforms the baselines. Compared with the original transformer model, our model improves on the SARI value by 4.05 points. In terms of the individual SARI values, our model also achieves the highest scores for the deleting and adding operations. For the keeping operation, our model's score is slightly lower than that of TPC that performs fewer deleting and adding operations and thus gets higher scores on keeping operations.
As for the FKGL values, although our model does not achieve the best result, our model's FKGL value is just 0.45 points higher than that of the BiLSTM-based encoder-decoder and much lower than the rest of the baselines. The TP gets the highest FKGL value but it improves the SARI value by better keeping operation compared with the transformer. As for the BLEU values, our SUC model and the best-performing TPC model yield similar results and far outperform the other three models. However, it is worth noting that the SARI value of our model is 8.55 higher than that of the TPC model.
We regard SARI as the most important criterion for automatic evaluation. Therefore, the automatic evaluation results illustrate that our model outperforms the baselines and that the context contributes to sentence simplification. Examples of the sentences generated from our model and the baseline models are given in Table 3.

Human Evaluation
In addition to automatic evaluation, we conducted the human evaluation on the outputs of different models. We randomly selected 50 sentences from the test set for evaluation. 8 Following previous works 7 The code is available at https://github.com/mounicam/wiki-auto/tree/master/simplification 8 One of the goals of text simplification is to provide convenience for non-native speakers. Therefore, we invited two nonnative speakers with English proficiency roughly equivalent to 10-12 years old in English-speaking countries as volunteers. Volunteers had fully understood the evaluation criteria before conducting the evaluation and they were given complex sentences and different system outputs in random order. After the evaluation, they received a sum of money as payment.

Original sentence
On   Table 4: Results of human evaluation on 50 sentence pairs randomly selected from our test set. We use Bold to mark the best results other than the reference. Avg represents the average scores of the three aspects.
Apart from the outputs of different models, volunteers also rated the reference. The five-point Likert scale is used for rating, and the results of the human evaluation are shown in Table 4. It can be seen that our SUC model outperforms baselines in all the aspects and the average score. Although the fluency of generated sentences of our model is still far from the reference, it exceeds that of the second-ranked model by 0.24 points. In terms of adequacy, the performance of our model is close to the reference. In terms of simplicity, our model outperforms the second-ranked model by 0.09 points, which is also the best performing model.

Ablation Study
We designed ablation experiments to explore the effects of different modules in our model on the results, especially the effects of the preceding sentences and the following sentences. We designed five additional experiments: (1) We added an additional encoder to TPC and only used 114K sentence pairs with the preceding sentences to train the model (TPC-P).
(2) We added an additional encoder to TPC and used 114K sentence pairs with the preceding sentences and 37K sentence pairs without context to train the model (TPC-PF).
(3) We added an additional encoder to TPC. There are a total of 134K sentence pairs with the following sentences in our dataset. To better compare the impact of the position of the added contextual information, we randomly selected 114K sentence pairs with the following sentences to train the model (TPC-F).
(4) We added an additional encoder to TPC and used 134K sentence pairs with the following sentences and 17K sentence pairs without context to train the model (TPC-FF).
(5) We added an additional encoder to TPC. We simply spliced the preceding sentences and the following sentences together and fed them into the additional encoder, which means the preceding sentences and the following sentences won't have any extra weight (TPC-S  The results of ablation experiments are shown in Table 5. From the results, we can see that when adding context, the SARI value increased by nearly five points compared to TPC, which means context is helpful for simplification. In particular, context helps immensely with deleting and adding operations, which we believe is the result of the additional information provided by the context. In experiment 2 and experiment 4, when the sentences without context information are added for training, there is a slight increase in the SARI value and a more significant decrease in the FKGL value, indicating that training with more sentence pairs can improve the readability of the output sentences. The results of experiment 5 show that simply splicing together the preceding and following sentences does not improve simplification very much, and is even less effective than using the full dataset with preceding or following sentences. This also demonstrates the necessity of treating the preceding and following sentences separately and assigning different weights to them.

Conclusion
In this paper, we propose a new model named SUC which makes full use of the context information for text simplification. From the results of the automatic evaluation and human evaluation, it can be seen that our model outperforms the baselines, which proves that context information is helpful to sentence simplification. In the ablation experiment, we show that sentence pairs with the preceding sentences or following sentences can both significantly improve simplification.