Controllable Text Simplification with Lexical Constraint Loss

We propose a method to control the level of a sentence in a text simplification task. Text simplification is a monolingual translation task translating a complex sentence into a simpler and easier to understand the alternative. In this study, we use the grade level of the US education system as the level of the sentence. Our text simplification method succeeds in translating an input into a specific grade level by considering levels of both sentences and words. Sentence level is considered by adding the target grade level as input. By contrast, the word level is considered by adding weights to the training loss based on words that frequently appear in sentences of the desired grade level. Although existing models that consider only the sentence level may control the syntactic complexity, they tend to generate words beyond the target level. Our approach can control both the lexical and syntactic complexity and achieve an aggressive rewriting. Experiment results indicate that the proposed method improves the metrics of both BLEU and SARI.


Introduction
Text simplification (Shardlow, 2014) is the task of rewriting a complex text into a simpler form while preserving its meaning. Its applications include reading comprehension assistance and language education support. Because each target user has different reading abilities and/or knowledge, we need a text simplification system that translates an input sentence into a sentence of an appropriate difficulty level for each user. According to the input hypothesis (Krashen, 1985), educational materials slightly beyond the learner's level effectively improve their reading abilities. On the contrary, materials that are too difficult for learners deteriorate their learning motivation. In the context of language education, teachers manually simplify Grade Examples

12
According to the Pentagon , 152 female troops have been killed while serving in Iraq and Afghanistan . 7 The Pentagon says 152 female troops have been killed while serving in Iraq and Afghanistan . 5 The military says 152 female have died . sentences for each learner. To reduce the burden on teachers, automatic text simplification systems are desired (Petersen and Ostendorf, 2007). As mentioned, text simplification translates a complex sentence into a simpler alternative. The transformation allows entailment and omission/replacement of phrases and words. Table 1 shows sentences in different grade levels. Sentence level depends on both the syntactic and lexical complexities. When simplifying a sentence of grade level 12 into grade level 7 1 , paraphrasing "According to ∼ ," to "∼ says" reduces the syntactic complexity. In addition, when simplifying the sentence from the grade levels 12 to 5, paraphrasing "Pentagon" to "military" reduces the lexical complexity. Assuming an application to language education, we aim at automatically rewriting the input sentence to accommodate the level of difficulty appropriate for each grade level, as shown in Table 1.
Many previous studies (Specia, 2010;Wubben et al., 2012;Xu et al., 2016;Nisioi et al., 2017;Zhang and Lapata, 2017;Vu et al., 2018;Guo et al., 2018;Zhao et al., 2018) in text simplification have trained machine translators on a monolingual parallel corpus consisting of complex-simple sentence pairs without considering the level of each sentence. Therefore, these text simplification models are ignorant regarding the sentence level. Scarton and Specia (2018) developed a pioneering text simplification model that can control the sentence level. They trained a text simplification model on a parallel corpus by attaching tags specifying 11 grade levels to each sentence (Xu et al., 2015). The trained model allows the generation of a sentence of a desired level specified by a tag attached to the input. This model may control the syntactic complexity such as the sentence length; however, it often outputs overly difficult words beyond the target grade level. To control the lexical complexity in text simplification, we propose a method for add weights to a training loss according to levels of words on top of (Scarton and Specia, 2018), and thus output only words under the desired level.
Experiment results indicate that the proposed method improves the BLEU and SARI scores by 1.04 and 0.15 compared to Scarton and Specia (2018). Moreover, our detailed analysis indicates that our method controls both the lexical and syntactic complexities and promotes an aggressive rewriting.

Text Simplification
Text simplification can be regarded as a monolingual machine translation problem.
Previous studies have trained a model to translate complex sentences into simpler sentences on parallel corpora between Wikipedia and Simple Wikipedia (W-SW) (Zhu et al., 2010;Coster and Kauchak, 2011).
As in the field of machine translation, early studies (Specia, 2010;Wubben et al., 2012;Xu et al., 2016) were mainly based on a statistical machine translation (Koehn et al., 2007;Post et al., 2013). Inspired by the success of neural machine translation (Bahdanau et al., 2015), recent studies (Nisioi et al., 2017;Zhang and Lapata, 2017;Vu et al., 2018;Guo et al., 2018;Zhao et al., 2018) use the encoder-decoder model with the attention mechanism. These studies do not consider the level of each sentence.

Controllable Text Simplification
In addition to W-SW, Newsela (Xu et al., 2015) is a famous dataset available for text simplification. Newsela is a parallel corpus with 11 grade levels. Scarton and Specia (2018) trained a levelcontrollable text simplification model on Newsela. Although their model is a standard attentional encoder-decoder model similar to (Nisioi et al., 2017), a special token <grade> indicating the grade level of the target sentence is attached to the beginning of the input sentence. This is a promising approach that has been successful in similar tasks (Johnson et al., 2017;Niu et al., 2018). As expected regarding the task of text simplification, this approach has improved both BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016) compared to a baseline model (Nisioi et al., 2017) that does not consider the target level at all. This model allows the syntactic complexity to be controlled; however, it tends to output overly difficult words beyond the target grade level.

Loss Function with Word Level
To control the lexical complexity, our model weighs a training loss of a text simplification model considering words that frequently appear in the sentences of a specific grade level, as shown in Figure 1. Here, the weight f (w, l) corresponds to the relevance of the word w at grade level l.
A sequence-to-sequence model commonly uses the cross-entropy loss. When a model outputs o = [o 1 , · · · , o N ] (where N is the size of the vocabulary) at a certain time step, the cross-entropy loss is as follows: where y = [y 1 , · · · , y N ] is a one-hot vector in which only the c-th element of a correct word is 1 and others are all 0. Our model adds weights to the loss function (Equation 1) based on the level of words such that the model learns to output words of the desired level: As f (·, ·), we use TFIDF or PPMI assuming that words frequently appear in sentences of level l also have the same level l.
TFIDF We compute the TFIDF regarding sentences of the same level as a document: where P (w | l) is a probability that word w appears in a set of sentences of grade level l, D is the number of grade levels 2 , and DF(w) is the number of grade levels in which w appears. By so doing, TFIDF provides more weights to words that uniquely appear in the sentences of a specific level.
PPMI Pointwise mutual information (PMI) allows estimating the strength of a cooccurrence between w and l: where P (w) is a probability of word w being within the entire training corpus, whereas P (w | l) is the same as Equation 3. Words with negative PMI scores have a negative correlation against l that means w tends to appear across different sentence levels. Hence, we ignore w with a negative PMI using a positive-PMI (PPMI) function: Both TFIDF and PPMI have a range of [0, ∞), and thus we apply the Laplace smoothing: Func ∈ {PPMI, TFIDF} 4 Experiment

Dataset
We evaluated whether our method can control the grade levels in a text simplification using the Newsela corpus. The Newsela corpus provides 2 Here, D = 11 because we use grade levels 2 to 12.  news articles of different levels, which have been manually rewritten by human experts. It conforms to the grade levels in the US education system, where the levels range from 2 to 12.
We use the publicly available version of the Newsela corpus 3 that has been sentence-aligned by Xu et al. (2015) and divided into 94k, 1k, and 1k sentences for the training, development, and test, respectively, by Zhang and Lapata (2017). As in previous studies, we regard each sentence in an article as sharing the same level as the entire article. Zhang and Lapata (2017) first divided the set of articles and then extracted sentence pairs to avoid the same sentences appearing in both the training and test sets. Note that the Newsela corpus used in (Scarton and Specia, 2018) is different from the present corpus, and is preprocessed differently. Due to these differences, the training, development, and test sets used in (Scarton and Specia, 2018) are unreproducible. Therefore, we reimplemented (Scarton and Specia, 2018) and compared it to our method using our public corpus. Table 2 shows statistics for the Newsela corpus, which clearly present the tendency that lower grade sentences are significantly shorter than those of higher grades. This indicates that aggressive omission of phrases is required to simplify sentences of grade 8 to 12 into those of grade 2 to 7.

Methods for Comparison
During this experiment, the following four methods were compared.
1. s2s is a baseline, plain sequence-to-sequence model based on the attention mechanism.
2. s2s+grade is our re-implementation of Scarton and Specia (2018), which is a stateof-the-art controllable text simplification.
3. s2s+grade+TFIDF is our model (Sec. 3) implemented on s2s+grade, which adds TFIDFbased word weighing to the loss function. TFIDF scores were pre-computed using the training data.
4. s2s+grade+PPMI is our other model (Sec. 3) implemented on s2s+grade, which adds PPMI-based word weighing in the loss function. PPMI scores were pre-computed using the training data.

Implementation Details
In this study, we implemented our model using Marian (Junczys-Dowmunt et al., 2018). 4 Both the encoder and decoder consist of 2 layers of Bi-LSTM with the 1, 024-dimensions of hidden layers and 512-dimensions of the embedding layer shared by the encoder and decoder including its output layer. Word embedding was randomly initialized. A dropout rate of 0.2 was applied to the hidden layer, and a dropout rate of 0.1 was applied to the embedding layer. Adam was used as an optimizer. Training was stopped when the perplexity measured on the development set stopped improving for 8 epochs. 5 All scores reported in this experiment are the averages of 3 trials with random initialization.
In addition, we investigate the scores of BLEU ST , mean absolute error (MAE) of sentence length (MAE LEN ), and mean PMI (MPMI) for a detailed analysis. BLEU ST computes a BLEU score by taking the source and output sentences as input, which allows evaluating the degree of rewrites made by a model. The lower BLEU ST is, the more actively the model rewrites the source sentence.
In addition, MAE LEN approximately evaluates the syntactic complexity of the output based on its length: where N is the number of sentences in the test set, and Len(·) provides the number of words in a sentence. The lower the MAE LEN is, the more appropriate the length of the output.
MPMI evaluates to what extent the levels of the output words match with the target level: where W is the number of words appearing in the output and l s is the grade level of sentence s. PMI scores were pre-computed using the training data.
The higher the MPMI is, the more words of the target level are generated by the model.

Grade Examples
Source 12 In its original incarnation during the ' 60s , African-American " freedom songs " aimed to motivate protesters to march into harm 's way and , on a broader scale , spread news of the struggle to a mainstream audience . 7 s2s+grade: In the 1960s , African-American " freedom songs are aimed to motivate protesters to march into harm 's way . s2s+grade+PPMI: In its original people in the 1960s , African-American " freedom songs are aimed to inspire protesters to march into harm 's way .

4
s2s+grade: In the 1960s , African-American " freedom songs are aimed to motivate protesters to march into harm 's way . s2s+grade+PPMI: African-American " freedom songs are aimed to inspire protesters to march into harm 's way .  Table 3 shows the experiment results. The first two rows show the performances when the source sentence itself or the reference sentence is regarded as the model output, which sets the standard to interpret the scores. Our method outperforms the state-of-the-art baseline in both the BLEU and SARI metrics. In particular, s2s+grade+PPMI improved the BLEU and SARI scores by 1.04 and 0.15 compared to s2s+grade, respectively.
An evaluation in BLEU ST shows that our proposed models conduct an aggressive rewriting. In addition, s2s+grade+PPMI, which has the highest performance in both the BLEU and BLEU ST metrics, conducts many appropriate rewrites that are far from the source and close to the reference. The s2s baseline, which does not consider the target level, applies conservative rewriting, whereas the proposed model, which considers it more properly conducts more aggressive rewriting.
The evaluations of MAE LEN and MPMI show that s2s+grade+PPMI can best the control both syntactic and lexical complexities. From these results, we confirmed the effectiveness of the text simplification model that takes the word level into account. Table 4 shows examples of the model outputs. Here, s2s+grade+PPMI paraphrases a complex word "incarnation" into "people" for grade level 7. In addition, the complex word "motivate" is simplified to "inspire" for grade level 4. Although  both models can remove unimportant phrases "and , on ∼", s2s+grade+PPMI successfully summarized shorter sentences for grade level 4.

Analysis for Each Grade Level
To analyze the level control in detail, we simplified each source sentence in the test set to all simpler grade levels 8 . This analysis does not allow an evaluation based on references such as BLEU because references are only given for some levels for each source sentence. Table 5 shows FKGL (Kincaid et al., 1975) and MPMI for each target grade level for s2s+grade (prev.) and s2s+grade+PPMI (prop.). FKGL is an automatic evaluation metric that estimates the textual readability. The FKGL scores correspond to grade levels of K-12.
An analysis of the FKGL revealed that both models were oversimplified. However, MAE with the target grade level shows that the proposed model is superior to the baseline model. Focusing on the FKGL differences, the proposed model generates simpler sentences for the simpler target grade levels than the baseline model, and vice versa. These results show that incorporating word levels into the model contributes to a level control in text simplification.
In the evaluation of MPMI, the proposed method consistently outperforms the state-of-theart baseline at all target levels. As expected, we confirmed that the proposed method for weighting the cross-entropy losses based on PPMI encourages the use of words suitable for the target grade level.

Conclusion
We proposed a text simplification method that controls not only the sentence level but also the word level. Our method controls the word level by weighing words in the loss function, which frequently appear in text of a specific grade level. The evaluation results confirmed that our method improved both the BLEU and SARI scores, and achieved an aggressive rewriting compared to Scarton and Specia (2018). A detailed analysis indicated that our method achieved an accurate control of the level in converting the sentences into those of the target level.
In this study, we regard a document and the sentences contained within it to have the same grade level as in previous studies. In practice, however, this assumption may not hold. Although the readability and level in the units of document (Kincaid et al., 1975) and phrase Maddela and Xu, 2018) have been studied, there have been no previous works focusing on the level of the sentences. This direction is an area of our future work.