Learning Simpliﬁcations for Speciﬁc Target Audiences

Text simpliﬁcation (TS) is a monolingual text-to-text transformation task where an original (complex) text is transformed into a target (simpler) text. Most recent work is based on sequence-to-sequence neural models similar to those used for machine translation (MT). Different from MT, TS data comprises more elaborate transformations, such as sentence splitting. It can also contain multiple simpliﬁcations of the same original text targeting different audiences, such as school grade levels. We explore these two features of TS to build models tailored for speciﬁc grade levels. Our approach uses a standard sequence-to-sequence architecture where the original sequence is annotated with information about the target audience and/or the (predicted) type of simpliﬁcation operation. We show that it outperforms state-of-the-art TS approaches (up to 3 and 12 BLEU and SARI points, respectively), including when training data for the speciﬁc complex-simple combination of grade levels is not available, i.e. zero-shot learning.


Introduction
Text simplification (TS) is the task of modifying an original text into a simpler version of it.One of the main parameters for defining a suitable simplification is the target audience.Examples include elderly, children, cognitively impaired users, nonnative speakers and low-literacy readers.
Traditionally, work on TS has been divided in lexical simplification (LS) and syntactic simplification (SS).LS (Paetzold, 2016) deals with the identification and replacement of complex words or phrases.SS (Siddharthan, 2011) performs structural transformations such as changing a sentence from passive to active voice.However, most recent approaches learn transformations from corpora, addressing simplification at lexical and syntactic levels altogether.These include either learning tree-based transformations (Woodsend and Lapata, 2011;Paetzold and Specia, 2013) or using machine translation (MT)-based techniques (Zhu et al., 2010;Coster and Kauchak, 2011a;Wubben et al., 2012;Narayan and Gardent, 2014;Nisioi et al., 2017;Zhang and Lapata, 2017).This paper uses the latter type of technique, which treats TS as a monolingual MT task, where an original text is "translated" into its simplified version.
In order to build MT-based models, a parallel corpus of original texts with their simplified counterparts is needed.For English, two main such corpora are available: Wikipedia-Simple Wikipedia (W-SW) (Zhu et al., 2010) and the Newsela Article Corpus. 1 The former is a collection of original Wikipedia articles and their simplified versions created by volunteers.The latter consists of news articles professionally simplified for various specific audiences following the US school grade system.To build simplification models, the pairs of articles in these corpora have been aligned at the level of smaller units using standard algorithms (Coster and Kauchak, 2011b;Paetzold and Specia, 2016;Štajner et al., 2017).Based on the number of sentences involved in these alignments, one can categorise alignments into four types of coarse-grained simplification operations: • Identical: an original sentence is aligned to itself, i.e. no simplification is performed.• Elaboration: an original sentence is aligned to a single, rewritten simplified sentence.• One-to-many: splitting -an original sentence is aligned to 2+ simplified sentences.
• Many-to-one: joining -2+ original sentences are aligned to a single simplified sentence.
We hereafter refer to the unit of simplification, i.e. one or more original or simplified sentences, as instances.
The Newsela corpus is seen as having higher quality than W-SW because its simplifications are created by professionals, following well defined guidelines (Xu et al., 2015).It is also larger which is preferable for training corpus-based models.More interestingly, the Newsela corpus has a feature that has been ignored thus far: Each instance in the corpus was created for readers with a certain school grade level.Each original article has a label indicating its corresponding grade level (from 12 to 2), and may have various simplified versions, each for a different grade level.For example, a level 12 article may have simplified counterparts for levels 8 and 4. In other words, the corpus contains instances where the same input leads to different outputs.Disregarding this factor may lead to suboptimal models.To avoid this problem, previous work (Alva-Manchego et al., 2017;Zhang and Lapata, 2017;Scarton et al., 2018b) has used subsets of the corpus with only certain combinations of complex-simplified article pairs, e.g.adjacent or non-adjacent pairs.This however reduces the amount of data available for training.
We propose a way of making use of this information to build more informed TS models that are aware of different types of target audiences, while still making use of the full dataset for learning.Inspired by the work of Johnson et al. (2017) for MT, we add to each original instance an artificial token that represents the target grade level of that instance in order to guide a sequence-to-sequence attentional encoder-decoder neural approach (Bahdanau et al., 2015) ( §2).In a similar vein, we also annotate the coarse-grained type of operation that should be performed to simplify the original instance, under the hypothesis that certain operations are more often used to simplify into certain grade levels.Deciding on the operation is an easier problem than performing the actual operation.We rely on both gold and predicted operation types.
Experiments with models built with these artificial tokens outperform state-of-the-art neural models for TS, with the best approach combining grade level and type of operation ( §3).Interestingly, such an approach also enables zero-shot TS, where a simplification for a grade level pair unseen at training time can still be generated during testing.We show that our zero-shot learning models perform virtually as well as our grade/operationinformed models ( §4).To the best of our knowledge, this is the first work to build TS models for specific target audiences and to explore zero-shot learning for this application.

System architecture
Our approach follows that of Johnson et al. (2017), a multilingual MT approach that adds an artificial token to encode the target language to the beginning of each source sentence in the parallel corpus.With this modified version of the corpus, a single encoder-decoder architecture is used to deal with different language pairs.Based on the tokens, the source sentences are encoded differently according to the target language they have been paired with in the corpus.Such an approach enables zeroshot MT, where a model is able to provide translations for language pairs it has not seem at training time.
We apply three types of data manipulation, where artificial tokens are added to the beginning of original side of both training and test instances: • to-grade: the token corresponds to the grade level of the target instance, • operation: the token is one of the four possible coarse-grained operations that transforms the original into the simplified instance, • to-grade-operation: concatenation of the two above tokens.Different from the grade level, which can be available at test time simply by knowing the intended reader of the text, information about the operations to be performed, which we extracted from the parallel corpus, will not be available at test time.We use gold labels extracted from the parallel corpus for an oracle experiment but also use a classifier that predicts the operations for the test set based on those in the training data.We built a simple Naive Bayes classifier using the scikit-learn toolkit (Pedregosa et al., 2011) and nine features (Scarton et al., 2017): • number of tokens / punctuation / content words / clauses, • ratio of the number of verbs / nouns / adjectives / adverbs / connectives to the number of content words.Table 1 shows examples of the tokens used when an original instance is marked to be simpli-to grade level 4 to grade level 2 to-grade <4> dusty handprints stood out against the rust of the fence near Sasabe.
<2> dusty handprints stood out against the rust of the fence near Sasabe.operation <identical> dusty handprints stood out against the rust of the fence near Sasabe.
<elaboration> dusty handprints stood out against the rust of the fence near Sasabe.to-grade-operation <4-identical> dusty handprints stood out against the rust of the fence near Sasabe.
<2-elaboration> dusty handprints stood out against the rust of the fence near Sasabe.reference dusty handprints stood out against the rust of the fence near Sasabe.
dusty handprints could be seen on the fence near Sasabe.For level 2 the reference is a rewrite and, therefore, the operation token is <elaboration>.
We use OpenNMT2 as our encoder-decoder architecture.Both encoder and decoder have two LSTM layers, hidden states of size 500 and dropout = 0.3.Global attention combined with input-feeding is used, as describe in (Luong et al., 2015).A model is trained for each dataset constructed with different artificial tokens for 13 epochs.The best model is selected according to perplexity on the development set. Figure 1 shows the architecture of the neural network, including attention and input-feeding.In this example, <token> represents the artificial token added to the pre-processed data.We evaluate our models with BLEU3 (Papineni et al., 2002) (a proxy for grammaticality assessment), SARI (Xu et al., 2016) 4 (a proxy for simplicity assessment) and Flesch Reading Ease5 (a proxy for readability assessment).According to Xu et al. (2016), BLEU shows high correlation with human scores for grammaticality and meaning preservation, whilst SARI shows high correlation with human scores for simplicity.Although previous work have also relied on human judgements of grammaticality, meaning preservation and simplicity, in our case such a type of evaluation is infeasible: we would need to involve judges with specific grade levels or rely on professionals who are experts in grade level-specific simplification to make such assessments.
As baseline we trained a model using Open-NMT and the same hyperparameters as described in §2 on the entire Newsela corpus but without artificial tokens (s2s model).The state-of-the-art model is represented by NTS, which was also trained on the entire corpus using a similar Open-NMT architecture with the same hyperparameters but additional pre-trained word embeddings as described in Nisioi et al. (2017). 6 As shown in Table 2 the NTS system performs slightly worse than the baseline system according to BLEU and SARI.Although concatenating global and local embeddings has led to improvements for the W-SW corpus in (Nisioi et al., 2017), this does not seem to be the case for the Newsela corpus.Our models outperform both the baseline and NTS systems by a large margin.Examples of outputs from all systems can be found in the Supplementary Material.The best model is the one built with the <to-grade+operation> token with gold operations annotations (last row).The second best system uses the gold <operation> token only.Therefore, knowing the operation type to be performed for a given instance provides valuable information.Even though the models with predicted operations ('pred' in Table 2) still outperform the baseline, they lag behind their counterparts built using gold operations.The main reason for that is the very simplistic classifier we used (average accuracy = 0.51, calculated using 10-fold crossvalidation).In summary, s2s+to-grade is the best performing model in a real world scenario, given the low performance of 'pred' systems.A more informed classifier should lead to better results, but this left for future work; our goal was to show the potential of this information.
The improvements in SARI are substantial: 7 points over the baseline even with the predicted operations.However, SARI aims to measure simplicity in general (not for specific grade levels).Since human evaluation of the targeted simplification performed by our models is not feasible, we can only approximate the usefulness of our models by using readability metrics such 6 Equivalent to their best performing "NTS-w2v" version.
as the Flesch-Kincaid Grade Level.This metric maps a text into a US grade level, which is the same grading provided in the Newsela corpus and, therefore, relevant for our study.Table 3 shows the Flesch-Kincaid results for the test set divided into the appropriate grade levels considering the outputs of s2s, s2s+to-grade and s2s+to-grade+operation (gold) models.Simplifications generated by s2s+to-grade and s2s+to-grade+operation are scored consistently closer to the appropriate grade, which does not happen with s2s.Table 3: Flesch-Kincaid scores for instances of each grade level simplified using s2s, s2s+to-grade and s2s+to-grade+operation (gold) models.
The last row of Table 3 shows the Mean Absolute Error (MAE) considering the Flesch-Kincaid Grade Level scores for the system outputs as the hypothesis and the expected grade level as the gold scores.Our s2s+to-grade and s2s+to-grade+operation (gold) models show lower error scores than the baseline system, which supports our hypothesis that such models produce more adequate outputs for targeted grade levels.

Usefulness of the s2s+to-grade model
The main advantage of s2s+to-grade is that a user can inform their grade level and retrieve a personalised simplification.Table 4 shows an example with different simplifications for an out-of-domain instance from the SimPA corpus (Scarton et al., 2018a).The same instance was given as input to the s2s+to-grade model with different artificial tokens according to the grade level that we want to achieve.The s2s system (second row) repeats the original instance (first row).Conversely, our s2s+to-grade model is capable of distinguishing among different levels and produces personalised simplifications for each grade level.
original We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.s2s We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.

<10>
We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.

<9>
We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.

<8>
We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.

<7>
We want to reassure you that we take fire safety very seriously and we are doing everything we can to make sure our residents are safe.

<6>
We want to reassure you that we take fire safety very seriously.We are doing everything we can to make sure our residents are safe.

<5>
We want to reassure you that we take fire safety very seriously.We are doing everything we can to make sure our residents are safe.

<4>
We want to make sure we take fire safety very seriously.We are doing everything we can to make sure our people are safe.

<3>
We want to make sure people take fire safety very seriously.We are doing everything we can to make sure our people are safe.

<2>
We want to make sure people take fire safety very seriously.We are doing everything we can to make sure people are safe.In Table 6, the s2s and s2s+to-grade models are the same as in Section 3, i.e. trained with the entire dataset without artificial tokens (s2s) or with artificial tokens (s2s+to-grade).The zero-shot models (s2s+to-grade+zs) are trained with <to-grade> data, but after removing instances of the grade level pair < ĝo , ĝt > under investigation, i.e. on a smaller dataset.For < 12, 7 > and < 12, 4 >, the zero-shot models outperform the baseline according to all metrics.In terms of SARI, for < 12, 7 > the zero-shot model is only marginally worse than the s2s+to-grade model.Conversely, s2s+to-grade+zs outperforms s2s+to-grade for < 12, 4 >, which is an impressive result.Finally, for < 6, 5 > all three models perform similarly.This may be explained by the proximity of ĝo and ĝt , which means that instances must be considerably close to each other and therefore simplifications will be minor and have little impact in the scores.

Conclusions
We have presented an approach for TS that benefits from corpora built for various target audiences and allows building better models than generalpurpose ones.We have also shown that zero-shot learning is possible for TS, where instances of the original-target audience do not exist.As future work we intend to investigate (i) better classifiers to predict operation types and (ii) multi-task learning as an alternative way of building a single TS model for various specific target audiences.We also plan to run experiments with the W-SW corpus and using an improved classifier to train models with information on operations.

Table 1 :
Examples of artificial tokens used.
fied to grade level 4 or grade level 2. Since the reference for grade level 4 is a copy of the original, the operation token for this case is <identical>.

Table 2 :
Results on the Newsela test set.

Table 4 :
Examples of s2s+to-grade outputs when an original instance is simplified into different levels.To show that zero-shot TS is possible, we build models on training data without instances of a certain grade level pair and test them on instances of that grade level pair.Consider the grade level pair < g o , g t >, where g o is the grade level of an original instance o, g t of a target instance t, and t is aligned to o.We test if our "s2s+to-grade" model can generalise for instances of < ĝo , ĝt > that have not been seen at training time.Due to space restrictions, we only show results for three representative grade level pairs.These pairs have a large enough number of training and test instances and cover levels that are closer or further apart from each other.In addition, after removing them the training corpus still has enough instances of the ĝt as target grade level.Instances of the target but not the original level (or of the target language in MT) must exist for zero-shot to be possible.The distributions of the selected grade level pairs is shown in Table5.

Table 6 :
Results of zero-shot experiments for TS.