Recursive Context-Aware Lexical Simplification

This paper presents a novel architecture for recursive context-aware lexical simplification, REC-LS, that is capable of (1) making use of the wider context when detecting the words in need of simplification and suggesting alternatives, and (2) taking previous simplification steps into account. We show that our system outputs lexical simplifications that are grammatically correct and semantically appropriate, and outperforms the current state-of-the-art systems in lexical simplification.


Introduction
Text simplification (TS) is aimed at reducing the reading and grammatical complexity of text while retaining the meaning and grammaticality (Chandrasekar and Bangalore, 1997). This is usually achieved by a series of transformations at the lexical and syntactic level. A number of systems in the recent years have approached this task in an integral manner (Zhu et al., 2010;Kauchak, 2013;Zhang and Lapata, 2017). Such comprehensive systems can perform a number of simplification operations at once, but the results are sometimes ungrammatical and meaning can be changed, arguably making the original text less clear and more complex (Siddharthan, 2014).
In this paper, we assume the lexical and syntactic components of a TS system to be independent and complementary to each other, and focus on the lexical simplification (LS) component for a number of reasons. First, it has been shown that lexical simplification techniques positively impact the readability of text and improve reader understanding and information retention (Leroy et al., 2012). Secondly, it has been argued that a large number of people with reading difficulties, including those with disabilities, low-literacy, nonnative backgrounds or non-expert knowledge benefit from LS (Xu et al., 2015). For instance, James (1998) shows that vocabulary of the non-native language plays central role in second language acquisition. As we aim to make information more accessible to such readers, the quality of the simplified text is of a paramount importance.
We stress the importance of three key points of quality assessment in LS: it is vital for the simplified text to be of a lower complexity, while being semantically equivalent to the original, and grammatically correct. Context plays a central role in fulfilling these requirements. For instance, consider the different uses of situation in the complex word dataset collected by Yimam et al. (2017): (1) the gravity of the economic situation (2) the situation has remained unchanged This example demonstrates two types of contextual effects: first of all, in (1) both economic and situation are marked as complex in mutual context, but when situation occurs in a different context it is not annotated as complex. Secondly, this example illustrates the impact of the context on the choice of the appropriate substitution: substituting climate for situation will work for (1), but will result in a semantically different expression in (2).
In addition, we argue that as word complexity depends on context, the order and choice of applied simplifications matters. For instance, consider the following simplifications for the sentence 'This is a problem in the contemporary world': This is a problem in the modern earth (4) This is a problem in the modern world Example (3) shows an output of a system that tries to simplify all words, which results in a nonsensical sentence, while (4) exemplifies the result of recursive word replacement in context. Context effects in LS have not been thoroughly investigated before. In this paper, we introduce a novel approach to LS that addresses these issues. In particular, we make the following contributions: 1. As each simplification step changes the complexity of the output sentence, our LS algorithm applies simplification recursively taking word complexity in context into account; 2. To ensure grammaticality and meaning equivalence to the input in the output, our algorithm takes context both at the complex word identification and substitution selection steps into account. We use a novel sequence labelling component for the former step, and assess semantic equivalence at the latter using deep contextualized word representations provided by ELMo (Peters et al., 2018).
3. To facilitate reproducibility, we release the code and the output of our system at github.com/siangooding/ lexical_simplification.

Previous work 2.1 Approaches
Early approaches to TS have mostly relied on rulebased systems (Carroll et al., 1998;Canning et al., 2000;Siddharthan, 2006), with many of the earlier systems prioritising syntactic operations, such as sentence splitting, deletion or reordering. Some work combined lexical simplification with syntactic operations (Zhu et al., 2010;Coster and Kauchak, 2011a;Kauchak, 2013). The availability of parallel corpora of "normal" and simplified text has inspired a number of approaches that treated TS as a monolingual machine translation problem (Zhu et al., 2010;Coster and Kauchak, 2011a,b) or allowed the researchers to apply language modelling (Kauchak, 2013).
Building on this line of work, Zhang and Lapata (2017) combine a novel sequence-to-sequence encoder-decoder model with a deep reinforcement learning framework that rewards the system for providing simple, fluent output, similar in meaning to the input. However, one of the main challenges for such comprehensive end-to-end systems is the ability to address specific types of errors independently. For instance, the DRESS-LS model (Zhang and Lapata, 2017) sometimes changes the meaning of the input to the opposite as in "Inspections, she said, rarely cost more than $ 1,400" → "Inspections, she said, often cost more than $ 1,400", or produces nonsensical output as in "Archaeologists digging on the grounds" → Archaeologists digging on the zebras".
A number of approaches focused on generation and assessment of lexical simplification (Yatskar et al., 2010;Biran et al., 2011;Horn et al., 2014;Glavaš andŠtajner, 2015). Paetzold and Specia (2016a) note the lack of consistent evaluation for text simplification, and in particular lexical simplification, and introduce an evaluation dataset BENCHLS, on which they perform benchmarking of several LS systems. They argue that lexical simplification consists of a number of steps, including substitution generation, substitution selection and ranking. They show that the unsupervised system of Paetzold and Specia (2016b) outperforms a range of other systems in all steps, with the only exception of the feature-based system by Horn et al. (2014), which performs better in terms of precision in a round-trip system evaluation.
Finally, Shardlow (2013) introduced complex word identification (CWI) as the first step in an LS pipeline to detect words within text that require simplification. He showed that systems' performance on this stage is crucial for the overall performance, as low recall of this component might result in an overly difficult text with many missed complex words, while low precision might result in meaning distortions. The recent shared task on CWI shows that most systems rely on classification approaches using features that pertain to individual words, not taking wider context into account (Yimam et al., 2018).

Datasets
Until recently, parallel Wikipedia and Simple Wikipedia datasets have been the most widely used data for training and evaluating TS systems (Zhu et al., 2010;Yatskar et al., 2010;Coster and Kauchak, 2011a,b;Biran et al., 2011;Kauchak, 2013;Horn et al., 2014). Wikipedia allows free access to large quantities of data, and Simple Wikipedia represents a simplified version of original articles that uses simpler vocabulary and syntactic structures (Coster and Kauchak, 2011a). Researchers in the past applied monolingual alignment techniques to construct parallel versions of the two Wikipedias and learn the transformations from these parallel versions using such tools as GIZA++ (Och and Ney, 2000).
Despite the popularity of the Wikipedia-based datasets for TS research, Xu et al. (2015) argue that focusing on Wikipedia limits simplification research and propose using a dataset based on news articles. They use a dataset from Newsela as an example, where the texts are simplified by professional editors at 4 levels of simplificity in accordance with the grade levels defined by the Common Core Standards (Porter et al., 2011).
In their benchmarking study, Paetzold and Specia (2016a) introduce an evaluation dataset BENCHLS that combines two previously released datasets for TS -LexMTurk (Horn et al., 2014) and LSeval (De Belder and Moens, 2012). This dataset contains 929 instances with an original sentence, a target complex word, and several candidate substitutions ranked by English speakers from the U.S. according to their simplicity. Paetzold and Specia (2016a) additionally filter out misspelled candidates and inflect all candidates to the grammatical form of the target word. The dataset contains an average of 7.37 candidate substitutions per complex word.
Finally, the CEFR-LS dataset (Uchida et al., 2018) contains simplifications, which not only represent a semantically good fit and are grammatically correct, but are also at different levels of simplicity annotated with respect to nonnative speakers. Simpler candidates at lower levels of language proficiency (A1-B1) according to the Common European Framework of Reference for Languages (CEFR) (Council of Europe, 2011) are provided for the original words that are at higher levels of language proficiency (B2-C2). The dataset contains 406 target words and 4912 possible substitutions. Unlike BENCHLS, substitute candidates within this dataset may not be correct for the given context, and if a substitute candidate is not appropriate in context it is labelled as such. The dataset contains an average of 2.35 candidate substitutions per complex word.

Data
In this study we implement an LS system that includes complex word identification, substitute generation, filtering and ranking as steps in a simplification pipeline. To train our CWI system, we use the CWI 2018 shared task dataset (Yimam et al., 2017), which contains texts on three genres -professionally written NEWS, amateurishly written WIKINEWS, and WIKIPEDIA articles. The words in the dataset are annotated as complex or not by 10 native and 10 non-native speakers of English.
We evaluate each step of the LS pipeline on two datasets -BENCHLS and CEFR-LS. We select the BENCHLS dataset because it contains multiple simplification alternatives ranked with respect to their simplicity by a number of human annotators. The CEFR-LS dataset is a useful resource for evaluation because, in addition to contextually suitable, grammatically correct, simpler alternatives, it also contains substitution candidates that do not fit the context. Furthermore, this dataset is aimed at non-native speakers of English which we view as the future target group for our LS system. For further details on datasets collection and annotation, we refer the readers to the original papers.
Finally, we evaluate our LS system in an endto-end manner and compare its performance to that of the current state-of-the-art systems, including those reported in Paetzold and Specia (2016a) and the DRESS-LS system of Zhang and Lapata (2017). In contrast to Zhang and Lapata (2017), we perform lexical simplification only. For fair comparison of the two systems, we extract only lexical simplifications from the parallel "normal" to simplified versions of the data used in Zhang and Lapata (2017), as well as from the original "normal" text and the DRESS-LS system output.
To extract the lexical transformations from the data, we use GIZA++ (Och and Ney, 2000) similarly to previous research (Coster and Kauchak, 2011b;Xu et al., 2015). Following Horn et al. (2014), we extract the examples that constitute one-to-one word correspondences between the two sides identified by the automatically induced word alignment, where the part-of-speech tag of the two words is the same while the lemmas are different. In addition, we filter out the instances involving modification of stopwords on the original, "normal" side as well as rewrites involving proper nouns. All preprocessing steps for our algorithm are performed using the RASP parser (Briscoe et al., 2006). The first stage of the algorithm is complex word identification (CWI), which aims to identify which words should be simplified within the text and, thus, allows for a personalisable and targeted approach to LS. Furthermore, it helps to reduce the number of unnecessary and potentially "harmful" simplifications performed by a 'simplify all words' approach. To train and test our CWI component we use the dataset by Yimam et al. (2017).
Word complexity depends on the surrounding context: for instance, between 3% and 10% of lexical items (depending on genre) in the Yimam et al. (2017) dataset receive different annotations in different contexts. However, most CWI systems to date have approached this task on an individual word basis (Yimam et al., 2018). For instance, the 2018 CWI shared task winning system CAMB uses a classification-based approach with 27 word-level features (Gooding and Kochmar, 2018). As context impacts the perceived complexity of text, we argue that CWI should be framed as a sequence labelling task and propose a novel architecture SEQ based on this idea (Gooding and Kochmar, 2019). We extend the implementation of a sequence labeller by Rei (2017), 1 which achieves state-of-theart results on a number of NLP tasks. The design of this architecture is highly suitable for CWI as: (1) it uses bi-directional long short-term memory units (BiLSTM) (Hochreiter and Schmidhuber, 1997), which allow the system to learn about both the left and right context of a target word; (2) the context is combined with both word and character-level representations (Rei et al., 2016) which helps capture complexity due to rare character sequences as well as morphological structure; (3) this architecture uses a language modelling objective, which enables the model to take one of the highly informative complexity factors of word frequency into account.   (Pennington et al., 2014). The model is trained to predict the binary complexity of words as annotated in the dataset of Yimam et al. (2017). Training is performed over 20 iterations on randomly shuffled sentences from all genres included within the dataset. To test this novel architecture on CWI, we apply it to the CWI 2018 shared task test data (Yimam et al., 2018) and compare the results to the current state-of-theart (SOTA) CAMB system. Table 1 shows that the SEQ model outperforms the current SOTA system on all three text genres for binary CWI (statistically significant using McNemar's test, p=0.0016, The proposed SEQ model (Gooding and Kochmar, 2019) has a number of additional advantages: it takes context into account, helps avoid the necessity of extensive feature engineering relying on word embeddings as the only input information at run time, and generalises well across all three datasets. To further assess generalisability of the model, we test it on CEFR-LS, as well as BENCHLS for consistency (see Table 2). However, as the words in BENCHLS dataset were selected at random rather than according to their complexity (Paetzold and Specia, 2016a), the results as expected are lower.

Lexical Complexity Threshold
The SEQ system labels each word with a lexical complexity score. This score represents the likelihood (p) of each word belonging to the complex class. Whether a word is considered as complex is set according to a predefined threshold for p, which allows the 'aggressiveness' of the algorithm to be tailored according to the application. For instance, consider the following example: (5) I believe that ignoring public opinion discredits 0.89 the authorities 0.79 and destabilises 0.84 the situation.
The SEQ labeller assigns complexity probability p>0.80 to the words discredits and destabilises, whereas for authorities p=0.79. If the complexity threshold in the simplification pipeline is set to 0.80, the word authorities would not be a candidate for lexical simplification. Our simplification pipeline will attempt to simplify words one at a time, starting with the highest p values above this predefined threshold.

Substitute Generation
Substitute generation refers to the process of generating candidates that can be used as simpler alternatives to the target word. The goal of the system at this stage is to generate a diverse set of potential substitutions that can be filtered and ranked according to different criteria at later steps. We note that the use of multiple resources for substitute generation at this step is crucial as each individual resource has only a limited coverage. To this end, we first use traditional approaches, which do not take the context of the target word or the simplicity of substitutes into account. In summary, we rely on the following resources: • we extract synonyms from WordNet (Fellbaum, 2012), following Devlin (1998)

Substitution Filtering and Ranking
Following substitute generation, the next step in the LS pipeline is to choose one of the generated candidates to replace the original target word. This is done by filtering and ranking the candidates using a set of criteria, and choosing the top candidate. Previous work has framed this task as ranking words according to simplicity (Paetzold and Specia, 2016a). However, a system that is only aimed at selecting the simplest candidate may return a substitution that does not fit the surrounding context, is ungrammatical or not semantically equivalent to the original. Therefore, we consider all three aspects of high-quality substitution selection: contextual simplicity, semantic equivalence and grammaticality, which are outlined below.
Contextual Simplicity: The complexity of each candidate is calculated using our sequential CWI model from Section 4.1. We calculate the complexity for each substitution within context and refer to this as the simplicity score S. Table 3 shows 2 possible substitutions for engulfed, with their respective simplicity scores.
Contextual Semantic Equivalence: To assess whether a substitution is semantically equivalent to the original we use ELMo embeddings (Peters et al., 2018), which to the best of our knowledge have not been used for LS before. ELMo provides deep contextualized word representations, which are learned from the internal states of a deep bidirectional language model. As a result, these embeddings are able to model complex syntactic and Oak is strong and also gives shade g.c. Oak gives shade Table 5: Original sentence O and the grammatical context (g.c.) used to calculate C G for the word gives semantic characteristics of word use, as well as how these word uses vary across linguistic contexts. In addition, contextualised embeddings help filter out antonyms that might be included in the set of potential substitutes by the previous step.
The similarity of words in a given sentential context is calculated using the ELMo embeddings associated with the target word and each of the substitutes. The contextual similarity score (C S ) is calculated by taking the cosine distance between the ELMo word vectors associated with the original word v o and the substitute v s . This allows us to reason about the likelihood that the substitution is semantically equivalent to the original in the given context: the smaller the distance the more likely this word fits the context. Table 4 provides some examples for this score.
This technique works particularly well when the target word appears in the immediate context of the words grammatically related to it, for example when a verb is surrounded by its subject and object. To counteract the effect of long range dependencies in sentences, we constrain the context to the grammatical dependents and calculate a second contextual score C G using cosine distance. Table 5 provides an example of a full sentence O and the grammatical context for the word gives.
Grammaticality: To assess whether a substitution results in a grammatically correct sentence, we calculate bigram frequencies of the candidate substitute and one word to the left and to the right of it using the 520 million word COCA corpus (Davies, 2014). If either of these frequencies equals 0, it is assumed that the substitute is not a valid grammatical fit or is extremely rare, making it a poor candidate for simplification.  Here, we use the bigram frequencies as a proxy for grammaticality: although "able" and "capable" are semantically similar, it is the grammatical constraints, captured by the bigram frequencies, that rule one of the alternatives out.

Threshold-based filtering
Threshold-based filtering is performed by removing all substitutes that are unlikely to be grammatical or do not fit the target context. First, substitutes are removed from consideration if their right or left bigram frequency equals 0. Then, we remove substitutes if either of their contextual ELMo scores is below a given threshold t (C S ∨ C G < t), as this implies the substitute is not equivalent in this context. We test our filtering approach on the CEFR-LS dataset as it contains annotations for contextually suitable (value 1) as well as unsuitable (value 0) substitutions. We empirically find the optimal threshold value on the CEFR-LS dataset to be 0.175. With this optimal value, we can identify contextually suitable and grammatically correct substitutes on CEFR-LS with the precision of 0.7968, recall of 0.8081 and F 1 =0.8014. As we show in Section 5, this filtering technique generalises well to other datasets.

Ranking Algorithm
Once filtering has been performed we then rank substitutes. In order to rank substitutes, the simplicity and contextual semantic equivalence scores are combined to produce an overall suitability score. We evaluate our ranking techniques on the BENCHLS and CEFR-LS datasets.
We perform ranking using the sum of the contextual simplicity score (S) and the average contextual semantic equivalence scoreC = avg(C S , C G ), and evaluate the results using the TRank-at-n measure introduced in Paetzold and Specia (2016a), which estimates the proportion of times a candidate with a gold-standard rank   r≤n is ranked first by the system. For instance, our best performing ranking technique on the full BENCHLS dataset for n=1 is based on the combination of S+C scores and achieves TRank-at-n=0.5602. This means that for approximately 56% of the test instances the top ranked substitute in the gold standard is correctly ranked first.
In addition, we report the mean reciprocal rank (MRR) (Voorhees, 1999), which takes into account the rank of the substitutes proposed by each ranking technique. We report the results on the full BENCHLS dataset in the upper half of Table 7. In the lower half, we compare our results on the test set of 464 instances to those running the Paetzold and Specia (2016a) system (P&S) on the same test splits. Since the P&S system was trained on half of BENCHLS we cannot run it on the full dataset. Table 7 shows that ranking with S+C works best according to all measures and across both sets. Table 8 reports the ranking results on the CEFR-LS data. We observe a decrease in performance, however this is expected: as all substitutes within the BENCHLS dataset are valid, ranking by simplicity is more informative; in contrast, the CEFR-LS dataset contains irrelevant substitutes, so contextual fit has more pronounced effects.

Recursive Step
Following CWI, substitute generation, filtering and substitute ranking steps, the system is able to perform a simplification. As outlined in Section 4.1.1, our system attempts to simplify one word at a time, starting with the word considered most complex. Since word complexity depends on the context, each individual lexical simplifica-tion made to a sentence has a subsequent impact on the perceived complexity of the surrounding words. Such sequential effects cannot be modelled by systems that apply several simplification steps at once. For instance, when considering examples (6) and (7) we see that the word situation is given a high likelihood of being complex in the context of another complex word hazardous. However, once dangerous is substituted for hazardous, the subsequent complexity of situation is reduced as well. (6) It was a hazardous 0.90 situation 0.87 (7) It was a dangerous 0.30 situation 0.35 Our lexical simplification algorithm is applied to a sentence recursively: it starts by identifying and simplifying the most complex words in the sentence. Once the simplification is applied in step n, the algorithm reassesses word complexity in step n+1, which, in light of the simplifications applied in previous steps, might have changed. This prevents unnecessarily simplifying situation in example (7), which is no longer necessary after simplifying hazardous → dangerous. The simplification algorithm stops when there are no words with the complexity score above the predefined threshold left within the sentence, set to 0.5. Table 9 exemplifies the benefit of using the recursive simplification approach REC-LS. We see that both the 'Simplify CW', aimed at individually identified complex words, and 'Simplify all' approaches result in unnecessary simplifications, whilst REC-LS stops when it recognises that the surrounding words are no longer complex.  We note that the only consequence of performing simplification recursively is that fewer words, or potentially different words, are simplified, and highlight that it does not lead to any error propagation. Table 10 presents the entire REC-LS algorithm including the recursive simplification step. datasets. The CWI step of our recursive LS algorithm identifies 77% of the target complex words in the BENCHLS and 86% of the target complex words in the CEFR-LS dataset. Next, the substitution generation and ranking produce lists of simplification candidates. We evaluate these steps using precision and measuring the percentage of times that our system ranks one of the gold standard candidates as its top choice for substitution.

End-to-end System Performance
The best precision P =0.7945 on BENCHLS is achieved with threshold-based filtering of the unsuitable candidates and ranking of the candidates according to the combination of contextual simplicity (S) and contextual fit (C) scores. We note that this result outperforms the previous SOTA system by Horn et al. (2014), which is reported to have P =0.5460 (Paetzold and Specia, 2016a), by a large margin.
With a similar approach, we achieve P =0.4628 on the CEFR-LS, and to the best of our knowledge, this is the first time an LS system is benchmarked on this dataset. We note that the precision on this dataset is lower, which can be attributed to the smaller set of gold standard substitutes per word (an average of 2.35 in CEFR-LS vs 7.37 in BENCHLS). We also note that in a number of cases our system generates valid substitutes, which are not included in the gold standard. For example, wealthy → rich in "could participate Rank substitutes according to grammaticality, contextual semantic equivalence and simplicity (11) Take the top substitute (13) If no appropriate substitutes found, or best fit is original word, then retain original word and add word to ignore list (15) Otherwise replace word and call function with new sentence in government just as wealthy men could". We present more examples of such cases in the Appendix.
We also run our entire recursive simplification system REC-LS on the three simplification datasets: WikiSmall, WikiLarge and Newsela, used in Zhang and Lapata (2017). We then compare the results to the SOTA simplification systems DRESS-LS by Zhang and Lapata (2017), which contains a specialised LS model, and the P&S simplification system (Paetzold and Specia, 2016a). As DRESS-LS is trained using the above datasets, we test using the 649 sentences that are reserved for testing only. Simplifications are assessed using the gold standard for lexical substitutions. For each dataset, the number of lexical simplifications performed by the systems is recorded. We then compare the simplifications performed by the system with the gold standard simplifications and calculate the recall, precision and correct proportion as follows: • For recall we estimate the number of simplifications present in the gold standard |G|, and the total number produced by the system S, which are in the gold standard |S∩G|. Recall is then calculated as |S∩G|/|G|. • For precision, we identify the proportion of simplifications out of the total |S|, which are also in the gold standard: |S∩G|/|S|.  • Finally, correct stands for the proportion of instances where the top lexical substitution returned by the system is exactly the same as the gold standard one. The results show that the recall for REC-LS is higher than that of DRESS-LS across all datasets, while the P&S system performs the best in terms of recall as it applies a 'simplify all' approach. The difference is especially pronounced in the WikiSmall dataset, where the DRESS-LS system does not perform any required lexical substitutions. REC-LS outperforms both DRESS-LS and P&S systems across all datasets in terms of precision, which indicates that the CWI stage of the algorithm is able to narrow down simplifications to relevant words. Finally, the three systems show different results in terms of correct proportion: REC-LS outperforms other systems on Newsela and WikiSmall, and DRESS-LS has a higher correct proportion on WikiLarge.
Finally, we note that there are many instances where the REC-LS system performs substitutions with valid alternatives that are not contained within the gold standard. For instance, consider the word "separated" in the following context: The island chain forms part of the Hebrides, separated from the Scottish mainland.