Lexical Simplification with Neural Ranking

We present a new Lexical Simplification approach that exploits Neural Networks to learn substitutions from the Newsela corpus - a large set of professionally produced simplifications. We extract candidate substitutions by combining the Newsela corpus with a retrofitted context-aware word embeddings model and rank them using a new neural regression model that learns rankings from annotated data. This strategy leads to the highest Accuracy, Precision and F1 scores to date in standard datasets for the task.


Introduction
In Lexical Simplification (LS), words and expressions that challenge a target audience are replaced with simpler alternatives. Early lexical simplifiers Carroll et al., 1998) combine WordNet (Fellbaum, 1998) and frequency information such as Kucera-Francis coefficients (Rudell, 1993). Modern simplifiers are more sophisticated, but most of them still adhere to the following pipeline: Complex Word Identification (CWI) to select words to simplify; Substitution Generation (SG) to produce candidate substitutions for each complex word; Substitution Selection (SS) to filter candidates that do not fit the context of the complex word; and Substitution Ranking (SR) to rank them according to their simplicity.
The most effective LS approaches exploit Machine Learning techniques. In CWI, ensembles that use large corpora and thesauri dominate the top 10 systems in the CWI task of SemEval 2016 (Paetzold and Specia, 2016d). In SG, Horn et al. (2014) extract candidates from a parallel Wikipedia and Simple Wikipedia corpus, yielding major improvements over previous approaches (Devlin, 1999;Biran et al., 2011). Glavaš anď Stajner (2015) and Paetzold and Specia (2016f) employ word embedding models to generate candidates, leading to even better results.
In SR, the state-of-the-art performance is achieved by employing supervised approaches: SVMRank (Horn et al., 2014) and Boundary Ranking (Paetzold and Specia, 2015). Supervised approaches have the caveat of requiring annotated data, but as a consequence they can adapt to the needs of a specific target audience.
Recently, (Xu et al., 2015) introduced the Newsela corpus, a new resource composed of thousands of news articles simplified by professionals. Their analysis reveals the potential use of this corpus in simplification, but thus far no simplifiers exist that exploit this resource. The scale of this corpus and the fact that it was created by professionals opens new avenues for research, including using Neural Network approaches, which have proved promising for many related problems.
Neural Networks for supervised ranking have performed well in Information Retrieval (Burges et al., 2005), Medical Risk Evaluation (Caruana et al., 1995) and Summarization (Cao et al., 2015), among other tasks, which suggests that they could be an interesting approach to SR. In the context of LS, existing work has only exploited word embeddings as features for SG, SS and SR.
In this paper, we introduce an LS approach that uses the Newsela corpus for SG and employs a new regression model for Neural Ranking in SR that addresses the task in three steps: Regression, Ordering and Confidence Check.

Hybrid Substitution Generation
Our approach combines candidate substitutions from two sources: the Newsela corpus and retrofitted context-aware word embedding models.

SG via Parallel Data
The Newsela corpus 1 (version 2016-01-29.1) contains 1,911 news articles in their original form, as well as up to 5 versions simplified by trained professionals to different reading levels. It has a total of 10,787 documents, each with a unique article identifier and a version indicator between 0 and 5, where 0 refers to the article's original form, and 5 to its simplest version.
To employ the Newsela corpus in SG, we first produce sentence alignments for all pairs of versions of a given article. To do so, we use paragraph and sentence alignment algorithms from (Paetzold and Specia, 2016g). They align paragraphs with sentences that have high TF-IDF similarity, concatenate aligned paragraphs, and finally align concatenated paragraphs at sentence-level using the TF-IDF similarity between them. Using this algorithm, we produce 550,644 sentence alignments.
We then tag sentences using the Stanford Tagger (Toutanvoa and Manning, 2000), produce word alignments using Meteor (Denkowski and Lavie, 2011), and extract candidates using a strategy similar to that of Horn et al. (2014). First we consider all aligned complex-to-simple word pairs as candidates. Then we filter them by discarding pairs which: do not share the same POS tag, have at least one non-content word, have at least one proper noun, or share the same stem. After filtering, we inflect all nouns, verbs, adjectives and adverbs to all possible variants. We then complement the candidate substitutions from the Newsela corpus using the following word embeddings model.

SG via Context-aware Word Embeddings
Paetzold and Specia (2016f) present a state-ofthe-art simplifier that generates candidates from a context-aware word embeddings model trained over a corpus composed of words concatenated with universal POS tags. We take this approach a step further by incorporating another enhancement: lexicon retrofitting. Faruqui et al. (2015) introduce an algorithm that allows for typical embeddings to be retrofitted over lexicon relations, such as synonymy, hypernymy, etc. To retrofit the context-aware models from (Paetzold and Specia, 2016f), we concatenate the words in WordNet (Fellbaum, 1998) with their universal POS tag, create a dictionary containing mappings between word-tag pairs and 1 https://newsela.com/data their synonyms, then use the algorithm described in (Faruqui et al., 2015).
We train a bag-of-words (CBOW) model (Mikolov et al., 2013b) of 1,300 dimensions with word2vec (Mikolov et al., 2013a) using a corpus of over 7 billion words that includes the SubIMDB corpus (Paetzold and Specia, 2016b), UMBC webbase 2 , News Crawl 3 , SUBTLEX (Brysbaert and New, 2009), Wikipedia and Simple Wikipedia (Kauchak, 2013). We retrofit the model over WordNet's synonym relations only. We choose this model training configuration because it has been shown to perform best for LS in a recent extensive benchmarking (Paetzold, 2016).
For each target word in the Newsela vocabulary we then generate as complementary candidate substitutions the three words in the model with the lowest cosine distances from the target word that have the same POS tag and are not a morphological variant. As demonstrated by Paetzold and Specia (2016a), in SG parallel corpora tend to yield higher Precision, but noticeably lower Recall than embedding models. We add only three candidates in order increase Recall without compromising the high Precision from the Newsela corpus.

Unsupervised Substitution Selection
We pair our generator with the Unsupervised Boundary Ranking SS approach from (Paetzold and Specia, 2016f). They learn a supervised ranking model over data gathered in unsupervised fashion. Candidates are ranked according to how well they fit the context of the target word, and a percentage of the worst ranking candidates is discarded.
For training, the approach requires a set of complex words in context along with candidate substitutions for it. To produce this data, we generate candidates for the complex words in all 929 simplification instances of the BenchLS dataset (Paetzold and Specia, 2016a) using our SG approach. The selector assigns label 1 to the complex words and 0 to all candidates, then trains the model over this data. During SS, we discard 50% of candidates with the worst rankings. We chose this proportion through experimentation. As features, we use the same described in (Paetzold and Specia, 2016f).

Neural Substitution Ranking
Our approach performs three steps: Regression, Ordering and Confidence Check.

Regression
In this step, we employ a multi-layer perceptron to determine the ranking between candidate substitutions. The network (Figure 1) takes as input a set of features from two candidates, and produces a single value that represents how much simpler candidate 1 is than candidate 2. If the value is negative, then candidate 1 is simpler than 2, if it is positive, candidate 2 is simpler than 1. Our network has three hidden layers with eight nodes each. For training we use the LexMTurk dataset (Horn et al., 2014), which contains 500 instances composed of a sentence, a target complex word and candidate substitutions ranked by simplicity. Let c 1 and c 2 be a pair of candidates from an instance, r 1 and r 2 their simplicity ranks, and Φ(c i ) a function that maps a candidate c i to a set of feature values. For each possible pair in each instance of the LexMTurk dataset we create two training instances: one with input [Φ(c 1 ) , Φ(c 2 )] and reference output r 1 − r 2 , and one with input [Φ(c 2 ) , Φ(c 1 )] and reference output r 2 − r 1 . We train our model for 500 epochs. We use the same n-gram probability features from SubIMDB used by (Paetzold and Specia, 2015). Hidden layers use the tanh activation function, and the output node uses a linear function with Mean Average Error.

Ordering
Once the model is trained, we rank candidates by simplicity. Let M (c i , c j ) be the value estimated by our model for a pair of candidates c i and c j of a generated set C. During the ordering, we calculate the final score R(c i ) of all candidates c i (Eq. 1).
Then, we simply rank all candidates based on R: the lower the score, the simpler a candidate is.

Confidence Check
Once candidates are ranked, in order to increase the reliability of our simplifier, instead of replacing the target complex word with the simplest candidate, we first compare the use of this candidate against the original word in context, which can be seen as a Confidence Check.
The target t is only replaced by the simplest candidate c if the language model probability of the trigram S j−1 j−2 t, in which S j−1 j−2 is the bigram of words preceding t in position j of sentence S, is smaller than that of trigram S j−1 j−2 c. This type of approach has been proved a reliable alternative to simply adding the target complex word to the candidate pool during ranking (Glavaš andŠtajner, 2015).
To calculate probabilities, we train a 5-gram language model over SubIMDB, since its word and n-gram frequencies have been shown to correlate with simplicity better than those from other larger corpora (Paetzold and Specia, 2016b). We henceforth refer to our LS approach (SG+SS+SR) as NNLS.

Substitution Generation Evaluation
Here we assess the performance of our SG approach in isolation (NNLS/SG), and when paired with our SS strategy (NNLS/SG+SS), as described in Sections 2 and 3. We compare them to the generators of all approaches featured in the benchmarks of Paetzold and Specia (2016a): Devlin , Biran (Biran et al., 2011), Yamamoto (Kajiwara et al., 2013), Horn (Horn et al., 2014), Glavas (Glavaš andŠtajner, 2015) and Paetzold (Paetzold and Specia, 2015;Paetzold and Specia, 2016f). These SG strategies extract candidates from WordNet, Wikipedia and Simple Wikipedia articles, Merriam dictionary, sentencealigned Wikipedia and Simple Wikipedia articles, typical word embeddings and context-aware word embeddings, respectively. They are all available in the LEXenstein framework (Paetzold and Specia, 2015).
We use two common evaluation datasets for LS: BenchLS (Paetzold and Specia, 2016a), which contains 929 instances and is annotated by English speakers from the U.S, and NNSEval (Paetzold and Specia, 2016f), which contains 239 instances and is annotated by non-native English speakers. Each instance is composed of a sentence, a target complex word, and a set of gold candidates ranked by simplicity. We use the same metrics featured in (Paetzold and Specia, 2016a), which are the well known Precision, Recall and F1. Notice that, since these datasets already provide target words deemed complex by human annotators, we do not address CWI in our evaluations.
The results in Table 1 reveal that our SG approach outperforms all others in Precision and F1 by a considerable margin, and that our SS approach leads to noticeable increases in Precision at almost no cost in Recall.

Substitution Ranking Evaluation
We also compare our Neural Ranking SR approach (NNLS/SR) to the rankers of all aforementioned lexical simplifiers. The Devlin, Biran, Yamamoto, Horn, Glavas and Paetzold rankers exploit Kucera-Francis coefficients (Rudell, 1993), hand-crafted complexity metrics, a supervised SVM ranker, rank averaging and Boundary Ranking, respectively. In this experiment we disregard the step of Confidence Check, since we aim to analyse the performance of our ranking strategy alone.
The datasets used are those introduced for the English Lexical Simplification task of SemEval 2012 (Specia et al., 2012), to which dozens of systems were submitted. The training and test sets are composed of 300 and 1,710 instances, respectively. Each instance is composed of a sentence, a target complex word, and a series of candidate substitutions ranked by simplicity. We use TRank, the official metric of the SemEval 2012 task, which measures the proportion of instances for which the candidate with the highest goldrank was ranked first, as well Pearson (p) correlation. While TRank best captures the reliability of rankers in practice, Pearson correlation shows how well the rankers capture simplicity in general. Table 2 reveals that, much like our SG approach, our Neural Ranker performs well in isolation, offering the highest scores among all strategies available.

Full Pipeline Evaluation
We then evaluate our approach in two settings: with (NNLS) and without (NNLS-C), the Confidence Check (Section 4.3). The evaluation datasets used are the same described in Section 5, and the metrics are: • Accuracy: The proportion of instances in which the target word was replaced by a gold candidate.
• Precision: The proportion of instances in which the target word was either replaced by a gold candidate or not replaced at all.

Error Analysis
In this Section we analyse NNLS to understand the sources of its errors. For that, we use PLUMBErr (Paetzold and Specia, 2016c;Shardlow, 2014), a method that assesses all steps taken by LS systems and identifies five types of errors: • 1: No error during simplification.
• 5: Replacement does not simplify the word.
Errors of type 2 are made during CWI, 3 during SG/SS, and 4 and 5 during SR. We pair ours, Devlin's, Horn's, Glavas' and Paetzold's simplifiers with two CWI approaches: one that simplifies everything (SE), and the Performance-Oriented Soft Voting approach (PV), which won the CWI task of SemEval 2016 (Paetzold and Specia, 2016e). Table 3 shows the count and proportion (in brackets) of instances in BenchLS in which each error was made. It shows that our approach correctly simplifies the largest number of problems, while making the fewest errors of type 3A and 4. However, it can be noticed that NNLS makes many errors of type 5. By analysing the output produced after each step, we found that this is caused by the inherently high Precision of our approach: by producing a smaller number of spurious candidates, our simplifier reduces the occurrences of ungrammatical and/or incoherent substitutions, but also disregards many candidates that are simpler than the target complex word. Nonetheless, this noticeably increases the number of correct simplifications made.

Conclusions
We introduced an LS approach that extracts candidate substitutions from the Newsela corpus and retrofitted context-aware word embedding models, selects them with Unsupervised Boundary Ranking, and ranks them using a new Neural Ranking strategy.
We found that: (i) our generator achieves the highest Precision and F1 scores to date, (ii) our Neural Ranking strategy leads to the top scores on the English Lexical Simplification task of Se-mEval 2012, (iii) and their combination offers the highest Precision and Accuracy scores in two standard evaluation datasets. An error analysis reveals that our LS approach makes considerably fewer grammaticality/meaning errors than former stateof-the-art simplifiers.
In future work, we aim to investigate new architectures for our Neural Ranking model, as well as to test our approach in other NLP tasks. An implementation of our Substitution Generation, Selection and Ranking approaches can be found in the LEXenstein framework 4 .