A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification

Current lexical simplification approaches rely heavily on heuristics and corpus level features that do not always align with human judgment. We create a human-rated word-complexity lexicon of 15,000 English words and propose a novel neural readability ranking model with a Gaussian-based feature vectorization layer that utilizes these human ratings to measure the complexity of any given word or phrase. Our model performs better than the state-of-the-art systems for different lexical simplification tasks and evaluation datasets. Additionally, we also produce SimplePPDB++, a lexical resource of over 10 million simplifying paraphrase rules, by applying our model to the Paraphrase Database (PPDB).


Introduction
Lexical simplification is an important subfield that is concerned with the complexity of words or phrases, and particularly how to measure readability and reduce the complexity using alternative paraphrases. There are three major lexical simplification tasks which effectively resemble a pipeline: (i) Complex Word Identification (Paetzold and Specia, 2016a;Yimam et al., 2017;Shardlow, 2013b) which involves identifying complex words in the sentence; (ii) Substitution Generation (Glavaš andŠtajner, 2015;Coster and Kauchak, 2011) which involves finding alternatives to complex words or phrases; and (iii) Substitution Ranking  which involves ranking the paraphrases by simplicity. Lexical simplification also has practical real-world uses, such as displaying alternative expressions of complex words as reading assistance for children (Kajiwara et al., 2013), non-native speakers (Petersen and Ostendorf, 2007;Pellow and Eskenazi, 2014), lay readers (Elhadad and Sutaria, 2007;Siddharthan and Katsos, 2010), or people with reading disabilities (Rello et al., 2013).
Most current approaches to lexical simplification heavily rely on corpus statistics and surface level features, such as word length and corpusbased word frequencies (read more in §5). Two of the most commonly used assumptions are that simple words are associated with shorter lengths and higher frequencies in a corpus. However, these assumptions are not always accurate and are often the major source of errors in the simplification pipeline (Shardlow, 2014). For instance, the word foolishness is simpler than its meaningpreserving substitution folly even though foolishness is longer and less frequent in the Google 1T Ngram corpus (Brants and Franz, 2006). In fact, we found that 21% of the 2272 meaningequivalent word pairs randomly sampled from PPDB 2 (Ganitkevitch et al., 2013) had the simpler word longer than the complex word, while 14% had the simpler word less frequent.
To alleviate these inevitable shortcomings of corpus and surface-based methods, we explore a simple but surprisingly unexplored idea -creating an English lexicon of 15,000 words with wordcomplexity ratings by humans. We also propose a new neural readability ranking model with a Gaussian-based feature vectorization layer, which can effectively exploit these human ratings as well as other numerical features to measure the complexity of any given word or phrase (including those outside the lexicon and/or with sentential context). Our model significantly outperforms the state-of-the-art on the benchmark SemEval-2012 evaluation for Substitution Ranking Paetzold and Specia, 2017), with or without using the manually created word-complexity lexicon, achieving a Pearson correlation of 0.714 and 0.702 respectively. We also apply the new ranking model to identify lexical simplifications (e.g., commemorate → celebrate) among the large number of paraphrase rules in PPDB with improved accuracy compared to previous work for Substitution Generation. At last, by utilizing the wordcomplexity lexicon, we establish a new state-ofthe-art on two common test sets for Complex Word Identification (Paetzold and Specia, 2016a;Yimam et al., 2017). We make our code, the wordcomplexity lexicon, and a lexical resource of over 10 million paraphrase rules with improved readability scores (namely SimplePPDB++) all publicly available.

Constructing A Word-Complexity Lexicon with Human Judgments
We first constructed a lexicon of 15,000 English words with word-complexity scores assessed by human annotators. 3 Despite the actual larger English vocabulary size, we found that rating the most frequent 15,000 English words in Google 1T Ngram Corpus 4 was effective for simplification purposes (see experiments in §4) as our neural ranking model ( §3) can estimate the complexity of any word or phrase even out-of-vocabulary. We asked 11 non-native but fluent English speakers to rate words on a 6-point Likert scale. We found that an even number 6-point scale worked better than a 5-point scale in a pilot experiment with two annotators, as the 6-point scheme allowed annotators to take a natural two-step approach: first determine whether a word is simple or complex; then decide whether it is 'very simple' (or 'very complex'), 'simple' (or 'complex'), or 'moderately simple' (or 'moderately complex'). For words with multiple capitalized versions (e.g., nature, Nature, NATURE), we displayed the most frequent form to the annotators. We also asked the annotators to indicate the words for which they had trouble assessing their complexity due to ambiguity, lack of context or any other reason. All the annotators reported little difficulty, and explained possible reasons such as that word bug is simple 3 Download at https://github.com/mounicam/ lexical_simplification 4 https://catalog.ldc.upenn.edu/ ldc2006t13 regardless of its meaning as an insect in biology or an error in computer software. 5 With our hired annotators, we were able to have most annotators complete half or the full list of 15,000 words for better consistency, and collected between 5 and 7 ratings for each word. It took most annotators about 2 to 2.5 hours to rate 1,000 words. Table 1 shows few examples from the lexicon along with their human ratings.
In order to assess the annotation quality, we computed the Pearson correlation between each annotator's annotations and the average of the rest of the annotations (Agirre et al., 2014). For our final word-complexity lexicon, we took an average of the human ratings for each word, discarding those (about 3%) that had a difference ≥ 2 from the mean of the rest of the ratings. The overall inter-annotator agreement improved from 0.55 to 0.64 after discarding the outlying ratings. For the majority of the disagreements, the ratings of one annotator and the mean of the rest were fairly close: the difference is ≤ 0.5 for 47% of the annotations; ≤ 1.0 for 78% of the annotations; and ≤ 1.5 for 93% of the annotations on the 6-point scale. We hired annotators of different native languages intentionally, which may have contributed to the variance in the judgments. 6 We leave further investigation and possible crowdsourcing annotation to future work.

Neural Readability Ranking Model for Words and Phrases
In order to predict the complexity of any given word or phrase, within or outside the lexicon, we propose a Neural Readability Ranking model that can leverage the created word-complexity lexicon and take context (if available) into account to further improve performance. Our model uses a Gaussian-based vectorization layer to exploit numerical features more effectively and can outperform the state-of-the-art approaches on multiple lexical simplification tasks with or without the word-complexity lexicon. We describe the general model framework in this section, and task-specific configurations in the experiment section ( §4).

Neural Readability Ranker (NRR)
Given a pair of words/phrases w a , w b as input, our model aims to output a real number that indicates the relative complexity P (y| w a , w b ) of w a and w b . If the output value is negative, then w a is simpler than w b and vice versa. Figure 1 shows the general architecture of our ranking model highlighting the three main components: 1. An input feature extraction layer ( §3.2) that creates lexical and corpus-based features for each input f (w a ) and f (w b ), and pairwise features f ( w a , w b ). We also inject the word-complexity lexicon into the model as a numerical feature plus a binary indicator.

2.
A Gaussian-based feature vectorization layer ( §3.3) that converts each numerical feature, such as the lexicon scores and n-gram probabilities, into a vector representation by a series of Gaussian radial basis functions.
3. A feedforward neural network performing regression with one task-specific output node that adapts the model to different lexical simplification tasks ( §4).
Our model first processes each input word or phrase in parallel, producing vectorized features. All the features are then fed into a joint feedforward neural network.

Features
We use a combination of rating scores from the word-complexity lexicon, lexical and corpus features ) and collocational features (Paetzold and Specia, 2017).
We inject the word-complexity lexicon into the NRR model by adding two features for each input word or phrase: a 0-1 binary feature representing the presence of a word (the longest word in a multi-word phrase) in the lexicon, and the corresponding word complexity score. For out-ofvocabulary words, both features have the value 0. We back-off to the complexity score of the lemmatized word if applicable. We also extract the following features: phrase length in terms of words and characters, number of syllables, frequency with respect to Google Ngram corpus (Brants and Franz, 2006), the relative frequency in Simple Wikipedia with respect to normal Wikipedia (Pavlick and Nenkova, 2015) and ngram probabilities from a 5-gram language model trained on the SubIMDB corpus (Paetzold and Specia, 2016c), which has been shown to work well for lexical simplification. For a word w, we take language model probabilities of all the possible n-grams within the context window of 2 to the left and right of w. When w is a multi-word phrase, we break w into possible n-grams and average the probabilities for a specific context window.
For an input pair of words/phrases w a , w b , between the word2vec (Mikolov et al., 2013) embedding of the input words. The embeddings for a mutli-word phrase are obtained by averaging the embeddings of all the words in the phrase. We use the 300-dimensional embeddings pretrained on the Google News corpus, which is released as part of the word2vec package. 7

Vectorizing Numerical Features via Gaussian Binning
Our model relies primarily on numerical features as many previous approaches for lexical simplification. Although these continuous features can be directly fed into the network, it is helpful to exploit fully the nuanced relatedness between different intervals of feature values. We adopt a smooth binning approach and project each numerical feature into a vector representation by applying multiple Gaussian radial basis functions. For each feature f , we divide its value range [f min , f max ] evenly into k bins and place a Gaussian function for each bin with the mean µ j (j ∈ {1, 2, . . . , k}) at the center of the bin and standard deviation σ. We specify σ as a fraction γ of bin width: where γ is a tunable hyperparameter in the model. For a given feature value f (·), we then compute the distance to each bin as follows: (2) and normalize to project into a k-dimensional vec- We vectorize all the features except word2vec vectors, , then concatenate them as inputs. Figure 2 presents a motivating t-SNE visualization of the word-complexity scores from the lexicon after the vectorization in our NRR model, where different feature value ranges are gathered together with some distances in between.

Training and Implementation Details
We use PyTorch framework to implement the NRR model, which consists of an input layer, three hidden layers with eight nodes in each layer and the tanh activation function, and a single node linear output layer. The training objective is to minimize the Mean Squared Error (MSE): where y i andŷ i are the true and predicted relative complexity scores of w a , w b which can be configured accordingly for different lexical simplification tasks and datasets, m is the number of training examples, and θ is the set of parameters of the NRR model. We use Adam algorithm (Kingma and Ba, 2014) for optimization and also apply a dropout of 0.2 to prevent overfitting. We set the rate to 0.0005 and 0.001 for experiments in ( §4.1) and ( §4.2) respectively. For Gaussian binning layer, we set the number of bins k to 10 and γ to 0.2 without extensive parameter tuning.
For each experiment,we report results with 100 epochs.

Lexical Simplification Applications
As the lexical simplification research field traditionally studies multiple sub-tasks and datasets, we present a series of experiments to demonstrate the effectiveness of our newly created lexicon and neural readability ranking (NRR) model.

Substitution Ranking
Given an instance consisting of a target complex word in a sentence and a set of candidate substitutions, the goal of the Substitution Ranking task is to rank the candidates in the order of their simplicity. In this section, we show that our proposed NRR model outperforms the state-of-the-art neural model on this task, with or without using the word-complexity lexicon.
Data. We use the dataset from the English Lexical Simplification shared-task at SemEval 2012  for evaluation. The training and test sets consist of 300 and 1,710 instances, respectively, with a total of 201 target words (all single word, mostly polysemous) and each in 10 different sentences. One example of such instance contains a target complex word in context: When you think about it, that's pretty terrible.
and a set of candidate substitutions {bad, awful, deplorable}. Each instance contains at least 2 and an average of 5 candidates to be ranked. There are a total of 10034 candidates in the dataset, 88.5% of which are covered by our word-complexity lexicon and 9.9% are multi-word phrases (3438 unique candidates with 81.8% in-vocabulary and 20.2% multi-word).
Task-specific setup of the NRR model. We train the NRR model with every pair of candidates c a , c b in a candidate set as the input, and the difference of their ranks r a −r b as the groundtruth label. For each such pair, we also include another training instance with c b , c a as the input and r b − r a as the label. Given a test instance with candidate set C, we rank the candidates as follows: for every pair of candidates c a , c b , the model predicts the relative complexity score S(c a , c b ); we then compute a single score R(c a ) = ca =c b ∈C S(c a , c b ) for each candidate by aggregating pairwise scores and rank the candidates in the increasing order of these scores.
Comparison to existing methods. We compare with the state-of-the-art neural model (Paetzold P@1 Pearson Biran et al. (2011) 51.3 0.505 Jauhar &  60.2 0.575 Kajiwara et al. (2013) 60.4 0.649 Horn et al. (2014) 63.9 0.673 Glavaš &Štajner (2015) 63   (2011), Kajiwara et al. (2013), and Glavaš &Štajner (2015), which use carefully designed heuristic scoring functions to combine various information such as corpus statistics and semantic similarity measures from Word-Net; Horn et al. (2014) and the Boundary Ranker (Paetzold and Specia, 2015), which respectively use a supervised SVM ranking model and pairwise linear classification model with various features. All of these methods have been implemented as part of the LEXenstein toolkit (Paetzold and Specia, 2015), which we use for the experimental comparisons here. In addition, we also compare to the best system (Jauhar and Specia, 2012) among participants at SemEval 2012, which used SVMbased ranking.
Results. Table 2 compares the performances of our NRR model to the state-of-the-art results reported by Paetzold and Specia (2017). We use precision of the simplest candidate (P@1) and Pearson correlation to measure performance. P@1 is equivalent to TRank , the official metric for the SemEval 2012 English Lexical Simplification task. While P@1 captures the practical utility of an approach, Pearson correlation indicates how well the system's rankings correlate with human judgment. We train our NRR model with all the features (NRR all ) mentioned in §3.2 except the word2vec embedding features to avoid overfitting on the small training set. Our full model (NRR all+binning+W C ) exhibits a statistically significant improvement over the state-of-   shirani, 1993) as it can be applied to any performance metric. We also conducted ablation experiments to show the effectiveness of the Gaussianbased feature vectorization layer ( +binning ) and the word-complexity lexicon ( +W C ).

SimplePPDB++
We also can apply our NRR model to rank the lexical and phrasal paraphrase rules in the Paraphrase Database (PPDB) , and identify good simplifications (see examples in Table 3). The resulting lexical resource, Sim-plePPDB++, contains all 13.1 million lexical and phrasal paraphrase rules in the XL version of PPDB 2.0 with readability scores in 'simplifying', 'complicating', or 'nonsense/no-difference' categories, allowing flexible trade-off between highquality and high-coverage paraphrases. In this section, we show the effectiveness of the NRR model we used to create SimplePPDB++ by comparing with the previous version of SimplePPDB ) which used a three-way logistic regression classifier. In next section, we demonstrate the utility of SimpleP-PDB++ for the Substitution Generation task.
Task-specific setup of NRR model. We use the same manually labeled data of 11,829 paraphrase rules as SimplePPDB for training and testing, of which 26.5% labeled as 'simplifying', 26.5%  as 'complicating', and 47% as 'nonsense/nodifference'. We adapt our NRR model to perform the three-way classification by treating it as a regression problem. During training, we specify the ground truth label as follows: y = -1 if the paraphrase rule belongs to the 'complicating' class, y = +1 if the rule belongs to the 'simplifying'class, and y = 0 otherwise. For predicting, the network produces a single real-value outputŷ ∈ [−1, 1] which is then mapped to three-class labels based on the value ranges for evaluation. The thresholds for the value ranges are -0.4 and 0.4 chosen by cross-validation.
Comparison to existing methods. We compare our neural readability ranking (NRR) model used to create the SimplePPDB++ against SimpleP-PDB, which uses a multi-class logistic regression model. We also use several other baselines, including W2V which uses logistic regression with only word2vec embedding features.
Results. Following the evaluation setup in previous work , we compare accuracy and precision by 10-fold cross-validation. Folds are constructed in such a way that the training and test vocabularies are disjoint. Table 4 shows the performance of our model compared to SimplePPDB and other baselines. We use all the features (NRR all ) in §3.2 except for the context features as we are classifying paraphrase rules in PPDB that come with no context. SimplePPDB used the same features plus additional discrete features, such as POS tags, character unigrams and bigrams. Our neural readability ranking model alone with Gaussian binning (NRR all+binning ) achieves better accuracy and precision while using less features. Leverag-ing the lexicon (NRR all+binning+W C ) shows statistically significant improvements over SimpleP-PDB rankings based on the paired bootstrap test. The accuracy increases by 3.2 points, the precision for 'simplifying' class improves by 7.4 points and the precision for 'complicating' class improves by 4.0 points.

Substitution Generation
Substitution Generation is arguably the most challenging research problem in lexical simplification, which involves producing candidate substitutions for each target complex word/phrase, followed by the substitution ranking. The key focus is to not only have better rankings, but more importantly, to have a larger number of simplifying substitutions generated. This is a more realistic evaluation to demonstrate the utility of SimplePPDB++ and the effectiveness of the NRR ranking model we used to create it, and how likely such lexical resources can benefit developing end-to-end sentence simplification system (Narayan and Gardent, 2016;Zhang and Lapata, 2017) in future work.
Data. We use the dataset from , which contains 100 unique target words/phrases sampled from the Newsela Simplification Corpus (Xu et al., 2015) of news articles, and follow the same evaluation procedure. We ask two annotators to evaluate whether the generated substitutions are good simplifications.
Comparison to existing methods. We evaluate the correctness of the substitutions generated by SimplePPDB++ in comparison to several existing methods: Glavaš (Glavaš andŠtajner, 2015), Kauchak (Coster and Kauchak, 2011), WordNet Generator (Devlin and Tait, 1998;Carroll et al., 1999), and SimplePPDB . Glavaš obtains candidates with the highest similarity scores in the GloVe (Pennington et al., 2014) word vector space. Kauchak's generator is based on Simple Wikipedia and normal Wikipedia parallel corpus and automatic word alignment. WordNet-based generator simply uses the synonyms of word in WordNet (Miller, 1995). For all the existing methods, we report the results based on the implementations in , which used SVMbased ranking. For both SimplePPDB and Sim-plePPDB++, extracted candidates are high quality paraphrase rules (quality score ≥3.5 for words and  Table 5: Substitution Generation evaluation with Mean Average Precision, Precision@1 and the average number of paraphrases generated per target for each method. n is the number of target complex words/phrases for which the model generated > 0 candidates. Kauchak † has an advantage on MAP because it generates the least number of candidates. Glavaš is marked as '-' because it can technically generate as many words/phrases as are in the vocabulary. ≥4.0 for phrases) belonging to the same syntactic category as target word according to PPDB 2.0 .
Results. Table 5 shows the comparison of Sim-plePPDB and SimplePPDB++ on the number of substitutions generated for each target, the mean average precision and precision@1 for the final ranked list of candidate substitutions. This is a fair and direct comparison between SimplePPDB++ and SimplePPDB, as both methods have access to the same paraphrase rules in PPDB as potential candidates. The better NRR model we used in creating SimplePPDB++ allows improved selections and rankings of simplifying paraphrase rules than the previous version of SimplePPDB. As an additional reference, we also include the measurements for the other existing methods based on , which, by evaluation design, are focused on the comparison of precision while PPDB has full coverage.

Complex Word Identification
Complex Word Identification (CWI) identifies the difficult words in a sentence that need to be simplified. According to Shardlow (2014), this step can improve the simplification system by avoiding mistakes such as overlooking challenging words or oversimplifying simple words. In this section, we demonstrate how our word-complexity lexicon helps with the CWI task by injecting human ratings into the state-of-the-art systems.
Data. The task is to predict whether a target word/phrase in a sentence is 'simple' or 'complex', and an example instance is as follows: Nine people were killed in the bombardment.
We conduct experiments on two datasets: (i) Semeval 2016 CWI shared-task dataset (Paetzold  Table 6: Evaluation on two datasets for English complex word identification. Our approaches that utilize the word-complexity lexicon (W C) improve upon the nearest centroid (Yimam et al., 2017) and SV000gg (Paetzold and Specia, 2016b) systems. The best performance figure of each column is denoted in bold typeface and the second best is denoted by an underline.   Table 7 shows the coverage of our wordcomplexity lexicon over the two CWI datasets.

CWI
Comparison to existing methods. We consider two state-of-the-art CWI systems: (i) the nearest centroid classifier proposed in (Yimam et al., 2017), which uses phrase length, number of senses, POS tags, word2vec cosine similarities, ngram frequency in Simple Wikipedia corpus and Google 1T corpus as features; and (ii) SV000gg (Paetzold and Specia, 2016b) which is an ensemble of binary classifiers trained with a combination of lexical, morphological, collocational, and semantic features. The latter is the best performing system on the Semeval 2016 CWI dataset.
We also compare to threshold-based baselines that use word length, number of word senses and frequency in the Simple Wikipedia.
Utilizing the word-complexity lexicon. We enhance the SV000gg and the nearest centroid classifier by incorporating the word-complexity lexicon as additional features as described in §3.2.
We added our modifications to the implementation of SV000gg in the LEXenstein toolkit, and used our own implementation for the nearest centroid classifier. Additionally, to evaluate the wordcomplexity lexicon in isolation, we train a decision tree classifier with only human ratings as input (W C-only), which is equivalent to learning a threshold over the human ratings.
Results. We compare our enhanced approaches (SV000gg +W C and NC +W C ) and lexicon only approach (W C-only), with the state-of-the-art and baseline threshold-based methods. For measuring performance, we use F-score and accuracy as well as G-score, the harmonic mean of accuracy and recall. G-score is the official metric of the CWI task of Semeval 2016. Table 6 shows that the wordcomplexity lexicon improves the performance of SV000gg and the nearest centroid classifier in all the three metrics. The improvements are statistically significant according to the paired bootstrap test with p < 0.01. The word-complexity lexicon alone (W C-only) performs satisfactorily on the CWIG3G2 dataset, which effectively is a simple table look-up approach with extreme time and space efficiency. For CWI SemEval 2016 dataset, W C-only approach gives the best accuracy and Fscore, though this can be attributed to the skewed distribution of dataset (only 5% of the test instances are 'complex').

Related Work
Lexical simplification: Prior work on lexical simplification depends on lexical and corpusbased features to assess word complexity. For complex word identification, there are broadly two lines of research: learning a frequency-based threshold over a large corpus (Shardlow, 2013b) or training an ensemble of classifiers over a combination of lexical and language model features (Shardlow, 2013a;Paetzold and Specia, 2016a;Yimam et al., 2017;Kriz et al., 2018). Substitution ranking also follows similar trend. Biran et al. (2011) and Bott et al. (2012)  Only recently, researchers started to apply neural networks to simplification tasks. To the best of our knowledge, the work by Paetzold and Specia (2017) is the first neural model for lexical simplification which uses a feedforward network with language model probability features. Our NRR model is the first pairwise neural ranking model to vectorize numeric features and to embed human judgments using a word-complexity lexicon of 15,000 English words.
Besides lexical simplification, another line of relevant research is sentence simplification that uses statistical or neural machine translation (MT) approaches (Xu et al., 2016;Nisioi et al., 2017;Zhang and Lapata, 2017;Vu et al., 2018;Guo et al., 2018). It has shown possible to integrate paraphrase rules in PPDB into statistical MT for sentence simplification (Xu et al., 2016) and bilingual translation (Mehdizadeh Seraj et al., 2015), while how to inject SimplePPDB++ into neural MT remains an open research question.
Lexica for simplification: There have been previous attempts to use manually created lexica for simplification. For example, Elhadad and Sutaria (2007) used UMLS lexicon (Bodenreider, 2007), a repository of technical medical terms; Ehara et al. (2010) asked non-native speakers to answer multiple-choice questions corresponding to 12,000 English words to study each user's familiarity of vocabulary; Kaji et al. (2012) and Kajiwara et al. (2013) used a dictionary of 5,404 Japanese words based on the elementary school textbooks; Xu et al. (2016) used a list of 3,000 most common English words; Lee and Yeung (2018) used an ensemble of vocabulary lists of different complexity levels. However, to the best of our knowledge, there is no previous study on manually building a large word-complexity lexi-con with human judgments that has shown substantial improvements on automatic simplification systems. We were encouraged by the success of the word-emotion lexicon (Mohammad and Turney, 2013) and the word-happiness lexicon (Dodds et al., 2011(Dodds et al., , 2015. Vectorizing features: Feature binning is a standard feature engineering and data processing method to discretize continuous values, more commonly used in non-neural machine learning models. Our work is largely inspired by recent works on entity linking that discussed feature quantization for neural models (Sil et al., 2017;Liu et al., 2016) and neural dependency parsing with embeddings of POS tags as features (Chen and Manning, 2014).

Conclusion
We proposed a new neural readability ranking model and showed significant performance improvement over the state-of-the-art on various lexical simplification tasks. We release a manually constructed word-complexity lexicon of 15,000 English words and an automatically constructed lexical resource, SimplePPDB++, of over 10 million paraphrase rules with quality and simplicity ratings. For future work, we would like to extend our lexicon to cover specific domains, different target users and languages.