SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting

We introduce the SV000gg systems: two En-semble Methods for the Complex Word Iden-tiﬁcation task of SemEval 2016. While the SV000gg-Hard system exploits basic Hard Voting, the SV000gg-Soft system employs Performance-Oriented Soft Voting, which weights votes according to the voter’s performance rather than its prediction conﬁdence, allowing for completely heterogeneous systems to be combined. Our performance comparison shows that our voting techniques out-perform traditional Soft Voting, as well as other systems submitted to the shared task, ranking ﬁrst and second overall.


Introduction
In Complex Word Identification (CWI), the goal is to find which words in a given text may challenge the members of a given target audience. It is part of the usual Lexical Simplification pipeline, which is illustrated in Figure 1. As shown by the results obtained by (Paetzold and Specia, 2013) and (Shardlow, 2014), ignoring the step of Complex Word Identification in Lexical Simplification can lead simplifiers to neglect challenging words, as well as to replace simple words with inappropriate alternatives.
Various strategies have been devised to address CWI and most of them are very simple in nature. For example, to identify complex words, the lexical simplifier for the medical domain in (Elhadad and Sutaria, 2007) uses a Lexicon-Based approach that exploits the UMLS (Bodenreider, 2004) database: if a medical expression is among the technical terms registered in UMLS, then it is complex. The complexity identifier for the lexical simplifier in (Keskisärkkä, 2012), for Swedish, uses a threshold over word frequencies to distinguish complex from simple words. Recently, however, more sophisticated approaches have been used. (Shardlow, 2013) presents a CWI benchmarking that compares the performance of a Threshold-Based strategy, a Support Vector Machine (SVM) model trained over various features, and a "simplify everything" baseline. (Shardlow, 2013)'s SVM model has shown promising results, but CWI approaches do not tend to explore Machine Learning techniques and, in particular, their combination. As an effort to fill this gap, in this paper we describe our contributions to the Complex Word Identification task of SemEval 2016. We introduce two systems, SV000gg-Hard and SV000gg-Soft, both of which use straightforward Ensemble Methods to combine different predictions for CWI. These come from a variety of models, ranging from simple Lexicon-Based approaches to more elaborate Machine Learning classifiers.

969
In the CWI task of SemEval 2016, participants were asked to submit predictions on the complexity of words based on the needs of non-native English speakers. The setup of the task is as follows: given a target word in a sentence, predict whether or not a non-native English speaker would be able to understand it. For training, a joint and a decomposed dataset were provided. Both datasets consist in 2, 237 instances containing a sentence, a target word, its position in the sentence, and complexity label(s). The decomposed dataset contains 20 binary complexity labels, provided by 20 annotators, while the joint dataset contains only one label: 1 if at least one of the 20 annotators did not understand it (complex), and 0 otherwise (simple). Participants were allowed to train their systems over either, both or none of the datasets, as well as use any external resources.
The test set contains 88, 221 instances and follows the same format of the joint dataset, but was generated using only one word complexity label. The difference between the training and test sets is that while each instance in the training set was annotated by 20 people, each instance in the test set was annotated by only one person. The goal with this setup was that of replicating a realistic scenario in Text Simplification, where systems must predict the individual preferences of a target audience based on the overall needs of a population sample.
For evaluation, common metrics -Accuracy, Precision, Recall and F-score -are used, along with a new metric designed specifically for CWI: the Gscore. The G-score consists of the harmonic mean between Accuracy and Recall, and aims at capturing the performance of a CWI approach to be used within a Lexical Simplification system. The reasoning behind the metric is that an ideal CWI system should avoid both false negatives and false positives, which is measured through Accuracy, and at the same time capture as many complex words as possible, which is measured through Recall. High values on these two metrics would prevent a lexical simplifier from making unnecessary and possibly erroneous word replacements and from neglecting words which should be simplified.

System Overview
Our strategy explores the idea behind the popular saying "two heads are better than one" for the CWI problem. We believe that combining the "opinion" of various distinct approaches to a given task can yield better results than any of the individual approaches. This idea is not new for classification tasks like ours, and have been thoroughly explored in several ways. Strategies that combine multiple Machine Learning classifiers are often referred to as Ensemble Methods. Such methods range from very simple solutions, such as Hard Voting, in which labels are determined based on how many times they were predicted by the classifiers, to very elaborate approaches, such as Random Forests (Breiman, 2001) and Gradient Boosting (Friedman, 2001).
The strategy we employ consists of a variant of Soft Voting, in which the class of a given instance is determined as in Equation 1.
In traditional Soft Voting, c f is the selected class, c is one of the possible classes in a classification problem, S the collection of systems considered, and T a confidence estimate, i.e. a function that expresses how confident system s is that c is the correct class. Its goal is to increment Hard Voting by incorporating the systems' classification confidence in the decision process, hopefully making for a more reliable way of exploiting their strengths and weaknesses.
Although sensible in principle, Soft Voting might not be able to effectively combine systems if they do not have a reasonably uniform way of determining the confidence on their predictions. The presence of over-optimistic or over-pessimistic systems may skew the results severely, and hence make the resulting classifier have worse performance than that of the best system among those considered in the voting. Another clear limitation of traditional Soft Voting is that it cannot include systems which simply cannot estimate the confidence level of their prediction. Lexicon-Based CWI approaches such as the ones of (Elhadad and Ph, 2006) and (Elhadad and Sutaria, 2007), for example, predict that a word is simple if it is present in a certain vocabulary. These approaches tend to be very effective in certain contexts, but can only produce binary confidence estimates: if the word is in the vocabulary, then it is 100% sure the word is simple, if not, it is 100% sure the word is complex.
In order to address these limitations, we exploit Performance-Oriented Soft Voting (Georgiou and Mavroforakis, 2013). Instead of using the systems' summed confidence to predict a label, it uses their performance score over a certain validation dataset. Formally, we decompose function T from Equation 1 into the two functions illustrated in Equation 2.
In Equation 2, P represents the score of system s over a certain dataset d given a certain performance metric, such as Precision, Recall, F1, Accuracy, etc. Function D, on the other hand, outputs value 1 if system s has predicted c for the classification problem in question, and 0 otherwise. This setup works under the assumption that the systems' performance under a validation dataset is a reliable surrogate for confidence predictions, and allows for any type of systems to be combined, whether or not they are homogeneous in their way of predicting classes.
In what follows, we described the features and settings used in the creation of our two CWI systems: SV000gg-Hard and SV000gg-Soft. While SV000gg-Hard uses basic Hard Voting, SV000gg-Soft uses Performance-Oriented Soft Voting. Since both of them combine a series of sub-systems, to avoid confusion, we henceforth refer to these subsystems as "voters".

Features
Our voters use a total of 69 features. They can be divided in four categories: • Binary: If a target word is part of a certain vocabulary, then it receives label 1, otherwise, 0. We extract vocabularies from Simple Wikipedia (Kauchak, 2013), Ogden's Basic English (Ogden, 1968) and SubIMDB (Paetzold, 2015).
• Collocational: Language model probabilities of all n-gram combinations with windows w < 3 to the left and right of the target complex word in Wikipedia, SUBTLEX (Brysbaert and New, 2009), Simple Wikipedia and SubIMDB.
• Nominal: Includes the word itself, its POS tag, both word and POS tag n-gram combinations with windows w < 3 to the left and right, and the word's language model backoff behavior (Uhrik and Ward, 1997) according to a 5-gram language model trained over Simple Wikipedia with SRILM (Stolcke and others, 2002).
In order for language model probabilities to be calculated, we train a 5-gram language model for each of the aforementioned corpora using SRILM (Stolcke and others, 2002). Nominal features were obtained with the help of LEXenstein (Paetzold and Specia, 2015).

Voters
We train a total of 21 voters which we have grouped in three categories: • Lexicon-Based (LB): If a word is present in a given vocabulary of simple words, then it is simple, otherwise, it is complex. We train one Lexicon-Based voter for each binary feature described in the previous Section.
• Threshold-Based (TB): Given a certain feature, learns the threshold t which best separates complex and simple words. In order to learn t, it first calculates the feature value for all instances in the training data and obtains its minimum and maximum. It then divides the interval into 10, 000 equally sized parts, and performs a brute force search over all 10, 000 values to find the one which yields the highest G-score over the training data. We train one Threshold-Based voter for each lexical feature described in the previous Section. • Machine-Learning-Assisted (ML): Learn a binary classification model from the training data using a Machine Learning algorithm. We build models using the following seven algorithms in the scikit-learn toolkit (Pedregosa et al., 2011): 1. Support Vector Machines 2. Passive Aggressive Learning 3. Stochastic Gradient Descent 4. Decision Trees 5. Ada Boosting 6. Gradient Boosting 7. Random Forests Additionally, we use Keras 1 to otrain a Multi-Layer Perceptron voter. Its architecture, including number and size of hidden-layers, was decided through 5-fold cross-validation over the training set. The aforementioned models use as input all binary, lexical and collocational features. Finally, we also train a Conditional Random Field model using CRFSuite (Okazaki, 2007). It uses as input all nominal features described in the previous Section. The hyper-parameters of all Machine Learningassisted voters are determined through 5-fold cross-validation over the G-score.
We select the number of the top G-score systems to be considered through 5-fold cross-validation over the joint dataset. For completion, we also include a traditional Soft Voting system that combines Machine Learning approaches only, given that the others do not have well-established ways of calculating prediction probability estimates. Table 1 illustrates the performance scores of all individual voters, along with the 25 best performing systems in the CWI task, a standard Soft Voting approach, and our two SV000gg systems. Despite their simplicity, our system voting strategies are the two most effective CWI solutions submitted to SemEval 2016, having both obtained considerably higher Gscores than traditional Soft Voting. These results 1 http://keras.io show the importance of finding clever ways to combine distinct strategies for a task, since, by not considering Lexicon and Threshold-Based voters, the traditional soft voter suffered a considerable loss in G-score.

Results
The results of the individual voters reveal that Decision Trees and Ensemble Methods achieve noticeably higher performance than the Multi-Layer Perceptron, which have been used as state-of-the-art solutions to various tasks. Another surprise comes with the scores of Threshold-Based voters, which offer competitive performance in comparison to Machine Learning techniques. The performance of our Conditional Random Field voter suggest that nominal features are not as reliable as numeric features in predicting word complexity.
The effectiveness of Ensemble Methods is further highlighted by the scores of ours' and others' solutions for the SemEval task: precisely 50% of the top 10 systems use some type of Ensemble.

Conclusions
We have presented our contributions to the Complex Word Identification task of SemEval 2016: the SV000gg systems, which exploit two types of system Ensemble voting schemes. Along with the typical Hard Voting, we employ Performance-Oriented Soft Voting, which diverges from traditional Soft Voting by weighting votes not by their prediction confidence, but rather by overall system performance.
Our performance comparison shows how effective our voting strategies can be: they top the rankings in the SemEval task, outperforming even elaborate Ensemble strategies. We hope that our approach will serve as a reliable alternative to other problems in Natural Language Processing and beyond.
In the future, we also intend to explore the use of Gaussian Processes and Multi-Task Learning for Complex Word Identification.