USAAR at SemEval-2016 Task 11: Complex Word Identification with Sense Entropy and Sentence Perplexity

This paper describes an information-theoretic approach to complex word identiﬁcation using a classiﬁer based on an entropy based measure based on word senses and sentence-level perplexity features. We describe the motivation behind these features based on information density and demonstrate that they perform modestly well in the complex word identiﬁcation task in SemEval-2016. We also discuss the possible improvements that can be made to future work by exploring the subjectivity of word complexity and more robust evaluation metrics for the complex word identiﬁcation task.


Introduction
Complex Word Identification (CWI) is the task of automatically identifying difficult words in a sentence.
It is an important subtask prior to the textual/lexical simplification task that pertains to the substitution of abstruse words with lucid variants which can be apprehended by a wider gamut of readers (Siddharthan, 2006;Specia et al., 2012;Shardlow, 2013).
The aim of the CWI task is to annotate the difficult words as shown in the underlined examples in the previous paragraph, such that a lexical simplification system can produce the following sentence: It is an important subtask before the textual/lexical simplification task that concerns the replacement of difficult words with simpler variants which can be understood by a wider range of readers.
Lexical simplification is a specific case of lexical substitution where the complex words in a sentence are replaced with simpler words.
Historically, lexical substitution was conceived as a means to examine the issue of the appropriateness of a fixed word sense inventory in the word sense disambiguation task the "sense" of a polysemous word is correctly identified given a context sentence (Kilgarriff, 1997;Palmer, 2000;Hanks, 2000;Ide and Wilks, 2007;McCarthy and Navigli, 2009). By allowing fluidity in the "sense" inventory and by quantifying how much the systems were able to generate good substitutes, these lexical substitutes would have built a word sense cluster of words that may not be covered by a set of pre-defined words in a sense inventory, e.g. Princeton WordNet (Miller, 1995) and Open Multilingual WordNet (Bond and Paik, 2012).

Entropy and Perplexity
Entropy is an information-theoretical measure of the degree of indeterminacy of a random variable 1 . In simpler words, entropy measures how unpredictable an event is likely to occur (Shannon, 1951).
For the case of complex word , we can also assume that the degree of word ambiguity contributes to its level of unpredictability which determines its complexity. We define the degree of word ambiguity as the number of possible senses a word can have, more specially the number of synsets of a lemma of the target word as recorded in the Princeton Word-Net. Formally, we define the sense entropy of a word, H(word), as such: where n is the number of possible sense of a word and p(sense k ) is the probability of sense given the context sentence where the word occurs. We assume a uniform distribution across all senses of a word, thus we assign 1/n to the p(sense k ) variable.
Perplexity is inverse measure of entropy that measures how predictable an event is likely to occur. Intuitively, if a complex word appears in a sentence, the sentence would become less common and less predictable, yielding a higher sentence perplexity score. Mathematically, we define the sentence perplexity, 2 H(sentence) as follows: where N is the number of words in the sentence and p(word i ) is the unigram probability of the word generated from a modified Kneser-Ney language model (Chen and Goodman, 1999).

Experimental Setup
The dataset for the CWI task in SemEval-2016 is annotated at word level with binary labels; 1 for complex and 0 for non-complex.  Table 1 presents the corpus statistics of the dataset provided for the CWI Task. The organizers have decided to emulate the limited human language capacity with a small training set and a large testing set that reflects the relatively larger proportion of text that a human will encounter in reality. However, we do note the stark difference between the percentage of complex words in the training and test data; it skews towards words being annotated with the noncomplex labels.
To compute the sense entropy, we annotated the dataset with lemmas using the PyWSD lemmatizer  and reference the lemmas to the Princeton WordNet. The training and testing set comprise 2,237 and 88,221 words respectively. Of the annotated words, the training and testing set has 1,903 and 20,016 unique lemmas and the WordNet covers 84.97% and 64.89% of these lemmas respectively. When a lemma is not covered by WordNet, we assign an entropy of 0 that indicates that the lemma's complexity is easily predictable and the classifier would assign the majority label to the word.
To compute the sentence perplexity as presented in the previous section, we use the English Wikipedia section of the SeedLing corpus (Emerson et al., 2014) and the news articles from the DSL Corpus Collection  to train the language model using the KenLM tool (Heafield et al., 2013). On average, there are 11 annotated words per sentence and every word in the same sentence shares the same sentence perplexity.
Using both the sense entropy and sentence perplexity as features, we train a boosted tree binary classifier (Friedman, 2002) using the Graphlab Create 2 machine learning toolkit to identify the word complexity.
Interestingly, when we use the raw number of senses instead of sense entropy as a feature on various machine learning classifiers, the number of senses were uninformative and the classifiers either labels all words as complex or all words as noncomplex.

Results
We submitted 2 systems to the CWI task in SemEval-2016 (Paetzold and Specia, 2016), one using only the sense entropy (sentropy) and another that includes the sentence perplexity feature (entroplexity).
The complex word classification would be evaluated based on classic (i) accuracy, (ii) precision, (iii) recall, (iv) F-score. In ad-dition to the harmonic F-score between the precision and recall, the organizers reported the harmonic mean between the accuracy and recall, dubbed G-score 3 .
Since the accuracy score computes the percentage of true positive labels globally, it might be more indicative to read the accuracy scores given the highly skewed dataset (<5% of the test set is labelled as complex). Table 2 presents the comparative results between our systems, 4 systems that ranked top in F-score and G-score and 2 baseline systems that uses threshold frequencies that best separate complex from simple words learned from the English and Simple English Wikipedias.
PLUJAGH-SEWDFF uses frequency thresholding techniques, they consider any word that occurs less than 147 times in the simple English Wikipedia to be complex. LTG-System2 uses a decision tree classifier trained using similar threshold features. SV000gg uses a soft and hard voting ensemble to combine 23 different systems that includes threshold-based and lexicon-based techniques and machine learning classifiers based on 69 distinct morphological, lexical, semantic, collocational and nominal features.
Compared to the top systems, our system has performed modestly and our Sentropy system outperforms the thresholding baselines. We note that our accuracy and precision scores are relatively competitive as compared to the top systems but our recall is distinctly lower which affects the F-and Gscores. Possibly, we could improve the system by using a word sense disambiguation tool that provide the sense probabilities instead of assuming uniform probabilities across all senses, especially when word senses are often dictated by the most common sense of the word given the context sentence.
Intuitively, we can expect the Entroplexity system with sentence-level perplexity to underperform in this particular test set because the variance of the perplexity measures are low since all words within the same sentence attain the same sentence perplexity. For a dataset where there are more training sentences, the feature could perform better.

Subjectivity of Word Complexity
From the example in the introduction, we see the subjectivity of word complexity and how it may vary from speaker to speaker. Arguably substitution and replacement could have been of equal word complexity depending on the speaker's level of English proficiency. Although the word variant could easily be considered simple for a native French/German speaker learning English where the equivalence Variante exists in his/her native language, it might have been considered a complex word for other second English language speaker.
The annotations from the CWI task training set were collected from 20 annotators over a set of 200 sentences. A word is labelled complex if any one of the annotator deems it to be complex, while the testing set was annotated by 1 annotator.
To explore the reader-based subjectivity in word complexity identification, we suggest that future work on CWI explores reader-specific annotations and models user-specific annotations. In this respect, readers' meta-data such as their native and non-native languages, country of residence, etc. could potentially be more telling in predicting their English proficiency and identifying complex word catered to specific readers or groups of readers.

Evaluation Metrics
Complex Word Identification is a novel task and possibly the standard F-score and accuracy measures might not be reflective of the task difficulty or the system efficiency. Given the binary nature of the classification task, we suggest the use of Matthews correlation coefficient (Matthews, 1975) that measures the measures the correlation coefficient between the observed and predicted binary labels, which can be viewed as a variant of the chisquare coefficient 4 .
It measures the discordant relations between the true and false positives and negatives and avoids the need to optimize the systems based on either accuracy or precision but a healthy fusion of both. The coefficient value ranges from -1 to +1 where +1, 0  and -1 respectively represents perfect, random and inverse predictions.

Related Work
Although the lexical simplification/substitution task is well-studied, the complex word identification task has mostly been discussed as an anterior subtask. Devlin and Tait (1998) implemented a lexical substitution system that considers all words as complex words and generated the simpler variant of the words by referencing the most frequent synonym of the word from the WordNet synsets and SUBTLEX corpus (Brysbaert and New, 2009).
Another method to identify complex words is to use the Zipfian nature of language by thresholding frequencies and classify words that occurs below a certain threshold as complex. Zeng et al. (2005) and Elhadad (2006) applied the thresholding method to the medical domain to identify technical terms that non-experts would find it difficult to read, the complex terms and varying frequencies correlate with the word difficulty scores elicited from questionnaires (Zeng-Treitler et al., 2008). Similarly,  used n-gram based frequency threshold to identify complex words that has caused second language Chinese learners to make errors in their essays.
Other than the heuristics described above, previous studies had also used supervised machine learning algorithms and data annotated with binary labels for each words in the training corpus. Malmasi et al. (2016) use Zipfian word ranks and character n-grams features to train a random forest, an SVM and a nearest neighbour classifier to predict word complexity. Shardlow (2013) compared various techniques to identify complex words, viz. (i) treating every word as complex, (ii) thresholding frequency using the mean of the thresholds discovered through the highest accuracies achieved across cross-validations folds of the training and (iii) an SVM classifier using word-level and character-level (orthographic and phonemic) frequencies and the number of synsets of each words.
While the 'everything is complex' technique achieved the highest recall, the SVM classifier scored the best precision 5 . The coefficients in his SVM classifier presented the sense feature as the weakest while the frequency features indicated higher correlations with the binary label distribution 6 . In comparison, our sense entropy system is based solely on the number of senses per word reported modest results in the CWI task.

Conclusion
In this paper, we presented our systems submitted to the complex word identification task in SemEval-2016. We introduced the notion of sense entropy that measures the unpredictability of a word based on its number of senses and used it as a feature to identify complex word . The implementation of our system is released as an open source tool available on https://github.com/alvations/ entroplexity.