PLUJAGH at SemEval-2016 Task 11: Simple System for Complex Word Identification

This paper presents the description of a sys-tem which detects complex words. It solely uses information regarding the presence of a word in a prepared vocabulary list. The sys-tem outperforms multiple more advanced systems and is ranked fourth for the shared task, with minimal loss to the best system. F-score optimization guaranteed the ﬁrst place in this measurement. Different features are considered and evaluated. Maximal bounds are predicted. The rule “the simplest methods give the best results” is conﬁrmed.


Introduction
The goal of Complex Word Identification (CWI) is to detect words in a text that are complex (not easy to understand) for some group of people. CWI is one of the tasks of SemEval-2016 (Paetzold and Specia, 2016).
CWI can be treated as the first step of Lexical Simplification (LS). LS was a task of SemEval-2012 . Complex words were identified using n-grams, the length of the word, and the number of syllables (Ligozat et al., 2012;De Belder et al., 2010;Biran et al., 2011). The resources exploited in this task include Wikipedia, WordNet, Google Web 1T corpus (Sinha, 2012;Paetzold and Specia, 2015). Additional annotation of input sentences was performed by: a part-of-speech tagger, and word sense disambiguation (Amoia and Romanelli, 2012;. A similar task is the prediction of the readability of a whole text. In comparison, in CWI, each word has to be scored. The applied methods are summarized in (Dębowski et al., 2015).
This paper presents findings regarding the necessary data and the performed experiments. For the final submission, a simple system was chosen, which scored at fourth place.

Task Data Analysis
It is important to notice the difference between training and test data. Each sentence in the training set was annotated by 20 annotators. If at least one of them classified a word in a sentence as complex, it was marked as complex. The training data consists of 2237 classified words. On the other hand, each sentence in the test data (88221 classified words) was annotated by only one annotator.
Complex words represent 31.56% of the words in the training data. Fortunately, organizers published the unaggregated annotations -every word in a sentence has 20 annotations. In this scenario, only 4.55% instances are classified as complex.
A priori probability of the word being complex is important knowledge for the classification task.
What is more, the organizers shared the baseline results for test data (Table 1). It shows that complex words represent 4.7% of instances in the test datasimilar to training.

Resources and Methods
Knowledge bases are essential to this task. Wikipedia is one of the most popular sources of text used in NLP. Using the cycloped.io (Smywiński-Pohl and Wróbel, 2014) framework the English and Simple English Wikipedia were preprocessed. The text extracted from articles allowed the calculation of term frequency (TF) and document frequency (DF). TF represents the total number of times a word appears in the corpora; DF is the number of documents in which the word occurred at least once. It was required to apply the same tokenization of corpora as in the data from the organizers.
For every word which needed classification, many features were created: • TF and DF for the word and its lemma use, -English Wikipedia, -Simple English Wikipedia, corpora created from training and test sentences, • length of sentence (number of words), • length of word (number of characters), • position of word in sentence, • GloVe word embedding (Pennington et al., 2014).

Evaluation
All experiments were conducted by employing cross-validation on raw vote data. Training data were aggregated -a word is labeled as complex if at least two annotators marked it accordingly.

Metrics
The results are scored using a harmonic mean of accuracy and recall (marked as G-score). In comparison to F-score (a harmonic mean of precision and recall), it is higher if more instances are predicted as complex.

Experiments
Tree-based classifiers achieved the best results (except for word embeddings). Table 2 presents the G-scores obtained by training a classifier with each of the features. Combining features gives only a slightly better score.

Upper Bounds
Complex word identification is a subjective task. The understanding of a word depends on the knowledge of a particular person. Therefore, 100% Gscore is impossible to achieve. Due to the fact that the training data was annotated by multiple annotators, it was possible to measure the inter-annotator agreement. Two theoretical systems were scored on the training data. Both systems have knowledge regarding the annotators' assessment of the words in sentences. The first one has information regarding the context (whole sentence) -for each sentence, it knows how many annotators recognized each word as complex. The second one knows how many times each word was assessed as complex (without context).
1. The problem can be treated as simple classification and not sequence labeling. For every word in every sentence, the system predicts words as complex if at least X people annotated it as complex. The maximum G-score is 84.54% for X=10% and the Fscore is 51.66% for X=25%. This system has information regarding the word and the sentence. However, it is still not sequence classification -it has no information regarding the predictions of the other words in the sentence. Figure 1 presents results in a function of X.
2. Going further input data can be solely words, without the sentence, so that we can aggregate annotations for the same words, but in different sentences. The system describes a word as complex if at least X people annotated it as complex (this system has no information regarding the context of the sentence). The maximum G-score is 85.04% for X from 4% to 5%, and the F-score is 51.71% for X from 26% to 27%. This system has information only about the word. Figure 2 presents results in a function of X.
The results above show that a G-score of 86% can not be exceeded on this data.

Final Submission
The experiments showed a minimally increased score for more advanced classifiers using more features in comparison to the simple one-rule algorithm with one feature. Simple models are usually more difficult to overfit. The complexity of this algorithm is O(1) for every word using hashing.
The final submission uses DF of Simple English Wikipedia. The scores, as a function of threshold, are presented in Figure 3.
The main submission is optimized for G-score, and its threshold is 147. Words with a DF exceeding this threshold are considered simple, and others are considered complex. A set of simple words contains almost 11 thousand tokens (without sanitization). The size of the model is 78 kilobytes.
The second submission was optimized for F-score and the threshold was 18. Table 3 shows the top 10 results of the systems on the test data in terms of G-score. The system placed fourth with two other systems.

Results and Discussion
The best system, SV000gg, ensembles 23 distinct systems using 69 morphological, lexical, semantic, collocation, and nominal features. The system is much more advanced than the one presented in this   paper. Its result is higher by almost one percentage point. The next system in the ranking, TALN-WEI, uses external resources, i.e. WordNet, simple/complex word lists, tools, i.e. part-of-speech tagger, and a dependency parser. A random forest classifier is then trained.
JUNLP-NaiveBayes employs word sense disambiguation and features extracted from an ontology. Also, a random forest classifier is used. Additional word lists are developed, i.e. scientific, geographical, and non-English.
Surprisingly, UWB-ALL is almost the same as the one presented in this article (the English version of Wikipedia is used, not Simple English).
The presented system took first place in terms of F-score. The higher score is probably due to this submission being optimized for F-score with no other teams doing this.
Beating 85% G-score is not possible without more information. It is possible that having the possibility to model every person's knowledge would improve the results. However, this approach needs historic data annotated by a specified user and the predictions would be only relevant for this user.