The Whole is Greater than the Sum of its Parts: Towards the Effectiveness of Voting Ensemble Classifiers for Complex Word Identification

In this paper, we present an effective system using voting ensemble classifiers to detect contextually complex words for non-native English speakers. To make the final decision, we channel a set of eight calibrated classifiers based on lexical, size and vocabulary features and train our model with annotated datasets collected from a mixture of native and non-native speakers. Thereafter, we test our system on three datasets namely News, WikiNews, and Wikipedia and report competitive results with an F1-Score ranging between 0.777 to 0.855 for each of the datasets. Our system outperforms multiple other models and falls within 0.042 to 0.026 percent of the best-performing model’s score in the shared task.


Introduction
Complex Word Identification (CWI) is an essential sub-task for Lexical Simplification. Lexical Simplification involves substituting a complicated word in the text with a more straightforward synonym. Figure 1 shows the pipeline for Lexical Simplification systems. It is geared for target population like non-native speakers, second-language learners, young learners, and people with language disabilities (like Aphasia and Alexia), with the aim of allowing them to comprehend the presented text completely.
The goal of the shared task is as follows: Given a target word (or phrase) and its context, we are to computationally determine if the target word is complex or not. Unlike the SemEval 2016 shared task, the target words here could have more than one word (e.g., teenage girl), and the context could stretch over multiple sentences.
The rest of the paper is organized as follows. In Section 2, we mention related work in the area of Complex Word Identification -in particular, the previous shared task at SemEval 2016  (Paetzold and Specia, 2016a). Section 3 describes the dataset of NLP BEA'S CWI shared task at NAACL 2018. In Section 4, we describe our system, the features used, and our classification methodology. Moving along we then report our competitive results in Section 5 and discuss them in Section 6. We conclude by recapitulating our paper in Section 7 and identify future work that will be done.

Related Work
In SemEval 2016, 21 teams participated in a shared task on complex word identification (Paetzold and Specia, 2016a). The competition involved finding out whether a given word in a sentence was complex or not for a non-native speaker. The dataset used was completely in English.
In this task, the winning team used a soft votingbased approach from the outputs of 21 predictors (either classifiers, threshold-based, or lexical) (Paetzold and Specia, 2016b). This system was the best system according to the G-Score -an evaluation metric designed specifically for this task at SemEval 2016 (Paetzold and Specia, 2016a). The system with the best F1-Score made use of a threshold-based approach that marked a word as complex if its frequency in Simple Wikipedia is above a threshold (Wróbel, 2016).
Other systems at the SemEval 2016 shared task used SVM (Kuru, 2016;Choubey and Pateria, 2016;S P et al., 2016;, Random Forest (Davoodi and Kosseim, 2016;Mukherjee et al., 2016;Brooke et al., 2016;Ronzano et al., 2016), Neural Networks (Bingel et al., 2016;Nat, 2016), Decision Trees (Quijada and Medero, 2016;, Nearest Centroid classifier (Palakurthi and Mamidi, 2016), Naive Bayes (Mukherjee et al., 2016), threshold bagged classifiers (Kauchak, 2016) and Entropy classifiers (Konkol, 2016;Martínez Martínez and Tan, 2016). The features used in most of the systems were common, such as length-based features (like target word length), presence in a corpus (like presence of the target word in Simple English Wikipedia), PoS features of the target word, position features (position of the target word in the sentence), etc. However, a few of the systems used some innovative features. One of them was the MRC Psycholinguistic database (Wilson, 1988) used by Davoodi and Kosseim (2016). Another system by Konkol (2016) used a single feature namely document frequency of the word in Wikipedia, for classifying using a maximum entropy classifier.

Datasets
For this shared task (Yimam et al., 2018), we used only the English monolingual dataset, which made use of data from a number of sources, such as News articles, WikiNews and Wikipedia articles. Table 3 shows details such as total sentences and the number of unique sentences that we computed across all the three datasets. The Wikipedia dataset consisted of sentences from Wikipedia articles. Likewise, the WIKINEWS dataset and the NEWS dataset contained sentences from news articles. However, the difference between the two is that the articles in the NEWS dataset were written by professional journalists, while lesser experienced writers wrote those in the WIKINEWS dataset.
In a majority of instances, the target words were just a single word. However, there were a few target words that were over a word long. Similarly, in most cases, the context was only one sentence, except for a few instances in which the context was as long as 3 -4 sentences. The training datasets were annotated by 10 native and 10 non-native English speakers. Even if one amongst them found the word to be difficult, it was annotated as complex.

Methodology
In this section, we describe the experiment setup, such as the features used and provide analysis for their selection. This is followed by a detailed system overview which explains the system's architecture.

Feature Sets
We investigated several intuitive properties of the target word such as its relevant lexical attributes, length properties and presence in certain word lists.

Lexical Features
The following features were extracted using WordNet (Fellbaum, 1998) for the target word: • Degree of Polysemy (DP): Number of senses of the target word in WordNet (Fellbaum, 1998). This is operationalized by counting the number of Synsets of the target word in WordNet. Words with larger WordNet Synset sizes have several senses and were found to be more unclear.
• Hyponym (Ho) and Hypernym (He) Tree Depth (TD): These help in finding lexical relations. To find the position of the word in WordNet's hierarchical tree, we consider capturing its depth. General and simple words tend to be at the top of the tree. By computing the average depth among all the target-word Synsets, we count the number of Hyponyms and Hypernyms as a feature.
• Holonym Count (HC) and Meronym Count (MC): An alternative way to traverse Wordnet's hierarchical tree is by considering the relationship of the target word to its components (Meronyms) or to the things it is contained in (Holonyms). Holonyms tend to be more simple than meronyms because meronyms are usually more specific, compared to holonyms, as holonyms are a generalized word for a group of entities, while meronyms refer to specific entities in that group.
• Verb Entailments (VE): Verbs being action words often contain entailment relationships. For example, the act of roosting involves the act of sitting, so roosting entails sitting. Target words on average with multiple entailments were found to be relatively complex since they tend to be visually more vivid when trying to comprehend. Hence, the number of verb entailments of the target word was also part of our feature set.

Other Features
In addition to the lexical features, we also make use of size-based features and vocabulary-based features. These features are defined in Table 3.

System Overview
These input features are converged to the following eight calibrated classifiers, namely Random   These eight classifiers were chosen because they gave the best results on 10-fold crossvalidation of the training set. We decided upon these classifiers since each of them had an F1-Score of the complex class in excess of 0.70. Table 2 describes the selected and rejected classifiers, along with their Precision, Recall and F1-Score on ten-fold cross-validation of the training data. Since the majority class was the non-  complex class, the ZeroR classifier has a Precision, Recall, and F1-Score of 0. We use a hard voting approach to predict the class of the target word. If more than 4 classifiers classify the target word as either complex or simple, we assign the majority label to that word. In case of a 4-4 tie, (where 4 classifiers say the target word is complex and 4 say that it is simple), we use a word-embedding based classifier to act as a tie-breaker. For the word-embedding based classifier, we use the GloVe pre-trained word embeddings (Pennington et al., 2014). We first split the target into its constituent words (in most cases, it is a single word, but in a few cases, it is a phrase). We find the most similar word to each of the constituent words in the training set. If any of the given constituent words were tagged as complex, we target the target word as complex as well.
Out of 4252 test points to be classified, 173 times a tie occurred and the ensembled classiferes were unable to make a call. This is almost 4.06% of the predictions, which is significant in the larger scheme of things and further refines the hard voting.

Results and Analysis
In this section we discuss the results as well as reflect on the significance of each of the features for this task. Table 4 gives the results of our experiments on the test set. From the results, our system is placed 4th in the WIKINEWS dataset, 5th in the WIKIPEDIA dataset, and 6th in the NEWS dataset. Figure 3 delineates important features and ranks them according to their significance. Size based features namely -Word Length, Vowels Count, Syllable Count, Word Count were seen to constitute the first four topmost features. Another useful indicator of a complex word is its presence in Barron's GRE Word List, a list filled with the vocabulary level equivalent to a graduate college student.

Discussion
As it is evident from Tables 2 and 4, we see that individual classifiers do not work as well as ensembling them together, which agrees with the expression "The whole is greater than the sum of its parts". Classifier Ensembling would further prove to be an efficacy for contextual documents similarity-based binary classification tasks (Kanojia et al., 2017) which rely heavily on lexical features, as well as it should also potentially crosspollinate to benefit probabilistic touch classification problems  where spatial and contextual information has been proven to be pivotal.

Conclusion and Future Work
In this paper, we describe our participation to NLP-BEA'S CWI 2018 Shared Task at NAACL concerning Complex Word Identification. We presented and evaluated our system across three datasets and showed that Ensemble Classifiers with hard and GloVe Voting are effective by means of lexical, size and vocabulary features for identifying complex words.
As part of our future work, we plan to incorporate Parts of Speech (POS) tags, Named Entity Recognition (NER) tag and word position features to improve our existing effective system.