Pomona at SemEval-2016 Task 11: Predicting Word Complexity Based on Corpus Frequency

We introduce a word frequency-based classi-ﬁer for the SemEval 2016 complex word iden-tiﬁcation task (#11). Words with lower frequency are predicted as complex based on a threshold optimized for G-score. We examine three different corpora for calculating frequencies and ﬁnd English Wikipedia to perform best (ranked 13th on the SemEval task), followed by the Google Web Corpus and lastly Simple English Wikipedia. Bagging is also shown to slightly improve the performance of the classiﬁer. Overall, we ﬁnd word frequency to be a strong predictor of complexity. On the SemEval “test” set, a frequency classiﬁer that uses the optimal frequency threshold performs on-par with the best submitted system and a system trained using only 500 labeled examples split from the test set achieves results that are only slightly below the best submitted sys-tem.


Introduction
Text simplification aims to transform text into more accessible versions while retaining the original meaning. A frequent subproblem of the general simplification problem is complex word identification: identify words in a text that are difficult to understand for the reader. Complex word identification is critical in lexical simplification algorithms where the simplification process is done a word at a time; frequently, simplification is broken into two steps, first identifying the complex words that need simplifying and, second, determining substitutions for these words (Shardlow, 2014). Even for simplifi-cation systems that make sentence-level transformations (Siddharthan, 2014) complex word identification can be used as an additional feature function in the model and as a development tool to help measure progress. Additionally, in some domains such as health and medicine, accuracy is critical and semiautomated simplification tools are common (Leroy et al., 2012). In these domains, complex word identification is useful to help guide the simplification process by both identifying which words need to be simplified and filtering/ranking possible candidate substitutions .
In this paper, we explore the use of word frequency as a predictor of the complexity of that word. Corpus studies have shown that simpler texts contain more frequent words than more complicated texts (Breland, 1996;Pitler and Nenkova, 2008;Leroy et al., 2012). User studies have also shown a correlation between word frequency and whether users know the definition of a word . In semi-automated text simplification approaches, replacing less frequent words with higher frequency synonyms has been shown to produce text that people view as simpler and is easier to understand .

Bagged Frequency Classifier
Given a sentence S = s 1 s 2 ...s m and a word in that sentence, s j , the complex word identification task is to predict whether that word is complex (1) or not (0). Labeled examples are triples consisting of the sentence, the word and the label, i.e. S, s j , {0, 1} , and unlabeled examples consist only of the sentence and word S, s j . We view the problem as a super-vised classification problem: given a collection of training examples, the goal is to learn a classifier to predict the label of unlabeled examples. See the Se-mEval Task 11 description for more details (Paetzold and Specia, 2016).
We utilize bagging (bootstrap resampling) to learn and combine multiple basic classifiers that predict by thresholding the frequency of occurrence of the word in question (s j ) in an external corpus. Classification is then done by majority vote of these classifiers. The subsections below provide more details.

Basic frequency classifier
The basic frequency classifier predicts the word complexity using only a single feature, the frequency of the word in question (s j ) in a corpus. Given an unlabeled example, the classifier predicts based on a learned threshold, α: with the assumption that words that occur less frequently in a corpus are more complex.
To train the basic classifier, we select α in an exhaustive fashion by considering all possible frequencies of seen in the training examples as candidate thresholds. Specifically, for each training example S, s j , {0, 1} , we consider using α = f req(s j ). We select the α that maximizes the G-score on the training set, where the G-score is defined as: 2 * accuracy * recall accuracy + recall We chose to optimize the G-score since it was the metric used for evaluation in the SemEval task (Paetzold and Specia, 2016), though other metrics could be used instead.

Word frequencies
Word frequencies can be pre-calculated from any corpus. For this paper we examined three corpora: articles from Simple English Wikipedia 1 , articles from English Wikipedia 2 and the Google Web Corpus (Brants and Franz, 2006). For the Wikipedia articles we used the document aligned data set created by Kauchak (2013) consisting of approximately 60K articles on the same topics from each Wikipedia.
To collect word frequencies for the two Wikipedia variants, tokenization was first performed using the Stanford CoreNLP PTBTokenizer (Manning et al., 2014) and then frequencies were calculated. For the Google Web Corpus, we used the unigram counts. In all corpora, all capitalization variants were aggregated, e.g. occurrences of "natural" and "Natural" would both be counted towards the same word form.

Bagging
To improve classifier performance and reduce variance we investigated the use of bagging (Breiman, 1996), also referred to as bootstrap resampling. An These two steps are repeated b times resulting in b different classifiers. To classify a new, unlabeled example, each of the b classifiers makes a prediction and the final label is the label with the majority vote, with ties going to not complex (0), since this was the more frequent class.

Experiments
We submitted two systems to the SemEval Complex Word Identification challenge (Task 11), which used the same parameter settings and only differed in where the corpus frequencies were collected, English Wikipedia (NormalBag) and the Google Web Corpus (GoogleBag). Both systems used b = 10 rounds of bagging, which was shown experimentally to have the best scoring value, using repeated rounds of 10-fold validation. We also discuss results here for a system which used Simple English Wikipedia word frequencies, though we did not submit it to the challenge (for consistency, we denote it SimpleBag). The task consisted of a training data set (N = 2,237), which was available during development, and a test data set (N = 88,221) on which the competition was scored and the labels were only released after the competition (Paetzold and Specia, 2016). We use both data sets here to analyze the performance of the classifiers. Table 1 shows the results for the classifiers trained using bagging with word frequencies calculated from the three different corpora. English Wikipedia performed the best, followed by Google Web Corpus and finally Simple English Wikipedia. The top two entries were entered into the SemEval competition and ranked 13th and 16th respectively out of 51 systems (42 team submitted systems and 9 baseline systems). We hypothesize that English Wikipedia represents a good compromise between size/coverage and corpus quality; even though NormalBag had slightly lower recall than the other two, it was able to achieve that recall with a significantly higher accuracy.

The impact of corpus choice
To verify that the differences in performance between the three systems were significant, we used bootstrap resampling with a paired sample t-test (Koehn, 2004). Based on 100 random samples, all differences between all metrics and all systems were significant (p < 0.0001, with Bonferroni correction to correct for testing multiple different comparisons).
Overall, relative to other systems that were submitted for the SemEval task, these frequency-based classifiers biased towards recall, e.g. the Google frequency and English Wikipedia frequency systems ranked 3rd and 5th with respect to recall (of the 42 team submitted systems).

The impact of bagging
To measure the impact of bagging on the prediction performance of the systems, for each corpus source, we compared the basic frequency classifier to the bagged variant. We generated 100 random 10-fold partitions of the training data and performed 10-fold cross validation on each for each system variant. We averaged the results across the each 10-fold set resulting in 100 scores for each of the systems. Table 2 shows the averages over these 100 scores. For both of the Wikipedia variants (Normal and Simple) bagging provided a small increase in performance (p < 0.0001 based on a paired t-test). For the Google frequencies the performance actually decreased, though this decrease was not significant.
To understand the effect that b (the number of bootstrap samples) has on the performance of the classifier, we compared the performance of the classifier with b = 1, 2, ..., 100. Figure 1 shows a plot of the G-score versus the number of bags used by the classifier for the NormalBag classifier on the training set. As with the previous experiment, to partially mitigate noise, we generated 100 randomly 10-fold sets and averaged the results across all of these to generate the data.
For small b, increasing the number of classifiers voting does increase the performance of the classifier. However, after around b = 10 adding more classifiers degrades the performance with the classifier. Although the difference is small for b = 1 vs. b = 10 (0.677 vs. 0.680), the difference is statistically significant.

Limits of frequency-based classification
Using English Wikipedia frequencies and the optimal frequency threshold (i.e. α), the basic threshold classifier achieves a G-score of 0.779 on the test data set. This is slightly higher than the best scoring SemEval system, which achieved 0.774. Clearly frequency provides a strong signal for word complexity.
The previous experiment assumes an unreasonable scenario where we know the labels and can pick the optimal value. To better understand the impact of frequency, we split the test data into 10-folds and performed 10-fold cross-validation analysis using the basic threshold classifier, training the threshold on 90% of the SemEval "test" data and then testing on the remaining 10%. In this scenario, the threshold classifier still achieves a G-score of 0.764, only slightly less than the score achieved using the optimal threshold. 0.764 is still significantly higher than the score achieved by the system when trained on the SemEval "training" set. Two possible differences exist between the training and testing data. First, the test data is two orders of magnitude larger than the original training data. This additional data could result in a more reliable classifier. Alternatively, train and Here the training data is a subset of the SemEval "test" set. test were generated in different ways and could have different characteristics.
To investigate this, we held out 10% of the "test" data set as testing data and trained the basic threshold classifier on increasing amounts of the remaining 90%. Figure 2 shows the G-score for training sizes up to 1,000 (the G-score mostly stabilized beyond 1,000 with only minor variation). Even with only 250 training examples, the classifier already achieves a G-score of 0.748 and with 500 training examples, it achieves 0.752, only a little less than the final score using all of the training data of 0.760. For the frequency classifier, more the data domain, and less the size, accounts for the differences in performance seen.

Conclusion
In this paper, we described our entry for the complex word identification SeEval 2016 task (#11). We utilize word frequency to classify complexity, with less frequent words being classified as complex. As has been seen in previous corpus studies, frequency is a very strong predictor of the complexity of a word. However, the corpus where those frequencies are measured does play a role in performance. We found that English Wikipedia performed best for this particular task. Future research is needed to investigate this phenomena more broadly.