Complex Word Identification Based on Frequency in a Learner Corpus

We introduce the TMU systems for the Complex Word Identification (CWI) Shared Task 2018. TMU systems use random forest classifiers and regressors whose features are the number of characters, the number of words, and the frequency of target words in various corpora. Our simple systems performed best on 5 tracks out of 12 tracks. Our ablation analysis revealed the usefulness of a learner corpus for CWI task.


Introduction
Lexical simplification  is one of the approaches for text simplification (Shardlow, 2014), which facilitates children and language learners reading comprehension. Lexical simplification comprises the following steps: 1. Complex word identification 2. Substitution generation 3. Substitution selection

Substitution ranking
In this study, we work on complex word identification (CWI) (Shardlow, 2013), a subtask of lexical simplification.
Previous studies (Specia et al., 2012;Paetzold and Specia, 2016a) concluded that the most effective way to estimate word difficulty is to count the word frequency in a corpus. However, they counted the word frequency in corpora written by native speakers, such as Wikipedia. Language learners tend to use simple words as compared to native speakers. Therefore, we expect the word frequency in the learner corpus to be a useful feature for CWI.
Our CWI system considers the word frequency in a learner corpus as well as in corpora written by native speakers. We use the Lang-8 corpus 1 (Mizumoto et al., 2011), a learner corpus that can be used on a large-scale in many languages.

CWI Shared Task 2018
In CWI shared tasks, systems predict whether words in a given context are complex or noncomplex for a non-native speaker. The first CWI shared task (Paetzold and Specia, 2016a;Zampieri et al., 2017) contained only English data designed for non-native English speakers. Totally, 20 annotators were assigned to each instance in the training set. However, in the test set, only one annotator was assigned to each instance. By contrast, the CWI shared task 2018 (Yimam et al., 2018) used a multilingual dataset (Yimam et al., 2017a,b) having all instances annotated by multiple annotators. This shared task was divided into two tasks (binary and probabilistic classification) and the following four tracks: The English dataset contained a mixture of professionally written news, non-professionally written news (WikiNews), and Wikipedia articles. Datasets for languages excluding English were from Wikipedia articles. Tables 1 and 2 display the dataset and the number of instances, respectively.

Target
Label Probability According to Goodyear, a neighbor heard gun shots. shots 0 0.00 According to Goodyear, a neighbor heard gun shots.
according to 1 0.05 A lieutenant who had defected was also killed in the clashes. defected 1 0.45 A bad part of the investigation is we may not get the why. investigation 1 0.95

Binary Classification Task
Labels in the binary classification task were assigned as follows: 0: simple word (none of the annotators marked the word as difficult) 1: complex word (at least one annotator marked the word as difficult) We evaluated the systems using the macroaveraged F1-score.

Probabilistic Classification Task
Labels in the probabilistic classification task were assigned as the proportion of annotators identifying the target as complex. Systems were evaluated using the MAE (mean absolute error).

TMU Systems
According to previous studies (Specia et al., 2012;Paetzold and Specia, 2016a), we estimated the word difficulty by counting word frequency.

Classifiers
We used random forest classifiers and random forest regressors for binary classification tasks and probabilistic classification tasks, respectively. We examined all combinations of the following hyperparameters 2 : • n estimators: {10, 50, 100, 500, 1000} •    Table 3 shows all the features used by our systems. First, we used the heuristics that the longer words are more complex to understand as the first feature. For example, Flesch reading ease (Flesch, 1948), frequently used in research on text simplification, uses this heuristics.

Features
Second, as shown in Table 1, the target includes words and phrases. As long phrases tend to be less frequent, we used the number of words as the second feature.
Others features (3-8) are based on the frequency of targets in a corpus. We counted frequencies from texts written by native speakers and language learners. Language learners are more likely to use simple words than native speakers. Therefore, we expected word frequency in the learner corpus to be a useful feature for CWI. As a text written by native speakers, we counted the frequency from Wikipedia and WikiNews. By contrast, as a text written by language learners, we counted the frequency from the Lang-8 corpus (Mizumoto et al., 2011). The Lang-8 corpus contains texts before and after corrections written by learners and native speakers, respectively. We use the former.

Experimental Settings
The dump data of Wikipedia and WikiNews on December 01, 2017, were downloaded and divided into sentences using WikiExtractor 3 and NLTK 4 . All corpora (Train / Dev / Test and Wikipedia / WikiNews / Lang-8) were tokenized and lowercased in the script of the statistical machine translation tool Moses 5 (Koehn et al., 2007). Table 4 displays the number of sentences in each corpus.

Results
Tables 5 and 6 present the official evaluation results. In Table 5, systems are ranked by their macro-averaged F1-score for the binary classification task. TMU systems ranked first in Spanish and German, and second in French. In Table 6, systems are ranked by their MAE score for the probabilistic classification task. TMU systems ranked first in Spanish, German, and English news track and second in English WikiNews track.

Ablation Analysis of Freq. and Proba.
Frequency and probability are similar features. Table 7 indicates that although the probability features are more important than the frequency features, systems can yield better performance by considering both features.

Ablation Analysis of Corpora
We examined which corpus provides important features. Table 8 shows the most important features obtained from the Lang-8 corpus. Remarkably, the largest Wikipedia corpus does not contribute significantly to performance.

Related Work
Although our systems (random forest with length and frequency of the target word) are simple, they achieve competitive results.
While previous works counted the word frequency in corpora such as Wikipedia, which is written by native speakers, we used corpora written by language learners. As anticipated, the word frequency in the learner corpus proved to be a vital feature in the CWI task.

Conclusion
We explained the TMU systems for CWI shared task 2018. Our systems performed best on 5 of the 12 tracks using only simple features.
Previous studies concluded that the most effective way to estimate word difficulty is to count the word frequency in a corpus. However, it was not clear what kind of corpus is useful for counting word frequencies. We discussed the usefulness of a learner corpus for the CWI task for the first time. As anticipated, the word frequency counted from the learner corpus worked better than that from the in-domain corpus written by the native speakers for the CWI task.