Complex Word Identification Using Character n-grams

This paper investigates the use of character n-gram frequencies for identifying complex words in English, German and Spanish texts. The approach is based on the assumption that complex words are likely to contain different character sequences than simple words. The multinomial Naive Bayes classifier was used with n-grams of different lengths as features, and the best results were obtained for the combination of 2-grams and 4-grams. This variant was submitted to the Complex Word Identification Shared Task 2018 for all texts and achieved F-scores between 70% and 83%. The system was ranked in the middle range for all English texts, as third of fourteen submissions for German, and as tenth of seventeen submissions for Spanish. The method is not very convenient for the cross-language task, achieving only 59% on the French text.


Introduction
Complex Word Identification (CWI) refers to identification of words which are considered by readers from a specific target audience to be complex. The CWI task is the first step towards the lexical simplification task which aims at improving the readability of texts: a lexical simplification system should replace the identified complex words with their simpler synonyms. Some of these systems have a CWI module at the beginning of their pipeline, e.g. (Paetzold and Specia, 2015) whereas some perform the CWI task implicitly, such as (Glavaš andŠtajner, 2015).
The first shared task on CWI was organized at the SemEval 2016 (Paetzold and Specia, 2016) where 21 teams submitted 42 systems trained to predict whether words in a given context were complex for a non-native English speaker. Following the success of the first CWI shared task and additional findings reported in (Zampieri et al., 2017), the second shared task has been organised at the BEA workshop 2018 (Yimam et al., 2018) featuring a multilingual dataset. The dataset consists of training and testing sets for three languages: English, German and Spanish, as well as French test set for cross-lingual CWI. The goal was to predict which words could be difficult for a non-native speaker, based on annotations collected from a mixture of native and non-native speakers. The predictions could be submitted in the form of class labels (complex or simple) and/or in the form of complexity probabilities.
This work proposes the use of character ngrams for complex word identification. The main assumption is that complex words contain different character sequences than simple words, i.e. that the combination of particular characters is related to the complexity of a word. Additional motivation is the successful use of character ngrams for machine translation evaluation metrics in recent years (Stanojević and Sima'an, 2014;Popović, 2015;Wang et al., 2016). The results of Machine Translation Metrics Shared Tasks 1 (Bojar et al., 2017) have shown that these metrics correlate very well with human judgments for all analysed target languages, which indicates that character sequences carry some important information.
We used the multinomial Naive Bayes classifier, although the assumption about independence between different n-grams was certainly not valid. The motivation to conduct our first experiments with character n-grams using this classifier was its often use as a baseline for text classification (Mc-Callum and Nigam, 1998;Kibriya et al., 2004;Lohar et al., 2017). Since Naive Bayes probabilities are generally not reliable, we participated only in the binary classification task.
Although the relation between character ngrams and word complexity intuitively depends on the language, we still decided to investigate crosslingual CWI and to participate in this track.

Related work
Several different techniques for identifying complex words were investigated by (Shardlow, 2013) which include word frequency, word length and syllable counts among others, but no character sequences.
The first CWI shared task (Paetzold and Specia, 2016) featured 42 systems based on different techniques and using different features such as semantic, morphological, lexical, as well as word frequencies which are reported to be a very important factor for CWI.
One of the submitted systems (Mukherjee et al., 2016) used Naive Bayes classifier with morphological, semantic and lexical features, however no character n-grams were investigated.
Another system (Zampieri et al., 2016) used probabilities of word character trigrams and sentence character trigrams together with word length and sentence length to measure orthographic difficulty. These features together with the word frequency features are used for three classifiers: Random Forest, Nearest neighbour and SVM. Nevertheless, no results regarding the contribution of character trigram features were reported.
Number of vowels, number of syllables and number of characters (word length) together with word frequencies in corpora were investigated in (Yimam et al., 2017b), but no experiments with character n-grams were conducted.

Character n-grams and multinomial Naive Bayes classifier
For each labelled word, all character n-grams of given length(s) and their frequencies were extracted and the word was represented as a "bag of n-grams". Decision on which n-gram length(s) to concentrate is far from trivial since, to our best knowledge, no similar experiments have been conducted before. Therefore, we started with individual n-gram lengths from 2 to 6, following the findings from machine translation metric task where lenghts above 6 did not bring any improvements.
Our preliminary experiments showed that introducing six-grams degraded the performance so we kept the lengths up to 5. As for mixed lengths, the best preliminary results were obtained for 2-gram, 3-gram and 4-gram combinations, so we concentrated on these variants. Table 1 presents two complex and three simple English words with their 2-grams, 3-grams and 4-grams and corresponding frequencies. Under the (very) naive assumption of conditional independence between individual n-grams, these frequencies are then used for estimating the classcontidition probabilities of the Naive Bayes multinomial model: where P (ngr i |c) is the conditional probability that the n-gram ngr i occurs in a word with the class value c, and N ngr is the total number of distinct n-grams, i.e. the dimension of the feature vector. P (c) is the prior probability that a word has class label c. For the multinomial model, these two probabilities can be estimated as relative frequencies in the following way: where the numerator represents the number of occurences of the n-gram ngr i in a word with class label c, and the denominator represents the number of occurences of all n-grams in this class. The smoothing probability for unseen n-grams was set to 0.001. The prior class probability can be estimated as: where count(c) represents the number of words with class label c and count(words) represents the total number of labelled words. If the words in Table 1 and their 4-grams were used for training, the prior class probabilities for simple ("S") and complex ("C") words would be P (S) = 3/5 = 0.60 and P (C) = 2/5 = 0.4. Class condition probabilities for the 4-gram "frug" would be P (f rug|S) = 0 and P (f rug|C) = 1/5 = 0.2, and for the 4-gram "real" P (real|S) = 0.25, P (real|C) = 0. The 4-ram "lity" would have similar probabilities for the complex and for word "bag of n-grams": 2-grams, 3-grams, 4-grams and their frequencies class frugality fr:1 ru:1 ug:1 ga:1 al:1 li:1 it:1 ty:1 fru:1 rug:1 uga:1 ali:1 lit:1 ity:1 C frug:1 ruga:1 ugal:1 gali:1 lity:1 reefs re:1 ee:1 ef:1 fs:1 ree:1 eef:1 efs:1 reef:1 eefs:1 C banana ba:1 an:2 na:2 ban:1 ana:2 nan:1 bana:1 anan:1 nana:1 S coral co:1 or:1 ra:1 al:1 cor:1 ora:1 ral:1 cora:1 oral:1 S reality re:1 ea:1 al:1 li:1 it:1 ty:1 rea:1 eal:1 ali:1 lit:1 ity:1 S real:1 eali:1 alit:1 lity:1 Table 1: Examples of two complex and three simple words with their 2-grams, 3-grams and 4-grams and corresponding frequencies.

Data
The organisers of the shared CWI task provided all participants with training and test data sets for English, German and Spanish. For French, only test data set was provided since it was intended for the cross-lingual CWI task. The English data set consists of mixture of professionally written news (News), non-professionally written news (WikiNews), and Wikipedia articles (Wiki). German, Spanish and French data sets contain data taken from German, Spanish and French Wikipedia pages. Data statistics is presented in Table 2.
Each sentence in the English data set was annotated by 20 people, 10 native and 10 non-native speakers. Each sentence in the German, Spanish and French data sets was annotated by 10 people, a mixture of native and non-native speakers. Annotators were provided with the surrounding context of each sentence, i.e. a paragraph, then asked to mark words they think would be difficult to understand for children, non-native speakers, and people with language disabilities. Annotators were enabled not only to annotate individual words, but also several consecutive words as complex. The details about the data sets can be found in (Yimam et al., 2017b) and (Yimam et al., 2017a).

Results
As mentioned in Section 2, the main part of our experiments was to determine which n-gram lenghts to include in the classifier. Preliminary experiments showed that the individual lengths of 2,3,4 and 5 should be further investigated, as well as combinations of 2-and 4-grams, 3-and 4-grams, as well as 2-, 3-and 4-grams.
All these variants were investigated for three scenarios: (i) standard classification, where each training set corresponds to the development set, (ii) classification with the extended English training corpus, where all English training corpora were concatenated and used for classifying each of the development sets, and (iii) cross-lingual classification, where training sets of other two languages were used for each language.
The comparison of the methods was carried out on the development sets in terms of complex word F-score and overall accuracy.

Standard set-up
In the standard set-up, each development set was classified using its corresponding training set, both in terms of domain and of language. Table 3 represents the obtained resuls, with best F-scores / accuracies in bold.
It can be noted that the combination of 2-grams and 4-grams is the best option for allmost all texts. It is second ranked (and very close to the best one) only for the accuracy of English news. As for the individual n-grams, the best performance is obtained by 4-grams. The scores are improving when increasing n-gram length up to 4, and then drop for 5-grams (except for the accuracy of English News and German Wikipedia). It can also be seen that in general, combining different n-gram lenghts works better than using the individual ones.

Concatenated English training corpus
Since the English data set contained three domains: Wikipedia, News and WikiNews, the question about effects of enlarging the training set arised: will the use of a larger training corpus from   different domains lead to better results or not? In order to answer this question, each of the three English development sets was also classified using the concatenated English training corpus containing all three domains and the results are presented in Table 4. These results show that enlarging the training corpus generally helps. The smallest improvements can be observed for the News text, probably because the News training corpus is the largest one, as can be noted in Table 2. Another finding is that for the larger training set, individual 3-grams, 4-grams and 5-grams can outperform the n-gram combinations. A possible explanation is that the reliability of longer character sequences is increased when a larger training corpus with more instances is used. When the three n-gram length combinations are compared on the larger training set, "24" still outperforms the other two except for the Wikipedia set.

Cross-lingual classification
In order to explore cross-language classification, each of the Wikipedia development sets was classified using the training corpora of another two languages. English News and WikiNews development sets were not used in order to avoid possible effects of domain mixing. The results in Table 5 show that the method is, as mentioned in Section 1, indeed not very appropriate for crosslingual classification since the character combina-tions are generally language dependent -the drop in F-score and accuracy is large, in the range of 10 to 15 absolute points.
As for the n-gram lengths, combination "24" is useful, although mostly for English. For German and Spanish, 3-grams and 5-grams outperformed the n-gram combinations. As for the usage of different languages, no advantage of one "foreign" language over another was observed -the best results are rather similar for both "external" languages. For example, the F-score for English is slightly better when the German training set is used, and accuracy is slightly better when the system was trained on the Spanish text. The fact that none of the language pairs is closely related might have an important influence on these results.

Confusion analysis
The results described in previous sections have shown the following: • combination of 2-grams and 4-grams is the best option for the standard setting, and performs decently also for enlarged English training corpus as well as for cross-lingual classification; • individual 3-grams, 4-grams and 5-grams outperform the combinations when a larger English corpus is used.   Table 5: Cross-language classification: F-score for complex word class / accurracy for cross-language classification.
In order to better understand the above findings, confusion analysis was carried out for all n-gram lengths and for all Wikipedia development sets in all three set-ups. Table 6 shows the percentages of (non-)confusions: C-C and S-S represent correctly classified instances, C-S stands for complex words classified as simple, and S-C for simple words classified as complex. The results show the following: • 5-grams are very good in identifying simple words: less than 10% of them are classified as complex. Nevertheless, they are absolutely the worse in labelling complex words: for German and Spanish texts, they even label more complex instances incorrectly than correctly (red numbers).
• the combination "24" is very good in labelling complex words, although often outperformed by one of the other two combinations; the percentages in the majority of those cases are very close, though.
• the same combination, "24", is the best of all three combinations for labelling simple words, although clearly outperformed by 5grams and 4-grams.
The described findings indicate that the combination "24", despite not always yielding the best scores, is the most balanced and the most stable one over all set-ups. Therefore, this variant was submitted for all shared task tracks.
It should be noted that the confusions were also analysed for the cross-lingual classification showing the very same behaviour for 5-grams and for the "24" variant. As for other n-gram lenghts, a number of different large confusion percentages was observed, indicating once again that the method is not convenient for cross-lingual CWI.

Official shared task results
Following all the findings described in previous sections, we decided to submit the "24" variant, i.e. the combination of 2-grams and 4-grams, to all shared task tracks. For each of the three English test sets, we sent two submissions: one classified using the corresponding in-domain training corpus, and one classified using the concatenated training corpus. For the French test set, we sent four submissions: one classified using English   Table 6: Confusion analysis for the English, German and Spanish development sets: C-C and S-S are correctly classified complex and simple words, C-S stands for complex words classified as simple, and S-C for simple words classified as complex.
Wikipedia training corpus, one classified using the concatenated English training corpus, one classified using the Spanish training corpus, and one using the German training corpus. For the German and Spanish test sets, one submission was sent for each. The official accurracies for the best system, for all our submissions and for the worst system are shown in Table 7 together with the ranks (in parenthesis).
All our monolingual submissions were ranked in the middle, some better than others. The best rank is achieved for German (3 of 14) and the worst for Spanish (10 from 17). The obtained accurracies are all above 70%, the German being the lowest one and the English News the highest one. For the cross-lingual task, our submissions were ranked very low, with one of the submissions being the worst one. However, it should be noted that the use of the Spanish training set yielded the best result: this indicates that the method could potentially be used for closely related languages, however this should be further examined in future work.
All the results indicate that there is a potential for using character n-grams for complex word identification, however more experiments should be carried out and several refinements should be applied.

Summary and outlook
In this paper, we have proposed the use of character n-grams for complex word idenfitication starting from the assumption that character sequences in complex words are often different than those in simple words. We carried out extensive experiments with multinomial Naive Bayes classifier with n-grams of different lengths as features, and found out that using 2-grams and 4-grams is the most stable option in this configuration. Our system was ranked in a middle-range position for all tracks except for the cross-lingual track where it was ranked very low -this was not surprising since frequencies of character sequences in words are intuitively rather language-dependent. Our official accurracy scores range from 70% to 83% for English, German and Spanish texts and from 50% to 59% for French cross-lingually classified text.
Our experiments described in this work together with the official shared task results indicate that the use of character n-grams for complex word identification has a potential, but the methods should be further investigated and improved. First of all, other classifiers without independency assumption should be investigated. In addition, using context (surrounding words and their n-grams) should be investigated as well.