Joint learning of frequency and word embeddings for multilingual readability assessment

This paper describes two models that employ word frequency embeddings to deal with the problem of readability assessment in multiple languages. The task is to determine the difficulty level of a given document, i.e., how hard it is for a reader to fully comprehend the text. The proposed models show how frequency information can be integrated to improve the readability assessment. The experimental results testing on both English and Chinese datasets show that the proposed models improve the results notably when comparing to those using only traditional word embeddings.


Introduction
Readability assessment is the task of determining how difficult a given document is to understand. It is useful in many applications such as selecting learning material for children of different grade levels, for language learners, for comprehension tests, skills training, text summarisation, simplification systems and so on. Readability assessment has a long research history, and many methods have been developed in the last couple of decades (Dale and Chall, 1948;Mc Laughlin, 1969;Kincaid et al., 1975;Chall and Dale, 1995;Si and Callan, 2001;Heilman et al., 2007;Jiang et al., 2015;Wang and Andersen, 2016). These approaches, however, rely on hand-crafted features that depend heavily on the languages and require adjustment when applying to a new language. Our aim is to develop a universal method that can be used in a multilingual setting, which involve little effort when extending to other languages.
Recent machine learning techniques, such as convolutional neural networks (CNN) (Collobert et al., 2011) typically do not have to be supplied with hand-crafted features. These models often use pre-trained word embeddings for NLP tasks and have been proven to achieve good results on multiple benchmarks (Mikolov et al., 2013b;Pennington et al., 2014;Mikolov et al., 2013a). The pre-trained word embeddings are generally designed in a way that they can capture word meaning and topics. Though they are useful since topics are good indications of whether a document is difficult to comprehend, word embeddings do not directly reflect the frequency levels of words.
In our scenario, it is desirable that the system can take into account the frequency level of words rather purely focusing on their meanings. It is based on the assumption that more frequent words are supposed to be easier to understand. We therefore propose two models that jointly represent words based on their meanings with traditional word embeddings and their frequency levels with the so-called frequency embeddings. These two embedding layers are employed in a CNN architecture to determine the readability level of a given document. Since this model does not depend on hand-crafted features, it can be easily adapted to multiple languages.

Related Work
Readability assessment methods can be classified into two categories, the traditional approach and data driven approach. The traditional approach include (Dale and Chall, 1948), FOG Index (Gunning, 1952), SMOG (Mc Laughlin, 1969) and Flesch-Kincaid Index (Kincaid et al., 1975), (Chall and Dale, 1995). These early studies evaluated text difficulty based on shallow features such as word difficulty levels, the average sentence length, the average number of syllables. Though considered quick and easy to compute, these tra-ditional metrics/formulae are designed with some specific language in mind, and thus they may not work well when applying to other languages.
The data driven approach treats readability assessment as a machine learning problem, that is to automatically learn the mapping from documents to difficulty levels based on training examples (Si and Callan, 2001;Heilman et al., 2007;Jiang et al., 2015;Wang and Andersen, 2016). In these studies, documents are represented by different types of features such as bag of words, lexical and grammatical features extracted from parse trees (Heilman et al., 2007), grammatical templates (Wang and Andersen, 2016), word frequency smoothed by correlation information (Jiang et al., 2015). Most of these studies however require hand-crafted, language-dependent features, and not readily applicable to multilingual setting.

Our method
While traditional methods are simple to implement, they focus mostly on Latin languages such as English. These methods are not easily transferred to other languages especially Asian. Motivated by the recent success of Convolutional Neural Network (CNN) models in many text classification tasks, we employ the models for learning and classifying a given text to its difficulty level.
Word embeddings are used transferrably in many general NLP tasks. They take into account the context in which a word appears to learn the representation of words. Although they can reflect word meaning and topics, they do not take directly frequency information of a word into account. In the readability assessment scenario, frequency information is important in deciding whether a document is hard to read or not (Jiang et al., 2015).
From this observation, we propose a model that takes into account also word frequency information besides word embeddings. Our hypothesis is that the model can learn better from knowing words' difficulty levels besides their meanings. Word embeddings help associating the topics of documents, which are important to assess the readability levels (e.g., there are topics that are more difficult to understand than others from their natures). In addition, frequency information plays the role of pointing out which words are more difficult to understand 1 .
The three common metrics representing word frequency information are raw counts (number of times a word appears in the whole corpus), ranking (i.e., rank 0 for the most common word) and frequency classes. We take these metrics directly as an embedding vector represents words in the corpus. Among these metrics, the word frequency class information is the most generalised one.
In particular, the frequency class FC(w) of a word w describes the frequency freq( w ) of the word in relation to the frequency freq max of the most frequent word, i.e., the word with ranking 0 (Sabine Fiedler and Quasthoff, 2012): Our architecture is slightly different from the CNN architecture presented in (Kim, 2014). In particular, we propose two models ( Figure 1) WFE-COM (left) and WFE-SEP (right). WFE-COM Model. In this model, the filters are applied to the concatenated embeddings of word and frequency. The network learns these filters' weights that activate features extracted from the these embeddings. Let x w i ∈ R kw and x f i ∈ R k f , where x i is a word in a sentence of length n, k w is the word embedding dimension and k f is the frequency embedding dimension. x w i represents the word embeddings of word w i while x f i represents its frequency embeddings. Note that in the frequency embeddings, instead of randomly assigning values to unknown words as in word embeddings, we set them to the highest frequency class adopted from the training corpus. The sentence with length n is then represented by a matrix: and x E i = x w i ⊕ x f i represents the final embedding of word x i , which is a concatenation of word and frequency embeddings. A feature map is generated using filters of window size h to the sentence matrix in Eq. 2, where a feature c i is obtained using a non-linear activation function f : where x i:i+h−1 represents the matrix which composes of vectors from x i to x i+h−1 . This convolution operation in Eq. 4 is applied on the window size h from x i to x i+h−1 , and the weights w ∈ R hke where k e = k w + k f and b is the bias. We then apply max-over-time pooling operations in the feature map.
WFE-SEP Model. In this model, word embeddings and frequency embeddings are learned separately before being fetched into a fully connected layer. Convolutional layers and max poolings are applied to the word embeddings as these layers help finding and representing features of interests, while these layers are omitted for frequency embeddings.
The feature map extracted from applying the filters on word embeddings is then computed as: Finally this feature map is concatenated with the frequency embeddings, and then use dropout for regularisation similar to the architecture described in (Kim, 2014) (see section 4.2).

Dataset
We evaluate our methods for English and Chinese readability assessment on two datasets collected by (Jiang et al., 2015). The first dataset, ENCT, was built with four reading levels from English New Concept textbook. The second dataset, CPT, was collected from Chinese primary textbook and contains six difficulty levels. In total, there are 279 documents with 4671 sentences in ENCT and 637 documents with 16145 sentences in CPT. In both datasets, the difficulty levels were assigned by human experts. We split randomly the dataset 70% for training, 27% for testing and 3% for a development set.

Experiment setup
NDC-Level. The New Dale-Chall Readability level (Chall and Dale, 1995) is a traditional readability test. P DW is the percentage of difficult words in a document, calculated as the number of difficult words divided by the total number of words in the document. Raw score Φ is calculated as: Φ = 0.1579 × P DW + 0.0496 × n w n s where n w is the number of words and n s is the number of sentences in the whole corpus, hence n w n s represents the average sentence length in the corpus. Finally, if P DW is above 5%, then add 3.6365 to the raw score Φ to get the adjusted score. We implemented the New Dale-Chall Readability level (NDC) and converted the raw score Φ to corresponding readability levels as follows: Grade 4 and Below level 1 level 1 5.0 to 5.9 Grades 5 -6 level 1 level 2 6.0 to 6.9 Grades 7 -8 level 2 level 3 7.0 to 7.9 Grades 9 -10 level 3 level 4 8.0 to 8.9 Grades 11 -12 level 3 level 5 9.0 to 9.9 College level 4 level 6 ≥10 College Graduate level 4 level 6 Word embeddings (WE). For English, we used the pre-trained word2vec by (Mikolov et al., 2013b) on Google News. For Chinese, we collected a dataset consisting of news (≈ 320K documents) and Wikipedia, tokenised and trained the word embeddings on it. Frequency embeddings.
We used the pretrained frequency lists for English obtained from (Sabine Fiedler and Quasthoff, 2012), and created our own Chinese frequency lists using the same CNN architecture. We followed the setting as suggested in (Kim, 2014). The filter windows' sizes are 3, 4, 5 with 100 feature maps each. We used rectified linear units as activation functions for the convolutional layers, dropout rate of 0.5 and mini-batch size of 50. Static and non-static WE. These two settings followed the method in (Kim, 2014), where all words are kept either static (in static setting) or updated (in non-static setting) including the unknown ones while others parameters are learned. Random-WE. All words are randomly initialised and modified while training. Multichannel-WE. Each static and non-static WE is treated as one channel while gradients are backpropagated only through one of the channels. Static-FE. Only frequency embeddings are used in this setting (without word embeddings). Word Frequency Embeddings (WFE). We concatenate the pre-trained word embeddings and the frequency embeddings as explained in section 3.
In the WFE setting, we use the three frequency metrics: raw counts, ranking and frequency class, while in the WFE-class setting, we use only the frequency class metric. In both settings, the frequency embeddings are kept static during training.

Result and discussion
The result shows that the traditional method NDC works much better for English dataset (50%) than for Chinese (17%), which is probably explained by the fact that the formulae was originally designed for English language. Their results are still much lower than the CNN methods using pretrained frequency and word embeddings. The random-WE method works better for English and much better for Chinese in compared to the NDC, but lower than when using pretrained frequency and word embeddings. It shows that pre-trained embeddings play an important role in determining the difficulty levels. Among three WE methods (using pre-trained word embeddings), the static model achieves the best results. Non-static model is supposed to fine-tune to the specific given task. However, in our case, it does not work as well as when keeping the embedding vectors static for both English and Chinese.
When using all frequency levels, word ranks and number of occurrences together for frequency embedding, the results are better than other models. This model is however worse than when using only frequency class information. Since frequency class information is more representative than word counts and word ranks, it perhaps helps the model learn to classify the difficulty levels better in more general cases.
The result suggests that model WFE-SEP works better than WFE-COM. It means that it is not necessary to apply filters and max poolings on the frequency embeddings and the frequency and word embeddings can be learned separated and finally concatenate before going to the fully connected layer. Finally, it shows that the frequency embeddings help improving the results in both English (to 93% ) and Chinese (to 49%) when we concatenate the frequency embeddings and word embeddings, using the frequency class information. It proves our hypothesis that frequency information is useful in judging the difficulty level of a document. This method is extensible and can easily be applied to different languages without prior knowledge about these languages.

Conclusion
In this paper, we have proposed two models that employ both word and frequency embeddings for the readability assessment task. The experimental results show that (1) using frequency class metric can represent frequency information better than using other common metrics such as raw counts or ranking; (2) the model that integrates the frequency embeddings directly to the fullyconnected layer performs better than applying filters on the concatenated word frequency embeddings and (3) both proposed models outperform the baseline (the traditional NDC method) and the CNN models without using frequency information in both English and Chinese datasets.