A Graph-based Readability Assessment Method using Word Coupling

This paper proposes a graph-based read-ability assessment method using word coupling. Compared to the state-of-the-art methods such as the readability formulae, the word-based and feature-based methods, our method develops a coupled bag-of-words model which combines the merits of word frequencies and text features. Unlike the general bag-of-words model which assumes words are independent, our model correlates the words based on their similarities on readability. By applying TF-IDF (Term Frequency and Inverse Document Frequency), the coupled TF-IDF matrix is built, and used in the graph-based classiﬁcation framework, which involves graph building, merging and label propagation. Experiments are conducted on both English and Chinese datasets. The results demonstrate both effectiveness and potential of the method.


Introduction
Readability assessment is a task that aims to evaluate the reading difficulty or comprehending easiness of text documents. It is helpful for educationists to select texts appropriate to the reading/grade levels of the students, and for web designers to organize texts on web pages for the users doing personalized searches for information retrieval.
Research on readability assessment starts from the early 20th century (Dale and Chall, 1948). Many useful readability formulae have been developed since then (Dale and Chall, 1948;McLaughlin, 1969;Kincaid et al., 1975). Currently, due to the development of natural language processing, the methods on readability assessment have made a great progress (Zakaluk and Samuels, 1988; * Corresponding author. Benjamin, 2012;Gonzalez-Dios et al., 2014). The word-based methods compute word frequencies in documents to estimate their readability (Collins- Thompson and Callan, 2004;Kidwell et al., 2009). The feature-based methods extract text features from documents and train classification models to classify the readability (Schwarm and Ostendorf, 2005;Feng et al., 2010;François and Fairon, 2012;Hancke et al., 2012).
In this paper, we propose a graph-based method using word coupling, which combines the merits of both word frequencies and text features for readability assessment. We design a coupled bag-of-words model, which correlates words based on their similarities on sentence-level readability computed using text features. The model is used in a graph-based classification framework, which involves graph building, graph merging/combination, and label propagation. We perform experiments on datasets of both English and Chinese. The results demonstrate both effectiveness and potential of our method.
The rest of this paper is organized as follows: Section 2 introduces backgrounds of our work. Section 3 presents the details of the method. Section 4 designs the experiments and explains the results. Finally, Section 5 concludes the paper with planned future work.

Background
In this section, we introduce briefly three research topics relevant to our work: readability assessment, the bag-of-words model and the graphbased label propagation method.

Readability Assessment
Research on readability assessment has developed three types of methods: the readability formula, the word-based methods and the featurebased methods (Kincaid et al., 1975;Collins-Thompson and Callan, 2004;Schwarm and Os-tendorf, 2005). During the early time, many well-known readability formulae have been developed to assess the readability of text documents (Dale and Chall, 1948;McLaughlin, 1969;Kincaid et al., 1975). Surface text features are defined in these formulae to measure both lexical and grammatical complexities of a document. The word-based methods focus on words and their frequencies in a document to assess its readability, which mainly include the unigram/bigram/n-gram models (Collins- Thompson and Callan, 2004;Schwarm and Ostendorf, 2005) and the word acquisition model (Kidwell et al., 2009). The feature-based methods focus on extracting text features from a document and training a classification model to classify its readability (Feng et al., 2010;François and Fairon, 2012;Hancke et al., 2012). Suitable text features are usually essential to the success of these methods. The Support vector machine and logistic regression model are two classification models commonly used in these methods.

The Bag-of-Words Model
The bag-of-words model is mostly used for document classification. It constructs a feature space that contains all the distinct words in a language (or the document set). A document is represented by a vector, whose components reflect the weight of every distinct word contained in the document. Normally, it assumes the words are independent. Now the capturing of the relationship among words has attracted considerable attention (Wong et al., 1985;Cheng et al., 2013). Inspired by these works, this paper adopts the bag-of-words model in readability assessment, and refines the model by computing similarity among words on reading difficulty.

The Graph-based Label Propagation Method
Graph-based label propagation is applied on a graph to propagate class labels from labeled nodes to unlabeled ones (Kim et al., 2013). It has been successfully applied in various applications, such as dictionary construction (Kim et al., 2013), word segmentation and tagging (Zeng et al., 2013), and sentiment classification (Ponomareva and Thelwall, 2012). Typically, a graph-based label propagation method consists of two main steps: graph construction and label propagation (Zeng et al., 2013). During the first step, a similarity function is required to build edges and compute weights between pairs of the nodes (Daitch et al., 2009). Some form of edge pruning is required to refine the graph (Jebara et al., 2009). After that, effective algorithms have been developed to propagate the label distributions to all the nodes (Subramanya et al., 2010;Kim et al., 2013).

The Proposed Method
In this section, we present GRAW (Graph-based Readability Assessment method using Word coupling), which constructs a coupled bag-of-words model by exploiting the correlation of readability among the words. Unlike the general bag-ofwords model which models document relationship on topic, the coupled bag-of-words model extends it to model the relationship among documents on readability. In the following sections, we describe in detail how to build the coupled bag-of-words model. The model is then used in the graphbased classification framework for readability assessment.

TF-IDF (Term Frequency and Inverse Document
Frequency) is the most popular scheme of the bagof-words model. Given the set of documents D, the TF-IDF matrix M can be calculated based on the logarithmically scaled term (i.e. word) frequency (Salton and Buckley, 1988) as follows.
where f (t, d) is the number of times that a term (word) t occurs in a document d ∈ D.

The Coupled Bag-of-Words Model
As shown in Figure 1, three main stages are required to construct the coupled bag-of-words model: per-sentence readability estimation, word coupling matrix construction and coupled TF-IDF matrix calculation. The following sections describe the details of these stages.

Per-Sentence Readability Estimation
Two steps are required for the per-sentence readability estimation. The first is to compute a reading score of a sentence by heuristic functions. The second is to determine the difficulty level of the sentence by discretizing the score.  Figure 1: The Framework of GRAW Step 1. Given a sentence s, its reading difficulty can be quantified as a reading score which is a continuous variable denoted by r(s). The more difficult s is, the greater r(s) will be. Based on text features of s, r(s) can be computed by one of the eight heuristic functions listed in Table 1 which are grouped into three aspects.

anp(s)
the average number of (noun, verb, and preposition) phrases in s. Table 1: Three aspects of estimating reading difficulty of sentences using heuristic functions Step 2. Let η denote the pre-determined number of difficulty levels, r max and r min denote the maximum and minimum reading score respectively of all the sentences in D. To determine the difficulty level l * (s) (l * (s) ∈ [1, η]) of a sentence s, the range [r min , r max ] is divided into η intervals, so that each interval contains the reading scores of 1 η of all the sentences. The assumption is that all the sentences are equally distributed among the difficulty levels. l * (s) will be i, if the reading score r(s) resides in the i-th interval.
For each of the three aspects, we compute one l * (s) for a sentence s by combining the heuristic functions using the following equations. The assumption is that the reading difficulty of a sentence may be determined by the maximum measure on the text features. (2)

Word Coupling Matrix Construction
Let V denote the set of all the words, a word coupling matrix is defined as C * ∈ R |V|×|V| , the element of which reflects the correlation between two words (i.e. terms). Two steps are required to construct this matrix. The first is to count the difficulty distributions of words, and the second is to compute the correlation between each pair of words according to the similarity of their difficulty distributions.
Step 1. Let S denote the set of all the sentences, p t denote the difficulty distribution of a word (term) t. p t is a vector containing η (i.e. the number of difficulty levels) values, the i-th part of which can be calculated by the following formula.
where n t refers to the number of sentences in which t appears. The indicator function δ(x) returns 1 if x is true and 0 otherwise. l * (s) refers to one of the functions l sur (s), l lex (s) or l syn (s).
Step 2. Given two words (terms) t 1 and t 2 , whose level distributions are p t 1 and p t 2 respectively, we measure the distribution difference c KL (t 1 , t 2 ) using the Kullback-Leibler divergence (Kullback and Leibler, 1951), computed by the following formula.
After that, the logistic function is applied on the computed difference to get the normalized distribution similarity, i.e.
Given a word t i , only λ other words with highest correlation (similarity) are selected to build the neighbor set of t i , denoted as N (t i ). If a word t j is not selected (i.e. t j / ∈ N (t i )), the corresponding sim(t i , t j ) will be assigned 0. After that, the word coupling matrix (i.e. C * ) with sim(t i , t j ) as elements is normalized along the rows so that the sum of each row is 1. Based on three different l * (s), we construct three word coupling matrices C sur , C lex and C syn .

Coupled TF-IDF Matrix Calculation
In the general bag-of-words model, the words are treated as independent of each other. However, for readability assessment, words may be correlated according to the similarity of their difficulty distributions. To improve the TF-IDF matrix M described in Section 3.1, we multiply it by the word coupling matrix C * , so that the term frequencies are shared among the highly correlated (coupled) words. We denote the coupled TF-IDF matrix as M * , obtained by the following formula.
Specifically, three homogenous coupled TF-IDF matrices M sur , M lex and M syn can be built according to the three word coupling matrices C * .

Graph-based Readability Assessment
We employ the coupled bag-of-words model for readability assessment under the graph-based classification framework as described in the previous work (Zhu and Ghahramani, 2002). Firstly, we construct a graph representing the readability relationship among documents by using the coupled bag-of-words model to compute the relations among these documents. Secondly, we estimate reading levels of documents by applying label propagation on the graph.

Graph Construction
We build a directed graph G * to represent the readability relation among documents, where nodes represent documents, and edges are weighted by the similarities between pairs of documents. Given a similarity function, we link documents d i to d j with an edge of weight G * ij , defined as: where N (d i ) is the set of k-nearest neighbors of d i determined by the similarity function. Given the coupled matrix M * ∈ R m×|D| which maps each document into a m-dimension feature space, the similarity function sim(d i , d j ) can be defined by the Euclidean distance as follows.
where is a small constant to avoid zero denominators.
Merge the three graphs Refer to Section 3.2, the three coupled TF-IDF matrices will lead to three different document graphs, denoted as G sur , G lex and G syn respectively. To take advantage of the three aspects at one time, we need to merge the three graphs into one, denoted as G c .
In G c , each node also keeps k neighbors, and some edges shall be filtered out from the three graphs. The basic idea is to remove edges containing redundant information, as shown in Figure 2. For each node v, we firstly select the neighbors which are common in all the three graphs (i.e. N sur (v) ∩ N lex (v) ∩ N syn (v)). Secondly, for the rest candidate nodes, which are the neighbors of v in at least one graph, we select one by one the node which possesses the least number of common neighbors (from all the three graphs) with the nodes that are already selected in N c (v). The objective is to keep the number of triangles in G c to a minimum. The edge weights of G c are averaged on the corresponding edges appeared in the three graphs.
Combine with the feature-based graph Previous studies usually extract text features from documents to assess the readability using classification models. Here, we also take into consideration the feature-based graph, where similarities among documents are computed on text features. We use the features defined in (Jiang et al., 2014), where the model based features are eliminated since the computation depends on pre-assigned class labels, and represent a document as a vector of the feature values. We compute the similarity between any pair of documents using the Euclidean distance, and built the feature-based graph (denoted as G f ) in the same way as above.
Additionally, to take advantage of both graphs, we combine them into one (denoted as G cf ) using the following formula.

Graph Propagation
Given a graph G * constructed in previous sections, its nodes are divided into two sets: the labeled set V l and the unlabeled set V u . The goal of label propagation is to propagate class labels from the labeled nodes (i.e. documents) to the entire graph.
Here, we use a simplified version of the label propagation method presented in (Subramanya et al., 2010), which has been proved effective (Kim et al., 2013). The method iteratively updates the label distribution on a document node using the following equation.
At the left side of Eq.10, p

Empirical Studies
In this section, we conduct experiments on datasets of both English and Chinese, to investigate the following three research questions: RQ1: Whether the proposed method (i.e. GRAW) outperforms the state-of-the-art methods for readability assessment?
RQ2: What are the effects of adding the word coupling matrix to the general bag-of-words model?
RQ3: Whether the graph merging strategy is effective, and whether the performance can be further improved by combining the feature-based graph.

Corpus and Metrics
To evaluate our proposed method, we collected two datasets. The first is CPT (Chinese primary textbook) (Jiang et al., 2014), which contains Chinese documents of six reading levels. The second is ENCT (English New Concept textbook) which contains English documents of four reading levels. Both datasets are built from well-known textbooks where documents are labeled as grade levels by credible educationists. The details of the datasets are listed in Table 2  We conduct experiments on both datasets using the cross-validation which randomly divides a dataset into labeled (training) and unlabeled (test) sets. The labeling proportion is varied to investigate the performance of GRAW under different circumstances. To reduce variability, given certain labeling proportion, 100 rounds of crossvalidation are performed, and the validation results are averaged over all the rounds. We choose the precision (P), recall (R) and F1-measure (F1) as the performance metrics.

Comparison to the State-of-the-Art Methods
To address RQ1, we implement the following readability assessment methods and compare GRAW to them: (1) SMOG (McLaughlin, 1969) and FK (Kincaid et al., 1975) are two widely used readability formulae. We reserve their core measures (i.e. text features, and number of strokes for Chinese instead of number of syllables), and refine the coefficients on both datasets to befit the reading (grade) levels.
(2) SUM (Collins-Thompson and Callan, 2004) is a word-based method, which trains one unigram model for each grade level, and applies model smoothing both inter and intra the grade levels.
(3) LR and SVM refer to two featurebased methods which incorporate text features defined in (Jiang et al., 2014)   For GRAW, we implement label propagation on both the merged graph G c and the final graph G cf (Section 3.3), denoted as GRAW c and GRAW cf respectively. Table 3 gives the average performance measure per reading level resulted by the implemented methods on both datasets. Unless otherwise specified, we fixed η to 3, and λ to 2800 for CPT and 2000 for ENCT. The proportion of the labeled (training) set is set to 0.7.
In Table 3, the precision, recall and F1-measure of all the seven methods are calculated per reading (grade) level on both English and Chinese datasets. The values marked in bold in each row refer to the maximum (best) measure gained by the methods.
From Table 3, the readability formulae (SMOG and FK) perform poorly on either the precision or recall measure, and their F1-measure values are generally the poorest. Both SMOG and FK are designed for English, and have acceptable performance on the English dataset ENCT. The unigram model (SUM) performs a little better than the readability formulae. On ENCT, It has relatively good performance on grade levels 1 and 4, while on the Chinese dataset CPT, the performance is not satisfactory. The feature-based methods (LR and SVM) perform well on both ENCT and CPT, which means both the text features developed and the classifiers trained are useful. In general, GRAW c performs better than both LR and SVM, which demonstrates the effectiveness of our method. In addition, by combining the featurebased graph (GRAW cf ), GRAW can be improved, and performs the best on all the three metrics over the majority of reading levels on both datasets. The only exception is on level 5 in CPT, which suggests the requirement of further improvements.
We study the effect of labeling proportion on the performance of these methods on both datasets. The F1-measure averaged over the reading levels is used, since it is a good representative of the three metrics according to Table 3. Figure 3 depicts the performance trends of all the methods.
From Figure 3, neither SMOG nor FK benefits from the increasing size of the labeled set. This suggests that the performance of the readability formulae can hardly be improved by accumulating training data. The other 5 methods achieve better performance on larger labeled set, and outperform the two formulae even if the labeling proportion is small. Both LR and SVM perform better than SUM, but the performance is not good when the labeling proportion is less than 0.3, especially on the Chinese dataset. On the Chinese dataset, SVM performs better than LR, while on the English dataset, the situation is reversed. Both versions of GRAW outperform the other methods over the labeling ranges on both datasets. In addition, GRAW performs well when the labeling proportion is still small. Again, by combining the feature-based graph, the performance of GRAW is consistently improved.
In summary, GRAW can outperform the stateof-the-art methods for readability assessment on both English and Chinese datasets. By combining the feature-based graph, the performance of GRAW can be further improved.

Effects of the Word Coupling Matrix
For RQ2, we firstly compare the coupled bag-ofwords model to the general model in the process of graph construction. Four graphs are built by using each of the three word coupling matrices (i.e. M sur , M lex and M syn ) and the TF-IDF matrix respectively. Label propagation is applied on each graph to predict reading levels of unlabeled documents. The labeling proportion is varied from 0.1 to 0.9 on both the English and Chinese datasets. Figure 4(a) depicts the average F1-measure resulted from the four graphs.
From Figure 4(a), the three word coupling matrices greatly outperform the TF-IDF matrix, especially on the Chinese dataset. This demonstrates that the word coupling matrices are very effective in improving the performance of the general bagof-words model for readability assessment.
Secondly, we investigate the performance of the four matrices per reading level. Figure 4(b) depicts the recall rate per reading level of the four corresponding graphs in bar charts. The labeling proportion is set to 0.7. The recall rate is used because it makes the reason evident that the TF-IDF matrix performs poorly. From Figure 4(b), on the Chinese dataset, nearly all the unlabeled documents are classified as level 1 by the TF-IDF matrix, in which the word frequencies are too few to make meaningful discrimination among the reading levels. On the English dataset, the TF-IDF matrix performs better, but still prefers to classify documents into lower levels.
As described in Section 3.2.2, η (the number of difficulty levels of sentences) and λ (the number of neighbors pertained for each document node) are two parameters in building the word coupling matrices. To investigate their effects on the performance of the built matrices, we vary the values of both η and λ, and compute the average F1measure on the two datasets. Figure 4(c) depicts the results in line charts, where η varies from 2 to 9 step by 1, while λ varies from 400 to 4000 step by 400 on Chinese and from 200 to 2000 step by 200 on English (the difference is due to the dissimilar number of documents between the two datasets). The three word coupling matrices exhibit similar behavior during experiments, hence, only M syn is depicted.
From Figure 4(c), a small η (e.g. 2 or 3) is good on the Chinese dataset. However, on the English dataset, η = 2 leads to the poorest performance. It seems the increasing of η causes vibrated performance, and the trend is further complicated when involving λ. Above all, η = 3 gives a preferable option on both datasets. For λ, most of the lines exhibit a similar trend that rises first and then keeps stable on both datasets, although some may drop when λ is too large. This suggests that making a relatively large number of the other words as the neighbors of one (i.e. λ = 2800 on the Chinese dataset and λ = 2000 on the English dataset) will make an effective word coupling matrix.
The word coupling matrix constructed in GRAW uses the whole corpus on either English or Chinese. To investigate if the corpus size takes effects on the performance of GRAW, we vary the proportion of the corpus used by randomly removing documents from each reading level. From Figure 4(d), on the Chinese dataset, the performance of GRAW suffers little from removing documents, even if only 20% documents are left for building the word coupling matrix. However, on the English dataset, the mean performance drops sharply and the deviation increases evidently. This suggests that cumulating sufficient corpus is required for building a suitable word coupling matrix in GRAW, and factors other than number of documents may influence the corpus quality, which deserves further study.
In summary, the word coupling matrix plays an essential role in GRAW. For building a suitable word coupling matrix, the number of difficulty levels of sentences (η) can be set to 3, and a rel-atively large number of the other words should be selected as the neighbors of a word. A sufficient corpus is required for refining the matrix.

Effectiveness of Graph Combination
For RQ3, we compare graphs built on each singular word coupling matrix (i.e. M sur , M lex and M syn ) to the merged graph (i.e. GRAW c ) and the combined graph (i.e. GRAW cf ). Figure 5 depicts the average F1-measure resulted after applying label propagation on these graphs with labeling proportion varied from 0.1 to 0.9. The feature-based graph (i.e. G f ) is also depicted for comparison.  Figure 5, the merged graph GRAW c outperforms the three basic graphs on both datasets in most cases. Within the three, M syn performs best, especially on the English dataset, where it can outperform GRAW c slightly when the labeling proportion is small (0.2 − 0.4). By combining the feature-based graph, GRAW cf performs even better on both datasets, although G f performs poorest among all the graphs. In summary, the graph merging strategy is effective, and by combining the feature-based graph, the performance of GRAW can be improved. This demonstrates the potential of GRAW.

Conclusion
In this paper, we propose a graph-based readability assessment method using word coupling. The coupled bag-of-words model is designed, which exploits the correlation of readability among the words, and by applying TF-IDF, models the relationship among documents on reading levels. The model is employed in the graph-based classification framework for readability assessment, which involves graph building, merging, and label propagation. Experiments are conducted on both Chinese and English datasets. The results show that our method can outperform the commonly used methods for readability assessment. In addition, the evaluation demonstrates the potential of the coupled bag-of-words model and the graph combination/merging strategies.
In our future work, we plan to verify the soundness of the results by applying our method on large volume corpus of both English and Chinese. In addition, we will investigate other ways of computing the word coupling matrices, such as incorporating word coherency or semantics, and develop efficient merging strategies which can be used for training classification models, as well as for building graphs.