Graph-based Coherence Modeling For Assessing Readability

Readability depends on many factors ranging from shallow features like word length to semantic ones like coherence. We introduce novel graph-based coherence features based on frequent subgraphs and compare their ability to assess the readability of Wall Street Journal articles. In contrast to Pitler and Nenkova (2008) some of our graph-based features are signiﬁcantly correlated with human judgments. We outperform Pitler and Nenkova (2008) in the readability ranking task by more than 5% accuracy thus establishing a new state-of-the-art on this dataset.


Introduction
Readability depends on many factors which enable readers to process a text. These factors can be used by readability assessment methods to quantify the difficulty of text understanding. Possible applications of readability assessment are automatic text summarization and simplification systems. Measuring readability can also be used in question answering and knowledge extraction systems to prune texts with low readability (Kate et al., 2010).
Many different text features have been used to assess readability. They include shallow features (Flesch, 1948;Kincaid et al., 1975), language modeling features (Si and Callan, 2001;Collins-Thompson and Callan, 2004), syntactic features (Schwarm and Ostendorf, 2005) and text flow or coherence (Barzilay and Lapata, 2008;Pitler and Nenkova, 2008). In a coherent text each sentence has some connections with other sentences. Although these local connections make the text more readable, the corresponding coherence features used in Pitler and Nenkova (2008) (Section 2) are not strongly correlated with human judgments.
The main goal of this paper is to introduce novel graph-based coherence features for assessing readability. To achieve this goal, we use the entity graph coherence model by Guinaudeau and Strube (2013) (Section 3.1.1) and follow two ideas. The first main idea is to use a graph representation of rhetorical relations between sentences of a text (Section 3.1.2) and to merge the entity graph and the rhetorical graph (Section 3.1.3). Hence we enrich the entity graph and consequently consider the distribution of two aspects of coherence (i.e. entities and discourse relations) simultaneously. The second main idea is to apply subgraph mining algorithms to find frequent subgraphs (i.e. patterns) in texts (Section 3.2). Subgraph mining has been successfully applied to other tasks, e.g. image processing (Nowozin et al., 2007) and language modeling (Biemann et al., 2012). We hypothesize that text coherence correlates with frequent subgraphs (vaguely reminding us of coherence patterns (Daneš, 1974)) and that the mined patterns are good predictors for readability ratings.
Our study is novel in introducing new and informative graph-based coherence features. We examine the predictive power of these feature in two experiments: first, readability rating prediction, and second, ranking texts according to the readability (Section 5).  Figure 1: The entity graph representation of the text in Table 1. Dark entities are shared by the sentences.

Readability Assessment
The quality of a text depends on different factors which make the text easier to read. These factors range from shallow features like word length to semantic features like coherence. Readability assessment leads to two problems: distinguishing and recognizing readability levels of texts and predicting human readability ratings. Pitler and Nenkova (2008) use all entity transitions of the entity grid model (Barzilay and Lapata, 2008) as coherence features. They compute the correlation between them and readability ratings and show that none of them is significantly correlated with human readability judgments. Indeed, none of these features on its own is a good predictor to measure coherence and to predict readability as well.

Method
We introduce the graph representation of a text and propose to use these graphs to model coherence.

Entity Graph
Guinaudeau and Strube (2013) describe a graphbased version of the entity grid (Barzilay and Lapata, 2008) which models the interaction between entities and sentences as a bipartite graph. This graph contains two sets of nodes: sentences and entities. Sentence and entity nodes are connected if and only if the entity is mentioned in the sentence (Figure 1). Edges are weighted according to the grammatical role of the entity mentioned in the sentence. Guinaudeau and Strube (2013) model entity transitions between sentences via a one-mode projec-tion of the entity graph. The one-mode projection is a graph consisting of sentence nodes that are connected if and only if they have at least one entity in common in the entity graph. One-mode projections are directed as they follow the text order. Hence, backward edges never occur. Guinaudeau and Strube (2013) introduce three kinds of projections. The unweighted projection P ER u models the existence of the entity connections between sentences. The weighted projection P ER w uses the number of shared entities by sentences as a weight for the corresponding edge ( Figure 2). P ER acc takes the grammatical function of entities in sentences into account as edge weights. Guinaudeau and Strube (2013) show that P ER acc does not perform well for readability assessment. It does not outperform P ER w in our S1:   Figure 2: P ER u : unweighted, and P ER w : weighted projection graphs. In the weighted projection all edge weights are equal to one, because all sentences share one entity. experiments as well. Thus, we do explain further details of P ER w here.

Discourse Relation Graph
Lin et al. (2011) and Lin (2011) use Rhetorical Structure Theory (RST) to describe and model coherence by considering the transitions between discourse relations. Inspired by the entity grid they expand the relation sequence into a two-dimensional matrix whose rows and columns are sentences and entities, respectively. The cell s i , e j corresponds to the set of discourse relations entity e j is involved with in sentence s i . These methods are based on entity transitions which, however, are intuitively implausible, because discourse relations connect sentences (or elementary discourse units).
Since discourse relations capture interactions between sentences (Table 2), we model these relations with a graph.

Relation
Arg1 Arg2 Implicit Expansion S1 S2 Explicit Comparison S2 S2 Implicit Expansion S2 S3 Implicit Temporal S3 S4 Implicit Contingency S4 S5 Table 2: PDTB-style discourse relations (Prasad et al., 2008) of the sample text in Table 1 A discourse relation graph is P DR u = (V, R), where V is the set of sentence nodes and R is the edge set which represents all discourse relations in the text. Two sentence nodes are adjacent if and only if they are connected by at least one discourse relation. Intra-sentential discourse relations are represented as self-edges. We define P DR w as a weighted discourse relation graph whose edge weights are the number of discourse relations between sentence nodes ( Figure 3).

Combined Entity and Discourse Relation
Graphs Both projection and discourse relation graphs represent different types of connections. These graphs can be merged by employing basic operators.
We use the ∨ operator (logical OR) to combine the projection graph P ER u with the P DR u graph. The ∨ operator takes two sentence nodes and creates an edge between them if they are connected at least by one connection, whether entity transition (P ER u ) or discourse relations (P DR u ). The other basic logical operators (e.g. ∧ or ⊕) lose connections. Hence we do not report on their performance. Inspired by linear regression models we combine the weighted graphs by adding (+) the edge weights in P ER w and P DR w (Figure 4).

Coherence Features
We use the proposed graphs to introduce novel coherence features. Average outdegree. Measures to which extent a sentence is connected with other sentences (Guinaudeau and Strube, 2013): where OutDegree(s) is the sum of the weights associated with edges that leave node s and S is the number of sentences in the text. Number of components. The projection graph can be disconnected. A graph is disconnected if there are at least two nodes which are not reachable from each other (like s 1 and s 2 in Figure 2). A maximal non-empty connected subgraph in a graph is called component. Each projection graph in Figure  2 contains two components. Intuitively, projection graphs of a more coherent text should contain fewer number of components. The outdegree does not capture this type of connectivity. E.g., in Figure 5 the average outdegree of the two graphs is equal, while the left graph contains more components and should be less coherent. Frequent subgraphs. We hypothesize that particular coherence patterns show a correlation with readability. These patterns are encoded as subgraphs in graphs. An advantage is that coherence can be measured beyond simple sentence or node connectivity. We first define the graph concepts employed.
Isomorphic. Two graphs G and G are isomorphic, if they fulfill two conditions: there should be a oneto-one association between nodes of G and those of G, and two nodes of G should be connected, if and only if their associated nodes in G are connected.
Subgraph. Graph G is a subgraph of graph G, if G is isomorphic to a graph whose nodes and edges are in G.
k-node subgraph. A subgraph with k nodes is called k-node subgraph.
Induced subgraph. The graph G is an induced subgraph of graph G, if G is a subgraph of G whose nodes are connected by all edges which connect the corresponding nodes in G ( Figure 6). We always mean induced subgraphs when using the term subgraph.
Frequent subgraph & minimum support. Let ζ = {G 1 , G 2 , · · · , G n } be a database of n graphs. For each subgraph sg, support(sg) denotes the number of graphs (in ζ ) which contain sg as a subgraph. A subgraph sg is a frequent subgraph if and only if support(sg) > λ , where λ is called minimum support. Graph signature.
Given a set of fre- Here count(sg i , G) is the number of occurrences of sg i in graph G. We use the relative frequency ϕ(sg i , G) because it compares graphs with different numbers of nodes and different numbers of edges.
Subgraph features are divided into two categories: basic subgraphs and frequent large subgraphs.
Basic subgraphs. Instead of frequent subgraphs all possible 3-node subgraphs (Figure 7) are used as basic subgraphs because they are the smallest meaningful subgraphs that can model coherence patterns. Because backward edges never occur in onemode projections, only four subgraphs are feasible (Figure 8).
We interpret these subgraphs as follows: • sg 1 : The connection between a sentence and subsequent ones. In other words, at least two entities are mentioned in one sentence and the subsequent ones are about these entities. Node labels illustrate the order of sentences. Sentence s t occurs before sentence s u , and sentence s u occurs before sentence s v (i.e. t < u < v).
• sg 2 : Indicates that entities in s t and s u get connected to each other in s v . • sg 3 : Each sentence tends to refer to the most prominent entity (focus of attention) in preceding sentences (Sidner, 1983;Grosz et al., 1995). The absence of a connection between s t and s v indicates that the entity connecting s t and s u is different from the entity connecting s u and s v . Therefore this subgraph approximately corresponds to the shift of the focus of attention. • sg 4 : Merges sg 1 and sg 3 and represents all connections of these two subgraphs.
We use these feasible 3-node subgraphs and compute the graph signature, Φ, of each G ∈ ζ . We propose each ϕ ∈ Φ (i.e. relative frequency of each subgraph in G) as a connectivity feature of graph G to measure text coherence.
Frequent large subgraphs. Since we observe a strong correlation between basic subgraphs and human readability ratings (Table 4), we mine frequent large subgraphs of projection graphs. Our intuition is that larger subgraphs are more informative coherence patterns. Hence, we extend the coherence features from all feasible 3-node subgraphs to frequent k-node subgraphs. We first use an efficient subgraph mining algorithm to extract all subgraphs with size k and then compute the count of each subgraph as an induced subgraph in each graph G ∈ ζ . We retain a subgraph sg, if it is frequent (i.e. support(sg) > λ ). The result of these steps is a two-dimensional matrix whose rows represent graphs in ζ and columns represent frequent subgraphs with size k. The cell G i , sg j shows the count of sg j in graph G i . Given this matrix, we compute the graph signature of each G ∈ ζ and take each element of the graph signature as a coherence feature.

Data
We use the dataset created by Pitler and Nenkova (2008) which consists of randomly selected articles from the Wall Street Journal corpus. The articles were rated by three humans on a scale from 1 to 5 for readability based on quality measures that are designed to estimate the coherence of articles. The final readability score of each article is the average of these three ratings.
We exclude three files from this dataset: wsj--0382 does not exist in the Penn Treebank (Marcus et al., 1994) 1 . wsj-2090 does not exist in the Penn Discource Treebank (Prasad et al., 2008). wsj-1398 is a poem.

Settings
Entity graph. We use the gold parse trees in the Penn Treebank (Marcus et al., 1994) to extract all nouns in a document as mentions. We consider nouns with identical stem 2 as coreferent. We divide the edge weight between two sentence nodes s i and s j by their distance j − i to decrease the importance of links that exist between non-adjacent sentences. Discourse relation graph. We use gold PDTB-style discourse relations (Prasad et al., 2008). We filter out EntRel and NoRel relations. Number of components. For counting the number of components in each projection graph, the Sage-Math 3 package is used. This feature is computed on unweighted projections (i.e. P ER u ). Frequent subgraphs. Since subgraph mining is an NP-complete problem, different algorithms have been introduced to improve the performance of subgraph mining. We use the gSpan 4 algorithm (Yan and Han, 2002) to mine subgraphs of a graph database which contains P ER u projections. An advantage of using efficient subgraph mining algorithms is that we can exhaustively search very large subgraph spaces. A graph with E edges, however, potentially has O(2 E ) subgraphs. Having sparse graphs  Figure 9: Frequent subgraphs with four nodes where t < u < v < w. and using efficient subgraph mining algorithm lets us to search trough this space. We mine subgraphs with k = 4 and λ = 0 (Figure 9).

Evaluation
We evaluate on the following benchmark tasks. Readability assessment. We use the Pearson correlation coefficient to find features correlated with readability scores. It takes feature values and readability scores of all articles and returns −1 ≤ ρ ≤ +1. A high value of |ρ| shows a strong correlation. We report statistical significance on the 0.05-level 5 . Readability as ranking. We rank texts pairwise with respect to their readability. We define a classification problem with a set of text pairs and a label, which indicates whether the first text in a pair 5 The results written in bold face (Section 5).  is more readable. We use every two texts whose human readability scores differ by at least 0.5. Each text is represented with its graph-based coherence features. We employ WEKA's linear support vector implementation (SMO) to classify the pairs. Performance is evaluated using 10-fold cross-validation.

Results
Readability assessment. We report the correlation of our coherence models encoded in graph features and compare them with Guinaudeau and Strube's (2013) entity graph as the state-of-the-art coherence model. Pitler and Nenkova (2008) show that the entity transition features extracted from the entity grid model (Barzilay and Lapata, 2008) on its own do not significantly predict human readability ratings. So we do not describe their results here. The results for the outdegree feature is shown in Table 3. The average outdegree of P ER w is highly correlated with human readability ratings. This confirms the readability results of Guinaudeau and Strube (2013) on the Encyclopedia Britannica dataset. The outdegrees of discourse relation graphs are more strongly correlated with human readability ratings than the outdegree of the projections in the entity graph, suggesting that efficient graph-based encoding of discourse relations can measure readability well. The outdegree of the combined graph P ER w + P DR w is highly correlated, showing that the interaction of entity connections and discourse relations is important for text coherence. However, none of the outdegree measures in this table are significantly correlated with human readability rat-ings, confirming the intuition that outdegree only measures node connectivity in graphs and it is not enough to measure readability.  Table 4 shows the correlation of two features of projections 6 : The number of components has a strong and significant negative correlation with human readability ratings 7 , suggesting that simple properties of graphs measure text coherence. The lower part of Table 4 shows the correlation of the relative frequency of 3-node subgraphs (see Figure 8). More readable articles have many sg 1 and few number of sg 2 patterns. Pattern sg 3 is significantly and negatively correlated with human readability judgments, confirming the intuition that many shifts in focus of attention make texts difficult to read. Table 5 shows the correlation between the relative frequency of 4-node subgraphs and readability ratings. First, most subgraphs with less than four edges are negatively correlated with readability, except sg 20 and sg 24 which are weakly correlated with readability. Few connections between sentences make the text difficult to read.
Second, the highest positive and significant correlation of sg 12 and the most negatively correlated subgraph sg 11 show that different patterns of edges in subgraphs capture readability judgments. Stoddard (1991, p.29) explains this by the ambiguity node phenomenon: " [...] in some cases, there may be more than one logical, possible node for a given cohesive element in a text, in which case, a reader may see the resulting ambiguity but not be able to 6 Although, the proposed features can be applied on all kind of presented graphs, we evaluate them (except outdegree) only on projections of the entity graph model. We leave the application to the other graph representations for future work. 7 This supports Karamanis et al. (2009)   decide between the choices". E.g., in sg 11 a reader may make a decision about the focus of attention in s w , while in sg 12 the focus of attention of s w is the same as the focus of attention of s t . This phenomenon can also be observed in all positively correlated subgraphs. If readers have to return to one point in the text, they prefer to return to a sentence which is the core of the preceding sentences. However, we should refrain of interpreting too much into these patterns. Finally, we conclude that in all strongly negative correlated subgraphs, a subgraph suffers either from edge shortage or the ambiguity node phenomenon like sg 7 .
Considering the correlation of 3-node subgraphs in Table 4 and 4-node subgraphs in Table 5, two results are noticeable. First, in large subgraphs there are more strongly correlated subgraphs than 3-node subgraphs, confirming our hypothesis that larger subgraphs convey coherence patterns with higher quality. Second, sg 12 in 4-node subgraphs is more strongly and positively correlated than sg 4 in 3-node subgraphs, because sg 12 captures more circumstances about s t . The relative frequency of sg 12 is more informative than sg 4 's relative frequency. Readability as ranking. Results of the readability ranking problem are shown in Table 6. Baseline features are entity transition features which are used as coherence features by Pitler and Nenkova (2008)   When classifying with graph signatures based on basic subgraphs, accuracy is lower than with the baseline coherence features. This is probably related to the entity grid features which represent grammatical role transitions of entities, while the basic subgraphs only models the occurrence of entities across sentences. Graph signatures based on large subgraphs improve the performance of basic subgraphs by around 10%. This high accuracy verifies that larger subgraphs capture coherence patterns with high quality. Combining basic (3-node) and large subgraphs (4-node) cannot improve the performance of the large subgraphs features. This probably is because basic subgraphs are implicitly included in larger subgraphs. The combination of coherence baseline features and frequent large subgraphs improves the accuracy.

Related Work
There is a research tradition developing metrics for readability and using these metrics to quantify how difficult it is to understand a document. Shallow features such as word, sentence and text length, which only capture superficial properties of a text, have been used traditionally (Flesch, 1948;Kincaid et al., 1975). De Clercq et al. (2014 use traditional shallow features and apply these to a new corpus annotated with two different methodologies. However, some studies indicate that shallow features do not precisely predict the readability of a text (Feng et al., 2009;Petersen and Ostendorf, 2009). Later studies introduce deeper (more semantic) features such as those obtained by language models (Si and Callan, 2001;Collins-Thompson and Callan, 2004) and syntactic features like the number of NPs in sentences or the height of the sentence's parse tree (Schwarm and Ostendorf, 2005;Heilman et al., 2007). Barzilay and Lapata (2008) propose an entity-based coherence model which operationalizes some of the intuitions behind the centering model (Grosz et al., 1995). Although this model works well on the sentence ordering and summary coherence rating tasks, it does not work well for readability assessment. Only when combining the entity grid with features taken from Schwarm and Ostendorf (2005) the entity grid performs competitively.
While most of these studies predict the readability level of documents, Pitler and Nenkova (2008) present a new readability dataset with Wall Street Journal articles, where each article is assigned human readability ratings. They analyze the correlation between different readability features and human readability scores. They show no correlation between entity-transition features and readability scores. In contrast to them we are able to report a statistically significant correlation between some entity-based features and human readability ratings.

Conclusions
We proposed graph-based coherence features based on the notion of frequent subgraphs. We analyzed these features on the dataset created by Pitler and Nenkova (2008) which associates human readability ratings with each document. We have shown that frequent subgraphs represent coherence patterns in a text. Larger subgraphs obtain a high and statistically significant correlation with human readability ratings.
Pitler and Nenkova (2008) did not achieve statistically significant (positive or negative) correlations between their features derived from the entity grid and human readability ratings. In contrast, some of our automatically induced subgraphs have a strong statistically significant correlation. We also outperform Pitler and Nenkova (2008) in the readability ranking task by more than 5% accuracy thus establishing a new state-of-the-art on this dataset. We conclude that the graph-based representation (Guinaudeau and Strube, 2013) is a better and more informative starting point for assessing readability.
In future work, we plan to induce common subgraphs and apply our method to different datasets (e.g. the dataset created by De Clercq et al. (2014)) combined with other readability features (Schwarm and Ostendorf, 2005).