Evaluating text coherence based on semantic similarity graph

Coherence is a crucial feature of text because it is indispensable for conveying its communication purpose and meaning to its readers. In this paper, we propose an unsupervised text coherence scoring based on graph construction in which edges are established between semantically similar sentences represented by vertices. The sentence similarity is calculated based on the cosine similarity of semantic vectors representing sentences. We provide three graph construction methods establishing an edge from a given vertex to a preceding adjacent vertex, to a single similar vertex, or to multiple similar vertices. We evaluated our methods in the document discrimination task and the insertion task by comparing our proposed methods to the supervised (Entity Grid) and unsupervised (Entity Graph) baselines. In the document discrimination task, our method outperformed the unsupervised baseline but could not do the supervised baseline, while in the insertion task, our method outperformed both baselines.


Introduction
Coherence plays an important role in a text because it enables a text to convey its communication purpose and meaning to its readers (Bamberg, 1983;Grosz and Sidner, 1986). Coherence also decreases reading time as a more coherent text is easier to read with less reader's cognitive load (Todirascu et al., 2016). While there is no single agreed definition of coherence, we can compile several definitions of coherence and note its important aspects.
First, a text is coherent if it can convey its communication purpose and meaning to its readers (Wolf and Gibson, 2005;Somasundaran et al., 2014;Feng et al., 2014). Second, a text needs to be integrated as a whole, rather than a series of independent sentences (Bamberg, 1983;Garing, 2014). It means that sentences in the text are centralised around a certain theme or topic, and are arranged in a particular order in terms of logical, spatial, and temporal relations. Third, every sentence in a coherent text has relation(s) to each other (Halliday and Hasan, 1976;Grosz and Sidner, 1986;Mann and Thompson, 1988;Wolf and Gibson, 2005). It suggests that a text exhibits discourse/rhetorical relation and cohesion. Fourth, text coherence is greatly influenced by the presence of a certain organisation in the text (Persing et al., 2010;Somasundaran et al., 2014). The organisation helps readers to anticipate the upcoming textual information. Although a wellorganised text is highly probable to be coherent, only the organisation does not constitute coherence. Textual organisation concerns the structural formation and logical development of a text, while lexical and semantic continuity is also indispensable for coherent text (Feng et al., 2014). Fifth, it is easier to read a coherent text than its less coherent counterpart (Garing, 2014). Thus when writing a text, it is not enough to only revise the text with careful editing and proofreading from the lexical, or grammatical aspect. Coherence aspect also should be taken into account in revising the text (Bamberg, 1983;Garing, 2014).
There are studies on computational modelling of text coherence based on the supervised learning approach, such as the Entity Grid model (Barzilay and Lapata, 2008). The Entity Grid model has been further extended into the Role Matrix model (Lin et al., 2011;Feng et al., 2014). However, these models have a few drawbacks. First, department trial Microsoft evidence competitors markets products brands case Netscape software S1 S Entity Grid using co-reference resolution has a bias towards the original ordering of text when comparing a text with its permutated counterparts. The co-reference resolution module is trained on well-formed texts; thus it does not perform very well for ill-organised texts. The methods utilising a discourse parser for modelling text coherence (Lin et al., 2011;Feng et al., 2014) have the same problem. Second, the supervised model often suffers from data sparsity, domain dependence, and computational cost for training. To alleviate these problems in the supervised model, Guinaudeau and Strube (2013) proposed an unsupervised coherence model known as the Entity Graph model. The Entity Grid, Role Matrix, and Entity Graph model assumed coherence was achieved by local cohesion, i.e. repeated mentions of the same entities constitute cohesion. However, they did not capture the contribution of related-yet-notidentical entities (Petersen et al., 2015). To our best knowledge, the closest study addressing this problem was done by Li and Hovy (2014). The key idea of Li and Hovy (2014) is to learn a distributed sentence representation which captures the underlying semantic relations between consecutive sentences. To tackle these limitations of the past research, we present an unsupervised text coherence model that captures the contribution of related-yet-not-identical entities.
The rest of this paper is organised as follows. Section 2 describes related work; Section 3 introduces our proposed unsupervised method to measure text coherence from a semantic similarity perspective; Section 4 describes experimental results; then followed by the conclusion in Section 5.

Related work
This section provides an overview of existing coherence scoring models, both supervised and unsupervised. Entity Grid is considered as a supervised baseline in this paper. On the other hand, Entity Graph is selected as an unsupervised baseline.   (Barzilay and Lapata, 2008)

Entity Grid
The Entity Grid model focused on the evaluation of local cohesion developed on top of the Centering theory (Barzilay and Lapata, 2008). The key idea of the Centering theory is that the distribution of entities in coherent texts exhibits certain regularities (Grosz et al., 1995). The text is said to be less coherent if it exhibits many attention shifts, i.e. frequent changes in attention (centre) (Grosz et al., 1995). However, if the centre of attention has smooth transitions, it will be more coherent, e.g. when sentences in a text mentioning the same entity. Barzilay and Lapata (2008) proposed a computational model by representing text as a matrix called Entity Grid in which the column corresponds to entities, the row corresponds to sentences in the text, and the cell denotes the role of the entity in the sentence. The role of an entity is defined as one of S(subject), O(object), or X(neither). The cell is filled with "−" if the entity is not mentioned in the sentence. If the entity serves multiple roles in the sentence, the priority order would be S, O, and then X. They consider co-referent noun phrases as an entity. As an example, the text in Figure 1 is transformed into the Entity Grid as in Table 1. The bracketed words in Figure 1 are recognised as the entities in Table 1.
Also, they differentiate salient entities. An entity is considered salient if it occurs at least t times in the text. The text is further encoded into a feature vector, denoting the probability of local entity transitions (Barzilay and Lapata, 2008) Lin et al. (2011) andFeng et al. (2014) tried to tackle this limitation by filling the cell in the grid with the discourse role of the sentence in which the entity appears.

Entity Graph
To tackle the disadvantages of the supervised coherence model, Guinaudeau and Strube (2013) proposed a graph model to measure text coherence. Graph data structure allows us to relate nonadjacent sentences, spanning globally in the text to reflect global coherence as opposed to the local coherence of the Entity Grid model. A text is represented as a directed bipartite graph. The first partition is a sentence partition in which each vertex represents a sentence. The second partition is a discourse partition in which each vertex represents an entity. The weighted edge between a sentence vertex and an entity vertex is established if the entity is mentioned in the sentence. A weight is assigned to each edge based on entity's role in the sentence: 3 for a subject entity, 2 for an object entity, and 1 for others. Figure 2 shows an example of the bipartite graph transformation from the text in Figure 1. This directed bipartite graph is further transformed into a directed projection graph in which a vertex represents a sentence, and a directed weighted edge is established between vertices if they share same entities. The direction of the edge corresponds to the surface sequential order of the sentences within the text. For example, a vertex which represents the second sentence can only have outgoing edges to third, fourth, but not to the first sentence. There are three projection methods, P U , P W , and P Acc depending on the weighting scheme of edges. P U assigns a binary weight to S1 S2 S3 1 1 0.5 S1 S2 S3 1 1 1 S1 S2 S3 6 6 5.5 P U P W P Acc Figure 3: Example of projection graphs each edge: one for the edge connecting two sentences sharing at least one entity in common and zero for others. P W assigns the number of shared entities between connected sentences to each edge as its weight. P Acc calculates an edge weight by accumulating the products of the weights of edges sharing an entity in the bipartite graph over the shared entities by the connected two sentences. The weight of the edge established between sentence s i and s j is calculated by where E ij is the set of entities shared by s i and s j and bw(e, s) is a weight of the edge between entity e and sentence s in the bipartite graph. Furthermore, the edge weight in the projection graph can be normalised with dividing by the distance between the sentences, i.e. |j − i|. Figure 3 shows the projection graph transformed from Figure 2 after the normalisation. To measure text coherence by the projection graph, Guinaudeau and Strube (2013) used the average OutDegree of every vertex in the projection graph. The OutDegree of a vertex is defined as the summation of the weight of outgoing edges leaving the vertex.

Constructing semantic similarity graphs
As mentioned in Section 1, a text is coherent if it can convey its communication purpose to readers, integrated as a whole, cohesive, well organised, and easy to read. We would like to approach coherence from the cohesion perspective. We argue that coherence of a text is built by cohesion among its sentences. We call our method as Semantic Similarity Graph.
Our proposed method employs an unsupervised learning approach. The unsupervised approach suffers less from data sparsity, domain dependence, and computational cost for training which often arise in the supervised approach. We encode a text into a graph G(V, E), where V is a set of vertices and E is a set of edges in the graph. The vertex v i ∈ V represents the i-th sentence s i in the text, and the weighted directed edge e i,j ∈ E represents a semantic relation from the i-th to the j-th sentences. In what follows, the term "edge" refers to the weighted directed edge.
As stated by Halliday and Hasan (1976), cohesion is a matter of lexicosemantics. Our method projects a sentence into a vector representation using pre-trained GloVe word vectors 1 by Pennington et al. (2014). A sentence consists of multiple words {w 1 , w 2 , · · · , w M } where each of them is mapped into a vector space, i.e. { w 1 , w 2 , · · · , w M }. A sentence s can be encoded as a vector s by taking the average of consisting word vectors. Formally, a sentence vector s is described as where M denotes the number of words in the sentence. We propose three methods for constructing a graph from a text based on semantic similarity between sentence pairs in the text. Given a certain sentence vertex in the graph, how to decide its counterpart vertices for establishing edges is the crucial point. The following subsections describe each method to decide a counterpart vertex.

Preceding adjacent vertex (PAV)
People read a text from the beginning to the end and understand a particular part of the text based 1 We use word vectors trained on Wikipedia 2014 + Gigaword 5, 6B tokens 400K vocab, uncased, 100d. The resource is available at https://nlp.stanford.edu/projects/glove/ for i ← 2 to N do if sim(si, si−1) > 0 then creates edge ei,i−1 with sim(si, si−1) as the weight else for j ← i − 2 to 1 do if sim(si, sj) > 0 then creates edge ei,j with sim(si, sj) as the weight break Figure 4: Graph construction algorithm with similarity of PAV on information provided in the preceding part. When they do not understand a particular part, people look backwards for what they have missed. We mimic this reading process into graph construction that is reflected in the algorithm in Figure 4, where N is the number of sentences in the text to be processed. First we define a similarity measure sim(s i , s j ) of a pair of sentences s i and s j as where uot is the number of unique overlapping terms between the sentences s i and s j divided by the number of unique terms in the two sentences; cos( s i , s j ) is a cosine similarity of the sentence vectors; α is a balancing factor ranging over [0,1].
The algorithm constructs a graph by establishing a weighted directed edge from each sentence vertex to the preceding adjacent sentence vertex (PAV) if the sim value between the current and the preceding adjacent vertices exceeds zero; otherwise, the algorithm tries to establish an edge to the next closest preceding vertex with non-zero sim value. The established edge is assigned the sim value as its weight.

Single similar vertex (SSV)
Cohesion between two sentences s i and s j means that we need to know s i in order to understand s j or vice versa (Halliday and Hasan, 1976). In this sense, we interpret cohesion as a semantic dependency among sentences. We simulate the semantic dependency with the semantic similarity between sentences. Since the dependency could happen in both direction, we allow edges to the following vertices as well as preceding vertices.
In the previous method, "precedence" and "adjacency" are the important constraints for establishing the edges in graph construction. This   Figure 5: Example of semantic similarity graphs method discards these constraints and establishes edges based on only the semantic similarity between sentences. However, the edges are still directed and weighted. Also, only a single outgoing edge is allowed from every vertex in the graph. We cast semantic dependency task into an information retrieval task. When establishing an edge from a certain sentence vertex, we search for the most similar sentence in the text. The similarity measure between two sentences s i and s j is calculated based on the cosine similarity of their semantic vectors. An edge is established from the sentence vertex in question to the most similar sentence vertex with the weight calculated by This weight calculation takes into account the distance between two sentences, i.e. we prefer a closer counterpart.

Multiple similar vertex (MSV)
In the previous method, we allowed only a single outgoing edge for every sentence vertex in the graph. Here we discard the singular condition and allow multiple outgoing edges for every vertex. Instead of choosing the most similar sentence in the text, we choose multiple sentences that exceed a certain threshold (θ) in terms of cosine similarity with the sentence in question. Edges are established for all vertex pairs with the edge weight given in Equation (2). Figure 5 shows an example of semantic similarity graphs constructed by three proposed methods for the text shown in Figure 6. The parameters for the PAV and MSV-based methods are the optimal value in the evaluation experiment that is de-scribed in the next section, and the insertion sentence (I) was placed in the correct position (B).

Text coherence measure
From a constructed graph by one of the three methods explained in the preceding subsections, text coherence measure tc is calculated by averaging averaged weight of outgoing edges from every vertex in the graph as where N is the number of sentences in the text and L i is the number of outgoing edges from the vertex v i . L i is always one for the PAV and SSV based graph construction, since we allow only a single outgoing edge from every vertex in the graph in these methods. A larger tc value denotes a more coherent text. The proposed models have two significant differences from the Entity Graph model, our direct competitor. First, the Entity Graph model only allows establishing outgoing edges in the following direction, i.e. from the vertex v i to the vertex v j , where i < j. On the other hand, the proposed models except for the PAV based graph construction allow edges in both directions. Second, the Entity Graph model only measures coherence based on shared entities between sentences with respect to their syntactic role. This is also the case for the Entity Grid model. The proposed models measure text coherence based on the similarity between semantic vectors of sentences; hence we can take into account related-yet-not-identical entities.

Evaluation and results
We evaluate the proposed methods on two experimental tasks: the document discrimination task and insertion task. All stop words are removed from the texts in this experiment, while lemmatisation is not employed.
The performance of the proposed methods is also compared with our reimplementation of Entity Grid (Barzilay and Lapata, 2008) and Entity Graph (Guinaudeau and Strube, 2013). The experimental settings for each method are described below.
SSV There is no particular parameter to set.
Entity Grid The optimal value for transition length three (bigram and trigram) is used. In document discrimination task, we implement the Entity Grid model with and without saliency. An entity is judged as salient if it is mentioned in the text at least twice. Saliency is not employed in the insertion task because the texts in the insertion task are relatively short and an entity is not mentioned many times.
Entity Graph We implemented three projection methods with normalisation: P U , P W , and P Acc .
Co-reference resolution is not employed to avoid bias as mentioned by Nahnsen (2009). However, we follow the suggestion by Eisner and Charniak (2011) to consider all nouns (including nonhead nouns) as entities in our experiment. The role of each entity is extracted using the dependency parser in Stanford CoreNLP toolkit (Manning et al., 2014).

Data
In the document discrimination task, sentences in a text are randomly permutated to generate another text; the task is to identify the original text given a pair of the original and the randomised one. The result is considered successful if the original is identified with the strictly higher coherence value. The performance is measured by accuracy, i.e. the ratio of successfully identified pairs to all pairs in the test set.
Our data came from a part of the English WSJ text in OntoNotes Release 5.0 (LDC2013T19). Half of the data is used for training while another half is used for testing. For each instance in both training and testing data, at most 20 random permutations were created. Detail of the data is shown in Table 2. Table 3 shows the result of the document discrimination task of each method with the various experimental settings.

Result and discussion
Entity Grid without saliency performed the best (0.845), followed by Entity Grid with saliency (0.837), PAV (0.774, α = 0.4), MSV (0.741,    Table 3: Result of the document discrimination task θ = 0.1), Entity Graph (0.725), then SSV (0.676). The performances of PAV and MSV are increasing over changes of parameter until at certain point becomes steadily decreasing. We performed the McNemar test in R to find out that the difference in accuracy between every pair of methods is statistically significant at p < 0.05. Contrary to Barzilay and Lapata (2008), the saliency factor did not work effectively for Entity Grid in our data. The PAV and MSV based-method performed better than Entity Graph. This result suggests that coherence is not only the matter of surface overlapping of entities and their syntactic roles, but semantic similarity between sentences also should be taken into account. This also confirms that  Table 4: Number of the same judgements between two methods in the document discrimination task the semantic relation between adjacent sentences (local coherence) is more important for coherence than semantic relation between long-distance sentences in the document discrimination task. We also calculated the number of the same judgement between all pairs of methods (questions that are answered correctly and incorrectly by both methods in the pair). Table 4 shows the number of the same judgement between every pair of the methods. We found out the PAV-MSV pair shares the largest number of the same judgement (11,998, 88.3%). The MSV-based method establishes an edge between sentences whenever their similarity exceeds the threshold. However, it has relatively many same judgements with PAV. This implies the local coherence is sufficient enough to solve the document discrimination task.

Data
In the insertion task described in Barzilay and Lapata (2008), the coherence measure is evaluated based on to what extent the measure can estimate the original sentence position in a text from which one sentence is taken out randomly. The coherence measure of the text with a taken-out sentence inserted at the original position, i.e. the original text, is expected to be the highest value among other values of text with the sentence inserted at a wrong position.
We argue, however, adopting the TOEFL R iBT insertion type question is more suitable for this kind of task than using the artificially generated texts by sentence deletion. The TOEFL R insertion type question aims at measuring test takers' ability to understand the text coherence. Test takers are given a coherent text with an insert-sentence. The task is to find the best place to insert the insert-sentence. To the best of our observation, the texts in the TOEFL R iBT insertion type question are coherent even before the insert-sentence is inserted. An example of the TOEFL R iBT insertion (A) S1[The raising of livestock is a major economic activity in semiarid lands, where grasses are generally the dominant type of natural vegetation.] (B) S2[The consequences of an excessive number of livestock grazing in an area are the reduction of the vegetation cover and trampling and pulverization of the soil.] (C) S3[This is usually followed by the drying of the soil and accelerated erosion.] (D)

Question:
Insert the following sentence into one of (A)-(D). I[This economic reliance on livestock in certain regions makes large tracts of land susceptible to overgrazing.] Figure 6: Example of the TOEFL R iBT insertion type question (Education Testing Service, 2007) type question is shown in Figure 6.
In the following evaluation, a method is judged as a success if it assigns the highest coherence value to the text formed by inserting the insertsentence at the correct insertion position. We do not allow tie values and judge it as fail even though the correct position has the highest tie value.
We collected 104 insertion type questions from various TOEFL R iBT preparation books. The average number of sentences in a text is 7.05 (SD: standard deviation=1.85); the average number of tokens in a text is 139.8 (SD=43.7). As the data size is relatively small, we adopted the one-heldout cross validation for the Entity Grid model. The same rank is assigned to incorrect insertion positions when training the Entity Grid model. We did not adopt the Entity Grid model considering saliency since each text is relatively short in this data thus term frequency (saliency) tends to be low for all terms. Table 5 shows the result of the insertion task of each method with the various experimental settings. Our proposed methods showed good performance, particularly the PAV-based graph construction method outperformed both baselines: Entity Grid and Entity Graph. The PAV method obtained the best performance at α = 0.0, while MSV method performed best at θ = 0.8. However, the McNemar test revealed that the difference in accuracy between every pair of methods was not statistically significant at p < 0.05. This is probably due to the limited size of the insertion data compared with the document discrimination task.

Result and discussion
There are two questions correctly answered and 31 questions incorrectly answered by all methods. These two correctly answered questions have  Table 5: Result of the insertion task similar characteristics, having word overlaps and synonyms across adjacent sentences. These questions also tend to contain more common words.
On the other hand, the failed questions tend to contain more uncommon words, technical terms and named entities. Although the successful questions also contain named entities, they were mentioned more frequently in the texts as opposed to the failed questions. Therefore we suspected the limited coverage of our GloVe dictionary and investigated the proportion of the out of vocabulary (OOV) ratio of the texts. Among all of the questions, there are 32 out of 104 questions including the OOV words; each question contains one to three OOV words in type/in token. All methods failed in 15 out of these 32 questions but succeeded in the rest 17. This fact suggests that OOV words are not necessarily the main reason for failures in the insertion task.
Comparing the parameters (α of PAV and θ of MSV) in Table 3 and Table 5, they are different to achieve the best performance in two different datasets. In the PAV-based method, there is no significant difference in the average uot value of every pair of adjacent two sentences between the datasets. We also calculated the cosine similarity of every pair of adjacent two sentences to find more similar adjacent sentences in the insertion task data than in the document discrimination task data; 90% of the adjacent sentence similarities lies in 0.3 ∼ 0.6 in the document discrimination task, while it ranges 0.5 ∼ 0.9 in the insertion task data. This difference suggests that the uot factor helps relatively more in the document discrimination task for the PAV-based method, while it has less impact in the insertion task. This explains the difference α values of PAV across the two tasks.
To investigate the difference of the parameter θ in the MSV-based model, we calculated the cosine similarity of every sentence pair in the text. In both datasets, more than 90% of the sentence similarities lies in 0.5 ∼ 1.0. When the similarity is transformed into the edge weight by dividing by the sentence distance, the difference becomes apparent; while 86.6% of the edge weights in the document discrimination task lies less than 0.2, the edge weights scatter over 0 ∼ 1.0 in the insertion task. This happens because the average length of the texts in the document discrimination task is longer than that of the insertion task. Unless setting a low threshold (θ), the MSV-based model hardly establishes edges between sentence vertices. In other words, establishing edges between distant sentences would contribute to the performance of these tasks. Table 6: Number of the same answers between two methods in the insertion task Table 6 shows the number of the same answers between every pair of the methods. The SSV-MSV pair shares the most same answers in the insertion task among all pairs (84, 80.8%), followed by the PAV-MSV pair (79, 76.0%), then PAV-SSV pair (75, 72.1%). The PAV-based method performs best without considering the overlapping terms between the adjacent sentences (uot) by setting α = 0. In this case, the PAV-based method is almost similar to the SSV-based method except for allowing only backwards edges. However, Table 6 shows the PAV-based method answered differently from the SSV-based method in almost 30% questions. To further investigate the difference, we focused on the questions that were answered incorrectly by the PAV-based method but answered correctly by the SSV-based method. There are 14 of such questions, in which the SSV-based method tends to establish edges between distant sentences; the average distance between sentence vertices is 2.8 (SD = 0.7). This suggests that the SSVbased method could capture distant sentence relations contributing to text coherence more appropriately than the PAV-based method.
We also investigated 11 questions that were answered incorrectly by the PAV-based method but answered correctly by the MSV-based method. In these questions, the MSV-based method tends to establish more edges than the PAV-based method. The average number of outgoing edges from a sentence vertex in the graph constructed by the MSVbased method is 2.5 (SD = 1.8). In addition, the MSV-based method tends to establish edges between distant sentences as well as the SSV-based method; the average distance between sentence vertices is 2.6 (SD = 0.9). This suggests that the MSV-based method also could capture many distant sentence relations contributing to text coherence more appropriately than the PAV-based method.
Although the PAV-based method performs best with the present data, which considers only local cohesion between adjacent sentences, we need to introduce a more refined mechanism for incorporating distant sentence relations than the current SSV and MSV-based methods, as we showed that long-distance relations could contribute in determining text coherence. The representation of sentences and calculation of similarity between sentences would be direct targets of the refinement.

Conclusion
This paper presented three novel unsupervised text coherence scoring methods, in which text coherence is regarded to be realised by cohesion of sentences in the text and the cohesion is represented in a graph structure corresponding to the text. In the graph structure, a vertex corresponds to a sentence in the text, and an edge represents a semantic relationship between corresponding sentences. As cohesion is a matter of lexicosemantics, sentences are transformed into semantic vector representa-tions, and their similarity is calculated based on the cosine similarity between the vectors. Edges between sentence vertices are established based on the similarity and distance between the sentences. We presented three methods to construct a graph: the PAV, SSV, and MSV-based methods.
We evaluated the proposed methods in the document discrimination task and the insertion task. Our best performing method (PAV) outperformed the unsupervised baseline (Entity Graph) but not the supervised baseline (Entity Grid) in the document discrimination task. The difference was statistically significant at p < 0.05. In the insertion task, our best performing method (PAV) outperformed both supervised and unsupervised baselines, but the difference is not statistically significant at p < 0.05. We argue that further experiment is necessary with a larger size of data in the insertion task.
Our experimental result showed that our best proposed method (PAV) performed 0.774 in accuracy in the document discrimination task, but only performed 0.356 in the insertion task. There is a big gap in their performance between two tasks. The error analysis revealed a possibility to improve the performance by introducing a more refined representation of sentence vectors and calculation in semantic the similarity between sentences for capturing distant relations between sentences.