Unsupervised Text Segmentation Using Semantic Relatedness Graphs

Segmenting text into semantically coherent fragments improves readability of text and facilitates tasks like text summarization and passage retrieval. In this paper, we present a novel unsupervised al-gorithm for linear text segmentation (TS) that exploits word embeddings and a measure of semantic relatedness of short texts to construct a semantic relatedness graph of the document. Semantically coherent segments are then derived from maximal cliques of the relatedness graph. The al-gorithm performs competitively on a standard synthetic dataset and outperforms the best-performing method on a real-world (i.e., non-artiﬁcial) dataset of political man-ifestos.


Introduction
Despite the fact that in mainstream natural language processing (NLP) and information retrieval (IR) texts are modeled as bags of unordered words, texts are sequences of semantically coherent segments, designed (often very thoughtfully) to ease readability and understanding of the ideas conveyed by the authors. Although authors may explicitly define coherent segments (e.g., as paragraphs), many texts, especially on the web, lack any explicit segmentation.
Linear text segmentation aims to represent texts as sequences of semantically coherent segments. Besides improving readability and understandability of texts for readers, automated text segmentation is beneficial for NLP and IR tasks such as text summarization (Angheluta et al., 2002;Dias et al., 2007) and passage retrieval (Huang et al., 2003;Dias et al., 2007). Whereas early approaches to unsupervised text segmentation measured the co-herence of segments via raw term overlaps between sentences (Hearst, 1997;Choi, 2000), more recent methods (Misra et al., 2009;Riedl and Biemann, 2012) addressed the issue of sparsity of term-based representations by replacing term-vectors with vectors of latent topics.
A topical representation of text is, however, merely a vague approximation of its meaning. Considering that the goal of TS is to identify semantically coherent segments, we propose a TS algorithm aiming to directly capture the semantic relatedness between segments, instead of approximating it via topical similarity. We employ word embeddings (Mikolov et al., 2013) and a measure of semantic relatedness of short texts (Šarić et al., 2012) to construct a relatedness graph of the text in which nodes denote sentences and edges are added between semantically related sentences. We then derive segments using the maximal cliques of such similarity graphs.
The proposed algorithm displays competitive performance on the artifically-generated benchmark TS dataset (Choi, 2000) and, more importantly, outperforms the best-performing topic modeling-based TS method on a real-world dataset of political manifestos.

Related Work
Automated text segmentation received a lot of attention in NLP and IR communities due to its usefulness for text summarization and text indexing. Text segmentation can be performed in two different ways, namely (1) with the goal of obtaining linear segmentations (i.e. detecting the sequence of different segments in a text) , or (2) in order to obtain hierarchical segmentations (i.e. defining a structure of subtopics between the detected segments). Like the majority of TS methods (Hearst, 1994;Brants et al., 2002;Misra et al., 2009;Riedl and Biemann, 2012), in this work we focus on linear segmentation of text, but there is also a solid body of work on hierarchical TS, where each toplevel segment is further broken down (Yaari, 1997;Eisenstein, 2009). Hearst (1994 introduced TextTiling, one of the first unsupervised algorithms for linear text segmentation. She exploits the fact that words tend to be repeated in coherent segments and measures the similarity between paragraphs by comparing their sparse term-vectors. Choi (2000) introduced the probabilistic algorithm using matrix-based ranking and clustering to determine similarities between segments. Galley et al. (2003) combined contentbased information with acoustic cues in order to detect discourse shifts whereas Utiyama and Isahara (2001) and Fragkou et al. (2004) minimized different segmentation cost functions with dynamic programming.
The first segmentation approach based on topic modeling (Brants et al., 2002) employed the probabilistic latent semantic analysis (pLSA) to derive latent representations of segments and determined the segmentation based on similarities of segments' latent vectors. More recent models (Misra et al., 2009;Riedl and Biemann, 2012) employed the latent Dirichlet allocation (LDA) (Blei et al., 2003) to compute the latent topics and displayed superior performance to previous models on standard synthetic datasets (Choi, 2000;Galley et al., 2003). Misra et al. (2009) used dynamic programming to find globally optimal segmentation over the set of LDA-based segment representations, whereas Riedl and Biemann (2012) introduced TopicTiling, an LDA-driven extension of Hearst's TextTiling algorithm where segments are, represented as dense vectors of dominant topics of terms they contain (instead of as sparse term vectors). Riedl and Biemann (2012) show that TopicTiling outperforms at-that-time state-of-the-art methods for unsupervised linear segmentation (Choi, 2000;Utiyama and Isahara, 2001;Galley et al., 2003;Fragkou et al., 2004;Misra et al., 2009) and that it is also faster than other LDA-based methods (Misra et al., 2009).
In the most closely related work to ours, Malioutov and Barzilay (2006) proposed a graphbased TS approach in which they first construct the fully connected graph of sentences, with edges weighted via the cosine similarity between bagof-words sentence vectors, and then run the mini-mum normalized multiway cut algorithm to obtain the segments. Similarly, Ferret (2007) builds the similarity graph, only between words instead of between sentences, using sparse co-occurrence vectors as semantic representations for words. He then identifies topics by clustering the word similarity graph via the Shared Nearest Neighbor algorithm (Ertöz et al., 2004). Unlike these works, we use the dense semantic representations of words and sentences (i.e., embeddings), which have been shown to outperform sparse semantic vectors on a range of NLP tasks. Also, instead of looking for minimal cuts in the relatedness graph, we exploit the maximal cliques of the relatedness graph between sentences to obtain the topic segments.

Text Segmentation Algorithm
Our TS algorithm, dubbed GRAPHSEG, builds a semantic relatedness graph in which nodes denote sentences and edges are created for pairs of semantically related sentences. We then determine the coherent segments by finding maximal cliques of the relatedness graph. The novelty of GRAPHSEG is in the fact that it directly exploits the semantics of text instead of approximating the meaning with topicality.

Semantic Relatedness of Sentences
The measure of semantic relatedness between sentences we use is an extension of a salient greedy lemma alignment feature proposed in a supervised model byŠarić et al. (2012). They greedily align content words between sentences by the similarity of their distributional vectors and then sum the similarity scores of aligned word pairs. However, such greedily obtained alignment is not necessarily optimal. In contrast, we compute the optimal alignment by (1) creating a weighted complete bipartite graph between the sets of content words of the two sentences (i.e., each word from one sentence is connected with a relatedness edge to all of the words in the other sentence) and (2) running a bipartite graph matching algorithm known as the Hungarian method (Kuhn, 1955) that has the polynomial complexity. The similarities of content words between sentences (i.e., the weights of the bipartite graph) are computed as the cosine of the angle between their corresponding embedding vectors (Mikolov et al., 2013).
Let A be the set of word pairs in the optimal alignment between the content-word sets of the two sentences S 1 and S 2 , i.e., A = {(w 1 , w 2 ) | w 1 ∈ S 1 ∧ w 2 ∈ S 2 }. We then compute the semantic relatedness for two given sentences S 1 and S 2 as follows: where v i is the embedding vector of the word w i and ic(w ) is the information content (IC) of the word w, computed based on the relative frequency of w in some large corpus C: .
We utilize the IC weighting of embedding similarity because we assume that matches between less frequent words (e.g., guitar and ukulele) contribute more to sentence relatedness than pairs of similar but frequent words (e.g., do and make). We used Google Books Ngrams (Michel et al., 2011) as a large corpus C for estimating relative frequencies of words in a language. Because there will be more aligned pairs between longer sentences, the relatedness score will be larger for longer sentences merely because of their length (regardless of their actual similarity). Thus, we normalize the sr(S 1 , S 2 ) score first with the length of S 1 and then with the length S 2 and we finally average these two normalized scores:

Graph-Based Segmentation
All sentences in a text become nodes of the relatedness graph G. We then compute the semantic similarity, as described in the previous subsection, between all pairs of sentences in a given document. For each pair of sentences for which the semantic relatedness is above some treshold value τ we add an edge between the corresponding nodes of G. Next, we employ the Bron-Kerbosch algorithm (Bron and Kerbosch, 1973) to compute the set Q of all maximal cliques of G. We then create the initial set of segments SG by merging adjacent sentences found in at least one maximal clique Q ∈ Q of graph G. Next, we merge the adjacent segments sg i and sg i+1 for which there is at least one clique Q ∈ Q containing at least one sentence from sg i and one sentence from sg i+1 . Finally, given the  Table 1: Creating segments from graph cliques (n = 2). In the third step we merge segments {1, 2, 3} and {4, 5} because the second clique contains sentences 2 (from the left segment) and 4 (from the right segment). In the final step we merge single sentence segments (assuming segs ({1, 2, 3 minimal segment size n, we merge segments sg i with less than n sentences with the semantically more related of the two adjacent segments -sg i−1 or sg i+1 . The relatedness between two adjacent segments (sgr (sg i , sg i+1 )) is computed as the average relatedness between their respective sentences: We exemplify the creation of segments from maximal cliques in Table 1. The complete segmentation algorithm is fleshed out in Algorithm 1. 1

Evaluation
In this section, we first introduce the two evaluation datasets that we use one being the commonly used synthetic dataset and the other a realistic dataset of politi-cal manifestos. Following, we present the experimental setting and finally describe and discuss the results achieved by our GRAPHSEG algorithm and how it compares to other TS models.

Datasets
Unsupervised methods for text segmentation have most often been evaluated on synthetic datasets with segments from different sources being concatenated in artificial documents (Choi, 2000;Galley et al., 2003). Segmenting such artificial texts is easier than segmenting real-world documents. This is why besides on the artificial Choi dataset we also evaluate GRAPHSEG on a real-world dataset of political texts from the Manifesto Project, 2,3 manually labeled by domain experts with segments of seven different topics (e.g., economy and welfare, quality of life, foreign affairs). The selected manifestos contain between 1000 and 2500 sentences, with segments ranging in length from 1 to 78 sentences, which is in sharp contrast to the Choi dataset where all segments are of similar size.

Experimental Setting
To allow for comparison with previous work, we evaluate GRAPHSEG on four subsets of the Choi dataset, differing in number of sentences the seg-2008, and 2012 U.S. elections ments contain. For the evaluation on the Choi dataset, the GRAPHSEG algorithm made use of the publicly available word embeddings built from a Google News dataset. 4 Both LDA-based models (Misra et al., 2009;Riedl and Biemann, 2012) and GRAPHSEG rely on corpus-derived word representations. Thus, we evaluated on the Manifesto dataset both the domainadapted and domain-unadapted variants of these methods. The domain-adapted variants of the models used the unlabeled domain corpus -a test set of 466 unlabeled political manifestos -to train the domain-specific word representations. This means that we obtain (1) in-domain topics for the LDAbased TopicTiling model of Riedl and Biemann (2012) and (2) domain-specific embeddings for the GRAPHSEG algorithm. On the Manifesto dataset we also evaluate a baseline that randomly (50% chance) starts a new segment at points m sentences apart, with m being set to half of the average length of gold segments.
We evaluate the performance using two standard TS evaluation metrics -P k (Beeferman et al., 1999) and WindowDiff (WD) (Pevzner and Hearst, 2002). P k is the probability that two randomly drawn sentences mutually k sentences apart are classified incorrectly -either as belonging to the same segment when they are in different gold segments or as being in different segments when they are in the same gold segment. Following Riedl and Biemann (2012), we set k to half of the document length divided by the number of gold segments. WindowDiff is a stricter version of P k as, instead of only checking if the randomly chosen sentences are in the same predicted segment or not, it compares the exact number of segments between the sentences in the predicted segmentation with the number of segments in between the same sentences in the gold standard. Lower scores indicate better performance for both these metrics.
The GRAPHSEG algorithm has two parameters: (1) the sentence similarity treshold τ which is used when creating edges of the sentence relatedness graph and (2) the minimal segment size n, which we utilize to merge adjacent segments that are too small. In all experiments we use grid-search in a folded cross-validation setting to jointly optimize both parameters. In view of comparison with other models, the parameter optimization is justified be-3-5 6-8 9-11 3-11 Choi (2000) 12.0 -9.0 -9.0 -12.0 - Brants et al. (2002) 7.   cause other models, e.g., TopicTiling (Riedl and Biemann, 2012), also have parameters (e.g., number of topics for the topic model) which are optimized using cross-validation.

Results and Discussion
In Table 2 we report the performance of GRAPH-SEG and prominent TS methods on the synthetic Choi dataset. GRAPHSEG performs competitively, outperforming all methods but (Fragkou et al., 2004) and domain-adapted versions of LDA-based models (Misra et al., 2009;Riedl and Biemann, 2012). However, the approach by (Fragkou et al., 2004) uses the gold standard information -the average gold segment size -as input. On the other hand, the LDA-based models adapt their topic models on parts of the Choi dataset itself. Despite the fact that they use different documents for training the topic models from those used for evaluating segmentation quality, the evaluation is still tainted because snippets from the original documents appear in multiple artificial documents -some of which belong to the the training set and others to the test set, as admitted by Riedl and Biemann (2012) and this is why their reported performance on this dataset is overestimated.
In Table 3 we report the results on the Manifesto dataset. Results of both TopicTiling and GRAPHSEG indicate that the realistic Manifesto dataset is much more difficult to segment than the artificial Choi dataset. The GRAPHSEG algorithm significantly outperforms the TopicTiling method (p < 0.05, Student's t-test). In-domain training of word representations, topics for TopicTiling and word embeddings for GraphSeg, does not significantly improve the performance for neither of the two models. This result contrasts previous findings (Misra et al., 2009;Riedl and Biemann, 2012) in which the performance boost was credited to the indomain trained topics and supports our hypothesis that the performance boost of the LDA-based methods' with in-domain trained topics originates from information leakage between different portions of the synthetic Choi dataset.

Conclusion
In this work we presented GRAPHSEG, a novel graph-based algorithm for unsupervised text segmentation. GRAPHSEG employs word embeddings and extends a measure of semantic relatedness to construct a relatedness graph with edges established between semantically related sentences. The segmentation is then determined by the maximal cliques of the relatedness graph and improved by semantic comparison of adjacent segments.
GRAPHSEG displays competitive performance compared to best-performing LDA-based methods on a synthetic dataset. However, we identify and discuss evaluation issues pertaining to LDA-based TS on this dataset. We also performed an evaluation on the real-world dataset of political manifestos and showed that in a realistic setting GRAPHSEG significantly outperforms the state-of-the-art LDAbased TS model.