Learning to Rank Semantic Coherence for Topic Segmentation

Topic segmentation plays an important role for discourse parsing and information retrieval. Due to the absence of training data, previous work mainly adopts unsupervised methods to rank semantic coherence between paragraphs for topic segmentation. In this paper, we present an intuitive and simple idea to automatically create a “quasi” training dataset, which includes a large amount of text pairs from the same or different documents with different semantic coherence. With the training corpus, we design a symmetric CNN neural network to model text pairs and rank the semantic coherence within the learning to rank framework. Experiments show that our algorithm is able to achieve competitive performance over strong baselines on several real-world datasets.


Introduction
The goal of topic segmentation is to segment a document into several topically coherent parts, with different parts corresponding to different topics. Topic segmentation enables better understanding of document structure, and makes long document much easier to navigate. It also provides helpful information for tasks such as information retrieval, topic tracking etc (Purver, 2011).
Due to the lack of large scale annotated topic segmentation dataset, previous work mainly focus on unsupervised models to measure the coherence between two textual segments. The intuition behind unsupervised models is that two adjacent segments from the same topic are more coherent than those from different topics. Under this intuition, one direction of research attempts to measure coherence by computing text similarity. The typi-cal methods include TextTiling (Hearst, 1997) and its variants, such as C99 (Choi, 2000), TopicTiling (Riedl and Biemann, 2012b) etc. The other direction of research develops topic modeling techniques to explore topic representation of text and topic change between textual segments (Yamron et al., 1998;Eisenstein and Barzilay, 2008;Riedl and Biemann, 2012a;Du et al., 2013;Jameel and Lam, 2013). With carefully designed generative process and efficient inference algorithm, topic models are able to model coherence as latent variables and outperform lexical similarity based models.
Though unsupervised models make progress in modeling text coherence, they mostly suffer from one of the following two limitations. First, it is not precise to measure coherence with text similarity, since text similarity is just one aspect to influence coherence. Second, many assumptions and manually set parameters are usually involved in the complex modeling techniques, due to the absence of supervised information. To overcome aforementioned limitations, we prefer to directly model the text coherence by exploring possible supervised information. Then, we can learn a function f (s1, s2) which takes two textual segments s1 and s2 as input, and directly measure their semantic coherence.
As we know, it is hard to directly compile and collect a large number of samples with coherence scores labeling. Here we propose an intuitive and simple strategy to automatically create a "quasi" training corpus for supervision. It is a common sense that the original documents written by human are generally more coherent than a patchwork of sentences or paragraphs randomly extracted from different documents. In such cases, two textual segments from the same document are more coherent than those from different documents, and two segments from the same paragraph are more coherent than those from different paragraphs. Then, we can get a large set of text pairs with partial ordering relations, which denote some text pairs are more coherent than other text pairs. With these ordering information, we propose to apply the learning to rank framework to model the semantic coherence function f (s1, s2), based on which topic boundaries are identified.
The next key problem is how to model and represent text pairs. It is fortunate that neural networks have emerged as a powerful tool for modeling text pairs (Lu and Li, 2013;Severyn and Moschitti, 2015;Yin et al., 2015;Hu et al., 2014), freeing us from feature engineering. In this paper, we develop a symmetric convolutional neural network (CNN) framework, whose main idea is to jointly model text representation and interaction between texts. With our acquired large amount of training data, our CNN-based method is capable of reasonably rank semantic coherence and further conduct topic segmentation.

Coherence Ordering between Text Pairs
In our work, we define f (s1, s2) as a function, which returns a real number as semantic coherence score of the text pair <s1,s2>. To model f (s1, s2) of any text pair, we aim to explore the partial ordering relations of coherence between different text pairs, since it is hard to get a corpus with labeled coherence scores.
Next, we exploit the two types of ordering relations stated in Section 1. To formalize, we notate a collection of documents as D. Each document d i ∈ D consists of several paragraphs, and each paragraph p j ∈ d i consists of several sentences. We use T s:(s+k) ij to represent a text segment covering k sentences starting from the s-th sentence in document d i 's j-th paragraph. To make symbols less cluttered, we omit k and simply use T s ij for the same meaning. A text pair < T s ij , T s i j > is a tuple of two text segments.
The first one ordering relation is: coherence score of a text pair from different documents is lower than that from the same document. Formally, its mathematical expression is shown below: where dot · means arbitrary value.
The second one is: coherence score of text pair from different paragraphs is lower than those from the same paragraph, as represented below.
(2) As our defined relations are partially ordering, they have the properties of reflexivity, transitivity, and antisymmetry, Then we can easily infer that coherence score of a text pair from different documents is also lower than that from the same paragraph.

Learning to Rank Semantic Coherence
Learning to rank is a widely used learning framework in the field of information retrieval (Liu et al., 2009). There are generally three formulations (Li, 2011): pointwise ranking, pairwise ranking, and listwise ranking. The goal is to learn a ranking function f (w, tp i ) → y i where tp i denotes a text pair <s1,s2>. f maps tp i to a real value y i which is semantic coherence score in this paper, w is weight vector. We examine both pointwise ranking and pairwise ranking methods, listwise ranking is not naturally fit for our task, so it is not discussed here.

Pointwise Ranking
For pointwise formulation, y i = f (w, tp i ) ∈ [0, 1] computes inner product between weight vector w and text pair tp i 's representation vector h i . Here we apply a sigmoid non-linearity function.
Representation vectors h i of the text pair can be jointly learned through a neural network, which will be introduced in next subsection.
To conform to the partial ordering relations, we score each training instance tp i as follows.
where 0 < α < 1 and α is a hyper-parameter chosen to maximize performance on validation dataset.
With N training instances, we formulate the coherence scoring as a regression problem and use cross entropy as loss function: Generally speaking, pointwise ranking is simple, scalable and efficient to train.

Pairwise Ranking with Sampling
Pairwise formulation explicitly compares each pair of training instance and requires a minimal margin between their ranking score.
Here, the text pair tp i has a higher ranking score than tp j , and y i = f (w, tp i ) ∈ (−∞, +∞). Without loss of generality, we set = 1 and use squared hinge loss as optimization function.
where M is the number of pairs we need to compare. As we can see, in our problem setting, M ≈ N 2 , which makes M an extremely large number when N ≈ 10 5 . To make training feasible, we adopt a straightforward sampling mechanism, which randomly samples pairs from different groups to construct a mini-batch on the fly during training.
Pairwise ranking is reported to have better performance than pointwise ranking, but it is less efficient to train.  To model the text pair instances, we develop a symmetric convolutional neural network (CNN) architecture, as shown in Figure 1. Our model consists of two symmetric CNN models, and the two CNNs share their network configuration and parameters. Each CNN converts one text into a low-dimensional representation, and two generated text representation vectors are finally concatenated and fed into the scoring layer to get a real value as the coherence score.

Inference
At test time, coherence scores between any two adjacent paragraphs are computed. T − 1 paragraph boundaries with lowest semantic coherence score are chosen as topic boundaries, where T is ground-truth number of topics.
This inference procedure is computationally efficient. Unlike TextTiling, it doesn't need to calculate a so-called "depth score".

Experimental Setup Data
In order to train our ranking neural network, we use full English Wikipedia dump, which consists of more than 5 million documents, to automatically construct text pairs.
For performance evaluation, we use topic segmentation dataset from (Jeong and Titov, 2010) 1 . This dataset consists of 864 manually labeled documents from four different areas, as shown in Table 1.

Hyperparameters
Our neural network implementation is based on Tensorflow (Abadi et al., 2015). We use pre-trained 50 dimensional Glove vectors (Pennington et al., 2014) 4 for word embeddings initialization. Each text pair consists of 2 text segments, and each text segment consists of  no more than 3 sentences. Stop words and digits are removed from input text, and all words are converted to lowercase. We pad input sequence to 40 tokens. In order to capture information of different granularity, convolution window size of both 2 and 3 are used, with 64 filters for each window size. L2 regularization coefficient is set to 0.001. Adam algorithm (Kingma and Ba, 2014) is used for loss function minimization. We set α to 0.7 for pointwise ranking. Evaluation System performance is evaluated according to three metrics: P k (Beeferman et al., 1999), WindowDiff(W D) (Pevzner and Hearst, 2002) and F 1 score. P k and W D are calculated based on sliding windows, and can assign partial score to incorrect segmentation. Note that P k and W D are penalty metrics, smaller value means better performance.

Results and Analysis
Experimental results are shown in Table 2. Our proposed model is examined in 4 different settings, including whether to use pointwise ranking or pairwise ranking algorithm, and whether to fine-tune word embeddings or not. The best model Ours-pointwise-static is able to achieve better or competitive performance compared to BayesSeg and TopicTiling according to all three metrics, especially on News dataset. TopicTiling is reported to perform well on heuristically constructed dataset (Riedl and Biemann, 2012b), but behave mediocre on manually labeled dataset in our experiments.
One interesting phenomenon is that fine-tuned word embeddings has negative impact on overall performance, which is generally not the case in many NLP tasks. The reason may be that our task involves domain adaptation, and word embed-dings should generalize well across different domains rather than adapt to Wikipedia text. Though our proposed sampling mechanism enables easier training of pairwise ranking model, it inevitably loses some ordering information, which makes pairwise ranking model perform slightly worse than pointwise ranking model.  To illustrate what the model has learned, we show some typical examples of coherence score for text pair <A,B> in Table 3. There is almost no lexical overlap for all the three text pairs, cosine similarity between one-hot vectors would surely fail to rank them, even though "canonization" and "commemorations", "respond" and "responses", "environment" and "environmental" are closely related semantically. As we expect, our proposed model is able to capture such semantic relatedness and assign reasonable score to each text pair, which is a key to topic boundary detection.

Conclusion
This paper proposes a novel approach for topic segmentation by learning to rank semantic coherence. Symmetric convolutional neural network is used for text pair modeling. Training data can be automatically constructed from unlabeled documents, and no labeled data is needed. Experiments show promising performance on dataset from various domains.