Toward Fast and Accurate Neural Discourse Segmentation

Discourse segmentation, which segments texts into Elementary Discourse Units, is a fundamental step in discourse analysis. Previous discourse segmenters rely on complicated hand-crafted features and are not practical in actual use. In this paper, we propose an end-to-end neural segmenter based on BiLSTM-CRF framework. To improve its accuracy, we address the problem of data insufficiency by transferring a word representation model that is trained on a large corpus. We also propose a restricted self-attention mechanism in order to capture useful information within a neighborhood. Experiments on the RST-DT corpus show that our model is significantly faster than previous methods, while achieving new state-of-the-art performance.


Introduction
Discourse segmentation, which divides text into proper discourse units, is one of the fundamental tasks in natural language processing. According to Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), a complex text is composed of non-overlapping Elementary Discourse Units (EDUs), as shown in Table 1. Segmenting text into such discourse units is a key step in discourse analysis (Marcu, 2000) and can benefit many downstream tasks, such as sentence compression (Sporleder and Lapata, 2005) or document summarization .
Since EDUs are initially designed to be determined with lexical and syntactic clues (Carlson et al., 2001), existing methods for discourse segmentation usually design hand-crafted features to capture these clues (Feng and Hirst, 2014). Especially, nearly all previous methods rely on syntactic parse trees to achieve good performance.  But extracting such features usually takes a long time, which contradicts the fundamental role of discourse segmentation and hinders its actual use. Considering the great success of deep learning on many NLP tasks (Lu and Li, 2016), it's a natural idea for us to design an end-to-end neural model that can segment texts fast and accurately.
The first challenge of applying neural methods to discourse segmentation is data insufficiency. Due to the limited size of labeled data in existing corpus (Carlson et al., 2001), it's quite hard to train a data-hungry neural model without any prior knowledge. In fact, some traditional features, such as the POS tags or parse trees, naturally provide strong signals for identifying EDUs. Removing them definitely increases the difficulty of learning an accurate model. Secondly, many EDU boundaries are actually not determined locally. For example, to recognize the boundary between e 3 and e 4 in Table 1, our model has to be aware that e 3 is an embedded clauses starting from "overlooking", otherwise it could regard "San Fernando Valley" as the subject of e 4 . Such kind of long-distance dependency can be precisely extracted from parse trees but is difficult for neural models to capture.
To address these challenges, in this paper, we propose a neural discourse segmenter based on the BiLSTM-CRF (Huang et al., 2015) framework and further improve it from two aspects. Firstly, since the discourse segmentation corpus is too small to learn precise word representations, we transfer a word representation model trained on a large corpus into our task, and show that this trans-ferred model can provide very useful information for our task. Secondly, in order to model longdistance dependency, we employ the self-attention mechanism (Vaswani et al., 2017) when encoding the text. Different from previous self-attention, we restrict the attention area to a neighborhood of fixed size. The motivation is that effective information for determining the boundaries is usually collected from adjacent EDUs, while the whole text may contain many disturbing words, which could mislead the model into incorrect decisions. In summary, the contributions of this work are as follows: • Our neural discourse segmentation model doesn't rely on any syntactic features, while it can outperform other state-of-the-art systems and achieve significant speedup.
• To our knowledge, we are the first to transfer word representations learned from large corpus into discourse segmentation task and show that they can significantly alleviate the data insufficiency problem.
• Based on the nature of discourse segmentation, we propose a restricted attention mechanism , which enables the model to capture useful information within a neighborhood but ignore unnecessary faraway noises.

Neural Discourse Segmentation Model
We model discourse segmentation as a sequence labeling task, where the start word of each EDU (except the first EDU) is supposed to be labeled as 1 and other words are labeled as 0. Figure 1 gives an overview of our segmentation model. We will introduce the BiLSTM-CRF framework in Section 2.1, and describe the two key components of our model in Section 2.2, 2.3.

BiLSTM-CRF for Sequence Labeling
Conditional Random Fields (CRF) (Lafferty et al., 2001) is an effective method to sequence labeling problem and has been widely used in many NLP tasks (Sutton and McCallum, 2012). To approach our discourse segmentation task in a neural way, we adopt the BiLSTM-CRF model (Huang et al., 2015) as the framework of our system. Formally, given an input sentence x = {x t } n t=1 , we first embed each word into a vector e t . Then these word embeddings are fed into a bi-directional LSTM 0 1 0 2 0 3 0 4 0 5 6 7 1 6 7 2 6 7 3 6 7 4 6 7 5 7 1 7 2 7 3 7 4 7 5 8 1 8 2 8 3 8 4 8 5 9 1 9 2 9 3 9 4 9 5 ÿ !" , 3-2 , 3-3 , 3-4 Figure 1: Overview of our model for discourse segmentation layer to model the sequential information: where h t is the concatenation of the hidden states from both forward and backward LSTMs. After encoding this sentence, we make labeling decisions for each word. Instead of modeling the decisions independently, the CRF layer computes the conditional probability p(y|h; W, b) over all possible label sequences y given h as follows: is the potential function and Y is the set of possible label sequences. The training objective is to maximize the conditional likelihood of the golden label sequence. During testing, we search for the label sequence with the highest conditional probability.

Transferring Representations Learned from Large Corpus
Due to the large parameter space, neural models usually require much more training data in order to achieve good performance. However, to the best of our knowledge, nearly all existing discourse segmentation corpora are limited in size. After we remove all the syntactic features, which has been proven useful in many previous work (Bach et al., 2012;Feng and Hirst, 2014;Joty et al., 2015), it's expected that our neural model will not achieve very satisfying results.
To tackle this issue, we propose to leverage model learned from other large datasets, aiming that this transferred model has been well trained to encode text and capture useful signals. Instead of training the transferred model by ourselves, in this paper, we adopt the ELMo word representations , which are derived from a bidirectional language model (BiLM) trained on one billion word benchmark corpus (Chelba et al., 2014). Specifically, this BiLM has one character convolution layer and two biLSTM layers, and correspondingly there are three internal representations for each word x t , which are denoted as {h LM t,l } 3 l=1 . Following , we compute the ELMo representation r t for word x t as follows: where s LM are normalized weights and γ LM controls the scaling of the entire ELMo vector. Then we concatenate r t with the word embedding e t , and take them as the input of Equation (1).

Restricted Self-Attention
As we have introduced in Section 1, some EDU boundaries rely on relatively long-distance signals to recognize, while normal LSTM model is still weak at this. Recently, self-attention mechanism, which relates different positions of a single sequence, has been successfully applied to many NLP tasks (Vaswani et al., 2017;Wang et al., 2017) and shows its superiority in capturing long dependency. However, we found that most boundaries are actually only influenced by nearby EDUs, thereby forcing the model to attend to the whole sequence will bring in unnecessary noises. Therefore, we propose a restricted self-attention mechanism, which only collects information from a fixed neighborhood. To do this, we first compute the similarity between current word x i and each nearby word x j within a window: Then the attention vector a i is computed as a weighted sum of nearby words: where hyper-parameter K is the window size. This attention vector a i is then put into another BiLSTM layer together with h i in order to fuse the information: We useh t as the new input to the CRF layer.

Dataset and Metrics
We conduct experiments on the RST Discourse Treebank (RST-DT) (Carlson et al., 2001). The original corpus contains 385 Wall Street Journal articles from the Penn Treebank, which are divided in to training set (347 articles, 6132 sentences) and test set (38 articles, 991 sentences). We randomly sample 34 (10%) articles from the train set as validation set in order to tune the hyperparameters and only train our model on the remained train set. We follow mainstream studies (Soricut and Marcu, 2003;Joty et al., 2015) to measure segmentation accuracy only with respect to the intra-sentential segment boundaries, and we report Precision (P), Recall (R) and F1-score (F1) for segmentation performance.

Implementation Details
We tune all the hyper-parameters according to the model performance on the separated validation set. The 300-D Glove embeddings (Pennington et al., 2014) are employed and kept fixed during training. We use the AllenNLP toolkit  to compute the ELMo word representations. The hidden size of our model is set to be 200 and the batch size is 32. L2 regularization is applied to trainable variables with its weight as 0.0001 and we use dropout between every two layers, where the dropout rate is 0.1. For model training, we employ the Adam algorithm (Kingma and Ba, 2014) with its initial learning rate as 0.0001 and we clip the gradients to a maximal norm 5.0. Exponential moving average is applied to all trainable variables with a decay rate 0.9999. The window size K for restricted attention is set to be 5.

Performance
The results of our model and other competing systems on the test set of RST-DT are shown in Table  2. We compare our results against the following systems: (1) SPADE (Soricut and Marcu, 2003) is an early system using simple lexical and syntactic features; (2) (Prasad et al., 2005), Stanford represents trees from the Stanford parser (Klein and Manning, 2003) and BLLIP represents those from the BLLIP parser (Charniak and Johnson, 2005 Table 2, we can see that our model achieves state-of-the-art performance without extra parse trees. Especially, if no gold parse trees are provided, our system outperforms other methods by more than 1.7 points in F1 score. Since the gold parse trees are not available when processing new sentences, this improvement becomes more valuable when the system is put into use. 3 In parallel with our work, Li et al. (2018) proposes another neural model with its performance as: P-91.6, R-92.8, F1-92.2. We didn't see their paper at the time of submission, but it's worth mentioning here for the readers' reference.

System
Speed (Sents/s) Speedup Two-Pass  To further explore the influence of different components in our model, we also report the results of ablation experiments in Table 2. We can see that the transferred ELMo representations provide the most significant improvement. This accords with our assumption that the RST-DT corpus itself is not large enough to train an expressive neural model sufficiently. With the help of the transferred representations, we are capable of capturing more semantic and syntactic signals. Also, comparing the models with and without the restricted self-attention, we find that this attention mechanism can further boost the performance. Especially, if there are no ELMo vectors, the improvement provided by the attention mechanism is more noticeable.

Speed Comparison
We also measure the speedup of our model against traditional systems in Table 3. The Two-Pass system has the best performance among all existing methods, while SPADE is much simpler with less features. We test these systems on the same machine (CPU: Intel Xeon E5-2690, GPU: NVIDIA Tesla P100). The results show that our system is 2.4-6.5 times faster than the compared systems if the batch size is 1. Moreover, if we process the test sentences in parallel, we can achieve 20.2-54.8 times speedup with the batch size as 32. This makes our system more practical in actually use.

Effect of Restricted Self-Attention
We propose to restrict the self-attention within a neighborhood instead of the whole sequence. Table 4 demonstrates the performance of our model over different window size K. We can see that all these results is better than the performance our model without attention mechanism. However, a proper restriction window is helpful for the attention mechanism to take better effect.

Conclusion
In this paper, we propose a neural discourse segmenter that can segment text fast and accurately. Different from previous methods, our segmenter doesn't rely on any hand-crafted features, especially the syntactic parse tree. To achieve our goal, we propose to leverage the word representations learned from large corpus and we also propose a restricted self-attention mechanism. Experimental results on RST-DT show that our system can achieve state-of-the-art performance together with significant speedup.