Transition-Based Disfluency Detection using LSTMs

In this paper, we model the problem of disfluency detection using a transition-based framework, which incrementally constructs and labels the disfluency chunk of input sentences using a new transition system without syntax information. Compared with sequence labeling methods, it can capture non-local chunk-level features; compared with joint parsing and disfluency detection methods, it is free for noise in syntax. Experiments show that our model achieves state-of-the-art f-score of 87.5% on the commonly used English Switchboard test set, and a set of in-house annotated Chinese data.


Introduction
Disfluency detection is the task of recognizing non-fluent word sequences in spoken language transcripts (Zayats et al., 2016;Wu et al., 2015). As shown in Figure 1, standard annotation of disfluency structure (Shriberg, 1994) indicates the reparandum (words that are discarded, or corrected by the following words), the interruption point (+) marking the end of the reparandum, the associated repair, and an optional interregnum after the interruption point (filled pauses, discourse cue words, etc.).
Ignoring the interregnum, disfluencies can be categorized into three types: restarts, repetitions, and corrections, based on whether the repair is empty, the same as the reparandum or different, respectively. Table 1 gives a few examples. Interregnums are easy to detect as they often consist of fixed phrases (e.g. "uh", "you know"). However, reparandums are more difficult to detect, because they can be in arbitrary form. Most previ-  ous disfluency detection work focuses on detecting reparandums.
The main challenges of detecting reparandums include that they vary in length, may occur in different locations, and are sometimes nested. For example, the longest reparandum in our training set has fifteen words. Hence, it is very important to capture long-range dependencies for disfluency detection. Since there is large parallelism between the reparandum chunk and the following repair chunk (for example, in Figure 1, the reparandum begins with to and ends before another occurrence of to), it is also useful to exploit chunk-level representation, which explicitly makes use of resulted infelicity disfluency chunks.
Common approaches take disfluency detection as a sequence labeling problem, where each sentential word is assigned with a label (Zayats et al., 2016;Hough and Schlangen, 2015;Qian and Liu, 2013;Georgila, 2009). These methods achieve good performance, but are not powerful enough to capture complicated disfluencies with longer spans or distances. Another drawback of these approaches is that they are unable to exploit chunk-level features. Semi-CRF (Ferguson et al., 2015) is used to alleviate this issue to some extent. Semi-CRF models still have their inefficiencies because they can only use the local chunk information limited by the markov assumption when decoding.
A different line of work (Rasooli and Tetreault, 2013;Honnibal and Johnson, 2014;Wu et al., 2015) adopts transition-based parsing models for disfluency detection. This line of work can be seen as a joint of disfluency detection and parsing. The main advantage of the joint models is that they can capture long-range dependency of disfluencies as well as chunk-level information. However, they introduce additional annotated syntactic structure, which is very expensive to produce, and can cause noise by significantly enlarging the output search space.
Inspired by the above observations, we investigate a transition-based model without syntactic information. Our model incrementally constructs and labels the disfluency chunks of input sentences using an algorithm similar to transition-based depency parsing. As shown in Figure 2, the model state consists of four components: (i) O, a conventional sequential LSTM (Hochreiter and Schmidhuber, 1997) to store the words that have been labeled as fluency. (ii) S, a stack LSTM to represent partial disfluency chunks, which captures chunklevel information. (iii) A, a conventional sequential LSTM to represent history of actions. (iiii) B, a Bi-LSTM to represent words that have not yet been processed. A sequence of transition actions are used to consume input tokens and construct the output from left to right. To reduce error propagation, we use beam-search (Collins and Roark, 2004) and scheduled sampling (Bengio et al., 2015), respectively. We evaluate our model on the commonly used English Switchboard test set and a in-house annotated Chinese data set. Results show that our model outperforms previous state-of-the-art systems. The code is released 1 .

Background
For a background, we briefly introduce transitionbased parsing and its extention for joint disfluency detection. An arc-eager transition-based parsing system consists of a stack σ containing words being processed, a buffer β containing words to be processed and a memory A storing dependency 1 https://github.com/hitwsl/transition disfluency Figure 2: model state when processing the sentence "want a flight to boston to denver". arcs which have been generated. There are four types of transition actions (Nivre, 2008) • Shift : Remove the front of the buffer and push it to the stack.
• Reduce : Pop the top of the stack.
• LeftArc : Pop the top of the stack, and link the popped word to the front of the buffer.
• RightArc : Link the front of the buffer to the top of the stack, remove the front of the buffer and push it to the stack.
Many neural network parsers have been constructed under this framework, such as , who use different LSTM structure to represent information from σ to β. For disfluency detection, the input is a sentence with disfluencies from automatic speech recognition (ASR). We denote the word sequence as w n 1 = (w 1 , ..., w n ). The output of the task is a sequence of binary tags denoted as D n 1 = (d 1 , ..., d n ), where each d i corresponds to the word w i , indicating whether w i is a disfluent word or not. Hence the task can be modeled as searching for the best sequenc D * given the stream of words w n 1 D * = argmax D P (D n 1 |w n 1 ) Wu et al. (2015) proposes a statistical transitionbased disfluency detection model, which performs disfluency detection and parsing jointly by augmenting the Shift-Reduce algorithm with a binary classifier transition (BCT) action: • BCT : Classify whether the current word is disfluent or not. If it is, remove it from the buffer, push it into the stack which is similar to Shift and then mark it as disfluent. Otherwise the original parser transition actions will be used.
Disfluency detection and parsing are jointly optimized where T i 1 is the partial tree after word w i is consumed, d i is the disfluency tag of w i . P (T i 1 |.) is the parsing model and P (d i 1 |.) is the disfluency model used to predict the disluency tags on the contexts of partial trees that have been built.

Our Transition-Based Model
The BCT model serves as a state-of-the-art transition-based baseline. However, it requires that the training data contains both syntax trees and disfluency annotations, which reduces the practicality of the algorithm. Also, BCT does not explicitly make use of resulting infelicity disfluency chunks. Being a discrete model, the performance relies heavily on manual feature engineering.
To address the constraints above, we apply a transition-based neural model for disfluency detection that does not use any syntax information. Our transition-based method incrementally constructs and labels the disfluency chunk of input sentences by performing a sequence of actions. The task is modeled as

Transition-Based Disfluency Detection
Our model incrementally constructs and labels the disfluency chunks of input sentences, where a state is represented by a tuple (O, S, A, B): • output (O) : the output is used to represent the words that have been labeled as fluent.
• stack (S) : stack is used to represent the partially constructed disfluency chunk.
• action (A) : action is used to represent the complete history of actions taken by the transition system.
• buffer (B) : buffer is used to represent the sentences that have not yet been processed.
Given an input disfluent sentence, the stack, output and action are initially empty and the buffer contains all words of the sentence, a sequence of transition actions are used to consume words in the buffer and build the output sentence: • OUT: which moves the first word in the buffer to the output and clears out the stack if it is not empty.
• DEL: which moves the first word in the buffer to the stack.

Search Algorithm
Based on the transition system, the decoder searches for an optimal action sequence for a given sentence. The system is initialized by pushing all the input words and their representations (of §3.3) onto B in the reverse order, such that the first word is at the top of B, and S, O and A each contains an empty-stack token.
At each step, the system computes a composite representation of the model states (as determined by the current configurations of B, S, O, and A), which is used to predict an action to take. Decoding completes when B is empty (except for the empty-stack symbol), regardless of the state of S. Since each token in B is either moved directly to O or S every step, the total number of actions equals to the length of input sentence. Table 2 shows the sequence of operations required to process the sentence "want a flight to boston to denver". Figure 2, the model state representation at time t, which is written as e t , is defined as:

As shown in
passed through a component-wise rectified linear unit (ReLU) for nonlinearity (Glorot et al., 2011).
Step  Table 2: Segmentation process of a flight to boston to denver Finally, the model state e t is used to compute the probability of the action at time t as: where g z is a column vector representing the embedding of the transition action z, and q z is a bias term for action z. The set A(S, B) represents the valid actions that may be taken given the current state. Since e t = f (s t , b t , a t , o t ) encodes information about all previous decisions made by the transition system, the probability of any valid sequence of transition actions z conditioned on the input can be written as: We then have where the disfluency detection task is merged into the transition-based system.

Beam Search
The mainly drawback of greedy search is error propagation. An incorrect action will have a negative influence to its subsequent actions, leading to an incorrect output sequence. One way to reduce error propagation is beam-search. Because the number of actions taken always equals to the number of input sentence for every valid path, it is straightforward to use beam search. We use beamsearch for both training and testing. The early update strategy from Collins and Roark (2004) is applied for training. In particular, each training sequence is decoded, and we keep track of the location of the gold path in the beam. If the gold path falls out of the beam at step t, decoding process is stopped and parameter update is performed using the gold path as a positive example, and beam items as negative examples. We also use the global optimization method (Andor et al., 2016; to train our beam-search model.

Scheduled Sampling
Scheduled sampling (Bengio et al., 2015) can also be used to reduce error propagation. The training goal of the greedy baseline is to maximize the likelihood of each action given the current model state, which means that the correct action is taken at each step. Doing inference, the action predicted by the model itself is taken instead. This discrepancy between training and inference can yield errors that accumulate quickly along the searching process. Scheduled sampling is used to solve the discrepancy by gently changing the training process from a fully guided scheme using the true previous action, towards a less guided scheme which mostly uses the predicting action instead. We take the action gaining higher p(z t |e t ) with a certain probability p, and a probability (1 − p) for the correct action when training.

State Representation
For better capturing non-local context information, we use LSTM structures to represent different components of each state, including buffer, action, stack, and output. In particular, we exploit LSTM-Minus (Wang and Chang, 2016) to model the buffer segment, conventional LSTM to model the action and ouptut segment, and stack LSTM  to model the stack segments, which demonstrates highly effectively in parsing task.

Buffer Representation
In order to construct more informative representation, we use a Bi-LSTM to represent the buffer following the work of Wang and Chang (2016)  LSTM hidden vectors is utilized to represent a segment's information. We perform a similar method in a Bi-LSTM to obtain the representation of the buffer. The forward and backward subtractions for the buffer can be described as Here to is the first word in buffer and denver is the last. Then b f and b b are concatenated as the representation of buffer.

Action Representation
We represent an action a with an embedding e a (a) from a looking-up table E a , and apply a conventional LSTM to represent the complete history of actions taken by the transition system. Once an action a is taken, the embedding e a (a) will be added to the right-most position of the LSTM.

Stack Representation
We use a stack LSTM  to represent partial disfluency chunk. The stack LSTM tries to augment the conventional LSTM with a "stack pointer". For a conventional LSTM, new inputs are always added in the right-most position; but in a stack LSTM, the current location of the stack pointer determines which cell in the LSTM provides c t−1 and h t−1 when computing the new memory cell contents. In addition to adding elements to the end of the sequence, the stack LSTM provides a pop operation which moves the stack pointer to the previous element. Thus, the LSTM can be understood as a stack implemented so that contents are never overwritten, When the action OUT is taken, the stack is cleared by moving the stack pointer to the initial position. When the action DEL is taken, the representation of the buffer will be added directly to the stack LSTM.

Output Representation
We use a conventional LSTM to represent the output. When the action OUT is taken, the representation of the buffer will be added directly to the right-most position of the LSTM. Because the words in the output are a continuous subsequence of the final output sentence with disfluencies removed, the LSTM representation can be seen as a pseudo language model and thus has the ability to keep the generated sentence grammatical, which is very important for disfluency detection.

Token Embeddings
We use four vectors to represent each input token: a learned word embedding w; a fixed word embedding w; a learned POS-tag embedding p; and a hand-crafted feature representation d. The four vectors are concatenated, transformed by a matrix V and fed to a rectified layer to learn a feature combination: where V means vector concatenation.
Following the work of , we extract two types of hand-crafted discrete features (as shown in Table 3) for each token in a sentence, and incorporate them into our neural networks by translating them into a 0-1 vector d. The dimension of d is 78, which equals to the number of discrete features. For a token x t , d i fires if x t matches the i-th pattern of the feature templates. The duplicate features indicate whether x t has a duplicated word/POS-tag in certain distance. The similarity features indicate whether the surface string of x t resembles its surrounding words. duplicate features Duplicate(i, w i+k ), −15 ≤ k ≤ +15 and k = 0: if wi equals w i+k , the value is 1, others 0 Duplicate(pi, p i+k ), −15 ≤ k ≤ +15 and k = 0: if pi equals p i+k , the value is 1, others 0 Duplicate(wiwi+1, w i+k w i+k+1 ), −4 ≤ k ≤ +4 and k = 0: if wiwi+1 equals w i+k w i+k+1 , the value is 1, others 0 Duplicate(pipi+1, p i+k p i+k+1 ), −4 ≤ k ≤ +4 and k = 0: if pipi+1 equals p i+k p i+k+1 , the value is 1, others 0 similarity features f uzzyM atch(wi, w i+k ), k ∈ {−1, +1}: similarity = 2 * num same letters/(len(wi) + len(w i+k )). if similarity > 0.8, the value is 1, others 0  (Honnibal and Johnson, 2014;Wu et al., 2015), we also use the subcorpus of PARSED/MRG/SWBD. Following the experiment settings in Charniak and Johnson (2001), the training subcorpus contains directories 2 and 3 in PARSED/MRG/SWBD and directory 4 is split into test, development sets and others. Following Honnibal and Johnson (2014), we lower-case the text and remove all punctuations and partial words 2 . We also discard the 'um' and 'uh' tokens and merge 'you know' and 'i mean' into single tokens. Automatic POS-tags generated from pocket crf (Qian and Liu, 2013) are used as POStag in our experiments.
For Chinese experiments, we collect 25k spoken sentences from meeting minutes, which are transcribed using the iflyrec toolkit 3 , and annotate them with only disfluency annotations according to the guideline proposed by Meteer et al. (1995).

Performance On English Swtichboard
We build two baseline systems using CRF and Bi-LSTM, respectively. The hand-crafted discrete features of CRF refer to those in Ferguson et al. (2015). For the Bi-LSTM model, the token embedding is the same with our transition-based method. Table 4 shows the result of our model on both the development and test sets. Beam search improves the F-score form 87.1% to 87.5%, which is consistent with the finding of Buckman et al. (2016) on the LSTM parser of  (improvements by about 0.3 point). Scheduled sampling achieves the same improvements compared to beam-search. Because of high training speed, we conduct subsequent experiments based on scheduled sampling. We compare our transition-based neural model to five top performing systems. Our model outperforms the state-of-the-art, achieving a 87.5% F-   (Zayats et al., 2016) 91.8 80.6 85.9 semi-CRF (Ferguson et al., 2015) 90.0 81.2 85.4 UBT (Wu et al., 2015) 90  score as shown in Table 5. It achieves 2.4 point improvements over UBT (Wu et al., 2015), which is the best syntax-based method for disfluency detection. The best performance by linear statistical sequence labeling methods is the semi-CRF method of Ferguson et al. (2015), achieving a 85.4% Fscore leveraging prosodic features. Our model obtains a 2.1 point improvement compared to this. Our model also achieves 0.8 point improvement over the neural attention-based model , which regards the disfluency detection as a sequence-to-sequence problem. We attribute the success to the strong ability to learn global chunklevel features and the good state representation such as the stack-LSTM.

Result On DPS Corpus
As described in section 3.1, to directly compare with the transition-based parsing methods (Honnibal and Johnson, 2014;Wu et al., 2015), we only use MRG files, which are less than the DPS files. In fact, many methods, such as Qian and Liu (2013), have used all the DPS files as training data. We are curious about the performance of our system using all the DPS files. Following the experimental settings of Johnson and Charniak (2004), the corpus is split as follows: main training consisting of all sw[23]*.dps files, development training consisting of all sw4[5-9]*.dps files and test training consisting of all sw4[0-1]*.mrg files. Table 6 shows the result on the DPS files.    Qian and Liu (2013), which use the same data set and pre-processing. Our model achieves a 88.1% F-score by using more training data, obtaining 0.6 point improvement compared with the system training on MRG files. The performance is far better than the sequence labeling methods that use DPS files for training. Table 7 shows the results of Chinese disfluency detection. Our model obtains a 2.4 point improvement compared with the baseline Bi-LSTM model and a 5.3 point compared with the baseline CRF model. The performance on Chinese is much lower than that on English. Apart from the smaller training set, the main reason is that the proportion of repair type disflueny is much higher.

Ablation Tests
As described in section 3.1, the sate representation has four components. We explicitly compare the impact of different parts. As shown in Table 8, the F-score decreases most heavily without stack, which indicates that it is very necessary to capture chunk-level information for disfluency detection and our model can model it effectively. The results also show that output, which can be seen as a pseudo language model, has important influence on model performance. Seen from the result, history of actions represented in action is also useful for predicting at each step. The F-score decreases 4 The toolkit is available at https://code.google.com/p/disfluency-detection/downloads.

Repetitions vs Non-repetitions
Repetition disfluencies are easier to detect and even some simple hand-crafted features can handle them well. Other types of reparandums such as repair are more complex (Zayats et al., 2016;Ostendorf and Hahn, 2013). In order to better understand model performances, we evaluate our model's ability to detect repetition vs. non-repetition (other) reparandum. The results are shown in Table 9. All the three models achieve high score on repetition reparandum. Our transition-based model is much better in predicting non-repetitions compared to CRF and Bi-LSTM. We conjecture that our transitionbased structure can capture more of the reparandum/repair "rough copy" similarities by learning represention of both chunks and global state.

Related Work
Common approaches take disfluency detection as a sequence labeling problem, where each sentential word is assigned with a label (Georgila, 2009;Qian and Liu, 2013). These methods achieve good performance, but are not powerful enough to capture complicated disfluencies with longer spans or distances. Another drawback is that they have no ability to exploit chunk-level features. There are also works that try to use recurrent neural network (RNN), which can capture dependencies at any length in theory, on disfluency detection problem (Zayats et al., 2016;Hough and Schlangen, 2015). The RNN method treats sequence tagging as classification on each input token. Hence, it also has no power to exploit chunk-level features. Some works  regard the disfluency detection as a sequence-to-sequence problem and propose a neural attention-based model for it. The  attention-based model can capture a global representation of the input sentence by using a RNN when encoding. It can strongly capture long-range dependencies and achieves good performance, but are also not powerful enough to capture chunklevel information. To capture chunk-level information, Ferguson et al. (2015) try to use semi-CRF for disfluency detection, and reports improved results. Semi-CRF models still have their inefficiencies because they can only use the local chunk information limited by the markov assumption when decoding.
Many syntax-based approaches (Lease and Johnson, 2006;Rasooli and Tetreault, 2013;Honnibal and Johnson, 2014;Wu et al., 2015) have been proposed which jointly perform dependency parsing and disfluency detection. The main advantage of joint models is that they can capture longrange dependency of disfluencies. However, it requires that the training data contains both syntax trees and disfluency annotations, which reduces the practicality of the algorithm. The performance relies heavily on manual feature engineering.
Transition-based framework has been widely exploited in a number of other NLP tasks, including syntactic parsing (Zhang and Nivre, 2011;Zhu et al., 2013), information extraction (Li and Ji, 2014) and joint syntactic models (Zhang et al., , 2014. Recently, deep learning methods have been widely used in many nature language processing tasks, such as name entity recognition (Lample et al., 2016), zero pronoun resolution (Yin et al., 2017) and word segmentation (Zhang et al., 2016). The effectiveness of neural features has also been studied for this framework Watanabe and Sumita, 2015;Andor et al., 2016). We apply the transition-based neural framework to disfluency detection, which to our knowledge has not been investigated before.

Conclusion
We introduced a transition-based model for disfluency detection, which does not use any syntax information, learning represention of both chunks and global contexts. Experiments showed that our model achieves the state-of-the-art F-scores on both the commonly used English Switchboard test set and a in-house annotated Chinese data set.