Disfluency Detection using Auto-Correlational Neural Networks

In recent years, the natural language processing community has moved away from task-specific feature engineering, i.e., researchers discovering ad-hoc feature representations for various tasks, in favor of general-purpose methods that learn the input representation by themselves. However, state-of-the-art approaches to disfluency detection in spontaneous speech transcripts currently still depend on an array of hand-crafted features, and other representations derived from the output of pre-existing systems such as language models or dependency parsers. As an alternative, this paper proposes a simple yet effective model for automatic disfluency detection, called an auto-correlational neural network (ACNN). The model uses a convolutional neural network (CNN) and augments it with a new auto-correlation operator at the lowest layer that can capture the kinds of “rough copy” dependencies that are characteristic of repair disfluencies in speech. In experiments, the ACNN model outperforms the baseline CNN on a disfluency detection task with a 5% increase in f-score, which is close to the previous best result on this task.


Introduction
Disfluency informally refers to any interruptions in the normal flow of speech, including false starts, corrections, repetitions and filled pauses. Shriberg (1994) defines three distinct parts of a speech disfluency, referred to as the reparandum, interregnum and repair. As illustrated in Example 1, the reparandum to Boston is the part of the utterance that is replaced, the interregnum uh, I mean (which consists of a filled pause uh and a discouse marker I mean) is an optional part of a disfluent structure, and the repair to Denver replaces the reparandum. The fluent version is obtained by removing reparandum and interregnum words although dis-fluency detection models mainly deal with identifying and removing reparanda. The reason is that filled pauses and discourse markers belong to a closed set of words and phrases and are trivial to detect . (1) In disfluent structures, the repair (e.g., to Denver) frequently seems to be a "rough copy" of the reparandum (e.g., to Boston). In other words, they incorporate the same or very similar words in roughly the same word order. In the Switchboard training set (Godfrey and Holliman, 1993), over 60% of the words in the reparandum are exact copies of words in the repair. Thus, this similarity is strong evidence of a disfluency that can help the model detect reparanda (Charniak and Johnson, 2001;. As a result, models which are able to detect "rough copies" are likely to perform well on this task. Currently, state-of-the-art approaches to disfluency detection depend heavily on hand-crafted pattern match features, specifically designed to find such "rough copies" (Zayats et al., 2016;Jamshid Lou and Johnson, 2017). In contrast to many other sequence tagging tasks (Plank et al., 2016;Yu et al., 2017), "vanilla" convolutional neural networks (CNNs) and long shortterm memory (LSTM) models operating only on words or characters are surprisingly poor at disfluency detection (Zayats et al., 2016). As such, the task of disfluency detection sits in opposition to the ongoing trend in NLP away from task-specific feature engineering -i.e., researchers discovering ad-hoc feature representations for various tasks -in favor of general-purpose methods that learn the input representation by themselves (Collobert and Weston, 2008).
In this paper, we hypothesize that LSTMs and CNNs cannot easily learn "rough copy" dependencies. We address this problem in the context of a CNN by introducing a novel auto-correlation operator. The resulting model, called an autocorrelational neural network (ACNN), is a generalization of a CNN with an auto-correlation operator at the lowest layer. Evaluating the ACNN in the context of disfluency detection, we show that introducing the auto-correlation operator increases f-score by 5% over a baseline CNN. Furthermore, the ACNN -operating only on word inputsachieves results which are competitive with much more complex approaches relying on hand-crafted features and outputs from pre-existing systems such as language models or dependency parsers. In summary, the main contributions of this paper are: • We introduce the auto-correlational neural network (ACNN), a generalization of a CNN incorporating auto-correlation operations, • In the context of disfluency detection, we show that the ACNN captures important properties of speech repairs including "rough copy" dependencies, and • Using the ACNN, we achieve competitive results for disfluency detection without relying on any hand-crafted features or other representations derived from the output of preexisting systems.

Related Work
Approaches to disfluency detection task fall into three main categories: noisy channel models, parsing-based approaches and sequence tagging approaches. Noisy channel models (NCMs)  use complex tree adjoining grammar (TAG) (Shieber and Schabes, 1990) based channel models to find the "rough copy" dependencies between words. The channel model uses the similarity between the reparandum and the repair to allocate higher probabilities to exact copy reparandum words. Using the probabilities of TAG channel model and a bigram language model (LM) derived from training data, the NCM generates n-best disfluency analyses for each sentence at test time. The analyses are then reranked using a language model which is sensitive to the global properties of the sentence, such as a syntactic parser based LM . Some works have shown that rescoring the n-best analyses with external n-gram (Zwarts and Johnson, 2011) and deep learning LMs (Jamshid Lou and Johnson, 2017) trained on large speech and non-speech corpora, and using the LM scores along with other features (i.e. pattern match and NCM ones) into a MaxEnt reranker  improves the performance of the baseline NCM, although this creates complex runtime dependencies.
Parsing-based approaches detect disfluencies while simultaneously identifying the syntactic structure of the sentence. Typically, this is achieved by augmenting a transition-based dependency parser with a new action to detect and remove the disfluent parts of the sentence and their dependencies from the stack (Rasooli and Tetreault, 2013;Honnibal and Johnson, 2014;Yoshikawa et al., 2016). Joint parsing and disfluency detection can compare favorably to pipelined approaches, but requires large annotated treebanks containing both disfluent and syntatic structures for training.
Our proposed approach, based on an autocorrelational neural network (ACNN), belongs to the class of sequence tagging approaches. These approaches use classification techniques such as conditional random fields (Liu et al., 2006;Ostendorf and Hahn, 2013;Zayats et al., 2014;Ferguson et al., 2015), hidden Markov models (Liu et al., 2006;Schuler et al., 2010) and deep learning based models (Hough and Schlangen, 2015;Zayats et al., 2016) to label individual words as fluent or disfluent. In much of the previous work on sequence tagging approaches, improved performance has been gained by proposing increasingly complicated labeling schemes. In this case, a model with begin-inside-outside (BIO) style states which labels words as being inside or outside of edit region 1 is usually used as the baseline sequence tagging model. Then in order to come up with different pattern matching lexical cues for repetition and correction disfluencies, they extend the baseline state space with new explicit repair states to consider the words at repair region, in addition to edit region (Ostendorf and Hahn, 2013;Zayats et al., 2014Zayats et al., , 2016. A model which uses such labeling scheme may generate illegal label sequences at test time. As a solution, integer linear programming (ILP) constraints are applied to the output of classifier to avoid inconsistencies between neighboring labels (Georgila, 2009;Georgila et al., 2010;Zayats et al., 2016). This contrasts with our more straightforward approach, which directly labels words as being fluent or disfluent, and does not require any post-processing or annotation modifications.
The most similar work to ours is recent work by Zayats et al. (2016) that investigated the performance of a bidirectional long-short term memory network (BLSTM) for disfluency detection. Zayats et al. (2016) reported that a BLSTM operating only on words underperformed the same model augmented with hand-crafted pattern match features and POS tags by 7% in terms of f-score. In addition to lexically grounded features, some works incorporate prosodic information extracted from speech (Kahn et al., 2005;Ferguson et al., 2015;Tran et al., 2018). In this work, our primary motivation is to rectify the architectural limitations that prevent deep neural networks from automatically learning appropriate features from words alone. Therefore, our proposed model eschews manually engineered features and other representations derived from dependency parsers, language models or tree adjoining grammar transducers that are used to find "rough copy" dependencies. Instead, we aim to capture these kinds of dependencies automatically.

Convolutional and Auto-Correlational Networks
In this section, we introduce our proposed auto-correlation operator and the resulting autocorrelational neural network (ACNN) which is the focus of this work. A convolutional or auto-correlational network computes a series of h feature representations is the final (output) representation, and each non-input representation X (k) for k > 0, is computed from the preceding representation X (k−1) using a convolution or auto-correlation operation followed by an element-wise non-linear function.
Restricting our focus to convolutions in one dimension, as used in the context of language processing, each representation X (k) is a matrix of size (n, m k ), where n is the number of words in the input and m k is the feature dimension of representation k, or equivalently it can be viewed as a sequence of n row vectors X is the row vector of length m k that represents the tth word at level k.
Consistent with the second interpretation, the input representation is a sequence of word embeddings, where m 0 is the length of the embedding vector and x (0) t is the word embedding for the tth word.
Each non-input representation X (k) , k > 0 is formed by column-wise stacking the output of one or more convolution or auto-correlation operations applied to the preceding representation, and then applying an element-wise non-linear function. Formally, we define: where F (k,u) is the uth operator applied at layer k, and N (k) is the non-linear operation applied at layer k. Each operator F (k,u) (either convolution or auto-correlation) is a function from X (k−1) , which is a matrix of size (n, m k−1 ), to a vector of length n. A network that employs only convolution operators is a convolutional neural network (CNN). We call a network that utilizes a mixture of convolution and auto-correlation operators an auto-correlational neural network (ACNN). In our networks, the non-linear operation N (k) is always element-wise ReLU , except for the last layer, which uses a sof tmax non-linearity.

Convolution Operator
A one-dimensional convolution operation maps an input matrix X = (x 1 , . . . , x n ), where each x t is a row vector of length m, to an output vector y of length n. The convolution operation is defined by a convolutional kernel A, which is applied to a window of words to produce a new output representation, and kernel width parameters and r, which define the number of words to the left and right of the target word included in the convolutional window. For example, assuming appropriate input padding where necessary, element y t in the output vector y is computed as: Figure 1: Cosine similarity between word embedding vectors learned by the ACNN model for the sentence "I know they use that I mean they sell those" (with disfluent words highlighted). In the figure, darker shades denote higher cosine values. "Rough copies" are clearly indicated by darkly shaded diagonals, which can be detected by our proposed auto-correlation operator.
where A is a learned convolutional kernel of dimension ( + r, m), X i:j is the sub-matrix formed by selecting rows i to j from matrix X, · is the dot product (a sum over elementwise multiplications), i, j are given by i = t − and j = t + r, indicating the left and right extremities of the convolutional window effecting element y t , > 0 is the left kernel width, and r > 0 is right kernel width.
b is a learned bias vector of dimension n,

Auto-Correlation Operator
The auto-correlational operator is a generalisation of the convolution operator: where y t , A, X, b, i and j are as in the convolution operator, and X is a tensor of size (n, n, m) such that each vec-torX i,j,: is given by ) is a binary operation on vectors, such as the Hadamard or element-wise product (i.e., f (u, v) = u • v), and X i:j,i:j is the sub-tensor formed by selecting indices i to j from the first two dimensions of tensorX, B is a learned convolutional kernel of dimension ( + r, + r, m).
Unlike convolution operations, which are linear, the auto-correlation operator introduces secondorder interaction terms through the tensorX (since it multiplies the vector representations for each pair of input words). This naturally encodes the similarity between input words when applied at level k = 1 (or the co-activations of multiple CNN features, if applied at higher levels). As illustrated in Figure 1, blocks of similar words are indicative of "rough copies". We provide an illustration of the auto-correlation operation in Fig

Switchboard Dataset
We evaluate the proposed ACNN model for disfluency detection on the Switchboard corpus of conversational speech (Godfrey and Holliman, 1993). Switchboard is the largest available corpus (1.2 × 10 6 tokens) where disfluencies are annotated according to Shriberg's (1994) scheme: where (+) is the interruption point marking the end of reparandum and {} indicate optional interregnum. We collapse this annotation to a binary classification scheme in which reparanda are labeled as disfluent and all other words as fluent. We disregard interregnum words as they are trivial to detect as discussed in Section 1. Following Charniak and Johnson (2001), we split the Switchboard corpus into training, dev and test set as follows: training data consists of all sw[23] * .dff files, dev training consists of all sw4[5-9] * .dff files and test data consists of all sw4[0-1] * .dff files. We lower-case all text and remove all partial words and punctuations from the training data to make our evaluation both harder and more realistic . Partial words are strong indicators of disfluency; however, speech recognition models never generate them in their outputs. Figure 2: ACNN overview for labeling the target word "boston". A patch of words is fed into an auto-correlational layer. At inset bottom, the given patch of words is convolved with 2D kernels A of different sizes. At inset top, an auto-correlated tensor of size (n, n, m 0 ) is constructed by comparing each input vector u = x t with the input vector v = x t using a binary function f (u, v). The auto-correlated tensor is convolved with 3D kernels B of different sizes. Each kernel group A and B outputs a matrix of size (n, m 1 ) (here, we depict only the row vector relating to the target word "boston"). These outputs are added element-wise to produce the feature representation that is passed to further convolutional layers, followed by a softmax layer. "E" = disfluent, " " = fluent and m 0 = embedding size.

ACNN and CNN Baseline Models
We investigate two neural network models for disfluency detection; our proposed auto-correlational neural network (ACNN) and a convolutional neural network (CNN) baseline. The CNN baseline contains three convolutional operators (layers), followed by a width-1 convolution and a softmax output layer (to label each input word as either fluent or disfluent). The ACNN has the same general architecture as the baseline, except that we have replaced the first convolutional operator with an auto-correlation operator, as illustrated in Figure 2.
To ensure that equal effort was applied to the hyperparameter optimization of both models, we use randomized search (Bergstra and Bengio, 2012) to tune the optimization and architecture parameters separately for each model on the dev set, and to find an optimal stopping point for training. This results in different dimensions for each model. As indicated by Table 1, the resulting ACNN configuration has far fewer kernels at each layer than the CNN. However, as the autocorrelation kernels contain an additional dimension, both models have a similar number of parameters overall. Therefore, both models should have similar learning capacity except for their architec-tural differences (which is what we wish to investigate). Finally, we note that the resulting maximum right kernel width r 1 in the auto-correlational layer is 6. As illustrated in Figure 3, this is sufficient to capture almost all the "rough copies" in the Switchboard dataset (but could be increased for other datasets). For the ACNN, we considered a range of possible binary functions f (u, v) to compare the input vector u = x t with the input vector v = x t in the auto-correlational layer. However, in initial experiments we found that the Hadamard or elementwise product (i.e. f (u, v) = u • v) achieved the best results. We also considered concatenating the outputs of kernels A and B in Equation 4, but we found that element-wise addition produced slightly better results on the dev set.

Implementation Details
In both models, we use ReLU for the non-linear operation, all stride sizes are one word and there are no pooling operations. We randomly initialize the word embeddings and all weights of the model from a uniform distribution. The bias terms are initialized to be 1. To reduce overfitting, we apply dropout (Srivastava et al., 2014) to the input word embeddings and L 2 regularization to the weights of the width-1 convolutional layer. For parameter optimization, we use the Adam optimizer (Kingma and Ba, 2014) with a mini-batch size of 25 and an initial learning rate of 0.001. Figure 3: Distribution over the number of words in between the reparandum and the interregnum in the Switchboard training set (indicating the distance between "rough copies").

Results
As in previous work , we evaluate our model using precision, recall and f-score, where true positives are the words in the edit region (i.e., the reparandum words). As Charniak and Johnson (2001) observed, only 6% of words in the Switchboard corpus are disfluent, so accuracy is not a good measure of system performance. F-score, on the other hand, focuses more on detecting "edited" words, so it is more appropriate for highly skewed data. Table 2 compares the dev set performance of the ACNN model against our baseline CNN, as well as the LSTM and BLSTM models proposed by Zayats et al. (2016) operating only on word inputs (i.e., without any disfluency pattern-match features). Our baseline CNN outperforms both the LSTM and the BLSTM, while the ACNN model clearly outperforms the baseline CNN, with a further 5% increase in f-score. In particular, the ACNN noticeably improves recall without degrading precision.  To further investigate the differences between the two CNN-based models, we randomly select 100 sentences containing disfluencies from the Switchboard dev set and categorize them according to Shriberg's (1994) typology of speech repair disfluencies. Repetitions are repairs where the reparandum and repair portions of the disfluency are identical, while corrections are where the reparandum and repairs differ (so corrections are much harder to detect). Restarts are where the speaker abandons a sentence prefix, and starts a fresh sentence. As Table 3 shows, the ACNN model is better at detecting repetition and correction disfluencies than the CNN, especially for the more challenging correction disfluencies. On the other hand, the ACNN is no better than the baseline at detecting restarts, probably because the restart typically does not involve a rough copy dependency. Luckily restarts are much rarer than repetition and correction disfluencies. We also repeated the analysis of (Zayats et al., 2014) on the dev data, so we can compare our models to their extended BLSTM model with a 17-state CRF output and hand-crafted features, including partial-word and POS tag features that enable it to capture some "rough copy" dependencies. As expected, the ACNN outperforms both the CNN and the extended BLSTM model, especially in the "Other" category that involve the nonrepetition dependencies.  Finally, we compare the ACNN model to stateof-the-art methods from the literature, evaluated on the Switchboard test set. Table 5 shows that the ACNN model is competitive with recent models from the literature. The three models that score more highly than the ACNN all rely on handcrafted features, additional information sources such as partial-word features (which would not be available in a realistic ASR application), or external resources such as dependency parsers and language models. The ACNN, on the other hand, only uses whole-word inputs and learns the "rough copy" dependencies between words without requiring any manual feature engineering.  Table 5: Comparison of the ACNN model to the stateof-the-art methods on the Switchboard test set. The other models listed have used richer inputs and/or rely on the output of other systems, as well as pattern match features, as indicated by the following symbols: dependency parser, † hand-crafted constraints/rules, prosodic cues, tree adjoining grammar transducer, 1 refined/external language models and ⊗ partial words. P = precision, R = recall and F = f-score.

Qualitative Analysis
We conduct an error analysis on the Switchboard dev set to characterize the disfluencies that the ACNN model can capture and those which are difficult for the model to detect. In the following examples, the highlighted words indicate ground truth disfluency labels and the underlined ones are the ACNN predictions.
1. But if you let them yeah if you let them in a million at a time it wouldn't make that you know it wouldn't make that big a bulge in the population 2. They're handy uh they they come in handy at the most unusual times 3. My mechanics loved it because it was an old it was a sixty-five buick 4. Well I I I think we did I think we did learn some lessons that we weren't uh we weren't prepared for 5. Uh I have never even I have never even looked at one closely 6. But uh when I was when my kids were young I was teaching at a university 7. She said she'll never put her child in a in a in a in a in a preschool 8. Well I think they're at they're they've come a long way 9. I I like a I saw the the the the tapes that were that were run of marion berry's drug bust 10. But I know that in some I know in a lot of rural areas they're not that good According to examples 1-10, the ACNN detects repetition (e.g. 1, 5) and correction disfluencies (e.g. 3, 6, 10). It also captures complex structures where there are multiple or nested disfluencies (e.g. 2, 8) or stutter-like repetitions (e.g. 4, 7, 9). In some cases where repetitions are fluent, the model has incorrectly detected the first occurence of the word as disfluency (e.g. 13, 14, 15, 19). Moreover, when there is a long distance between reparandum and repair words (e.g. 11, 12), the model usually fails to detect the reparanda. In some sentences, the model is also unable to detect the disfluent words which result in ungrammatical sentences (e.g. 16, 17, 18, 20). In these examples, the undetected disfluencies "the", "did", "at" and "two the" cause the residual sentence to be ungrammatical.
We also discuss the types of disfluency captured by the ACNN model, but not by the baseline CNN. In the following examples, the ACNN predictions (underlined words) are the same as the ground truth disfluency labels (highlighted words). The bolded words indicate the CNN prediction of disfluencies.
21. Uh well I actually my dad's my dad's almost ninety 22. Not a man not a repair man but just a friend 23. we're from a county we're from the county they marched in 24. Now let's now we're done 25. And they've most of them have been pretty good 26. I do as far as uh as far as uh as far as immigration as a whole goes 27. No need to use this to play around with this space stuff anymore 28. We couldn't survive in a in a juror in a trial system without a jury 29. You stay within your uh within your means 30. So we're we're part we're actually part of MIT The ACNN model has a generally better performance in detecting "rough copies" which are important indicator of repetition (e.g. 21, 29), correction (e.g. 22,23,24,25,27), and stutter-like (e.g. 26, 28, 30) disfluencies.

Conclusion
This paper presents a simple new model for disfluency detection in spontaneous speech transcripts. It relies on a new auto-correlational kernel that is designed to detect the "rough copy" dependencies that are characteristic of speech disfluencies, and combines it with conventional convolutional kernels to form an auto-correlational neural network (ACNN). We show experimentally that using the ACNN model improves over a CNN baseline on disfluency detection task, indicating that the autocorrelational kernel can in fact detect the rough copy dependencies between words in disfluencies. The addition of the auto-correlational kernel permits a fairly conventional architecture to achieve near state-of-the-art results without complex handcrafted features or external information sources.
We expect that the performance of the ACNN model can be further improved in future by using more complex similarity functions and by incorporating similar kinds of external information (e.g. prosody) used in other disfluency models. In future work, we also intend to investigate other applications of the auto-correlational kernel. The auto-correlational layer is a generic neural network layer, so it can be used as a component of other architectures, such as RNNs. It might also be useful in very different applications such as image processing.