Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech

We present the joint task of incremental disfluency detection and utterance segmentation and a simple deep learning system which performs it on transcripts and ASR results. We show how the constraints of the two tasks interact. Our joint-task system outperforms the equivalent individual task systems, provides competitive results and is suitable for future use in conversation agents in the psychiatric domain.


Introduction
Artificial conversational systems promise to be a valuable addition to the existing set of psychiatric health care delivery solutions. As artificial systems, they can ensure that interview protocols are followed, and, perhaps surprisingly, due to being "just a computer", even seem to increase their interlocutors' willingness to disclose (Lucas et al., 2014). Interactions with such conversational agents have been shown to contain interpretable markers of psychological distress, such as rate of filled pauses, speaking rate, and various temporal, utterance and turn-related interactional features (DeVault et al., 2013). Filled pauses and disfluencies in general have also been shown to predict outcomes to psychiatric treatment (Howes et al., 2012;McCabe et al., 2013).
Currently, these systems are only used to elicit material that is then analysed offline. For offline analysis of transcripts with gold standard utterance segmentation, much work exists on detecting disfluencies (Johnson and Charniak, 2004;Qian and Liu, 2013;Honnibal and Johnson, 2014). To enable more cost-effective analysis, however, and possibly even let the interaction script itself be dependent on an analysis hypothesis, it would be better to be able to work directly off the speech sig-nal, and online (incrementally). This is what we explore in this paper, presenting and evaluating a model that works with online, incremental speech recognition output to detect disfluencies with various degrees of fine-grainedness.
As a second contribution, we combine incremental disfluency detection with another lowerlevel task that is important for responsive conversational systems, namely the detection of turntaking opportunities through detection of utterance boundaries. (See for example (Schlangen and Skantze, 2011) for arguments for incremental processing and responsive turn-taking in conversational systems, and (Schlangen, 2006;Atterer et al., 2008;Raux, 2008;Manuvinakurike et al., 2016, inter alia) for examples of incremental utterance segmentation). Besides both being relevant for interactive health assessment systems, these tasks also have an immanent connection, as the approach typically used for turn-end detection is simply waiting for a silence of a certain duration, and hence is mislead by intra-turn silent disfluencies. Similarly, without gold standard segmentation, disfluent restarts and repairs may be predicted at fluent utterance boundaries. We hence conjecture that the tasks can profitably be done jointly.

Related Work
As a separate task, there has been extensive work on utterance segmentation. Cuendet (2006) reports an NIST-SU utterance segmentation error rate result on the Switchboard corpus at 48.50, using a combination of lexical and acoustic features. Ang et al. (2005) report NIST-SU scores in the region of 34.35-45.92 on the ICSI Meeting Corpus. Martínez-Hinarejos et al. (2015) report state-of-the-art dialogue act segmentation results on Switchboard at 23.0 NIST-SU, however this is not on the level of full dialogues, but on pre-segmented turn stretches. For the equivalent task of sentence boundary detection, Seeker et al. (2016) report an F-score of 0.7665 on Switchboard data, using a joint dependency parsing framework, and Xu et al. (2014) implement a deep learning architecture and report an 0.810 F-score and 35.9 NIST-SU error rate on broadcast news speech using prosodic and lexical features using a DNN for prosodic features, combined with a CRF classifier. However scaling this to spontaneous speech and the challenges of incrementality explained here, is yet to be tested.
Strongly incremental approaches to the task are rare, however (Atterer et al., 2008) achieve a wordby-word F-score of 0.511 on predicting whether the current word is the end of the utterance (dialogue act) on Switchboard, and using ground-truth syntactic information indicating sentence structure information achieve 0.559.
Disfluency detection on pre-segmented utterances in the Switchboard corpus has also had a lot of attention, and has also reached high performance (Johnson and Charniak, 2004;Georgila, 2009;Qian and Liu, 2013;Honnibal and Johnson, 2014). On detection on Switchboard transcripts, Honnibal and Johnson (2014) achieve 0.841 reparandum word accuracy using a joint dependency parsing approach, and  in a strongly incrementally operating system without look-ahead achieve 0.779, using a pipeline of classifiers and language model features. The potentially live approaches tend to use acoustic information (Moniz et al., 2015) and do not perform on a comparable level to their transcription-based task analogues, nor achieve the same fine-grained analysis of disfluency structure, which is often needed to identify the disfluency type and compute its meaning.
Live incremental approaches to both tasks have not been able to benefit from reliable ASR hypotheses arriving in a timely manner until recently. Now the arrival of improved performance, in terms of low Word Error Rate (WER) and better live performance properties is making this possible (Baumann et al., 2016). In this paper we define a joint task in a live setting. After defining the task we present a simple deep learning system which simultaneously detects disfluencies and predicts up-coming utterance boundaries from incremental word hypotheses and derived information.
3 The Tasks: Real-time disfluency prediction and utterance segmentation

Incremental disfluency detection
Disfluencies, in their fullest form as speech repairs, are typically assumed to have a tripartite reparandum-interregnum-repair structure (terms originally proposed by Shriberg (1994)), as exhibited by the following example.
If reparandum and repair are absent, the disfluency reduces to an isolated edit term. In the example given here, the interregnum is filled by a marked, lexicalised edit term, but more phrasal terms such as I mean and you know can also occur.
The task of disfluency detection then is to recognise these elements and their structure, and the task of incremental disfluency detection adds the challenge of doing this in real-time, from "left-toright". In that latter setting, detection runs into the same problem as a human processor of such an utterance: Only by the time the interregnum is encountered, or possibly even only when the repair is seen, does it become clear that earlier material now is to be considered as "to be repaired" (reparandum). 1 Hence, the task cannot be set up as a straightforward sequence labelling task where the tags "reparandum", "interregnum" and "repair" are distributed left-to-right over words as indicated in the example above; in this example, it would unfairly require the prediction that "likes" is going to be repaired, at a point when no evidence is available for making it.
We follow Hough and Schlangen (2015) and use a tag set that encodes the reparandum start only at a time when it can be guessed, namely at the onset of the actual repair. This is illustrated in Figure 1 in the "disfluency (complex)" row. Here, the word at the repair onset, "to", gets tagged as repair onset (rpS) and, at the same time, as repairing material beginning 5 tokens in the past (-5, yielding the complex label rpS-5). Additionally, we annotate all repair words (as rpMid, if the word is neither first nor last word of the repair, and together with the disfluency type, if it is the final word; here, the Figure 1: An utterance with the traditional repair disfluency and segmentation annotation in-line (Shriberg, 1994;Meteer et al., 1995) and our incrementally-oriented tag schemes label is rpESub for substitution), 2 editing terms (e) and fluent material (f ) as well. From the complex tag set, we can reconstruct the disfluency structure as in (1) in a strongly incremental fashion. We also define a reduced tag set (shown in Figure 1 as "disfluency (simple)" that only tags fluent words, editing terms, and the repair onset.

Incremental utterance segmentation
We formulate incremental utterance segmentation as the judgement in real time as to when the current utterance is going to end, and so like (Schlangen, 2006;Atterer et al., 2008), we move from purely reactive approach, signalled by silence, to prediction. To allow prediction to be possible we use four tags for classifying stretches of acoustic data (which can be the time spans of forced aligned gold standard words, or the word hypotheses timings provided by an ASR), which are equivalent to a BIES (Beginning, Inside, End and Single) scheme for utterances-see Table 1. The tag set allows evidence from the prior context of the word (the acoustic and linguistic information preceding the word) to be used to predict whether this word continues a current utterance (the -prefix) or starts anew (the . prefix), and also permits the online prediction of whether the next word (or segment) will continue the current utterance (the -suffix) or the current word ends the utterance (the . suffix). From these utterance boundary predictions can be derived when -w. or .w. is predicted (i.e. "will end utterance"). The tag set is summarized in Table 1 and an example is in Fig. 1, row "utterance segmentation".

Defining the joint task
Studying the two phenomena in natural dialogue corpora, for example in terms of rich transcription mark-up in the SWBD annotation manual (Meteer et al., 1995), there are several constraints: 2 The other repair type is delete rpEDel. Verbatim reparandum-repair repetitions are subsumed by rpESub. C1 Repair onsets cannot begin an utterance (by definition of first position repairs needing a preceding reparandum). C2 Repairs must be completed within the utterance in which they begin. C3 Utterances can be interrupted or abandoned, but these are different to within-dialogue-act repairs.
Given these constraints, we can generate a joint tag set as a subset of the cross product of both tag schemes. The utterance segmentation tags in Table 1 are combined with the simple strongly incremental disfluency tags described in §3.1. The joint set for both the simple and complex tasks is in Fig. 2, where 1 indicates the tag is in the set and 0 otherwise. In the simple task, there are 10 tags. The joint set for the full task including disfluency structure detection has 53 possible tags (rather than the full cross product, which would be 92). In reality, in the training corpus, only 43 of these possible combinations were found, so this constituted our tag set in practice. See Fig. 1 (bottom 2 rows) for example sequences.

Research questions
Given the formulation of the joint task, we would like to ask the following questions of scalable, automatic approaches to it: -w-a word which continues the current utterance and whose following word will continue it -w.
a word which continues the current utterance and is the last word of it .w-a word which is the beginning of an utterance and whose following word will continue it .w.
a word constituting an entire utterance To address these questions we use a combination of a deep learning architecture for sequence labelling and incremental decoding techniques which we will now explain.

LSTMs and Incremental Decoding for Live Prediction
Our systems consist of deep learning sequence models which consume incoming words and use word embeddings in addition to other features to predict disfluency and utterance segmentation labels for each word, in a strictly left-to-right, wordby-word fashion. We also use word timings as input to a separate classifier whose output is combined with that of the deep learning architecture in an incremental decoder. See Fig. 3 for the overall architecture. We describe the elements of the system below.

Input Features
In our systems we use the following input features: • Words in a backwards window from the most recent word (transcribed or ASR) • Durations of words in the current window (from transcription or ASR word timings) • Part-Of-Speech (POS) tags for words in current window (either reference, or from an incremental CRF tagger) For incremental ASR, we use the free trial version of IBM's Watson Speech-To-Text service. 3 The service provides good quality ASR on noisy 3 https://www.ibm.com/watson/ developercloud/speech-to-text.html data-on our selected heldout data on Switchboard, the average WER is 26.5%. The Watson service, crucially for our task, does not filter out hesitation markers or disfluencies, which is rare for current web-based services (Baumann et al., 2016). The service also outputs results incrementally, so silence-based end-pointing is not used. The service also returns word timings, which upon manual inspection were close enough to the reference timings to use as features in the live version of our system. In this paper, the durations are not features in the principal RNN but in an orthogonal logistic regression classifier-see §4.3.
For POS-tagging, we use the NLTK CRF tagger, which when trained on our training data and tested on our heldout data achieves 0.915 accuracy on all tags, which was sufficiently good for our purposes. Crucially, for the label UH, which is important evidence for an edit term, it achieves an F-score of 0.959.

Architectures
We use two well-studied deep learning architectures for our sequence labelling task-the Elman Recurrent Neural Network (RNN) and the Long Short-Term Memory (LSTM) RNN. Architecturally the RNNs here reproduce approximately the identical set-up as described in (Mesnil et al., 2013;Hough and Schlangen, 2015).
Input and word embeddings Following (Mes-nil et al., 2013), we use 1-of-N, or 'one-hot', vectors as our raw input to the network, which provide unique indices to dense vectors in a word embedding matrix. The initial word embeddings were obtained from Switchboard data using the python implementation of word2vec in gensim, 4 using a skip-gram context model. The training data for the initial embeddings was cleaned of disfluencies, effecting a 'clean' language model (Johnson and Charniak, 2004). These embeddings were then further updated as part of the objective function during the task-specific training itself. Instead of single word/POS inputs we use context windows which, like n-gram language models, are backwards from the current word. The internal representation of context windows of length n in the network is created through the ordered concatenation of the n corresponding word embedding vectors of size 50, resulting in an input to the network of dimension R 50n . We use n =2 in our experiments here. RNN architecture and activation functions In addition to the embedding layer, we use a (recurrent) hidden layer of 50 nodes and an output layer the size of our training tag sets (43 nodes for the complex task and 10 nodes for the simple task). The standard Elman RNN dynamics in the recurrent hidden layer at time t is as in (3), where the hidden layer h(t) is calculated as the Sigmoid function (2) of the addition of the weight matrix U applied via dot product to the current input vector x(t) and the weight matrix V applied via dot product to the stored previous value of the hidden layer at time t−1, i.e. h(t−1).
We use the standard softmax function for the node activation function of the output layer.
At decoding time, the compression of the context into the hidden layer allows us to save the current state of the decode live compactly from ASR results as they become available to the network. In order to integrate the new incoming words and POS tags with the history, it is only necessary to store the current hidden layer activation h(t) (and the output softmax layer too, if that is being used by another process), and wait for new information to the input layer.
LSTM unit In our LSTM, we include recurrent LSTM units that uses the input x(t), the hidden state activation h(t−1), and memory cell activation c(t−1) to compute the hidden state activation h(t) at time t. It uses a combination of a memory cell c and three types of gates: input gate i, forget gate f , and output gate o to decide if the input needs to be remembered (using the input gate), when the previous memory needs to be retained (forget gate), and when the memory content needs to be output (using the output gate). For each time step t the cell activations c(t) and h(t) are computed by the below steps, whereby the is element-wise multiplication.
While many more weight matrices need to be learned (all the W , U and V subscripted matrices), as with the standard RNN, at decoding time it is efficient to store the current decoding state in a compact way, as it is only neccessary to save the activation of the memory cell c(t) and the hidden layer h(t) to save the current state of the network. See Fig. 3 for the schematic overall disfluency detection architecture for the LSTM.
Learning: error function and parameter update As is common for RNNs (De Mulder et al., 2015) we use negative log likelihood loss (NLL) as a cost function and use stochastic gradient descent over the parameters, including the embedding vectors, to minimize it. We use a batch size of 9 words, consistent with our repair tag scheme. Both networks use a learning rate of 0.005 and L2 regularisation on the parameters to be learned with a weight of 0.0001.

Incremental decoding and timing driven classifier
Markov model For decoding optimization we use Viterbi decoding on the sequence of softmax output distributions from the network in the spirit of (Guo et al., 2014). We use a Markov model which is hand-crafted to ensure legal tag sequences are outputted for the given tag set. In our joint task, this permits 'late' detection of an utterance boundary if the probability for a -w. and following .w-or .w. tag on their own are not the arg max, but their combined probability permits the best sequence. Similarly, in the complex task, repairs where evidence of a repair end tag is strong, but the repair onset tag was not the arg max can be detected at the repair end. From an incremental perspective, in Viterbi decoding there is the danger of output 'jitter'. We investigate how different output representations have different effects on output prediction stability in our evaluation.
Timing driven classifier As an edition to the decoding step, we experimented with an independent timing driven classifier which consumes the durations of the last three words and outputs a probability that this is a fluent continuation or the beginning of a new utterance. We train a logistic regression classifier on our training data. Combining this two-class probability with the probability of the relevant utterance segmentation tags in decoding boosted performance considerably.

Evaluation Criteria
Accuracy On transcripts, we calculate repair onset detection accuracy F rpS , where applicable reparandum word accuracy F rm , and F1 accuracy for edit term words F e , which includes interregna. For utterance segementation we also use wordlevel F1 scores for utterance boundaries (end-ofutterance words) F uttSeg . Carrying out the task live, on speech recognition hypotheses which very well may not be identical to the annotated goldstandard transcription, requires the use of timebased metrics of local accuracy in a time window (i.e. within this time window, has a disfluency/utterance boundary been detected, even if not on the identical words?)-we therefore calculate the F1 score over 10 second windows of each speaker's channel. While this window-ing can give higher scores on certain phenomena, it tends to follow the word-level F-score so is a good time-based indicator of accuracy.
For utterance segmentation, for comparison to previous work we also use NIST-SU error rate (Ang et al., 2005). NIST-SU is the ratio of the number of incorrect utterance boundary hypotheses (missed boundaries and false positives) made by a system to the number of reference boundaries.
For a more coarse-grained metric which includes both tasks, which is useful in our target domain of interactions in a clinical context (Howes et al., 2014), we look at the rpS : UttSeg ratio per speaker correlation (Pearson's R). This gives us the best approximation as to how good the system is at estimating repair rate per utterance.
Timeliness and diachronic metrics Crucial for the live nature of the system, we measure latency (i.e. how close to the actual time a disfluency or boundary event occurred has one been predicted?) and also stability of output over time (i.e. how much does the output change?). For latency we use Zwarts et al. (2010)'s time-to-detection metric: the average distance (in numbers of words) consumed before first detection of gold standard repairs from the repair onset word, TD rpS . 5 We generalize this measure to the other tags of interest to give TD e and TD uttSeg and also, particularly crucially for the ASR results, report the metrics in terms of time in seconds. 6 For stability, incorporating insights from the evaluation of incremental processors by Baumann et al. (2011), we measure the edit overhead (EO) of the output labels-this is the percentage of unnecessary edits (insertions and deletions) required to get to the final labels outputted by the system.

Experimental Set-up
We experiment with the 2 joint output representations in Fig. 1 and implement an RNN and LSTM using Theano (Bergstra et al., 2010) as an extension to the code in Mesnil et al. (2013). We also run the 3 individual versions of the tasks with the tag sets shown in Fig. 1 for comparison. We also train a word timings driven classifier which adds information to the decoding step as explained above to try to answer Q2. 7 Data We train on transcripts and test on both transcripts and ASR hypotheses. We use the standard Switchboard training data for disfluency detection (all conversation numbers beginning sw2*,sw3* in the Penn Treebank III release: 100k utterances, 650K words) and use the standard heldout data (PTB III files sw4[5-9]*: 6.4K utterances, 49K words) as our validation set. We test on the standard test data (PTB III files 4[0-1]*) with punctuation removed from all files. 8 For 5 Our measure is in fact one word earlier by default than Zwarts et al. (2010) as we take detection after the end of the repair onset word as the earliest possible detection point. 6 These measures only apply to repairs and utterance boundaries detected correctly. 7 All experiments are reproducible. The code can be downloaded at https://github.com/ dsg-bielefeld/deep_disfluency 8 We include partial words as these may in theory become available from the ASR in the live setting.   Table 3: Comparison of the joint vs. individual task performances the ASR results evaluation, we only select a subset of the heldout and test data whereby both channels achieved below 40% WER to ensure good separation-this left us with 18 dialogues in the validation data and 17 dialogues for testing. We train all RNNs for a maximum of 50 epochs else halt training if there is no improvement on the best F rm score on the transcript validation set after 10 epochs.

Results and Discussion
Our dialogue-final accuracy results are in Table 2. On transcripts, our best per-word F rpS reaches 0.720 and best F e reaches 0.918. For utterance segmentation, perword accuracy reaches 0.748 and the lowest NIST-SU error rate is 43.64. This is competitive with (Seeker et al., 2016)'s 0.767 F-score and out-performs (Cuendet, 2006) on the Switchboard data. The best rpS : uttSeg correlation per speaker reaches 0.92 (p<0.0001).
In comparison to incremental approaches, we outperform (Atterer et al., 2008)'s 0.511 accuracy on end-of-utterance. Their work allows no prediction lag in a strictly incremental setting, so is at a disadvantage, however our result of 0.748 on transcripts is reported alongside the average time to detection of 0.399 words, which suggests on average the uttSeg when predicted correctly, is done so with no latency.
With the exception of one metric, the LSTM outperforms the RNN on transcripts. The systems using the timing model in general outperform those with lexical information only on the utterance segmentation metrics, whilst not having an impact on disfluency detection.
According to the window-based accuracies, on ASR results there is significant degradation in accuracy for repair onsets (best F rpS =0.557) however utterance segmentation did not suffer the same loss, with the best system achieving 0.685 accuracy. The rpS : uttSeg Pearson's R correlation per speaker reaches 0.81 (p<0.0001) in a system with otherwise poor performance-the second  best achieved was 0.79 (p<0.0001). For disfluency detection, standard approaches use pre-segmented utterances to evaluate performance, so this result is difficult to compare. However in the simple task, the accuracy of 0.720 repair onset prediction is respectable (comparable to (Georgila, 2009)), and is useful enough to allow realistic relative repair rates, in line with our motivation. The complex tagging system performs poorly on repairs compared to the literature, however the lack of segementation makes this a considerably harder task, in the same way as dialogue act tagging results are lower on unsegmented transcripts (Martínez-Hinarejos et al., 2015). Edit term detection performs very well at 0.918, approaching the state-of-the-art on Switchboard reported at 0.938 .
The utility of a joint task As can be seen in Table 3, the overall best performing systems on the individual tasks do not reach the results in any relevant metric of the best performing combined system. The disfluency-only systems were run ignoring all utterance boundary information, which puts this setting at a disadvantage to previous approaches, however it is clear that on unsegmented data our posing of the task jointly is useful.
Incrementality Incrementally the differences between the architectures was neglible-results for the LSTM are in Table 4. The latency for repair onset detection is very low, being detected as little as 0.196 seconds after the onset word is finished (or on transcripts largely directly after the word has been consumed as T T D rps (word) = 0.003). Utterance boundaries were detected just over a second after the end of the last word of the previous utterance. However, the fact that T T D uttSeg on the word level reaches 0.283 suggests the timebased average is being weighed down by occa-sional long silences, which could be thresholded in future work. The EO measure of stability is severely affected by jittering ASR hypotheses, but given its worst result is 21.46% this is still a fairly stable incremental system.
Error Analysis To explore the errors being made by the systems, and how the RNN and LSTM may differ in ability, we performed an error analysis on the simple versions with the timing models-see Fig. 4. One can observe a boost in recall for various repair types in the LSTM, where it is performing better on repairs with longer reparanda. Characterizing repetitions as verbatim repeats, substitutions as the other repairs marked with a repair phase, and deletes as those without one, we see the LSTM outperforming the RNN on the rarer types. Whilst the problem is attenuated by the memory facility of the LSTM, our best system still suffers the vanishing gradient problem for predicting longer repairs with reparanda over 3 words long. Also we show in uttSeg detection all systems falter on long distance projections with coordinating conjunctions, which would potentially be dealt with more easily in a parsing framework, or a hierarchical deep learning framework.
We also investigated the uttSeg detection errors and see that the networks are generally not confusing disfluencies with boundaries. However, our best system incorrectly labelled 3.6% of the reference uttSegs as rpS (hence also affecting the precision of the rpS prediction)-upon inspection these were largely abandoned utterances, which according to the constraint C3 we posited above are not marked as disfluencies in the same way intra-utterance repairs are in the reference. Due to the original annotation instructions of (Meteer et al., 1995), these are segmented and not included in the traditional disfluency detection task. However,  intuitively these can be construed as a disfluency type, and in future we will treat them as a special type of uttSeg/disfluency hybrid. As can be seen in Fig. 4 (c) other main sources of error are on coordinating conjunctions (CC) such as 'and' and 'or', nouns with nominative subject marking case like 'I' and 'we' (subj), other proper nouns, variants of 'it' and grounding utterances like 'yeah' and 'okay'. uttSeg detection in both systems achieved high precision but relatively low recall.

Conclusion
We have presented the joint task of incremental utterance segmentation and disfluency detection and show a simple deep learning system which performs it on transcripts and ASR results. As regards the research questions posed in §3.4, in answer to Q1, we showed that, all else being equal, a deep learning system can perform both tasks jointly improves over equivalent systems doing the individual tasks. In answer to Q2, we showed that word timing information, both from transcripts and ASR results, helps the utterance segmentation and the joint task across all settings whilst not aiding disfluency detection on its own, and in response to Q3, we achieve a good online accuracy vs. final accuracy trade-off in a live, incremental, system, however still experience some time delays for utterance segmentation in our most accurate system.
We conclude that our joint-task system for disfluency detection and utterance segmentation shows a new benchmark for the joint task on Switchboard data and due its incremental functioning on unsegmented data, including ASR result streams, it is suitable for live systems, such as conversation agents in the psychiatric domain. In future work we intend to optimize the inputs to our networks after this exploration, including using raw acoustic features, and combining the task with language modelling and dialogue act tagging.