Unediting: Detecting Disfluencies Without Careful Transcripts

Speech transcripts often only capture semantic content, omitting disﬂuencies that can be useful for analyzing social dynamics of a discussion. This work describes steps in building a model that can recover a large fraction of locations where disﬂuencies were present, by transforming carefully annotated text to match the standard transcription style, introducing a two-stage model for handling different types of disﬂuencies, and applying semi-supervised learning. Experiments show improvement in disﬂuency detection on Supreme Court oral arguments, nearly 23% improvement in F1.


Introduction
Many hearings, lectures, news broadcasts and other spoken proceedings are hand-transcribed and made available online for easier searching and increased accessability. For speed and cost reasons, standard transcription services aim at representing semantic content only; thus, filled pauses (uh, um) and many disfluencies (repetitions and self corrections) are omitted, though not all. Careful transcripts represent all the words (and word fragments spoken), as shown below with disfluent regions underlined.

Careful: It is it is a we submit
Where there used to be um um um uh the decision Standard: It is, it is, we submit Where there used to be the decision These phenomena are quite common in spontaneous speech, even in formal settings such as Supreme Court oral arguments and congressional hearings (Zayats et al., 2014).
While disfluencies may not be important for analyzing the topic of a discussion, the rate and type of disfluencies provide an indication of other factors of interest in spoken language analysis, including cognitive load, emotion, and social cues (Shriberg, 2001). Further, predicting locations of disfluencies in standard transcripts would help to improve time alignments of transcripts to the audio signal, and to provide more useful text data for training language models for speech recognition. Since careful annotation of transcripts with this information is costly, this paper tackles the problem of recovering the disfluencies from clues in the standard orthographic transcripts, or "unediting" the transcripts. 1 Here, unediting is treated as detection of the reparandum of the disfluencies. Following the structural representation of (Shriberg, 1994), as in: [ we would + which we would ] [ would + [ who + who ] wouldn't ] the task is to detect the words in the brackets preceding the '+' which marks the self-interruption point. Of course, here, some of the words in those regions may not be in the transcript, so location is more important than extent. In addition, some cues used (i.e. filled pauses and word fragments) are not available in standard transcripts.
Three developments are combined to address the problem of unediting with the constraint of limited hand-annotated training data in the target domain: oral arguments from the Supreme Court of the United States (SCOTUS) available from the Oyez Project archive (oyez.org). First, we identify mechanisms for transforming the careful transcripts of the Switchboard corpus (Godfrey et al., 1992) to be more similar to the Oyez transcripts. Second, we introduce a multi-stage model that accounts for differences in the rates of repetitions and self-corrections in standard vs. careful transcripts. Lastly, we apply semi-supervised learning to take advantage of the large amount of original Oyez transcripts. The system combining all these techniques, referred to here as UNEDITOR, leads to an improvement in F1 of nearly 23% compared to a baseline of training from the original disfluency-annotated Switchboard corpus.

Related work
This paper builds on prior work using conditional random field (CRF) models Georgila, 2009;Ostendorf and Hahn, 2013;Zayats et al., 2014). More recent work has shown a benefit from Markov networks (Qian and Liu, 2013;Wang et al., 2014). Since our work is on the transcription style mismatch, this work adopts the simpler CRF approach, but can be easily extended to other classification techniques.
In this work, we use only text features. While prosodic features have been shown to be useful (Shriberg, 1999;Kahn et al., 2005;Wang et al., 2014), the fact that the Oyez transcripts do not capture all the words means that forced time alignments are unreliable and the associated prosodic features are too noisy to be useful. Other studies integrate disfluency detection with parsing, e.g. (Charniak and Johnson, 2001;Johnson and Charniak, 2004;Hale et al., 2006;Zwarts et al., 2010;Rasooli and Tetreault, 2013;Honnibal and Johnson, 2014), but parsers trained on standard treebank data sets are not effective on the very long and complex sentences in SCOTUS; parser adaptation is left for future work.
There are a few studies that have investigated disfluency detection using cross-domain training data (Georgila et al., 2010;Ostendorf and Hahn, 2013;Zayats et al., 2014), and many more that have used multi-domain data for other language processing tasks. What is different about the task addressed here is that both the domain (topic and speaking style) and the transcription protocol differ between the target and source domain. There have been some attempts to transform written text to a more conversational style for training language models, e.g. Bulyko et al. (2007) inserted pause fillers and word repetitions, which led to reductions in perplexity though not word error rate. The work here differs in that the transformation is in the reverse direction (removing fillers from conversational text) and punctuation cues are emphasized.

Transforming training data
Here we describe methods for generating training data for use with standard transcripts: i) transferring labels from a small amount of carefully annotated data to corresponding standard transcripts, and ii) transforming the existing Switchboard training set to make it more similar to the target domain.

SCOTUS corpora
The Oyez Project at Chicago-Kent is a multimedia archive containing audio and transcripts of the Supreme Court hearings since 1955. While OYEZ transcripts are consistent with the audio in general, they are not accurate when it comes to disfluencies. We notice that most simple disfluencies such as repetitions have been omitted by OYEZ annotators, while more complex ones are often present and annotators have used the '...' symbol at locations of filled pauses or repetitions. Having those explicit cues indicating interruption points in disfluencies makes it possible to consider recovering the untranscribed disfluencies.
For CAREFUL SCOTUS annotation, we use the data provided by (Zayats et al., 2014), which includes seven cases with carefully transcribed audio and hand-annotated disfluencies, with separately marked repetitions. We develop ANNOTATED OYEZ transcripts, by transferring disfluency labels for those seven cases from CAREFUL SCOTUS to the corresponding files in OYEZ and dropping the deletion markers. As a result, those transcripts are identical to the original OYEZ transcripts, but in addition contain disfluency annotation derived from CARE-FUL SCOTUS.
In order to align the CAREFUL SCOTUS and ORIGINAL OYEZ transcripts, we use a dynamic programming algorithm for sequence alignment with matching scores as given in Table 1

Switchboard transformation
The ANNOTATED OYEZ training set is a very small dataset, and other work has shown that Switchboard (SWBD) is useful for cross-domain training for SCO-TUS (Zayats et al., 2014). However, prior work has been with careful transcripts. SWBD transcripts do not include '...' symbols, and SWBD has many more commas and other punctuation symbols. In order to make best use of the SWBD data, we transform it to be more similar to the OYEZ transcripts in two steps. First, we add '...' after interruption points in SWBD. Second, we remove all punctuations except '...' in the middle of the sentence in both of the corpora.

Detecting disfluencies
In this section we describe the UNEDITOR system, which is a two-stage CRF model trained on transformed training data and takes advantage of a large pool of unlabeled data with a self-training technique.
Baseline: CRF We use a conditional random field (CRF) model that labels each word in a sentence, following a tagging approach with separate repetition and non-repetition reparandum states, as in (Ostendorf and Hahn, 2013). The feature set includes identity and pattern match features widely used in disfluency detection tasks, as well as distance-based and disfluency language model features from (Zayats et al., 2014).

Two-stage model
Using the same features as in the baseline, we introduce a two-stage CRF model motivated by our observation that many repetitions are omitted from the standard transcriptions. Thus, while 62% of disfluencies in CAREFUL SCOTUS are repetitions, only 22% of all disfluencies in ANNOTATED OYEZ are repetitions. We find that training at two separate stages helps to overcome the difference in distributions of two disfluency types between source and target domains, and hence results in a better model for adaptation. In the first stage, we train a model to detect repetitions by only considering repetition states in the training data. In the second stage, we train a model to detect non-repetitions by removing all repetitions from the training data. Similarly at test time, we use the first-stage model to detect repetitions, then remove all the detected repetitions, and apply the second-stage model to detect non-repetitions. In evaluation, we report the disfluencies detected in both stages.

Self-training
A benefit of OYEZ transcripts is that there is a huge amount of unlabeled data available, which makes it natural to use semi-supervised learning. In this work, we use a simple self-training approach. First we apply a CRF model trained on the labeled data to the unlabeled data. Then we augment the training data with automatically labeled sentences that have been detected to contain a disfluency with a confidence score greater than 0.5, and retrain the model with the new augmented training set.

Experiments and discussion
We evaluate the different sources/transformations of training data, self-training and the two-stage detection model on ANNOTATED OYEZ transcripts from three cases (∼30k words).

Transforming training data
First, we assess the utility of different training sources and training data transformation using the baseline model. Note that the two SCOTUS sets are quite small (four cases, ∼64k words) compared to Switchboard (1.3M words). Because of the difference in punctuation style between the original Oyez transcripts and the careful transcripts of both corpora, all sentence-internal punctuation is removed in the CAREFUL SCOTUS and ORIG SWBD data. Table 2 reports results on training the CRF model with the different sources and their combinations. As expected, detection with in-domain training data and transformed SWBD (ANNOT OYEZ+TRANSF SWBD) outperforms training on all other dataset combinations. Training on ANNOT OYEZ alone significantly outperforms detection (especially precision) when only trained on the carefully annotated data because of the matching transcription style. Training with ORIG SWBD outperforms training with ANNOT OYEZ alone mainly due to the availability of more training data in the SWBD dataset, consistent with results in (Ostendorf and Hahn, 2013). Surprisingly, the CAREFUL SCOTUS data did not provide any benefit when added to the ORIG SWBD.
Next, we study the impact of adding '...' symbols and removing punctuation for transforming the SWBD data. Table 3 reports results for training the CRF model with the combination of ANNOT OYEZ and SWBD with different transformation steps. We observe that roughly 30% of the interruption points in CAREFUL SCOTUS are associated with the '...' symbol in the OYEZ transcripts; therefore, we add '...' symbols after 1/3 of the interruption points in the SWBD. As expected, disfluency detection is improved by transforming SWBD with adding '...'. The largest gain is obtained when we also remove punc-   tuation (the row TRANSF SWBD). All further experiments use this setting for training the models.

Self-training
Here we study the contribution of semi-supervised learning when applied on the baseline model ( Table 5). For self-training, we use 1,765 OYEZ transcripts dated 1990 -2011 as our unlabeled data (∼17.5M words), with a confidence threshold of 0.5 for augmenting the training data, as described previously. We use each one of the baseline models in Table 2 as an initial model for the self-training for comparison to the results in Table 4. While adding a lot of in-domain data definitely helps, the quality of the initial model plays a major role in the overall performance.

Two-stage model
Finally, we assess the impact of the two-stage model with and without self-training (Table 5). For the two-stage semi-supervised model, self-training was only used for the second stage (non-repetition detection). As expected, both two-stage and selftraining models improve the baseline CRF model, and the combination performs the best. The twostage model helps to adapt the differences in distribution of repetitions and non-repetitions between the two domains by factoring the different problems to improve the match of the more difficult nonrepetition cases. Overall, we obtain nearly 23% im-  Table 5: Baseline, two-stages and self-training methods, comparison: baseline self-training method is trained on ...., all the rest methods are trained on ANNOT OYEZ and TRANSF SWBD. Our method, UNEDITOR combines selftraining and two-stage models. provement using the full UNEDITOR system comparing to the model trained on the ORIG SWBD dataset.

Conclusion
In this paper we present a framework for disfluency detection in non-careful transcripts. Experiments are based on the OYEZ archive of transcriptions of Supreme Court oral arguments. To address the problem of lack of annotated data, we first transfer disfluency annotations from careful transcripts of a few cases to the less precise OYEZ transcripts. Next, we transform Switchboard transcripts to make them more similar to the target domain. In addition, we introduce a two-stage model and self-training to further improve performance.
Experiments show improvement in disfluency detection on Supreme Court oral arguments. Starting from baselines of training from carefully annotated in-domain data (F1=26.1) or Switchboard data (F1=39.7), we achieve a substantial improvement to (F1=62.2) with our best case system UNEDITOR, which corresponds to an improvement of nearly 23% over the stronger baseline.
Possible extensions of this work include exploring graph-based semi-supervised approaches (e.g., (Subramanya et al., 2010)) and combining the text-based approach with flexible ASR forced alignment allowing optional insertion of filled pauses and words that are common as repetitions. In addition, the availability of the automatically annotated disfluencies makes it possible to study the variation in rates for different cases and speakers over an extended time period.