Combining Self-Training and Self-Supervised Learning for Unsupervised Disfluency Detection

Most existing approaches to disfluency detection heavily rely on human-annotated corpora, which is expensive to obtain in practice. There have been several proposals to alleviate this issue with, for instance, self-supervised learning techniques, but they still require human-annotated corpora. In this work, we explore the unsupervised learning paradigm which can potentially work with unlabeled text corpora that are cheaper and easier to obtain. Our model builds upon the recent work on Noisy Student Training, a semi-supervised learning approach that extends the idea of self-training. Experimental results on the commonly used English Switchboard test set show that our approach achieves competitive performance compared to the previous state-of-the-art supervised systems using contextualized word embeddings (e.g. BERT and ELECTRA).


Introduction
Automatic speech recognition (ASR) outputs often contain various disfluencies, which is a characteristic of spontaneous speech and create barriers to subsequent text processing tasks like parsing, machine translation, and summarization. Disfluency detection (Zayats et al., 2016;Wang et al., 2016;Wu et al., 2015) focuses on recognizing the disfluencies from ASR outputs. As shown in Figure 1, a standard annotation of the disfluency structure indicates the reparandum (words that the speaker intends to discard), the interruption point (denoted as '+', marking the end of the reparandum), an optional interregnum (filled pauses, discourse cue words, etc.) and the associated repair (Shriberg, 1994).
Ignoring the interregnum, disfluencies are categorized into three types: restarts, repetitions and  corrections. Table 1 gives a few examples. Interregnums are relatively easier to detect as they are often fixed phrases, e.g. "uh", "you know". On the other hand, reparandums are more difficult to detect in that they are in free form. As a result, most previous disfluency detection work focuses on detecting reparandums. Most work (Zayats and Ostendorf, 2018;Lou and Johnson, 2017;Wang et al., 2017;Jamshid Lou et al., 2018;Zayats and Ostendorf, 2019) on disfluency detection heavily relies on human-annotated corpora, which is scarce and expensive to obtain in practice. There have been several proposals to alleviate this issue with, for instance, self-supervised learning  and semi-supervised learning techniques (Wang et al., 2018), but they still require human-annotated corpora. In this work, we completely remove the need of humanannotated corpora and propose a novel method to train a disfluency detection system in a completely unsupervised manner, relying on nothing but unlabeled text corpora.
Our model builds upon the recent work on Noisy Student Training (Xie et al., 2019), a semisupervised learning approach based on the idea of  self-training. Noisy Student Training first trains a supervised model on labeled corpora and uses it as a teacher to generate pseudo labels for unlabeled corpora. It then trains a larger model as a student model on the combination of labeled and pseudo labeled corpora. This process is iterated by putting back the student as the teacher. The result showed that it is possible to use unlabeled corpora to significantly advance both accuracy and robustness of state-of-the-art supervised models. However, the performance of Noisy Student Training still relies on human-annotated corpora.
In this work, we extend Noisy Student Training to unsupervised disfluency detection by combining self-training and self-supervised learning methods. More concretely, as shown in Figure 2, we use the self-supervised learning method to train a weak disfluency detection model on large-scale pseudo training corpora as a teacher, which completely remove the need of human-annotated corpora. We also use the self-supervised learning method to train a sentence grammaticality judgment model to help select sentences with high-quality pseudo labels.
Experimental results on the commonly used English Switchboard set show that our approach achieves competitive performance compared to the previous state-of-the-art supervised systems using contextualized word embeddings (e.g. BERT and ELECTRA). Besides the experiment on the commonly used English Switchboard set, we evaluate our approach on another three different speech genres, and also achieve competitive performance compared to the supervised systems using contextualized word embeddings.
The code is released 1 .
2 Proposed Approach

Unsupervised Training Process
Algorithm 1 and Figure 2 give an overview of our unsupervised training. The inputs to the algorithm are all unlabeled sentences, including news data and ASR outputs. We first construct large-scale pseudo data by randomly adding or deleting words to a fluent sentence, and use the self-supervised learning method to train a sentence grammaticality judgment model. The sentence grammaticality judgment model has the ability to judge whether an input sentence is grammatically-correct or not. We then construct large-scale pseudo data by randomly adding words to a fluent sentence, and use the selfsupervised learning method to train a weak disfluency detection model as a teacher. Next, we use the teacher model to generate pseudo labels on unlabeled ASR outputs. Once a sentence is given correct pseudo labels, the rest after deleting the words with disfluency labels is fluent and grammaticallycorrect. Based on this fact, we use the sentence grammaticality judgment model to help select sentences with high-quality pseudo labels. We then train a student model on the selected pseudo labeled sentences. Finally, we iterate the process until performance stops growing by putting back the student as a teacher to generate new pseudo labels and train a new student. We choose the student model achieving the best performance on humanannotated dev set as our final model.

System Architecture Train Teacher Model
Traditional self-training method trains the teacher model on labeled corpus. In our work, we completely remove the need of human-annotated corpora and use the self-supervised learning method to train a weak disfluency detection model as a teacher.
We first construct large-scale pseudo data for the teacher model inspired by the work of . Let S be an ordered sequence, which is taken from raw unlabeled news data, assumed to be fluent. We start from S and introduce random perturbations to generate a disfluent sentence S disf . More specifically, we propose two types of perturbations: • Repetition(k) : the m (randomly selected from one to six) words starting from the position k are repeated.
• Inserting(k) : we randomly pick a m-gram (m is randomly selected from one to six) from the news corpus and insert it to the position k.
For S, we randomly choose one to three positions, and then randomly take one of the two perturbations for each selected position to generate the disfluent sentence S disf = {w 1 , w 2 , ..., w n }. The training goal is to detect the added noisy words by associating a label for each word, where the labels D and O means that the word is an added word and a fluent word, respectively. We directly fine-tune the ELECTRA model (the discriminator) (Clark et al., 2020) on our pseudo data. Note that the distribution of our pseudo data is different from the distribution of the gold disfluency detection data, which limits the performance of our teacher model on real test data.

Grammaticality Judgment Model
Once a sentence {w 1 , w 2 , ..., w n } is given correct pseudo labels {t 1 , t 2 , ..., t n } by a teacher model, the rest parts {w 1 ,w 2 , ...,w m } by deleting the words with label D is fluent and grammaticallycorrect. Based on this fact, we train a sentence grammaticality judgment model to help select sentences with high-quality pseudo labels.
We first construct large-scale pseudo data for the sentence grammaticality judgment model. The input contains two kinds of sentences: (i) S right which is directly taken from raw unlabeled news data. (ii) S error which is generated by adding some perturbations to S right . We introduce three types of perturbations to generate S error . The first two types of perturbations are Repetition(k) and Inserting(k) as described previously. The third type of perturbations is: • Delete(k) : for selected position k, m (randomly selected from one to six) words starting from this position are deleted.
For an input sentence S, we randomly choose one to three positions, and then randomly take one of the three perturbations for each selected position to generate the disfluent sentence S disf = {w 1 , w 2 , ..., w n }. The training goal is to detect the type of an input sentence, where the labels right and error means that the sentence is grammatically-correct and grammaticallyincorrect, respectively. We directly fine-tune the ELECTRA model (the discriminator) (Clark et al., 2020) on our pseudo data.

Infer and Select Sentences
We use the teacher model to generate pseudo labels on unlabeled ASR outputs. The performance of teacher model starts at a very low level, and it will bring too much noise if we directly use the full unlabeled ASR outputs. So we gradually increase the amount of unlabeled ASR outputs by random sampling from the full unlabeled ASR outputs in each iteration.
For an input sentence S = {w 1 , w 2 , ..., w n }, the teacher model give pseudo labels T = {t 1 , t 2 , ..., t n }, ∀t i ∈ {O, D}. Limited by the performance of teacher model, it will bring much noise if we directly train a student model on all the selected pseudo labeled sentences. We use the sen-tence grammaticality judgment model to help select sentences with high-quality pseudo labels. Given a sentence S = {w 1 , w 2 , ..., w n } and its pseudo labeles T = {t 1 , t 2 , ..., t n }, we get a sub-sentence S sub = {w 1 ,w 2 , ...,w m } by deleting the words with the label D. If the sentence grammaticality judgment model generates right label on S sub , we assume that the pseudo labels T is the same as gold labels and keep (S, D) for student model training.

Train Student Model
In this step, we directly fine-tune the first teacher model as shown in Step 2 of Algorithm 1 on the selected pseudo labeled ASR outputs, instead of fine-tuning the ELECTRA model. Although the difference of distribution between our pseudo data and the golden disfluency detection data limits the performance of teacher model, this stage converges faster than fine-tuning the ELECTRA model as it only needs to adapt to the idiosyncrasies of the target disfluency detection data.

Settings
Dataset. English Switchboard (SWBD) (Godfrey et al., 1992) is the standard and largest (1.73 × 10 5 sentences for training ) corpus used for disfluency detection. We use English Switchboard as main data. Following the experiment settings in Charniak and Johnson (2001), we split the Switchboard corpus into train, dev and test set as follows: train data consists of all sw[23] * .dff files, dev data consists of all sw4[5-9] * .dff files and test data consists of all sw4[0-1] * .dff files. Following Honnibal and Johnson (2014), we lower-case the text and remove all punctuations and partial words. 2 We also discard the 'um' and 'uh' tokens and merge 'you know' and 'i mean' into single tokens.
In addition to Switchboard, we test our models on three out-of-domain publicly available datasets annotated with disfluencies (Zayats et al., 2014;Zayats and Ostendorf, 2018): • CallHome: phone conversations between family members and close friends; • SCOTUS: transcribed Supreme Court oral arguments between justices and advocates; • FCIC: two transcribed hearings from Financial Crisis Inquiry Commission.
2 words are recognized as partial words if they are tagged as 'XX' or end with '-'.  The size of training and test sets for all corpora are given in Table 2. Unlabeled sentences include news data and ASR outputs. News data are randomly extracted from WMT2017 monolingual language model training data (News Discussions. Version 2). 3 Then we use the methods described in Section 2.2 to construct the pre-training dataset for the teacher model and grammaticality judgment model. The training set of the teacher model contains 2 million sentences. We use 5 million sentences for the grammaticality judgment model, in which half of them are grammatically-incorrect sentences and others are grammatically-correct sentences directly extracted from the news corpus. The unlabeled ASR outputs we use include Fisher Speech Transcripts Part 1 (Cieri et al., 2004) and Part 2 (Christopher Cieri and Walker, 2005), which contains about 835k sentences. Metric. Following previous works (Ferguson et al., 2015), token-based precision (P), recall (R), and F1 are used as the evaluation metrics.

Training Details
In all experiments including the ELECTRA model, we use English ELECTRA-Base model with 110M hidden units, 12 heads, 12 hidden layers. 4 For the self-supervised teacher models and grammaticality judgment model, we use streams of 128 tokens and a mini-batches of size 256. We use learning rate of 1e-4 and epoch of 30.
When training the student model with selected pseudo labeled ASR outputs, most model hyperparameters are the same as in the grammaticality judgment model, with the exception of the batch size, learning rate, and number of training epochs. We use batch size of 128, learning rate of 2e-5, and epoch of 10.  Table 3: Experiment results on the Switchboard dev set. " * fine-tuning" means " fine-tuning * model" on the Switchboard train set. The first part (from row 1 to row 5) is the supervised method using complicated hand-crafted features or contextualized word embeddings (e.g. ELMo (Peters et al., 2018) and ELECTRA), the second part (row 6 to 7) is the unsupervised methods.

Performance on English Switchboard
As shown in Table 3, we build six baseline systems: (1) Transition-based is a neural transition-based model (Wang et al., 2017). We directly use the code released by Wang et al. (2017); 5 (2) BERT-Base fine-tuning means fine-tuning BERT-Base model on Switchboard train set; (3) ELECTRA-Small fine-tuning means fine-tuning ELECTRA-Small (the discriminator) model on Switchboard train set; (4) ELECTRA-Base fine-tuning means fine-tuning ELECTRA-Base (the discriminator) model on Switchboard train set; (5) Unsupervised teacher is the teacher model as shown in Step 2 of Algorithm 1; (6) Teacher fine-tuning means finetuning unsupervised teacher model on Switchboard train set.  model on the Switchboard dev set. Our unsupervised model achieves almost 17 point improvements over the baseline unsupervised teacher model. Even compared with supervised systems using full set of Switchboard training data and contextualized word embeddings, our unsupervised approach achieves competitive performance. Finally, we compare our unsupervised model to state-of-the-art supervised and semi-supervised methods from the literature on the Switchboard test set, which can be divided into the following two categories: the methods without using contextualized word embeddings, and the methods using contextualized word embeddings. Table 4 shows that our unsupervised model is competitive with recent models using full set of Switchboard training data. In particular, our unsupervised model even achieves slightly improvement over the supervised methods without using contextualized word embeddings, demonstrating the effectiveness of our unsupervised model.

Performance on Cross-domain Data
To prove the robustness of our methods, we also test our unsupervised model on three out-of-domain publicly available datasets. As shown in Table  5, we use four baseline systems: (1) Unsupervised teacher is the teacher model as shown in Step 2 of Algorithm 1 trained on pseudo train set; (2) ELECTRA-Base fine-tuning means finetuning ELECTRA-Base (the discriminator) model on Switchboard train set; (3) Teacher fine-tuning means fine-tuning unsupervised teacher model on Switchboard train set; (4) Pattern-match (Zayats and Ostendorf, 2018) means a pattern match neural network architecture trained on Switchboard train set, and achieves state-of-the-art performance in cross-domain scenarios.
For both the baseline and our unsupervised sys-     Table 6: Ablation study of grammaticality judgment model. "Teacher" means unsupervised teacher models. "No-select" means our unsupervised self-training method without grammaticality judgment model. "Select" means our unsupervised self-training method with grammaticality judgment model. "SWBD" means the Switchboard dev set.
tems, we directly use the model achieving state-ofthe-art F1 score on the Switchboard dev set and directly test it on the out-of-domain data without retraining. Table 5 shows that our unsupervised model achieves consistent performance in both Switchboard and the three cross-domain datas. In contrast to the performance on the Switchboard dev set as shown in Table 3, our unsupervised model achieves performance similar to the ELECTRA-Base fine-tuning model. This surprising observation shows that our unsupervised model is robust in cross-domain testing. We conjecture that our method uses a large amount of unlabeled news data and ASR outputs, which make it survive the domain mismatch problem in cross-domain testing. Even compared with the supervised Patternmatch model (Zayats and Ostendorf, 2018) achieving state-of-the-art performance in cross-domain scenarios, our model achieves competitive performance.

Ablation Studies
In this section, we study the importance of grammaticality judgment model and iterative training.

The Importance of Grammaticality Judgment Model
To demonstrate the effect of grammaticality judgment model, we further conduct an experiment without grammaticality judgment model. As shown in Table 6, both of our two models achieve significant improvement compared with the baseline unsupervised teacher model. Higher performance is achieved through the introduction of grammaticality judgment model. We conjecture that grammaticality judgment model can help filter out the sentence with false pseudo labels.

A Study of Iterative Training
Here, we show the detailed effects of iterative training. As mentioned in Section 2.1, we first train a weak disfluency detection model on large-scale pseudo data and then use it as the teacher to train a student model. Then, we iterate this process by putting back the new student model as the teacher model. We plot F1-score with respect to the number of iteration for the two models with and without grammaticality judgment model. As shown in Figure  3 (a), both the two models keep increasing until reaching an experiment upper limit, and achieve significant improvement over the model in the first iteration. These results indicate that iterative training is effective in producing increasingly better models.

Varying Amounts of Pseudo Data for Teacher Model
We observed the impact of pseudo training data size to the teacher model as shown in Step 2 of  Algorithm 1. Figure 3 (b) reports the results of adding varying amounts of pseudo training data to the self-supervised teacher model. We observe that F1-score on the Switchboard dev set keeps growing until reaching an upper limit when the amount of pseudo data increases. The upper limit is only about 72.3 F1-score, which is much lower than the supervised methods. We conjecture that the distribution of our pseudo data is different from the distribution of the gold disfluency detection data, which limits the performance of our teacher model on real data. The result also shows that disfluencies in ASR outputs are complex, and disfluency detection cannot be fully solved by pretraining on pseudo disfluency data.

Quantitative Analysis of Grammaticality Judgment Model
The ablation test demonstrates the effect of grammaticality judgment model. To prove the conjecture that grammaticality judgment model help filter out the sentence with false pseudo labels, we make two quantitative analyses for grammaticality judgment model. The first quantitative analysis gives the classification accuracy of grammaticality judgment model on the Switchboard dev set. Grammaticality judgment model achieves a 85% accuracy. The result shows that grammaticality judgment model has the ability to judge whether an input sentence is grammatically-correct, and will always help select sentences with high-quality pseudo labels.
For the second quantitative analysis, we observed the change of F1 score by simulating the infer and select process of iterative training on the Switchboard dev set. For each iteration, we first use the teacher model to generate pseudo labels on the Switchboard dev set, and compute one F1 score. Then we use grammaticality judgment model to select sentences. We compute another F1 score on the selected sentences. Figure 3 (c) reports the change of F1 score in each iteration. The F1 score on selected sentences is always significantly higher than that without selecting. The result shows that grammaticality judgment model can always help select sentences with high-quality pseudo labels.

Repetitions vs Non-repetitions
Repetition disfluencies are much easier to detect than other disfluencies, although not trivial since some repetitions can be fluent. In order to better understand model performances, we evaluate our model's ability to detect repetition vs. nonrepetition (other) reparandum on the Switchboard dev set. The results are shown in Table 7. All three models achieve high scores on repetition reparandum. Our unsupervised model is much better in predicting non-repetitions compared to the unsupervised teacher model. Even compared with the supervised ELECTRA-Base model, our model achieves competitive performance on nonrepetitions. The result shows that our unsupervised model has the ability to solve complex disfluencies. We conjecture that our self-supervised tasks can capture more sentence-level structural information.

Related Work Disfluency Detection
Most work on disfluency detection focus on supervised learning methods, which mainly fall into three main categories: sequence tagging, noisy-channel, and parsing-based approaches. Sequence tagging approaches label words as fluent or disfluent using a variety of different techniques, including conditional random fields (CRF) (Georgila, 2009;Ostendorf and Hahn, 2013;Zayats et al., 2014), Max-Margin Markov Networks (M 3 N) (Qian and Liu, 2013), Semi-Markov CRF (Ferguson et al., 2015), and recurrent neural networks (Hough and Schlangen, 2015;Zayats et al., 2016;Wang et al., 2016). The main benefit of sequential models is the ability to capture longterm relationships between reparandum and repairs. Noisy channel models (Charniak and Johnson, 2001;Johnson and Charniak, 2004;Zwarts et al., 2010;Lou and Johnson, 2017) use the similarity between reparandum and repair as an indicator of disfluency. Parsing-based approaches (Rasooli and Tetreault, 2013;Honnibal and Johnson, 2014;Wu et al., 2015;Yoshikawa et al., 2016;Jamshid Lou et al., 2019) jointly perform parsing and disfluency detection. The joint models can capture long-range dependency of disfluencies as well as chunk-level information. However, training a parsing-based model requires large annotated tree-banks that contain both disfluencies and syntactic structures.
All of the above works heavily rely on humanannotated data. There exist a limited effort to tackle the training data bottleneck. Wang et al. (2018) and Dong et al. (2019) use an autoencoder method to help for disfluency detection by jointly training the autoencoder model and disfluency detection model.  use self-supervised learning to tackle the training data bottleneck. Their selfsupervised method can substantially reduce the need for human-annotated training data. Lou and Johnson (2020) shows that self-training and ensembling are effective methods for improving disfluency detection. These semi-supervised methods achieve higher performance by introducing pseudo training sentences. However, the performance still relies on human-annotated data. We explore unsupervised disfluency detection, taking inspiration from the success of self-supervised learning and self-training on disfluency detection.

Self-Supervised Representation Learning
Self-supervised learning aims to train a network on an auxiliary task where ground-truth is obtained automatically. Over the last few years, many selfsupervised tasks have been introduced in image processing domain, which make use of non-visual signals, intrinsically correlated to the image, as a form to supervise visual feature learning (Agrawal et al., 2015;Wang and Gupta, 2015;Doersch et al., 2015).
In natural language processing domain, selfsupervised research mainly focus on word embedding (Mikolov et al., 2013a,b) and language model learning (Bengio et al., 2003;Peters et al., 2018;Radford et al., 2018). For word embedding learning, the idea is to train a model that maps each word to a feature vector, such that it is easy to predict the words in the context given the vector. This converts an apparently unsupervised problem into a "self-supervised" one: learning a function from a given word to the words surrounding it.
Language model pre-training (Bengio et al., 2003;Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019) is another line of self-supervised learning task. A trained language model learns a function to predict the likelihood of occurrence of a word based on the surrounding sequence of words used in the text. There are mainly two existing strategies for applying pre-trained language rep-resentations to down-stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018) and BERT (Devlin et al., 2019), introduces minimal task-specific parameters and is trained on the downstream tasks by simply fine-tuning the pre-trained parameters.
Motivated by the success of self-supervised learning, we use self-supervised learning method to train a weak disfluency detection model as teacher model. We also train a sentence grammaticality judgment model to help select sentences with highquality pseudo labels.

Self-Training
Self-training (McClosky et al., 2006) first uses labeled data to train a good teacher model, then use the teacher model to label unlabeled data and finally use the labeled data and unlabeled data to jointly train a student model. Self-training has also been shown to work well for a variety of tasks including leveraging noisy data (Veit et al., 2017), semantic segmentation (Babakhin et al., 2019), text classification (Li et al., 2019). Xie et al. (2019) present Noisy Student Training, which extends the idea of self-training with the use of equal-or-larger student models and noise added to the student during learning.
Our model builds upon the recent work on Noisy Student Training (Xie et al., 2019) and further extend it to unsupervised disfluency detection by combining self-training and self-supervised learning methods.

Conclusion
In this work, we explore unsupervised disfluency detection by combining self-training and selfsupervised learning. We showed that it is possible to completely remove the need of human-annotated data and train a high-performance disfluency detection system in a completely unsupervised manner.