Planning and Generating Natural and Diverse Disfluent Texts as Augmentation for Disfluency Detection

Existing approaches to disﬂuency detection heavily depend on human-annotated data. Numbers of data augmentation methods have been proposed to alleviate the dependence on labeled data. However, current augmentation approaches such as random insertion or repetition fail to resemble training corpus well and usually resulted in unnatural and limited types of disﬂuencies. In this work, we propose a simple Planner-Generator based disﬂuency generation model to generate natural and diverse disﬂuent texts as augmented data, where the Planner decides on where to insert disﬂuent segments and the Generator follows the prediction to generate corresponding disﬂuent segments. We further utilize this augmented data for pretraining and leverage it for the task of disﬂuency detection. Experiments demonstrated that our two-stage disﬂuency generation model outperforms existing baselines; those disﬂu-ent sentences generated signiﬁcantly aided the task of disﬂuency detection and led to state-of-the-art performance on Switchboard corpus. We have publicly released our code at https://github.com/GT-SALT/ Disfluency-Generation-and-Detection .


Introduction
Disfluency is a para-linguistic concept defining the interruption to the flow of speech (Kowal, 2009). As shown in Figure 1, a standard annotation of the disfluency structure indicates the reparandum (the region to repair), an optional interregnum (filled pauses, discourse cue words, etc.) and the associated repair (corrected linguistic materials) (Nakatani and Hirschberg, 1994). Disfluency detection    Type Example Repetition they they learn to share.
Deletion this is just happened yesterday. Substitution it's nothing but wood up here down here. since interregnum can be easily detected in that they belong to a closed set of words and phrases, e.g. "uh" "I mean" "you know" etc. The output fluent sentences from disfluency detection can serve as clean inputs for most downstream NLP tasks, like dialogue systems, question answering, and machine translation (Wang et al., 2010).
Reparandum in disfluency can be categorized as repetition, deletion and substitution (McDougall and Duckworth, 2017), as shown in Table 1. Repetition occurs when linguistic materials repeat, usually in form of partial words, words or short phrases. Substitution occurs when linguistic materials are replaced in order to clarify a concept or idea. Deletion, also known as false restart, refers to abandoned linguistics materials.
Neural models have achieved reasonable performance in disfluency detection on English Switchboard (SWBD) corpus (Godfrey et al., 1992). Such models involve applying pretrained models to conduct disfluency detection as sequence tagging or seq2seq tasks (Wang et al., , 2018. Recently, data augmentation techniques are also used to generate augmented disfluent sentences for model pretraining. Those models are limited in that the augmented data is generated based on simple heuristics such as random repetition or insertion of ngrams (Wang et al., , 2018. Sentences generated Table 2: Disfluent text generated from random repetition, insertion, and our Disfluency Generation model by such methods do not resemble natural disfluent sentences and have different distribution of disfluency patterns from original sentences in SWBD dataset. For example, in Table 2, the random insertion of "of a" to the example (2) "of a that's really good" results in a quite unnatural disfluent text, not representative of our disfluency corpus nor most commonly used practices. As a result, most sentences generated by random insertion are inefficient as augmented samples, and even lead models to deviate from SWBD dataset. This is also supported by Gontijo-Lopes et al. (2020) that suggested the augmented data should have high affinity to the original dataset. Furthermore, based on our small corpus studies, we observed that current disfluency detection models usually struggle with substitution and deletion based disfluencies, largely due to the under-representation of substitution and deletion in current training corpus. Disfluent sentences generated by random repetition or insertion rarely contain substitutions, and are thus inadequate in introducing diverse forms of disfluency.
To address this gap, we propose a generation based data augmentation method to generate natural and diverse disfluent sentences to further improve the performance of disfluency detection. This method is similar to back-translation (Edunov et al., 2018), which has been shown effective as a way of data augmentation in machine translation. Different from classical neural generation models, our generation model is in a two-stage generation manner, motivated by coarse-to-fine decoding (Dong and Lapata, 2018). Specifically, given a fluent sentence as the input, in the first stage, a Planner selects the positions of where to insert reparandum; in the second stage, a Generator generates disfluent segments accordingly for the predicted areas. Compared to generic end-to-end generation, our two-stage model separates the generation task into two steps: where to generate and what to generate. Such breakdown enables the model to only generate disfluent segments rather than the whole disfluent sentences, and to carry naturally labeled data as augmentation for disfluency detection. As shown in Table 2, the outputs (3) and (4) from our twostage Planner-Generator model resemble natural disfluent sentences better than random insertion (2), and also introduces more substitutions (example (3) and (4)). We then utilize this Planner-Generator disfluency generation to create augmented training data for the task of disfluency detection. As an additional benefit, the disfluent texts generated by our generation model can be used as inputs of textto-speech (TTS) systems to generate more natural speech, thus improving the performance of tasks like automatic film dubbing, robotics, dialogue systems, or speech-to-speech translation, as shown by Betz et al. (2015) and Adell et al. (2006).
To sum up, our contributions are as follows: • We design a simple two-stage Planner-Generator generation model to generate disfluent texts, and demonstrate its effectiveness over various generation baseline models.
• We utilize our generation model to generate natural and diverse augmented disfluent data for the task of disfluency detection, and obtain state-of-the-art performance. We conduct thorough error analysis and discuss specific challenges faced by current approaches.

Related Work
Disfluency Generation Betz et al. (2015) and Adell et al. (2006) used heuristic rules to generate filled pauses, repetitions in disfluent speech generation. Their works demonstrate that disfluency generation enhances the naturalness and intelligibility of speech generated by text-to-speech (TTS) systems. Wang et al. (2018) and  randomly inserted or repeated ngrams to generate augmented disfluent sentences. Disfluent sentences generated in this method have low affinity to disfluent sentences from the benchmark dataset, and they contain few substitutions, causing limited diversity. To achieve better affinity and diversity, we design generation based data augmentation which generates natural disfluent sentences that can then be directly used to train the disfluency detection model. We adapt multi-stage coarse-to-fine neural decoders (Dong and Lapata, 2018) for generation tasks to design our disfluency generation model.
Disfluency Detection Disfluency detection models mainly fall into four categories. The first one utilizes noisy channel models (Zwarts and Johnson, 2011;, which require Tree Adjoining Grammar (TAG) based transducer in the channel model. The second category leverages phrase structure, which is often related to transitionbased parsing yet requires annotated syntactic structure (Rasooli and Tetreault, 2013;Yoshikawa et al., 2016;Wu et al., 2015;Jamshid Lou and Johnson, 2020). The third category frames the task as a sequence tagging task (Ferguson et al., 2015;Hough and Schlangen, 2015;Zayats et al., 2016;, and the last one employs end-to-end Encoder-Decoder models (Wang et al., 2016(Wang et al., , 2018 to detect disfluent segments automatically. Traditional disfluency detection models often required additional features (e.g. POS tags) (Wang et al., 2017), syntactic annotations or external tools (e.g. TAG based transducer)    (Devlin et al., 2018) in a sequence tagging task, and Wang et al. (2018) obtained similar performance by using the same data augmentation methods in an encoder-decoder fashion. These aforementioned data-augmentation methods created augmented disfluent sentences only by randomly inserting or repeating ngrams. To this end, we introduce generation based data augmentation to first generate disfluencies and then use them for sequence tagging of disfluency detection. Note that there was a similar trend in grammatical error detection. Felice and Yuan (2014); Kasewa et al. (2018) generated sentences with grammatical errors to augment the training data of grammatical error detection.

Disfluency Generation
Our goal is to generate a natural disfluent sentence from a fluent sentence. For this purpose, we introduced a Planner and Generator based model, as shown in Figure 2, which is described as follows. Let x = x 1 , x 2 , · · · , x |x| denote a fluent sentence, y = y 1 , y 2 , . . . , y |y| denote the corresponding disfluent sentence. We estimated p(y|x) via a two stage generation process: where a = a 1 , a 2 , . . . , a |a| is a decision sequence with the same length as x. a i is either 1 or 0, which represents whether a disfluent segment (reparandum) should be added after x i or not. We assumed a i are independent of each other and further decomposed our objective as follows: Planner At the first Planning stage, we used an encoder e 1 to obtain representations of x : (4) Then we used h to get the decision probability a i : Generator: Encoder We used another encoder e 2 to get the representations of x as the conditional state of the second stage: h =ĥ 1 ,ĥ 2 , · · · ,ĥ |x| = f e 2 (x 1 , x 2 , · · · , x |x| ) (6) Encoder e 1 and e 2 can be Bidirectional LSTM or Transformer (Vaswani et al., 2017).
Generator: Decoder p(y j |y <j , x, a) was computed based on the output (h j ) of the corresponding step of decoder. As shown in Figure 2, in our PG model, the input z j of j-th step of decoder is determined by the value of corresponding a i (E is the embedding layer of decoder): if a i = 0 or (a i = 1 and y j is the first word of reparandum) E(y j−1 ) otherwise, to Boston <EOD> <BOS> When will the flight to Denver take off When will the flight to Boston <EOD> to Denver take off <EOS> Alternatively, to make the model focus less on local contexts for less copied words, we can use a decoder with less connection with Planner (PG-LC):

Generator
We can also use a decoder with no connection with Planner (PG-NC), where we separate Generator from Planner and only use the decision sequence to guide generation. This modification is the basis of the models with higher generation diversity: if a i = 0 or (a i = 1 and y j is the first word of reparandum) E(y j−1 ) otherwise, (9) We used LSTM as the decoder, and the decoder's hidden vector at the j-th time step is computed bȳ whereh 0 =ĥ |x| if we use the last hidden state of encoder to initialize the first state of decoder; h 0 = 0 if we do not use such initialization (ID), decreasing the decoder's dependence on the encoder for high diversity of generated disfluent segments. Based on encoder's hidden vectorsĥ and decoder's hidden vectorsh, we used attention and copying mechanism to compute p(y j |y <j , x, a), similarly to See et al. (2017). Alternatively, we also computed it without attention (AD) or copying mechanism (CD) for high diversity of the generated reparandum. The decoder can also be replaced with Transformer or GPT2 (Radford et al., 2019).

Training and Inference
The training objective is to maximize the log likelihood of the disfluent sentence given the fluent sentence: (11) here D represents all training pairs.
During inference stage, Planner chose 0 or 1 with higher probability in each step to generate the decision sequence a. Alternatively, the Planner can also be an oracle Planner, whose predictions are gold decision sequences for the purpose of higher accuracy, or a heuristic Planner, whose predictions are selected according to simple heuristics for higher diversity in data augmentation. When generating the final disfluent sentence y, assume y j is generated based on a i . If a i = 0, we directly copy x i+1 as y j ; if a i = 1, the Generator generates a sequence of words as reparandum before copying x i+1 .

Disfluency Detection
We regarded the disfluency detection task as a sequence tagging task. We denoted i-th sentence with T words as s i = {w t |t = 1, . . . , T }, the input of our model is {s 1 , s 2 , . . . , s N }, where N is the number of sentences in the dataset. The corresponding output is {q 1 , q 2 , . . . , q N }, where q i is the label sequence of i-th sentence, q i = {l t |t = 1, . . . , T }. l t ∈ {I, O}, where I (O) represents that the word is in (or outside) the region of reperandum.

Heuristic based Data Augmentation
Pretraining the model on augmented data has been proved as effective (Wang et al., , 2018 before training the model on the SWBD dataset. Note that, compared to multi-task learning and using Sentence Pair Classification Task as an auxiliary task by , our study mainly focuses on sequence tagging pretraining. To generate an augmented disfluent sentence for any fluent sentence, we followed the augmentation method in . First we used the heuristic of randomly choosing one to three positions in a fluent sentence. Then, for each position k: • Insertion : we randomly picked a m-gram (m is randomly selected from one to six) from the news corpus and inserted it to the position k.
• Repetition : m (the length of repeated words, randomly selected from one to six) words starting from the position k were repeated.

Generation based Data Augmentation
These augmented sentences generated from Insertion are often not natural, since those inserted mgrams are randomly picked from the whole corpus which may be irrelevant to the current sentence. This creates large discrepancies between the distribution of augmented sentences and the original corpus, and further hinders the effectiveness of augmented data. To introduce more natural and diverse generated disfluencies, we introduced this generation based data augmentation mode: • Generation: we used our PG-based model to generate reparandum starting from position k.

Sequence Tagging
For the sequence tagging model, instead of using Transformer or the combination of trainable Transformer and frozen BERT as Wang et al. (2019) did, we directly adopted trainable BERT for both pretraining and fine tuning. First we got the probability of labels of each word: Eventually, the goal of the model is to minimize the objective, the cross-entropy (CE) loss:  (2015), we lower-cased the text and removed all punctuation and partial words. For disfluency generation, all sentences with reparandum were treated as disfluent sentences. Specifically, our training set contains 29k disfluent sentences out of 173k sentences. In development set, 2k sentences in a total of 10k sentences are disfluent sentences. In test set, 1.6k sentences out of 7.9k sentences are disfluent sentences.

Evaluation
To measure whether generated disfluent sentences are natural, we compared them with reference disfluent sentences based on two generation related metrics: BLEU (Papineni et al., 2002) and Sentence Accuracy, i.e. the percentage of the generated sentences that exactly match the ground-truth disfluent sentences. Furthermore, we evaluated the naturalness of model outputs according to human judgment. Due to budget issue, we only selected the model (PG-NC-AD-ID) and the baseline (Insertion & Repetition) with the highest diversity based on automatic measures. For those two models, we randomly selected 100 generated disfluent sentences and they were assessed on Amazon Mechanical Turk. We elicited 3 responses per HIT. For each sentence, Natural was marked with a score of one, Unnatural sentences with 0.5, and zero for Incomprehensible ones. Average Human-evaluated Naturalness (HN) score thus ranged from 0 (worst) to 1 (best).
We also designed metrics to measure the diversity of disfluent segments, similarly to Li et al. (2015). Specifically, we calculated the number of new unigrams and bigrams in the generated disfluent segments. The value was scaled by the total number of generated tokens in the disfluent segments (shown as Diverse-1 and Diverse-2 in Table  3). To evaluate disfluency detection, we used standard metrics: Precision, Recall and F-score.

Training Details
For disfluency detection, we used BERT-baseuncased (Wolf et al., 2019). In both pretraining   and fine-tuning stages, we used Adam optimizer with learning rate 1e-5 and batch size 32. For disfluency generation, we trained LSTM with learning rate 1e-2 and Transformer with learning rate 1e-4.

Models and Baselines
For pretraining, we followed  to use WMT2017 monolingual language model training data as unlabeled data. The data augmentation methods in Section 3.2 were used to generate augmented disfluent sentences.  used 3 million sentences in the sequence tagging task and 9 million sentence pairs in the sentence classification task. We used 3 million sentences for fair comparison and also experimented with 20 million sentences to examine the effect of data size.
In Table 4, we show the composition of augmented sentences in all of our models. Note that Wang et al. (2018) and Bach and Huang (2019) treated interregnum and reparandum types equally as disfluent segments when training and evaluating their models, while others in Table 5 only focused on reparandum which is more difficult to detect. Bach and Huang (2019) used a different way of splitting training and development set, whose training set had more data. Given the different setups, we did not compare with Wang et al. (2018) and Bach and Huang (2019). For disfluency generation, we applied various combinations of our model settings described in 3.1. In PG-Transformer, encoders and decoders are all Transformer, while all the other models use LSTM. Planner-Generator (PG), PG with less Planner-Generator connection (PG-LC), and PG with no Planner-Generator connection (PG-NC) are models that generate relatively natural disfluent sentences (high BLEU and Sent Acc). For higher diversity, PG-CD is PG without copying mechanism. Likewise, PG-NC-CD is PG-NC without copying mechanism, while PG-NC-AD is PG-NC without attention mechanism. In the extreme case, PG-NC-AD-ID is PG-NC without attention mechanism and encoder-Initialized decoder for high diversity. We used Simple Copy (directly copy input as output), Random Insertion & Repetition of ngrams, LSTM and Attention based Seq2Seq model, CopyNet and pretrained BART as baselines. Since our models enable the control of generating reparandum based on any given decision sequences, we examined their performances with and without oracle decision sequences, i.e. the positions of the reparandum in generated sentences are the same as the references.
In order to use our model to generate diverse disfluent sentences, we experimented with different variants of PG and found PG-NC-AD-ID produced better performances. Thus we finally chose PG-NC-AD-ID and the heuristic Planner applying the position choosing heuristic described in Section 3.2.1, since the model-based Planner always chose certain most probable positions, and generated less diverse disfluent sentences, which did not work well as augmented data GPT2 was used to replace the LSTM decoder and trained on a partial pretraining dataset to alleviate the domain gap.   seq-tagging --0.868 Transition-based (Wang et al., 2017) parsing & ad-hoc features 0.911 0.841 0.875 NMT (Dong et al., 2019) sequence  Table 5: Results of disfluency detection. F-score is the major metric. "ours" represents our implementations. The mark † denotes that the results are significant with the significance level p < 0.05. Specifically, p-value is 0.0003 comparing BERT-GRI3 and BERT. p-value is 0.0259 comparing and BERT-GRI3 and BERT-RI3. Table 3 shows the disfluency generation results. Despite its relatively high diversity, Insertion & Repetition baseline had a low BLEU score and an almost zero Sent Acc, which indicates that disfluent sentences generated in such manners are neither natural nor similar to real disfluency distributions in SWBD dataset. Simple Copy baseline maintained high BLEU yet failed to generate any disfluent sentences with zero Sent Acc. Other neural baselines were able to achieve reasonable BLEU and Sent Acc. However, their results could not serve as augmented data to pretrain sequence tagging models, since there was no indication where were the disfluent segments in output sentences. All of our proposed PG-based models outperformed Insertion & Repetition in terms of BLEU and Sent Acc, which shows that our generated results were closer to natural disfluent sentences than random Insertion & Repetition of ngrams were. Among our models, PG-Transformer, PG and PG-LC generated the most natural disfluent sentences, leading to the highest BLEU and Sent Acc. Our LSTM-based models PG and PG-LC outperformed all of the baselines in terms of Sent Acc and BLEU, despite that PG-Transformer was slightly overshadowed by CopyNet in terms of Sent Acc. The performance boost of our PG based models mainly came from our two-stage Planner-Generator process, since the hidden states of the first stage were used as initial input to guide the generation of reparandum in the second stage.

Disfluency Generation Result
We found that without copying mechanism , PG-CD model harmed Sent Acc but would not drastically decrease BLEU compared with PG. Without the state passing between Planner and Generator Decoder, PG-NC severely harmed Sent Acc as well as BLEU, while it improved the generation diversity. Without copying mechanism and state passing (PG-NC-CD), the diversity boosted significantly. This demonstrates that copying mechanism and state passing between Planner and Generator Decoder together forced the model to generate repetitions. The deletion of those two mechanisms were responsible for increased substitutions and deletions and decreased repetitions in results, leading to a higher diversity of disfluent sentences.
Without the attention between Generator Encoder and Generator Decoder, PG-NC-AD had little improvement in diversity compared to PG-NC-CD. However, when deleting the mechanism of using the last state of Generator Encoder as the initial state of Generator Decoder (PG-NC-AD-ID), diversity increased substantially. This made the Generator Decoder an unconditional language model trained on the dataset. Although the PG-NC-AD-ID model decreased BLEU and Sent Acc, it still generated more natural disfluent sentences than Insertion & Repetition, as demonstrated by higher automatic evaluation metrics (BLEU, Sent Acc) and human evaluation metric (HN). Considering that PG-NC-AD-ID outperformed Insertion & Repetition in all metrics, we used this model to generate diverse and natural augmented disfluent sentences for disfluency detection. As we expected, with oracle decision sequences, nearly all models achieved significantly better BLEU and Sent Acc.

Disfluency Detection Result
We used the above generation based augmented data to further improve disfluency detection. The   results are shown in Table 5. We found that BERT without pretraining already achieved competitive results. BERT-G3 performed better than BERT, showing the effectiveness of our generation based data augmentation. Our BERT-RI3 performed similarly to , although we did not use Sentence Pair Classification Task as an auxiliary task during pretraining. The reason might be that we fine-tuned the BERT model during both pretraining and SWBD training, while  trained a Transformer during these stages and combined it with a fixed BERT when training on SWBD. Overall, when using Repetition & Insertion to do data augmentation, BERT-RI3 performed better than BERT. After replacing Insertion with Generation, BERT-GR3 outperformed BERT-RI3 and . When adding Generation upon Repetition and Insertion, BERT-GRI3 achieved even better performance, a new state-of-the-art performance. We also did significance test with Bootstrap (Berg-Kirkpatrick et al., 2012), BERT-GRI3 significantly outperformed BERT-RI3 and BERT with significance level p=0.0259 and p=0.0003 respectively. This not only demonstrated the effectiveness of our disfluency generation based data augmentation, but also showed that disfluencies generated by our generation model are orthogonal to those generated by Insertion & Repetition. A comparison between the precision and recall of BERT-GRI3 and BERT revealed that the improvements of pretraining mainly come from its higher recall, indicating that pretraining can help the model to detect more disfluencies while obtaining similar accuracies. When pretraining data size was increased, BERT-GRI20 did not significantly outperform BERT-GRI3.

Impact of Augmented Disfluency Types:
We summarized different types of generated disfluen-  cies in Table 6 to show how our model contributed to disfluency detection. Insertion & Repetition generated limited substitutions, which caused a lack of natural and diverse disfluencies. Although our PG model achieved state-of-the-art performance in terms of BLEU and Sent Acc, it mainly generated repetitions, leading to low diversity. This was potentially caused by two factors. First, the disfluent segments in the training dataset were dominated by 65.39% repetitions, in comparison to 18.99% substitutions and 15.62% deletions. Second, copying words and phrases during generation for neural models proved to be the most convenient and consistent approach, even without copying mechanism. Our PG-NC-AD-ID model generated more substitutions and deletions compared with PG, leading to the highest diversity. Compared to random Insertion & Repetition, it also generated substantially more substitutions, leading to a more effective data augmentation. The decreased number of repetitions can be fixed by combining it with random repetition, like BERT-GR3. Table 7 presents the proportion that was not identified by our disfluency detection models among all repetitions, substitutions and deletions in reference test set. Comparing our generation based augmentation (BERT-GRI3 and BERT-GR3) with other methods (BERT-RI3 and BERT), we found that pretraining on our generated data can reduce substitution errors and improve the final metric recall in Table 5, contributed by increased natural substitutions generated by our disfluency generation model.

Error Analysis and Challenges
We manually annotated the errors made by our disfluency detection model case by case, and presented a thorough error analysis in terms of different types of errors in Table 8. Note that nearly one fourth "wrong" predictions were in fact correct. These mismatches were caused by improper annotation. For example, the sentence "the thing is is that's not enough" was annotated as a fluent sentence, while the first "is" should be reparandum. Similar noisy annotation issues in the SWBD dataset were a major hindrance to achieving higher performance. With respect to other errors, we saw much more false negatives than false positives. Among false negatives, errors were dominated by substitutions and deletions, although the proportion of repetitions (65.39%) was much more than substitutions (18.99%) and deletions (15.62%) in the original SWBD dataset. This showed that current models do relatively well in identifying repetitions, while detecting substitutions and deletions is still challenging for the model.

Conclusion
This work presents a simple two-stage disfluency generation model to generate natural and diverse disfluent texts. We further used them as augmented data for pretraining and aiding the task of disfluency detection. Experiments demonstrate that our proposed disfluency generation model outperformed existing baselines; those disfluent sentences generated significantly aided the task of disfluency detection and led to state-of-the-art performances.
In both pretraining and fine-tuning stages, we used Adam optimizer with learning rate 1e-5 (searched from [1e-4, 1e-5]) and batch size 32. Hyperparameters were searched manually according to F-score. We ran 20 epochs for pretraining and ran 20 epochs for training on SWBD dataset. After each epoch, we decayed the learning rate by 0.985.
After each epoch, we decayed the learning rate by 0.985. We trained them for 30 epochs. When we used GPT2 (Wolf et al., 2019) as decoder, it was trained on another 3M WMT2017 mono-lingual language model training data for 10 epochs.

B Computational Requirements
We ran our models on GeForce RTX 2080 GPU. Each disfluency generation model required 1 hour to finish training (GPT2 and Transformer required 4 hours). Each disfluency detection model required 2 hours to finish training on SWBD data. Pretraining disfluency detection models on 3M data required 5 days on 1 GPU. Pretraining models on 20M data required 7 days on 4 GPUs.

C Evaluation Metrics
As for metrics, we used NLTK to compute BLEU. Other metrics are computed by our scripts written according to descriptions in the paper. Metrics on validation sets were close to those reported on test sets for all experiments.
As for Human-evaluated Naturalness on AMT, we provided the description and example sentences of three levels of disfluent sentences (incomprehensible, unnatural and natural). For example, "Natural disfluent sentence" is "Perfectly natural speech. Similar to the talk you could probably have with someone in life." To improve annotation quality, annotators should have >5000 HITs approved, >98% HIT Approval Rate and located in the US. We also require annotators to pass a qualification test consisting of samples with expected answers before they work on the annotation, to make sure that they have a good understanding of our task. Annotators are paid $0.08 for annotating each sentence, and each sentence was rated by three workers.

D Dataset
SWBD dataset is a part of PDTB and WMT2017 mono-lingual language model training data can be downloaded from News Crawl: articles from 2016.