Grounding Conversations with Improvised Dialogues

Effective dialogue involves grounding, the process of establishing mutual knowledge that is essential for communication between people. Modern dialogue systems are not explicitly trained to build common ground, and therefore overlook this important aspect of communication. Improvisational theater (improv) intrinsically contains a high proportion of dialogue focused on building common ground, and makes use of the yes-and principle, a strong grounding speech act, to establish coherence and an actionable objective reality. We collect a corpus of more than 26,000 yes-and turns, transcribing them from improv dialogues and extracting them from larger, but more sparsely populated movie script dialogue corpora, via a bootstrapped classifier. We fine-tune chit-chat dialogue systems with our corpus to encourage more grounded, relevant conversation and confirm these findings with human evaluations.


Introduction
For humans, dialogue is fundamentally a collaborative, cooperative process by which partners coordinate via turns or acts to jointly construct a common world state (Bohm and Nichol, 2004). Without coordination, partners may establish different or conflicting world states, leading to solipsism in the best case and conflict in the worst. Clark and Schaefer (1989), describe five dimensions of grounding, by which partners cooperate to establish common ground, or a shared world state. The dimension of "initiation of next relevant contribution" is the most effective of these in expressing understanding of an ongoing dialogue, and yet is the least observed in dialogue systems.
Improvisational theater (improv) is a form of theater in which most or all of what is performed is unscripted, created spontaneously by the actors in real time. Because the performance is not scripted and there is typically little to no scenery or other es- tablished environment, 1 there is no objective reality that can naturally ground the scene. Hence, actors must mainly rely on dialogue in order to build a coherent scene and progressively establish a common world view. This necessitates accelerated use of the "initiation of next relevant contribution," which in improv is known as the yes-and principle. The yes-and principle is a rule-of-thumb that suggests that a participant should accept the reality of what the other participant has said ("yes") and expand or refine that reality with additional information ("and"). Since actors consciously abide by this principle during improv performances, there is a high proportion of these turns embedded in improv dialogue, which helps ensure scenes are coherent and interesting.
Open-domain neural dialogue systems, by contrast, specifically lack coherence and interestingness. They commonly repeat previous utterances (Li et al., 2016c) or generate non-committal, generic statements such as I don't know that are logically coherent as a response but preempt further conversation Li et al., 2016a). Either of these developments leads to a conversational black hole and discourages participation in further dialogue turns. This is a critical shortcoming for open-domain dialogue agents, which, unlike task-oriented dialogue systems, are not guided by specific objectives other than entertainment (Huang et al., 2020). It would behoove such systems to adopt the strategies improvisers include by habit in their dialogues and, consequently, incorporating improv acts should be a key focus for the dialogue community.
Yet, to the best of our knowledge, this has not been previously done. There has been work in applying improv to build believable agents that interact with humans (Bruce et al., 2000;Winston and Magerko, 2017) or generate improvised stories (Martin et al., 2016), but development of improvcapable systems in the neural era is largely absent, stymied, we suspect, by the lack of substantial corpora. This is unsurprising; while improv speech acts such as yes-and are crucial in all dialogues, they are only highly concentrated in improv dialogues. And improv dialogues are quite difficult to collect; research collections (Busso and Narayanan, 2008) have been far too small to be useful in the modern ML era. The art form has historically been mostly ephemeral, performed live in regional venues on shoestring budgets and rarely recorded. 2 Transcripts are all but absent and mainstream media products are rare. 3 However, the liberalization of high quality audio podcasts since 2014 has enabled the availability of a long tail of niche products, improv included (McHugh, 2016). 2 The art form has long roots, extending to the Italian Commedia dell'arte tradition from the 16th century and farces from the Roman era, but we constrain our focus to the post-20th century form developed and championed by e.g. Keith Johnstone (Johnstone, 2017), Del Close (Halpern et al., 1994), and our corpus' namesake, Viola Spolin (Spolin et al., 1986). Spolin was the originator of Theater Games, acting exercises that encourage the development of specific theatrical skills. As our corpus is similarly designed to elicit specific skills, we backronym it in recognition of her influence. 3 One exception, the long-running TV show Whose Line Is It Anyway, has, despite a large number of episodes, surprisingly little continuous improvised dialogue, due to the rapid-fire nature of the program. Therefore we set our objective as collecting yesand-type dialogue pairs (yes-ands) to enable their modeling by corpus-driven dialogue systems. We mine podcasts and existing movie script corpora for dialogue that abides by the yes-and principle and extract dialogue pairs from these sources to build the Selected Pairs Of Learnable Improvisa-tioN (SPOLIN) corpus. SPOLIN is a collection of more than 26,000 English dialogue turn pairs, each consisting of a prompt and subsequent response, which abide by the yes-and principle, though in diverse manners. Examples of yes-and type dialogue pairs collected for SPOLIN are in Figure 1. The corpus is substantial enough to be usable for fine-tuning existing dialogue models to encourage more yes-and behavior, and beyond that may prove a valuable knowledge base for empirical sociolinguistic studies on this dialogue act.
Our contributions are summarized as follows: • We carefully curate Selected Pairs Of Learnable ImprovisatioN (SPOLIN), the first largescale corpus of yes-and dialogue acts, sourced from improv and movie dialogues.
• We iteratively build a high-precision yes-and classifier, which we use to mine additional yesands from dialogue corpora with high volume but low yes-and density.
• We fine-tune existing open-domain conversational models with our corpus and confirm via human evaluations that this approach improves creative grounding.
• We release our models and data for public use, including a 64,000 turn pair extension of the core SPOLIN, at https://justin-cho. com/spolin.

Data Collection
Our data collection has five stages: 1. Manually extract yes-ands from a rich corpus of improv to obtain an initial set of yes-ands.
2. Construct a yes-and classifier from the corpus of collected yes-and data and negative examples.
3. Use the classifier from step 2 to automatically extract yes-and candidates from a much larger but sparser dialogue corpus.  4. If necessary, manually validate candidates before adding them to the yes-and corpus.
5. Repeat from step 2 as needed.
An overview of this process is shown in Figure 2.

Core yes-and Collection from Spontaneanation
We select the Spontaneanation 4 podcast as a source of concentrated yes-ands for its relatively noisefree recording quality and high-quality volume of broad domain improv dialogue. Each episode of this podcast includes an approximately 30 minutelong improv session performed by professional improvisers. Over its 201 episodes, we identified a total of 43K lines of useful spoken dialogue. Given the confluence of a lack of objective reality, and uninterrupted multiturn dialogue, the improvisers mostly abide by the yes-and principle, and therefore Spontaneanation is a rich resource for natural, high-quality yes-ands. As it exists only in audio form, and automatic transcription services are too noisy for high quality annotation use, we ask Amazon Mechanical Turk workers (Turkers) to listen to the improv sessions, view Amazon Transcribe preliminary transcriptions, and re-transcribe all of the yes-ands that they hear using our transcription interface, shown in Figure 3. The interface is based on oTranscribe, an open-source transcription service. Although the quality of transcriptions is poor, we find that including them assists the Turkers in identifying speaker turns and also understanding parts that are sometimes incomprehensible without helping context.

Recruiting Quality Crowdworkers for Difficult Annotation Tasks
One of the main challenges for the data collection process is to recruit competent Turkers who are able to develop a good understanding of the yes-and principle. We actively recruit potential annotators to our task by inviting denizens of the sub-Reddit TurkerNation, rather than simply inviting workers through Amazon's native task posting interface based on HIT approval rate and total number of HITs approved. Our approach enables more human-level engagement, making it easier to determine Turkers' English fluency and experience with improv. To ensure their competence,  Table 1: Iterative data collection results over Cornell. + indicates yes-ands and -indicates non-yes-ands. These counts exclude 500 turns collected from each of Spontaneanation and Cornell to form the validation set. The New Extraction Volume row indicates the new number of yes-and candidates identified with the given confidence threshold, and the New Proportion of yes-and row show as a percentage how many of these candidates were indeed evaluated as yes-ands by Turkers. The proportion of yes-ands increases after each iteration despite the lower confidence threshold used to filter the new predictions with the updated classifier.
Turkers first read yes-and guidelines (in the appendix) then demonstrate their level of understanding through qualification Human Intelligence Tasks (HITs), which test whether the candidates can identify if a yes-and exists in a 30 second audio segment and transcribe it if there is one. s Even after inviting Turkers for the actual HIT of transcribing yes-ands, we frequently monitor the quality of the data they collect and provide feedback for incorrectly identified yes-ands. Apart from base pay for each episode they work on, we provide incentives for extracting more yes-ands. The pay for our HITs averages well above California minimum wage. From all of the episodes, we extract 10,959 yes-ands, indicating about 25% of the total number of dialogue turns in Spontaneanation are yes-ands.

Guided Extraction from the Cornell Movie-Dialogs Corpus
Although larger than any improv corpus, let alone yes-and corpus known to date, we seek to increase our corpus volume from 10,959 turn pairs. The Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011, Cornell) contains 304,713 turns, nearly an order of magnitude more than Spontaneanation, and it is one of the closest in domain to improv among existing dialogue datasets. However, a sample annotation of 300 randomly selected turn pairs by Turkers reveal only 11.1% of pairs are yes-ands. We thus use the already-collected yes-ands to probe Cornell for likely candidates, to speed the search process. Recent developments of language models pre-trained on massive text data enable the training of high-accuracy models for down-stream tasks even with a small number of samples, by leveraging the contextualized embeddings that these models learn (Devlin et al., 2019; Radford et al., 2019). We thus fine-tune an initial BERT-based sequence classifier based on the implementation of Wolf et al. (2019a) with the yes-ands from the Spontaneanation episodes to determine if a given dialogue pair is a yes-and, using a high threshold (initially, a 95% probability of being yes-and) to bias for precision. We ask Turkers to validate the turn pairs identified by the classifier and add the validated pairs to our yes-and corpus. This procedure can be iterated.
For the first iteration, we train the classifier with a balanced number of non-yes-ands chosen by random sampling from Cornell, a reasonable assumption due to the relatively low concentration of yesands observed. The same Turkers that extracted yes-ands from Spontaneanation are invited to validate the yes-and candidates filtered out by the classifier using the interface shown in Figure 4. In order to ensure consistent annotation standards among Turkers, they are given a small number of overlapping HITs against which we validated. For 90 samples of unfiltered yes-and candidates from Cornell, the two workers yield a reasonably high Cohen's κ value of 0.74. Turkers are paid at rates consistent with their rates on the extraction-from-Spontaneanation task.
After the set of Cornell yes-and candidates are validated, the yes-ands and non-yes-ands are added to the training set to train a new classifier, and the same process is repeated. We hold out 500 dialogue pairs from each subcategory (i.e. Spontaneanation yes-ands) as the development set for monitoring the classifier's performance after each iteration. We incrementally lower the classification threshold for choosing a yes-and candidate as the classifier improved. We set this threshold on each iteration except for the first by retrospective evaluation of the classifier on the actual yes-and candidates' labels from previous iterations. The threshold with the highest F1 score is chosen to filter new yes-and candidates to be validated.
We balance each progressively larger corpus with negative sample turn pairs, which are either randomly selected from Cornell (round 1), selected Turkers are asked to correct minor errors in grammar, spelling, and punctuation for qualifying yes-and candidates, which are then categorized as 'Typo/Fix.' from the rejected-but-extracted turn pairs from Cornell (round 2 and later), or sampled from nonyes-and turn pairs in Spontaneanation formed by random coupling of prompts and responses of the Spontaneanation yes-ands (round 3 and later). The latter forces the classifier to make decisions based on semantic features relevant to a yes-and instead of only stylometric features in Spontaneanation yes-ands. We stop this iterative process after four rounds, when fewer than 5,000 new yes-and candidates are identified by the classifier, yielding a total corpus size of 26,435 yes-ands and 23,938 negative samples. An overview of this iterative process is summarized in Table 1. The negative sampling procedure, while somewhat ad-hoc, ultimately provides a mix of turn pairs from both corpora that is sufficient to allow extraction of yes-ands from new corpora at high precision rates, and is sufficient for our goals.

Additional Notes on yes-and Criteria
Although the concept of a yes-and is easy to define and understand, there are borderline cases between a yes-and and a non-yes-and that make the validation phase more difficult than originally expected. One of the cases that confused Turkers in the earlier stages of data collection is the case of yes-buts. A yes-but is a yes-and with a response that is coherent with the provided reality, but does not appear to provide an affirmative acceptance of a suggestion given in the prompt. These are different from contradictions that do not align with the previously established reality. In addition, there are instances where the response is a yes-and, but is accepted by a speaker other than the one to whom the prompt is directed. Some yes-and responses initiates a re-pair of a problem encountered while accepting the prompt, due to a confusion or a possible inconsistency, by asking for clarification (Clark and Schaefer, 1989). While these responses may not strictly establish more detail, they provide information for ultimately establishing new information. We elide these edge cases under the umbrella category yesand in SPOLIN as they further our top-level goal of providing relevant, actionable turn responses. Examples of some of these subtle differences are shown in Table 2.

Dataset Analysis
In order to provide a better understanding on the characteristics of our corpus, we annotate 200 yesands and 200 non-yes-ands in SPOLIN's development set to categorize them into specific yes-and or non-yes-and types.
We classify yes-ands into explicit yes-ands, implicit yes-ands, or yes-buts. Only 15% of all yesands are explicit yes-ands, containing phrases such as "Yeah" or "Sure" that reflects agreement. Even with such phrases, identifying explicit yes-ands is not a trivial task because it requires semantic understanding of the relevance of the context established in the prompt and that introduced in the response. In fact, there are non-yes-ands that contain phrases affirming agreement but have no contributions or have new contributions that lack relevance. The majority (78%) of yes-ands are implicit yes-ands, meaning that the agreement is implied, often in a subtle manner. The remaining 7% are yes-buts.
Non-yes-ands are divided into contradictions and others. Most of the non-yes-and were others, as only 5% of candidates extracted from Cornell are contradictions, which are dialogue pairs with  Table 2: Examples and proportions of yes-and and non-yes-and types from annotations of 200 yes-ands and nonyes-ands in SPOLIN's development set. Determining whether a given dialogue pair is a yes-and or not is a non-trivial task, as the agreement or contradiction of the previous dialogue turn's context is usually implicit.  a response that actively negates the reality in the prompt. Others encompass any dialogue pairs with a response that lacks coherence to the prompt or adds no or minimal contributions. The distribution and examples of different types of yes-ands and non-yes-ands are shown in Table 2.

yes-ands non-yes-ands
The main focus of our work is on yes-ands, but we provide non-yes-ands as part of SPOLIN for those interested in training their own classifiers. The negative samples are collected using the methods described in Section 2.2. The composition details of SPOLIN are shown in Table 3.

Experiments
To evaluate the effect of SPOLIN on generating yes-and responses and thus improving generated dialogue quality, we train a common architecture with a variety of fine-tuning data configurations, both with and without SPOLIN. Specifically, for each data configuration we fine-tune a doublehead GPT-2 model (117M-parameter version based on the implementation by Wolf et al. (2019b)), which achieved state-of-the-art performance on Personachat for the ConvAI-2 dialogue competition (Zhang et al., 2018). We fine-tune the models using two learning objectives, which we weigh equally in calculating loss: 1. Predicting the next word.
2. Predicting the next correct candidate that best fits the dialogue given the dialogue history.
The language modeling component uses pretrained weights from OpenAI, while the candidate classification head is trained from scratch. For evaluation, we use the language modeling component of the fine-tuned model to generate single-turn responses for the yes-and prompts in the development set. We use nucleus sampling (Holtzman et al., 2020) for the decoding step to keep only the top tokens with a cumulative probability that together exceed 0.9, from which the next token is chosen with multinomial sampling.

Data Configurations
For our experiments, we use several established dialogue datasets as baselines, namely Persona-chat (Zhang et al., 2018), Cornell (Danescu-Niculescu-Mizil and Lee, 2011) (the unfiltered corpus out of which we extract yes-ands, as described in Section 2.2), and DailyDialog (Li et al., 2017b). Each of these is an English-language open-domain casual conversation corpus with 100k-300k turns. For each of these datasets, we either simply finetune on that dataset, or fine-tune and then further Figure 5: Interface used by human evaluators to rank responses based on their quality as a yes-and, where a rank of 1 is most preferred. The correct ranking is shown for this example. The option ranked 1 is a yesbut: it does not reject a reality but rather rejects a suggestion and provides refining information that is most coherent to the prompt. fine-tune with SPOLIN. In another configuration, we also fine-tune directly with SPOLIN on top of GPT-2. The original GPT-2 implementation prepends the personalities given in Persona-chat to the dialogue sequence input before tokenization. For fine-tuning to datasets apart from Persona-chat, we simply do not prepend any auxiliary information to the dialogue sequence input.

Human Evaluation
Automatic metrics that rely on n-gram overlap, such as BLEU, ROUGE, and METEOR, are often used for generative models when there is little variability in the target output (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005). However, there can be a wide variety of responses that qualify as a good yes-and, a problem common to opendomain generation tasks. An adequate evaluation of our models requires assessing the main yes-and criteria: agreement with the context and the quality of the new relevant contribution, both of which are not feasible with the aforementioned metrics. Therefore, we ask human evaluators to compare the quality of the yes-ands generated by various models and the actual response to the prompt in SPOLIN that is used as the input.
We ask human evaluators to rank a set of four responses given a prompt, comparing the responses of a model trained only with SPOLIN, a model trained with an existing dialogue corpus, a model trained with both, and the actual response pair from the development set, denoted as "Gold." These four responses are randomly ordered for each question to prevent evaluators from developing a bias for responses that frequently have a good or poor response in a set order, as shown in Figure 5. The evaluators are permitted to provide the same rank for different responses if they are equal in quality. This evaluation set contains 100 such prompts, and each is evaluated twice by different evaluators. The results of the average ranking and some of the examples generated by the models are shown in Table 4.
Results show that models trained only with SPOLIN or with SPOLIN and another dialogue dataset are preferred to the models trained only with another dialogue dataset, although in the case of DailyDialog, the average ranking improves only by at most 0.06 after fine-tuning with SPOLIN. However, even the responses generated by models trained with SPOLIN are not ranked as well as the actual responses in the development set, indicating our models are still inferior to professional human improviser quality.

Extracting from Other Corpora
The approach to classifier-based mining we describe in Section 2.2 can naturally be applied to other dialogue corpora. We thus next consider mining the gigantic (441M sentence) OpenSubtitles (Lison and Tiedemann, 2016) collection. As OpenSubtitles contains undesirable material, such as subtitles for media with minimal dialogue, we instead mine from the (3.3M sentence) SubTle corpus (Ameixa et al., 2013), a preprocessed subset of OpenSubtitles that heuristically combines subtitle sequences into dialogue form.
By iterating through half of this corpus, we collect more than 40,000 yes-ands from it alone, which, when added to SPOLIN, yields what we call SPOLIN-extended, which contains about 68,000 yes-ands, more than 2.5 times the size of the core SPOLIN. Heuristics for finding alternations mean that SubTle's utterances are shorter than those in Spontaneanation and Cornell, so once the proportion of utterances longer than the average length of in Spontaneanation and Cornell (18.5 words) is less than 40%, we stop further collection in the remainder of the dataset. SPOLINextended is available in the same public repository as SPOLIN. Details of the iterative process as applied to SubTle are in the appendix.   (Zhang et al., 2018) Crowdsourced 162K The Ubuntu Dialogue Corpus (Lowe et al., 2015) Ubuntu chat logs 7M Twitter Triple Conversations  Social media 6K OpenSubtitles (Lison and Tiedemann, 2016) Subtitles 441M sentences SubTle (Eng) (Ameixa et al., 2013) Subtitles 3.3M pairs London-Lund Corpus (Greenbaum and Svartvik, 1990) Various sources 500K words London-Lund Corpus 2 (Põldvere et al., 2017) Various sources 500K words SPOLIN (yes-and only) Improv, Movie scripts 26K pairs SPOLIN-extended (yes-and only) Improv, Movie scripts, subtitles 68K pairs Table 5: A survey of publicly available English language text-based corpora frequently used for open-domain dialogue systems. The last two rows are our contribution. * Size is measured as the number of total utterances (dialogue turns) unless otherwise specified.

Related Work
Many works have identified the same issues of repetitive or non-committal responses generated by neural conversational systems that are at least partially related to the lack of sufficiently high quality yes-ands we deal with in this work; approaches that mitigate these problems vary. The majority of recent works focus on diversifying the responses by modifying the training and decoding objectives (Li et al., 2016a(Li et al., ,b, 2017a(Li et al., , 2016cXu et al., 2017;Shao et al., 2017). Other methods introduce latent variables to encourage diversity (Serban et al., 2017;Zhao et al., 2017). Some explore methods of re-weighing training instances that encourage diversity (Liu et al., 2018;Lison and Bibauw, 2017;Du and Black, 2019). Our approach is complementary to all the model-based approaches described here, as it simply deals with the production of a particularly useful corpus, that can be used to fine-tune on top of these methods.
We provide a survey of publicly available textbased datasets frequently used for open-domain dialogue systems and discuss their limitations for our purpose of generating grounded responses (see Table 5 for an overview).
DailyDialog is a collection of multi-turn dialogue with manually annotated emotion and intent labels (Li et al., 2017b). Danescu-Niculescu-Mizil and Lee (2011) created the Cornell Movie-Dialogs Corpus, a compilation of dialogue sequences paired with meta data about the movie and characters. Persona-chat provides dialogue sequence coupled with corresponding personas (Zhang et al., 2018).
The Ubuntu Dialogue Corpus contains 1 million dialogue turns extracted from Ubuntu chat logs, which discuss Ubuntu-related technical support (Lowe et al., 2015). The Twitter Triple Corpus is a dataset of 4K dialogue triples extracted from Twitter . OpenSubtitles is a huge collection of subtitles that span various genres, but the absence of speaker turn annotations make it difficult to modify into dialogue format (Lison and Tiedemann, 2016). Ameixa et al. (2013) use heuristics to reformat OpenSubtitles into dialogues with some limited success. Clark and Schaefer (1989) illustrate grounding in conversations with examples from the London-Lund Corpus (Greenbaum and Svartvik, 1990), a corpus of full conversations annotated with prosodic and paralinguistic features. A second version of the corpus was compiled with the same annotations standards as the first using more recent spoken and text data (Põldvere et al., 2017).
These corpora were not collected with the criteria for yes-ands in mind. Even for datasets with dialogue taking place in a similar domain as improv, they naturally contain only a small proportion of yes-ands. However, the relatively large sizes of these datasets still make them useful for dialogue systems. They can be used effectively for grounded conversations if the yes-ands or other desirable dialogue acts can be filtered out or given higher weights in training to enforce their characteristics in the responses generated.
Our data collection approach is similar to the method of Yarowsky (1995), which formalizes the bootstrapping mechanism of iteratively improving a classifier and label unlabeled data. The main difference from the Yarowsky algorithm and our approach is that, rather than using a fully automated process for increasing training data, we use a probability threshold to regulate recall, followed by human judgment to ensure high precision.
Apart from Clark and Schaefer (1989) there have been other taxonomies of grounding. For example, Traum (1999) considers six categories; among these are acknowledge and continue, which, taken together, map nicely to yes-and. Magerko et al. (2009) and Fuller and Magerko (2010) note the importance of establishing common ground in improv.

Conclusion
Inspired by yes-ands in improv, we carefully construct SPOLIN, a collection of dialogue pairs with responses that are not only coherent with dialogue context but also initiate the next relevant contribution. We extract high-quality yes-ands from Spontaneanation and build a classifier with them, which is then used to mine additional yes-ands from the Cornell Movie-Dialogs Corpus. We further use our mining technique to elicit a corpus of more than 68,000 yes-and turn pairs, easily the largest collection of this dialogue act known to exist. From human evaluations of dialogue models trained with various data configurations we find that SPOLIN is useful-when including it we are able to build models that can generate yes-ands more consistently than when we leave it out. Nevertheless, our models are still inferior at producing good yes-ands when compared to professional improvisers. We plan to continue our data-driven approach for grounded conversations by expanding our dataset through our iterative data collection process with other larger text-based open-domain dialogue corpora and extend our work to model and collect longer conversations exhibiting more complex improv-backed turns.   Table 1 with the extended version of SPOLIN that includes extracted yes-ands from SubTle. SubTle is collected from the fourth iteration onwards. *Statistics for Cornell/SubTle are shown separately. The same classifier is used for extracting candidates from Cornell and SubTle, but they are datasets with significantly different characteristics.

A.1 yes-and Guidelines for Turkers
We provide detailed annotation guidelines, shown in Figures 6-9, to the Turkers as a result of having continuous discussions with them and monitoring their submissions. Contrary to our expectations, it is difficult to make a binary decision on whether a dialogue turn is a yes-and or non-yes-and, and therefore these fine-grained details are crucial for collecting yes-ands in SPOLIN.

A.2 Iterative data collection results for SubTle
Due to SubTle's relatively large size, we split Sub-Tle into 20 equal blocks that each contains 10,486 dialogue turns. Note that every successive iteration of SubTle was not performed on the same block but on the next block. This is different from Cornell, for which every iteration is on the same set of dialogue turns. This difference is not due to any characteristics in the dataset but because of practical reasons arising from the size of the SubTle corpus.
The first extraction proportion for SubTle is low because of the prevalence of self-yes-ands in this corpus. Self-yes-ands are prompt and response pairs that evidently originate from the same speaker but otherwise meet the criteria of a yes-and. There are many incorrectly combined dialogue turns that actually come from the same speaker because of the heuristics employed for building SubTle. By providing labeled self-yes-and as negative samples, the classifier quickly learns to remove these self-yesands, leading to a significantly higher proportion of yes-ands in subsequent iterations. This is demonstrated in the specifics of the additional iterations, which are shown in Table 6.