Adding Chit-Chat to Enhance Task-Oriented Dialogues

Existing dialogue corpora and models are typically designed under two disjoint motives: while task-oriented systems focus on achieving functional goals (e.g., booking hotels), open-domain chatbots aim at making socially engaging conversations. In this work, we propose to integrate both types of systems by Adding Chit-Chat to ENhance Task-ORiented dialogues (ACCENTOR), with the goal of making virtual assistant conversations more engaging and interactive. Specifically, we propose a Human <-> AI collaborative data collection approach for generating diverse chit-chat responses to augment task-oriented dialogues with minimal annotation effort. We then present our new chit-chat-based annotations to 23.8K dialogues from two popular task-oriented datasets (Schema-Guided Dialogue and MultiWOZ 2.1) and demonstrate their advantage over the originals via human evaluation. Lastly, we propose three new models for adding chit-chat to task-oriented dialogues, explicitly trained to predict user goals and to generate contextually relevant chit-chat responses. Automatic and human evaluations show that, compared with the state-of-the-art task-oriented baseline, our models can code-switch between task and chit-chat to be more engaging, interesting, knowledgeable, and humanlike, while maintaining competitive task performance.


Introduction
With modeling innovations, increasing computing power, and a growing number of datasets, recent years have witnessed significant improvements in the performance of both task-oriented dialogue systems and chit-chat systems (Adiwardana et al., 2020;Hosseini-Asl et al., 2020;Peng et al., 2020a). Most research on dialogue * Work done as a research intern at Facebook. The code and data are available at https://github.com/ facebookresearch/accentor. systems focuses on a particular type of dialogue system. Work on task-oriented dialogue systems typically aims to track user goals with higher accuracy to better achieve functional goals (Rastogi et al., 2020) with the sacrifice of not paying explicit attention to user experience, such as making the conversation more engaging, while the latter is usually the target of research on chit-chat systems . In this work, we step forward and propose to integrate both types of systems by Adding Chit-Chat to ENhance Task-ORiented dialogues (ACCENTOR), aiming to have a virtual assistant capable not only of performing various complex tasks such as checking the weather, booking hotels, and finding restaurants, but also incorporating casual and contextually relevant chit-chat. We hypothesize that the added chit-chat can make the assistant appear more social, personable, and engaging, without being misleading or inappropriate, compared with existing task-oriented dialogue systems.
To show the feasibility of ACCENTOR and gather supervisory data for follow-up research, we propose a Human↔AI collaborative data construction approach that can effectively add suitable chit-chat to the beginning or end of system responses in existing task-oriented dialogue datasets. Specifically, we first generate chit-chat candidates for augmentation using off-the-shelf pre-trained language models and open-domain chatbots (Section 2.1). Next, we automatically filter out candidates that are unlikely to be of good quality using a filter model (Section 2.2). Finally, human annotators label each of the remaining candidates as good or bad, with justifications (Section 2.3). We augment the Schema-Guided Dialogue (SGD) (Rastogi et al., 2020) and MultiWOZ 2.1  corpora using the proposed approach. (See Figure 1 or Appendix A.4 for examples.) We employ ACUTE-Eval  to compare the augmented versions with the originals along four axes: engagingness, interestingness, knowledge, and humanness. We find that the augmented dialogues are consistently preferred by human judges across the four axes for both datasets (Section 4.1).
In addition, we propose and evaluate three models for adding chit-chat to task-oriented dialogues, including an end-to-end model and two code-switcher models built upon off-the-shelf taskoriented and chit-chat systems (Section 3). Compared with the baseline model trained with the original unaugmented data, our models trained with the augmented version can generate significantly higher-rated responses in terms of human preference while maintaining competitive task performance in goal tracking accuracy and action decision F1 (Section 4.2).
Our main contributions are: we propose (1) a data augmentation approach for generating diverse chit-chat supervisory data for task-oriented dialogues, leveraging pre-trained generative models and a custom filter model to minimize human annotation effort; (2) new versions of the popular task-oriented datasets, SGD and MultiWOZ 2.1, with newly added chit-chat annotations to 23.8K dialogues; and (3) three integrated chit-chat and task-oriented neural dialogue models for the above, substantially outperforming the state-of-the-art approach in terms of human evaluation of engagingness, interestingness, knowledge, and humanness. To our knowledge, we are the first to propose an annotated dataset and models that study explicit code-switching between full-stack task-oriented dialogues and free-form chit-chat responses.

Data Construction
In this section, we describe an approach to gather supervisory data for adding contextually relevant chit-chat to task-oriented dialogues. Our approach needs minimal annotation effort to augment suitable and diverse chit-chat add-ons that are not available in existing task-oriented datasets (Sec-  Figure 2: Data construction overview: (a) We generate diverse free-form chit-chat candidates using the state-of-the-art pre-trained generative models to augment original task-oriented dialogues, and (b) filter out bad candidates using the custom filter to minimize annotation effort. (c) Crowd workers annotate contextually relevant chit-chat augmentation with justifications.
tion 5.1). We primarily report results based on dialogues from the SGD dataset in this study, because it is the largest task-oriented dialogue dataset and is generally cleaner compared with most other task-oriented dialogue datasets. However, our approach is flexible and thus not limited to dialogues from a particular task-oriented dataset (Section 4.1). Figure 2 shows the overview of our approach.

Candidate Generation
Given a task-oriented dialogue D = {u 1 , s 1 , u 2 , s 2 , . . . , u n , s n }, where u 1...n and s 1...n represent user turns and system turns, respectively, we generate chit-chat candidates for augmenting s i in two ways: (i) pass u 1 , s 1 , . . . , u i , s i to an off-the-shelf pre-trained model (a language model or a chit-chat chatbot) and let the model add tokens to the end of s i ; (ii) pass u 1 , s 1 , . . . , u i to a pre-trained model and let the model generate a turn. We regard the output of (i) and (ii) as a chit-chat candidate to be appended and prepended to s i , respectively. If a chit-chat candidate consists of multiple sentences, we also regard each individual sentence as a chit-chat candidate. We run differently sized GPT-2 (Radford et al., 2019) and BlenderBot  with various decoding parameters as the pre-trained model and generate an average of 175.5 candidates for each of the dialogues from the SGD dataset. See Appendix A.1 for configuration details.

Candidate Filtering
We examine the quality of the model-generated candidates from Section 2.1 by performing a pilot annotation ourselves on a small proportion of the

Opinions
Express general opinions about generic, impersonal, or non-sensitive topics.
-"I love penguins." -"There's a lot of fun stuff to do." Express strong personal opinions, or opinions on sensitive topics.
-"I love you." -"The President is an idiot."

Preferences
Express preferences when making impersonal, or non-sensitive recommendations.
-"Their latest album wasn't as good." -"Their food is good." Express strong dispreferences, or preferences on personal or sensitive subjects.
-"I hated it, but you might like it." -"Invite her! I like her better."

Physical Actions
Use epistemic verbs to express uncertainty or opinions, or refer through hearsay to actions that it may not perform.
-"I hear it's beautiful." -"They say it tastes like chicken." Behave as though it could act physically, or perform tasks outside of its role.
-"I haven't arrived there yet." -"I can drive you there."

Experiences
Refer to others' experiences or personify experiences it is capable of (e.g., reading).
-"That sounds like a great trip!" -"I enjoyed reading that novel." Pretend to have experiences that it is incapable of.
-"We didn't have that when I was a kid." -"My roommate used to eat there a lot." Who is the virtual assistant? This digital assistant is more than just a bot that spits out facts. It has access to a wide range of information which can express not only as factual commentaries but also as opinions and preferences. However, it is not a person and should not pretend to have real experiences or be capable of physical actions. It should be personable and personlike, without appearing counterfeit. candidates. The annotation results show that only about 1 /10 of the candidates are suitable. Therefore, instead of directly sending the candidates to crowd workers for annotation, we propose to build a filter model to automatically filter out candidates that are unlikely to be of good quality first to reduce potential annotation workload. The filter is a hybrid model that consists of a RoBERTa-based binary classifier (Liu et al., 2019) and a rule-based ranker. The classifier takes as input an augmented dialogue, in which we explicitly surround the added chit-chat candidate with a pair of special tokens to help the model locate the candidate. We train the classifier with 1.7K candidates that are labeled as good/bad from the pilot annotation. The rule-based ranker ranks each candidate based on (i) the posterior probability output by the binary classifier, (ii) whether the candidate matches a list of bad patterns (e.g., containing an URL), (iii) the frequency of appearances of the candidate among all generated candidates, (iv) the similarity to the other candidates for the dialogue, and (v) the similarity to the system response being augmented. While (i) and (ii) directly help evaluate the quality of the candidate, (iii), (iv), and (v) additionally help create more variety (e.g., punishing high-frequency candidates such as "You're welcome"). We keep the top ten candidates for each of the dialogues. We present more details in Appendix A.2.

Annotation
We ask annotators (crowd workers) to label each of the remaining candidates from Section 2.2 as good or bad. Additionally, to guide the annotation process, improve the potential quality, and facil-itate the candidate distribution analysis, we also ask annotators to choose from four justifications that we come up with based on our pilot annotation experience to support their annotations. Annotators can choose one, both, or neither of the following justifications for a bad candidate: • Inappropriate: The candidate does not fit into the context (e.g., repeating, unnatural), or it contradicts the context or the role of the assistant (Table 1). This category comprises most of the commonly found bad cases such as improper switching, providing opinions or comments that are incompatible with the context, and misusing verbal routine.
• Misleading: The candidate provides additional information that is false or cannot be verified immediately. For example, the underlined candidate in the two-turn dialogue "U: I want to book a hotel room in San Diego with a check in on Thursday. A: There are over 10 hotels in San Diego. I would stay at Arlo NoMad if I were you." should be marked as misleading because "Arlo NoMad" is newly introduced information, which the annotator would have to look up to verify that a hotel by this name exists in San Diego, even though the information may be true.
Annotators can choose one, both, or neither of the following justifications for a good candidate: • Social: The candidate keeps the conversation flowing smoothly by appropriately switching to relevant topics, asking casual follow up questions, or engaging in social pleasantries.
The design of this subcategory is inspired by the line of research that studies different social and discourse strategies in chit-chat dialogue systems (Yu et al., 2016).
• Useful: The candidate enhances the conversation by appropriately offering opinions, commentaries, or pertinent and truthful information. Truthfulness should be established by conversational context or real world knowledge. To reduce annotation workload, if annotators have to use external resources (e.g., Wikipedia, search engines, maps) to verify information, they are instructed to label the candidate as misleading instead. The design of this subcategory is inspired by the line of work on knowledge-grounded dialogue systems that study contextual knowledge injections (Dinan et al., 2019).
We instruct annotators to evaluate each candidate independently as if it were the only augmentation for its associated dialogue. We discuss the additional dimension of complexity introduced by having multiple augmentations jointly in Section 4.1.  Annotation time per dialogue is 243s. The Fleiss' Kappa among crowd workers is 0.52. We view the agreement score as reasonable since whether an added chit-chat candidate leads to improved quality of a conversation can be highly subjective in many scenarios. We denote our augmented version of the SGD dataset as ACCENTOR-SGD and summarize the statistics in Table 2. We observe that the four provided justification categories provide adequate coverage of the justifications for most  Figure 3: A diagram for the proposed code-switching models. Given the dialogue context (H t ) and the pre-generated task-oriented and chit-chat response candidates (T t ,C t ), the Arranger learns the optimal code-switching sequences (discriminative), while the Rewriter outputs free-form paraphrases (generative).

Task Formulations
Since oracle information (i.e., oracle belief states and oracle action decisions) is not available in practical use and the SGD dataset does not have the associated database (i.e., a table of possible entities) released, we focus on exploring the endto-end setting in which we generate delexicalized task-oriented responses without using oracle information and database search results following Hosseini-Asl et al. (2020). Given dialogue history (i.e., previous turns) as context, the goal of the model for each system turn is to accurately generate belief states (i.e., a list of (domain, slot, value) triplets), action decisions (i.e., a list of (domain, action_type, slot) triplets), and a corresponding system response that is functionally accurate and socially engaging.

Models
We re-implement SimpleTOD (Hosseini-Asl et al., 2020) as our main baseline model, which is a state-of-the-art model in the end-to-end setting we explore. In addition, we propose an extension of Sim-pleTOD that incorporates chit-chat acts, as well as two new models (Arranger and Rewriter; Figure 3) that code-switch between chit-chat and taskoriented responses more explicitly.
SimpleTOD. It is a causal language model that models the joint probability over the concatenation of dialogue history H t , belief states B t , action decisions A t , and a task-oriented response T t for each turn t. During inference, the model takes as input H t and generates B t , A t , and T t . We refer readers to Hosseini-Asl et al. (2020) for more details.
SimpleTOD+. We extend SimpleTOD by introducing to the construction of input sequences a special new dialogue action chit-chat and good chit-chat candidates during training. Specifically, let C + t denote the set of good candidates for system turn t. If C + t is empty, we construct the same training sequence as SimpleTOD. Otherwise, for each C t ∈ C + t that is labeled as a candidate to be prepended (resp. appended) to the turn, we use the concatenation of H t , B t , [chit-chat], A t , C t , and T t (resp. H t , B t , A t , [chit-chat], T t , and C t ) as a training sequence.
Arranger. This model arranges the output of an off-the-shelf task-oriented dialogue model and an off-the-shelf chit-chat model without intervening in the task. It outputs the belief states and action decisions generated by the task-oriented model without modification. To generate a response for each system turn t, this model takes as input (i) dialogue history H t , (ii) a chit-chat responseC t generated by the chit-chat model based on H t , and (iii) a taskoriented response T t generated by the task-oriented dialogue model based on H t . The model chooses one of the following as the response:C t followed by T t , T t followed byC t , and T t only. Specifically, the model encodes the concatenation of H t and each of these three responses by a RoBERTa encoder (Liu et al., 2019) and passes the resulting representations through a linear plus softmax layer to make the choice. To train the model, we form training instances by regarding each chit-chat candidate for turn t from the training set of ACCEN-TOR-SGD asC t and the ground-truth task-oriented response as T t and setting the target choice based on the label (i.e., good/bad) and position (i.e., beginning/end of the response) of the candidate.
Rewriter. This model rewrites the output of an off-the-shelf task-oriented dialogue model and an off-the-shelf chit-chat model. It directly outputs the task-oriented model's belief states without modification and generates action decisions and a system response by a causal language model. The causal language model differs from SimpleTOD+ in that it has two additional components T t andC t added between H t and B t in each training sequence, where we form T t andC t in the same way as we do for Arranger. During the inference stage, it takes as input H t , T t output by the task-oriented dialogue model, C t output by the chit-chat model, and B t output by the task-oriented dialogue model, and generates action decisions and a system response. Note that since 25.4% of the annotated system turns in the training set of ACCENTOR-SGD have both good and bad chit-chat candidates, C + t can be non-empty whenC t is a bad candidate, which enables the model to potentially generate a suitable chit-chat augmented response even if the output of the offthe-shelf chit-chat model is not good.

Implementation Details
Unless specified otherwise, for causal language models, we use the 12-layer GPT-2 (117M parameters) as the pre-trained language model (Radford et al., 2019) and fine-tune for ten epochs. We set the batch size to 36 and the learning rate to 1 × 10 −3 . We employ the SimpleTOD baseline as the off-the-shelf task-oriented dialogue model for Arranger and Rewriter. We fine-tune a 90M parameter model  on each of the good chit-chat candidates with the associated dialogue history as the context from the training set of ACCENTOR-SGD following hyperparameters employed by  and employ the resulting model as the off-the-shelf chit-chat model in Arranger and Rewriter. We use RoBERTa BASE (Liu et al., 2019) as the pre-trained language model for Arranger and fine-tune for three epochs with a learning rate of 2 × 10 −5 and a batch size of 24.

Data Evaluations
ACCENTOR-SGD. We first evaluate ACCENTOR at the dataset level, aiming to answer two questions: Q1. Are task-oriented dialogues augmented with good chit-chat more preferred by human judges than the unaugmented? Q2. Does the answer to Q1 depend on how frequently we augment system  responses with chit-chat? To answer these questions, we randomly sample 100 dialogues from ACCENTOR-SGD, each having at least 8 turns and enough candidates labeled as good for augmenting over 40% of system responses so that we can compare the same task-oriented dialogue with different chit-chat injection frequencies that fall into each of the following four intervals: (0.  As shown in Figure 4, the chit-chat augmented dialogues from ACCENTOR-SGD are more preferred by human judges than the originals over all ACUTE-Eval metrics, regardless of the injection frequency (all p-values < 0.05). Among different injection frequency ranges, (0.2, 0.3] is the best. We offer three hypotheses to explain this finding: (i) (0.2, 0.3] best balances being engaging and not U: I like to find some movies directed by Jonathan Levine.

A:
What type of movies you interested? U: I like comic movies which is great. A: 1 That's cool. There are 1 movies of your interest. What is your opinion about Long Shot? U: Long Shot is a good one. A: 2 I agree. Do you like to run the movie? U: I need to run the movie with subtitles.

A:
Please give confirmation details: Running Long Shot with subtitles. 3 Thank you for your help. U: Yes. Okay fine.

A:
The movie started playing. 4 It's a great comedy. U: I want to make a reservation at a restaurant around there.

A:
Do you have any choice of restaurant and city and what is a booking time you need? U: I need it in San Francisco at 17:30 and the restaurant is Aliment.

A:
Please give confirmation details: Table booking at Aliment in San Francisco at 5:30 pm on march 1st and booking is for 2 people. 5 Thank you.

U:
Fine. This is okay for me.

A:
Your booking is successful and location address is 786 Bush street. 6 I hope you will enjoy the rest of your evening. U: Thank you so much for your help. A: 7 You are very welcome. Do you need any other help? U: No, thanks a lot.

A:
Have a great day. too talkative. (ii) There are inevitable annotation errors, and scenarios where whether a candidate is good or bad is subjective. A higher injection frequency means a higher chance of being affected by these factors. (iii) Since candidates are labeled independently, inter-candidate incompatibility may arise (e.g., expressing contradicted preferences), especially when we have a high injection frequency. Table 4 shows a real example to support our hypotheses. Specifically, 3 is labeled as good but is indeed not a suitable (or at least a questionable) candidate, supporting the hypothesis (ii). While 2 and 4 are good candidates when we evaluate them separately, they may be less preferred if we assess them jointly because they convey the same meaning: "Long Shot is a good comedy." Having them together may appear incompatible (i.e., repetition) or sound verbose to the user, supporting the hypothesis (i) and (iii).  ment about 1K randomly sampled dialogues from another task-oriented dataset, MultiWOZ 2.1  following the same steps as described in Section 2. Crowd workers label 30.0% of the candidates as good, which is lower compared with ACCENTOR-SGD (41.4% in Table 2). We attribute the difference to (i) the performance downgrade of the filter model since we do not re-train the model for MultiWOZ 2.1, and (ii) a higher chance of a chit-chat augmented response being too verbose to be good since the average number of tokens per system turn in MultiWOZ 2.1 is larger than that of SGD (17.3 vs. 13.1). Nevertheless, the augmented version (denoted as ACCENTOR-MultiWOZ) is significantly more preferred than the original, as shown in Figure 6, where we randomly sample 100 dialogues from ACCENTOR-MultiWOZ, augment all of their system responses that have chit-chat candidates labeled as good, and compare these augmented dialogues with the corresponding original dialogues.

Model Evaluations
Automatic Evaluations. We consider joint goal accuracy (Joint GA) and average goal accuracy (Avg GA) for evaluating belief states, act-slot F1 for evaluating action decisions, and two BLEU-4 scores (BLEU-4 orig , BLEU-4 aug ) for evaluating system responses, where we use original (resp. augmented) system responses as references for BLEU-4 orig (resp. BLEU-4 aug ). Table 3 summarizes the evaluation results. Since the test set of SGD contains unseen services (i.e., services not seen during training) designed to evaluate the model's generalizability, we report the results on all services (All) and seen services only (Seen) following Rastogi et al. (2020). Our proposed models generally achieve a similar task performance level compared with the SimpleTOD baseline. Unsurprisingly, the proposed models achieve lower BLEU-4 orig and higher BLEU-4 aug .
Human Evaluations. We turn to human evaluations for a more comprehensive measure of the response generation performance. We employ the same ACUTE-Eval metrics as we do in data evaluations. We randomly sample 100 dialogues from the test set of ACCENTOR-SGD. For each sampled dialogue D = {u 1 , s 1 , u 2 , s 2 , . . . , u n , s n }, we pass u 1 , s 1 , . . . , u i to each model M ∈ {SimpleTOD, SimpleTOD+, Arranger, Rewriter} to obtain its system response s M i for the i-th system turn (1 ≤ i ≤ n). Let D M represent {u 1 , s M 1 , . . . , u n , s M n }. We ask evaluators to compare each pair of D M 1 and D M 2 , where M 1 , M 2 ∈ {SimpleTOD, Simple-TOD+, Arranger, Rewriter} and M 1 = M 2 . As shown in Figure 5, all of the chit-chat augmented models outperform the SimpleTOD baseline over four ACUTE-Eval metrics. Among the chit-chat augmented models, no one shows a clear win over the other two on the quantitative level. We show a full dialogue example comparing responses gener-  Figure 7: Human evaluation results of the modified Arranger with controlled injection frequency ( * * : p-value < 0.005, ↑/↓: increased/decreased win % compared with the original Arranger).
ated by different models along with supplementary discussions in Appendix A.5. Considering that the injection frequency affects human evaluations (Section 4.1) and that all our models do not explicitly control the injection frequency, we experiment with controlling the injection frequency by modifying Arranger to consider including chit-chat into the current turn only when the injection frequency from the first turn to the current turn is less than 0.3. Compared with the original Arranger, the modified Arranger achieves a higher win percentage over SimpleTOD, as shown in Figure 7. We leave further exploration of injection frequency for future work.

Limitations and Further Discussions
Approach. Our proposed strategy to augment task-oriented dialogue system responses with chitchat is simple, compared with how it emerges in human conversations, where both functionality and engagingness structurally intertwine with each other in a more complex fashion. Our proposed Rewriter model does have a modeling capability to compose both functions organically but is limited due to the dataset's target arrangement (i.e., concatenation of two separate components). Despite the limitation, our chosen design of "code-separation" has practical merits: we can easily extend the proposed approach to an existing production-level virtual assistant system as a modularized solution, and it has minimal interference to the user-perceived task success rate, a core metric widely adapted in virtual assistant systems. Another limitation of our work is that we only augment responses on the system side in our dataset, and the augmentations are independent of each other, whereas in real-life situations, users are also likely to make chit-chat, and the chit-chat between the user and the system should ideally be related to each other. We leave for future research addressing these limitations.
Evaluation. We follow the previous literature on evaluation and regard the four ACUTE-Eval metrics as the primary measure of the response generation performance in this work. However, there is a large overlap between the desired quality measured by different human judgment categories used in ACUTE-Eval. The four ACUTE-Eval metrics favor the same dialogue 84.4% of the time in our evaluation, indicating high correlations between these metrics. We leave the study of addressing this issue for future work.

Dialogue Datasets
Dialogue system research has been consistently supported by the development of new datasets. The Dialog State Tracking Challenge (DSTC) series (Williams et al., 2013;Henderson et al., 2014a,b;Williams et al., 2014;Kim et al., 2016Kim et al., , 2017Moon et al., 2020) provide common testbeds for task-oriented dialogues. Following DSTC, researchers have created a variety of publicly available task-oriented dialogue datasets (El Asri et al., 2017;Shah et al., 2018;Budzianowski et al., 2018;Rastogi et al., 2020). Another line of work seeks to facilitate open-domain chatbot development with large amounts of human-created text data generated in a social context (Baumgartner et al., 2020) and supervision for a variety of desirable general qualities such as being engaging, personable, knowledgeable, and empathetic (Zhang et al., 2018;Dinan et al., 2019;Rashkin et al., 2019;Moon et al., 2019;Wang et al., 2019;. Our work bridges the two lines. We compare ACCEN-TOR-SGD and ACCENTOR-MultiWOZ with relevant and representative dialogue datasets in Table 5. Note that very few dialogue corpora contain explicit annotations for both task-oriented and chitchat utterances. For example, task-oriented dialogue corpora constructed by Rastogi et al. (2020) and Moon et al. (2020) contain annotations for a few chit-chat dialogue acts, but they are limited to light social greetings (e.g., "Thank you!", "Good Bye.") typically at the end of each dialogue session. Zhao et al. (2017) propose to artificially augment task-oriented dialogues with randomly sampled utterances from a chit-chat corpus, mainly to improve the out-of-domain recovery performance. Akasaki and Kaji (2017) annotate user utterances with chat/non-chat binary labels. Still, they do not study the contextual combination of these two   crowdsourcing 10, 438 Schema-Guided Dialogue (Rastogi et al., 2020) crowdsourcing 22, 825 SIMMC (Moon et al., 2020) crowdsourcing 12, 948 PersonaChat (Zhang et al., 2018) crowdsourcing 10, 907 Wizard of Wikipedia (Dinan et al., 2019) crowdsourcing 22, 311 EmpatheticDialogues (Rashkin et al., 2019) crowdsourcing 24, 850 BlendedSkillTalk  crowdsourcing 6, 808 Pushshift Reddit (Baumgartner et al., 2020) crawling & scraping 651, 778, 198 † ACCENTOR-SGD (this work) crowdsourcing 22, 825 ACCENTOR-MultiWOZ (this work) crowdsourcing 997 to make conversations more engaging, and their corpus does not contain goal labels like typical task-oriented dialogue corpora. In contrast, our work drastically increases the diversity and contextual coverage of chit-chat additions for any taskoriented dialogue corpus (e.g., "It's a great way to kick off the summer!", "I hear it's beautiful.").
Compared with other approaches of creating a high-quality dialogue corpus (e.g., via humanto-human "Wizard-of-Oz" collection , dialogue self-play and paraphrase (Shah et al., 2018)), the annotation cost of the proposed model-based dialogue generation approach combined with the quality control mechanisms is lower, as our work does not involve authoring new sentences by human annotators.

Task-Oriented Dialogue Systems
Over the past few years, neural models have achieved remarkable success in the development of the main components of task-oriented dialogue systems, including understanding user intent, tracking dialogue states, determining system actions, and generating system responses (Henderson et al., 2013;Sun et al., 2014;Wen et al., 2015;Liu and Lane, 2016;Mrkšić et al., 2017;Nouri and Hosseini-Asl, 2018;Heck et al., 2020;Chen et al., 2020). Recently, connecting separate components and building end-to-end taskoriented neural dialogue systems have attracted increasing interest Peng et al., 2020b). The most recent thread is to unify all components in a single end-to-end neural model by fine-tuning a pre-trained deep language model on multiple tasks, which leads to state-of-the-art performance (Hosseini-Asl et al., 2020;Peng et al., 2020a). We follow this thread and further en-hance the ability to generate appropriate non-taskoriented add-ons, on top of the ability to achieve functional goals that existing systems are typically narrowly tailored to. A few work have studied training a dialogue model leveraging multiple chit-chat and task-oriented dialogues (Madotto et al., 2019(Madotto et al., , 2020, which allows the model to attend on a relevant task for a given user utterance and respond accordingly, thus increasing the skill coverage of the model. Our proposed models are trained on the newly collected ACCENTOR-SGD dataset with the turn-level supervision signals, allowing for contextual and flexible code-switching between chit-chat and functional tasks in a single system turn.

Conclusion
We propose adding chit-chat to enhance taskoriented dialogues (ACCENTOR) in this study. We present a general Human↔AI collaborative data construction approach for ACCENTOR, with which we create a dataset consisting of 23.8K chit-chat augmented task-oriented dialogues. We show via human evaluation that chit-chat augmented dialogues are preferred than the unaugmented. In addition, we propose three models for ACCENTOR. Evaluation results show that compared with the baseline trained on the original unaugmented data, our proposed models trained on the chit-chat augmented counterpart achieve a similar task performance level and higher human evaluation scores.

A.2 Details of Candidate Filtering
The ranker initially ranks each candidate according to the posterior probability output by the binary classifier. It then lowers the ranks of candidates that match a list of bad patterns. Most bad patterns are about newly introduced counterfeit information (e.g., containing an URL/email address, a phone number, time, or amount of money). The rest of the bad patterns are mainly about text genre (e.g., containing email sign-offs such as "best regards") and format (e.g., misuse of punctuation marks). Lastly, the ranker raises the ranks of (i) uncommon candidates and (ii) candidates that are dissimilar to the other candidates for the dialogue and the system response being augmented. We measure the similarity by Levenshtein distance. Note that we do not explore the optimal settings for candidate filtering, as it is not the primary focus of this paper. For instance, how much the rulebased ranker lowers or raises the ranks of candidates is set manually based on engineering intuition rather than rigorous analysis; we do not exhaustively investigate how much labeled data is required to obtain a good enough binary classifier; the 1.7K examples from the pilot annotation are randomly sampled. Tuning the procedure (e.g., the number and selection of training examples) may lead to a better resulting candidate set.

A.3 Human Evaluation Questions
• Engaging: Who would you prefer to talk to?
Which version is more likely to hold your attention and make you want to hear more?
• Interesting: Who would you say is more interesting? Which version arouses your curiosity or tells you something new or useful?
• Humanlike: Who would you say sounds more human? Which version is more natural and personable?
• Knowledgeable: Who would you say is more knowledgeable? Which version seems more well informed and confident in the information? Table 7: Example dialogues from ACCENTOR-SGD (U: user; A: assistant). All chit-chat candidates for augmentation, generated with the state-of-the-art pre-trained language models, are annotated by the crowd workers with good () and () bad labels. Note that while most of the bad chit-hat candidates are fluent, they are often contextually inappropriate or inconsistent with the rest of the dialogue. The annotation guideline is highlighted in Section 2.3.