Effects of Naturalistic Variation in Goal-Oriented Dialog

Existing benchmarks used to evaluate the performance of end-to-end neural dialog systems lack a key component: natural variation present in human conversations. Most datasets are constructed through crowdsourcing, where the crowd workers follow a fixed template of instructions while enacting the role of a user/agent. This results in straight-forward, somewhat routine, and mostly trouble-free conversations, as crowd workers do not think to represent the full range of actions that occur naturally with real users. In this work, we investigate the impact of naturalistic variation on two goal-oriented datasets: bAbI dialog task and Stanford Multi-Domain Dataset (SMD). We also propose new and more effective testbeds for both datasets, by introducing naturalistic variation by the user. We observe that there is a significant drop in performance (more than 60% in Ent. F1 on SMD and 85% in per-dialog accuracy on bAbI task) of recent state-of-the-art end-to-end neural methods such as BossNet and GLMP on both datasets.


Introduction
End-to-end dialog systems that learn from humanto-human conversations have huge potential for various goal-oriented dialog tasks such as hotel, restaurant and flight reservations. Recent work (Serban et al., 2016;Bordes et al., 2017) has shown that it is possible to train dialog models in an end-toend manner and achieve satisfactory results. There are several benchmarks (Wen et al., 2017;El Asri et al., 2017;Eric and Manning, 2017;Wei et al., 2018) to evaluate the performance of neural models for goal-oriented dialog.
However, these benchmarks assume a world of a "perfect" user who always provides precise, con- cise, and correct utterances. These goal-oriented datasets are largely collected by crowdsourcing, where a crowdsource worker enacts the part of a real user by following a set template of instructions provided for the task. This method results in a dataset where most of the user utterances are straight-forward, stick to the goal and tend to leave out the variation commonly found in naturally occurring conversational data. For example, in making a restaurant reservation, a user may perform the following actions: a) check on the customer care agent's welfare, b) comment on the weather in the opening of the conversation, c) ask about business hours or about whether the restaurant accepts reservations as a preliminary question to the reservation request and d) paraphrase his/her prior request with more details. Each of these actions is natural variation present in human-to-human conversations.
Although some templates ask the crowd workers to paraphrase their request, they never ask workers to simulate the full range of naturalistic variation (Schegloff et al., 1977;Moore and Arar, 2019). This naturalistic variation has been thoroughly documented in the Conversation Analysis literature (Sacks et al., 1974;Schegloff, 2007), and further adapted for designing automated conversational agents (Moore and Arar, 2019).
The core reason for this omission is that natu-ralistic variation is often confused with "chit chat" (Dunbar et al., 1997;Zhang et al., 2018). Moore and Arar (2019, p. 121) writes, In common usage, "chit chat" means inconsequential talk. But much talk that may appear on the surface to be inconsequential in fact serves a variety of functions in managing the conversation itself.
In this work, we focus on the full range of activities observed in naturally occurring conversations, referred to as "natural variation". Our goal in this work is three-fold: • Highlight the problem of unnatural data generated through crowdsourcing • Showcase the impact of natural variation in the performance of state-of-the-art dialog systems, and • Publicly release improved testbeds for two datasets used extensively in goal-oriented dialog research: bAbI dialog task and SMD.
Recently, few approaches have been explored to study the behavior of neural dialog systems in the presence of synthetically introduced perturbations to the dialog history. Eshghi et al. (2017) created the bAbI+ dataset, an extension of bAbI dialog task-1, by introducing variations like hesitations, restarts and corrections. Zhao and Eskenazi (2018) created SimDial, which simulates spoken language phenomena, e.g. self-repair and hesitation. Sankar et al. (2019) introduce utterance-level and wordlevel perturbations on various benchmarks. However, such variations have been largely artificial and do not reflect the "natural variation" commonly found in naturally occuring conversational data. Geva et al. (2019) show that often models do not generalize well to examples from new annotators at test time who did not contribute to training data, which reinforces our choice of introducing natural variation in the test set for evaluation.

Datasets
We study and observe issues in multiple goaloriented dialog benchmarks. In this work, we focus on two multi-turn goal-oriented datasets: bAbI dialog task and SMD for evaluating the impact of natural variation. We provide details on issues in the following datasets: SMD, CamRest676 (Wen et al., 2017), Frames (El Asri et al., 2017) and Air-Dialogue (Wei et al., 2018) in the Appendix.

bAbI dialog task
The bAbI dialog tasks dataset (Bordes et al., 2017) includes five simulated tasks in the restaurant domain, where the dialog system has to retrieve the correct response from a set of given candidate responses. Task 1 to 4 are sub-tasks about issuing and updating API calls, recommending restaurant options, and providing additional information about a restaurant. Task 5 combines all tasks. There are two KBs used, where one KB is used to generate the standard training, validation, and test sets, and the other KB is used only to generate an Out-Of-Vocabulary (OOV) test set. The task is considered simple due to the small number of user and agent responses but is used extensively for goal-oriented dialog research.

Stanford Multi-Domain dataset (SMD)
SMD (Eric and Manning, 2017) is a multi-domain, task-oriented dialog dataset with three distinct domains: calendar scheduling, weather information retrieval, and point-of-interest navigation. SMD was collected using a Wizard-of-Oz (Woz) approach inspired by Wen et al. (2017). We provide sample dialogs in Figure 2. Crowd workers had two roles: Driver (tasked to extract certain information from the Car Assistant) and Car Assistant (tasked to answer Driver query using a private KB).
We incorporate the naturalistic variation mentioned below in Section 3 to these datasets as they are used extensively in goal-oriented dialog research and use them as benchmarks for our experimental evaluation 2 . Note that we introduce variation only in the test sets and create additional updated-test sets to simulate the presence of natural variation during deployment.

Naturalistic variation
In order to better approximate natural variation in our datasets, we utilize the Natural Conversation Framework (NCF) (Moore and Arar, 2019). The NCF is a framework for designing conversational agents that more closely emulate natural conversation than most of today's chatbots and voice assistants. The NCF is organized into two kinds of patterns: conversational activities and conversation management. The conversational activity patterns (denoted by A) handle the main business of conversation, i.e. the user request and the services provided by agent. On the other hand, conversation management patterns help the user and agent to manage the conversation itself. Conversation management occurs on two levels: sequence level (denoted B) and conversation level (denoted C).
After studying the 100 patterns in NCF, we identified a subset of the 32 patterns which are most commonly found in goal-oriented natural conversations and use them in our work. We excluded the remaining 68 patterns that covered other types of conversations e.g. quiz, question-answer jokes, voice-based assistant setting, etc. We provide details of these 32 patterns in the Appendix. For each pattern p from the 32 NCF patterns, we identify conversations in the test set where p could have been present in the conversation if the crowd worker was not limited to a given template. We define rules and heuristics based on the annotations e.g. dialog acts, slot information, etc. captured by the crowdsource worker from the user utterance. After introducing additional user utterances and agent responses per NCF pattern, we perform manual review of 20% of updated dialogs randomly to ensure that incorporating the pattern does not make the dialog incoherent (Sankar et al., 2019). After manual review, we select a subset of 9 patterns and incorporate them in the two datasets.
The statistics for number of dialogs in the test  sets for both datasets updated per pattern are in Table 1, and Table 2 provides details on number of dialogs where more than 1 pattern was added. We provide details with examples for a few patterns below and share details for the rest in the Appendix. Each pattern is denoted as pattern class (A/B/C) followed by pattern type.
(A) Open Request Screening: The user asks a preliminary question to a complex request to determine if the agent may be able to help with it. e.g. dialog D1 in Figure 1.
(B) Misunderstanding Report: The user tells the agent that it misunderstood what he or she said, e.g. (line 03) in dialog D2 in Figure 2.
(C) Capability Expansion: The user asks the agent to expand on one of its own capabilities that it previously mentioned, e.g. "Tell me more about restaurant recommendations." Although the naturalistic variation increases the complexity of the dialog, the added utterances do not increase the complexity of the goal, in other words, they do not introduce new topics or courses of action, they merely expand the existing ones.

Experiments
We use two state-of-the-art models: BossNet    as the baselines for our experiments. We use the best performing hyper-parameters reported by both models for each dataset. The test results reported (in Table 3 and 4) are calculated by using the saved model with highest validation performance across multiple runs. Training setting and hyperparameter details for both models in Appendix. For evaluation of the synthetic bAbI dialog task-5, we use per-response and per-dialog accuracy (Bordes et al., 2017). For SMD, we use a) BLEU (Papineni et al., 2002) and b) Entity F1 (Eric and Manning, 2017) scores. We evaluate BossNet and GLMP models on both the original and the updated test set. We do not evaluate the models on their ability to generate the newly added system responses as part of the naturalistic variation, but only on the system responses originally present in the test set.

Results
From Table 3 and 4, we observe that both models perform very poorly on our updated-test sets. For SMD, the EntF1 score drops by 62% for GLMP and 40% for BossNet. We observe similar performance reduction trends for bAbI dialog task-5, where the per-dialog accuracy decreases by more than 43% for BossNet and 85% for GLMP model.
We observe that the drop in performance on bAbI is much less than SMD. This is because bAbI is a synthetic dataset with a small set of fixed agent responses. Since the models are evaluated only on the agent responses present in the original test set, additional user and agent utterances for incorporating natural variation do not affect performance  We perform ablation experiments to study the impact of each pattern (presented in Table 5). We create separate updated-test sets for SMD for each pattern, by adding only one pattern at a time for the same number of dialogs per pattern from Table 1. We observe that (C) Capability Expansion pattern hurts the GLMP model performance the most in comparison to other patterns. As mentioned in Sec 3, in Capability Expansion, the user asks details from the agent about its capabilities. Since SMD has three domains, this adds more user and agent utterances to the dialog history, in comparison to other patterns, which results in a larger drop in model performance. In addition to higher overall dialog length, new domain entities are also present in these new utterances where agent/bot provides details on the services available, which results in lower performance. We provide statistics for change in average number of utterances per dialog per pattern for SMD in the Appendix.
Our results clearly show that naturalistic variation present during deployment affects model performance and will result in lower than expected performance for a given dialog system in production.

Conclusion
This work studies the dangers of using crowdsourced data, without templates for the natural range of activities in conversation, such as the Natural Conversation Framework (Moore and Arar, 2019), to train end-to-end dialog systems. We highlight the impact on the performance of state-ofthe-art models on our new and effective testbeds for bAbI dialog task-5 and SMD datasets, which have naturalistic variation. We believe this opens up a new and promising research direction for devising improved strategies for crowdsourcing goaloriented datasets, as well as improved models that can better handle interactions with real users.
2017. Learning end-to-end goal-oriented dialog. In the International Conference on Learning Representations (ICLR). A Appendix: Natural Conversation Framework (NCF) patterns At the core of the NCF is a pattern language of 100 interaction patterns that are adapted from conversation science for modeling rule-based dialog. NCF pattern language is organized into three classes: A) conversational activity, B) sequence-level management, and C) conversation-level management. The conversational activity patterns (A) involve the main business of the interaction and include ways in which user or agent can request information from the other (A1, A5), ways in which users can make complex requests in an open-ended way (A2), ways in which agents can tell stories or give instructions interactively (A3), and ways in which agents can quiz users on any topics (A4).
The sequence-level management patterns (B) involve managing particular sequences of utterances and include ways in which the agent and the user can repair troubles in hearing or understanding immediately prior utterances (B1, B2) or earlier utterances (B3), as well as ways of ending sequences either by closing them (B4) or by aborting them (B5). Finally, the conversation-level management patterns (C) involve coordinating entry into and exit from the interaction itself and include ways in which agents or the user can open the conversation (C1, C2), ways they can talk about the agent's capabilities (C3), and ways they can end the conversation either by closing it (C4) or disengaging from each other in other ways.
Each pattern consists of an abstract model in the form of a transcript with generic social actions. For example, Pattern A2.3 -Open Request is described below (Listing 1). The line numbers refer to utterance number in the conversation, U and A refer to user and agent utterance and generic social actions are listed in capitals. We provide details on other NCF patterns which were incorporated in the datasets, but omitted in the main paper due to space limitations below; • A: Open Request User Detail Request is a pattern in which the user requests additional information when attempting to answer an agent question, for example, "What are my choices?" • B: Other Correction is a pattern in which the agent corrects the user's second to last utterance based on his or her last utterance, for example, "Oh, you mean a different place." • B: Sequence Closer Not Helped is a pattern in which the user acknowledges a response from the agent in a negative way when it was not helpful, for example, "too bad" or "oh well." • B: Sequence Closer Repaired is a pattern in which the user acknowledges the repair of a part of a sequence, for example, an "ok" or "thank you" after the agent provides a repeat, paraphrase, example, etc.
• B: Example Request is a pattern in which the user requests clarification of the agent's prior utterance in the form of an example, for example, "Can you give an example?" • C: Recipient Correction is a pattern in which the user indicates that he or she is talking to someone other than the agent, for example, "I'm not talking to you."

B Appendix: NCF patterns for goal-oriented dialog
We provide the list of 32 patterns from the 100 NCF patterns, which are most commonly found in goal-oriented natural conversations below:  (2017) and collected 676 humanto-human dialogues. There were three informable slots (food, price range, area) that participants in the user role used to constrain the search (similar to bAbI dialog task-1) and six requestable slots (address, phone, postcode and the three informable slots) that the user could ask about once a restaurant has been offered (similar to bAbI dialog task-4). However, the user utterances in the dataset are straight-forward and always stick to the point without any diversity and novelty in natural language 4 .

C.3 Frames
El Asri et al. (2017) presented Frames corpus, by also using the Wizard-of-Oz (WOz) approach where the participants in the user role were given task templates during the data collection process. From the 38 templates used, 14 templates were generic and the other 24 were written to encourage more role-playing from users. This resulted in some novelty in the data collected and prevented the user utterances to be repetitive. However, to control data collection, the participants were asked to follow a set of instructions which resulted in user utterances largely focused on the task.

C.4 AirDialogue
Wei et al. (2018) recently presented AirDialogue, a large goal-oriented dataset where human annotators play the role of a customer or an agent and interact with the goal of successfully booking a trip given travel and flight restrictions generated by a context-generator. The dataset is the largest currently as it has largest context complexity and state complexity (based on all possible combinations of customer and agent context features, like number of flights in the database, number of airlines, airport codes and dialogue action states), in comparison to other existing datasets mentioned above. However, the authors don't share details on how the dataset was collected and instructions provided to the participants 5 .

E Appendix: Training Details
We use the best performing hyper-parameters reported by both models -BossNet and GLMP for each dataset. The test results reported are calculated by using the saved model with highest validation performance across multiple runs. Training setting and hyperparameter details for both models are provided below.

E.1 Baseline method: Bossnet
The hyperparameters used to train Bossnet on the different datasets are provided in Table 7.

E.2 Baseline method: GLMP
We use GLMP K3 (hops = 3) for training on the SMD dataset and GLMP K1 (hops = 1) for training on bAbI dialog task-5, as this configuration provides the best results. For both datasets, we used learning rate equal to 0.001, with a decay rate of 0.5. The hyperparameters used to train GLMP on the different datasets are provided in Table 8.