MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling

Even though machine learning has become the major scene in dialogue research community, the real breakthrough has been blocked by the scale of data available.To address this fundamental obstacle, we introduce the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics.At a size of 10k dialogues, it is at least one order of magnitude larger than all previous annotated task-oriented corpora.The contribution of this work apart from the open-sourced dataset is two-fold:firstly, a detailed description of the data collection procedure along with a summary of data structure and analysis is provided. The proposed data-collection pipeline is entirely based on crowd-sourcing without the need of hiring professional annotators;secondly, a set of benchmark results of belief tracking, dialogue act and response generation is reported, which shows the usability of the data and sets a baseline for future studies.


Introduction
Conversational Artificial Intelligence (Conversational AI) is one of the long-standing challenges in computer science and artificial intelligence since the Dartmouth Proposal (McCarthy et al., 1955). As human conversation is inherently complex and ambiguous, learning an open-domain conversational AI that can carry on arbitrary tasks is still very far-off (Vinyals and Le, 2015). As a consequence, instead of focusing on creating ambitious conversational agents that can reach human-level intelligence, industrial practice has focused on building task-oriented dialogue systems ) that can help with specific tasks such * The work was done while at the University of Cambridge. as flight reservation (Seneff and Polifroni, 2000) or bus information (Raux et al., 2005). As the need of hands-free use cases continues to grow, building a conversational agent that can handle tasks across different application domains has become more and more prominent (Ram et al., 2018).
Dialogues systems are inherently hard to build because there are several layers of complexity: the noise and uncertainty in speech recognition (Black et al., 2011); the ambiguity when understanding human language ; the need to integrate third-party services and dialogue context in the decision-making (Traum and Larsson, 2003;Paek and Pieraccini, 2008); and finally, the ability to generate natural and engaging responses (Stent et al., 2005). These difficulties have led to the same solution of using statistical framework and machine learning for various system components, such as natural language understanding (Henderson et al., 2013;Mesnil et al., 2015;Mrkšić et al., 2017a), dialogue management (Gašić and Young, 2014;Tegho et al., 2018), language generation (Wen et al., 2015;Kiddon et al., 2016), and even end-to-end dialogue modelling (Zhao and Eskenazi, 2016;Wen et al., 2017;Eric et al., 2017).
To drive the progress of building dialogue systems using data-driven approaches, a number of conversational corpora have been released in the past. Based on whether a structured annotation scheme is used to label the semantics, these corpora can be roughly divided into two categories: corpora with structured semantic labels (Hemphill et al., 1990;Asri et al., 2017;Wen et al., 2017;Eric et al., 2017;Shah et al., 2018); and corpora without semantic labels but with an implicit user goal in mind (Ritter et al., 2010;Lowe et al., 2015). Despite these efforts, aforementioned datasets are usually constrained in one or more dimensions such as missing proper annotations, only available in a limited capacity, lacking multi-domain use cases, or having a negligible linguistic variability. This paper introduces the Multi-Domain Wizard-of-Oz (MultiWOZ) dataset, a large-scale multi-turn conversational corpus with dialogues spanning across several domains and topics. Each dialogue is annotated with a sequence of dialogue states and corresponding system dialogue acts (Traum, 1999). Hence, MultiWOZ can be used to develop individual system modules as separate classification tasks and serve as a benchmark for existing modular-based approaches. On the other hand, MultiWOZ has around 10k dialogues, which is at least one order of magnitude larger than any structured corpus currently available. This significant size of the corpus allows researchers to carry on end-to-end based dialogue modelling experiments, which may facilitate a lot of exciting ongoing research in the area.
This work presents the data collection approach, a summary of the data structure, as well as a series of analyses of the data statistics. To show the potential and usefulness of the proposed Mul-tiWOZ corpus, benchmarking baselines of belief tracking, natural language generation and end-toend response generation have been conducted and reported. The dataset and baseline models will be freely available online. 1

Related Work
Existing datasets can be roughly grouped into three categories: machine-to-machine, human-tomachine, and human-to-human conversations. A detailed review of these categories is presented be-1 http://dialogue.mi.eng.cam.ac.uk/ index.php/corpus/ low.

Machine-to-Machine
Creating an environment with a simulated user enables to exhaustively generate dialogue templates. These templates can be mapped to a natural language by either pre-defined rules (Bordes et al., 2017) or crowd workers (Shah et al., 2018). Such approach ensures a diversity and full coverage of all possible dialogue outcomes within a certain domain. However, the naturalness of the dialogue flows relies entirely on the engineered set-up of the user and system bots. This poses a risk of a mismatch between training data and real interactions harming the interaction quality. Moreover, these datasets do not take into account noisy conditions often experienced in real interactions (Black et al., 2011).
Human-to-Machine Since collecting dialogue corpus for a task-specific application from scratch is difficult, most of the task-oriented dialogue corpora are fostered based on an existing dialogue system. One famous example of this kind is the Let's Go Bus Information System which offers live bus schedule information over the phone (Raux et al., 2005) leading to the first Dialogue State Tracking Challenge . Taking the idea of the Let's Go system forward, the second and third DSTCs (Henderson et al., 2014b,c) have produced bootstrapped human-machine datasets for a restaurant search domain in the Cambridge area, UK. Since then, DSTCs have become one of the central research topics in the dialogue community (Kim et al., 2016(Kim et al., , 2017. While human-to-machine data collection is an obvious solution for dialogue system development, it is only possible with a provision of an existing working system. Therefore, this chicken (system)-and-egg (data) problem limits the use of this type of data collection to existing system improvement instead of developing systems in a completely new domain. What is even worse is that the capability of the initial system introduces additional biases to the collected data, which may result in a mismatch between the training and testing sets (Wen et al., 2016). The limited understanding capability of the initial system may prompt the users to adapt to simpler input examples that the system can understand but are not necessarily natural in conversations.
Human-to-Human Arguably, the best strategy to build a natural conversational system may be to have a system that can directly mimic human behaviors through learning from a large amount of real human-human conversations. With this idea in mind, several large-scale dialogue corpora have been released in the past, such as the Twitter (Ritter et al., 2010) dataset, the Reddit conversations (Schrading et al., 2015), and the Ubuntu technical support corpus (Lowe et al., 2015). Although previous work (Vinyals and Le, 2015) has shown that a large learning system can learn to generate interesting responses from these corpora, the lack of grounding conversations onto an existing knowledge base or APIs limits the usability of developed systems. Due to the lack of an explicit goal in the conversation, recent studies have shown that systems trained with this type of corpus not only struggle in generating consistent and diverse responses (Li et al., 2016) but are also extremely hard to evaluate (Liu et al., 2016).
In this paper, we focus on a particular type of human-to-human data collection. The Wizardof-Oz framework (WOZ) (Kelley, 1984) was first proposed as an iterative approach to improve user experiences when designing a conversational system. The goal of WOZ data collection is to log down the conversation for future system development. One of the earliest dataset collected in this fashion is the ATIS corpus (Hemphill et al., 1990), where conversations between a client and an airline help-desk operator were recorded.
More recently, Wen et al. (2017) have shown that the WOZ approach can be applied to collect high-quality typed conversations where a machine learning-based system can learn from. By modifying the original WOZ framework to make it suitable for crowd-sourcing, a total of 676 dialogues was collected via Amazon Mechanical Turk. The corpus was later extended to additional two languages for cross-lingual research (Mrkšić et al., 2017b). Subsequently, this approach is followed by Asri et al. (2017) to collect the Frame corpus in a more complex travel booking domain, and Eric et al. (2017) to collect a corpus of conversations for in-car navigation. Despite the fact that all these datasets contain highly natural conversations comparing to other human-machine collected datasets, they are usually small in size with only a limited domain coverage.

Data Collection Set-up
Following the Wizard-of-Oz set-up (Kelley, 1984), corpora of annotated dialogues can be gathered at relatively low costs and with a small time effort. This is in contrast to previous approaches (Henderson et al., 2014a) and such WOZ set-up has been successfully validated by Wen et al. (2017) and Asri et al. (2017).
Therefore, we follow the same process to create a large-scale corpus of natural human-human conversations. Our goal was to collect multi-domain dialogues. To overcome the need of relying the data collection to a small set of trusted workers 2 , act type inform * / request * / select 123 / recommend/ 123 / not found 123 request booking info 123 / offer booking 1235 / inform booked 1235 / decline booking 1235 welcome * /greet * / bye * / reqmore * slots address * / postcode * / phone * / name 1234 / no of choices 1235 / area 123 / pricerange 123 / type 123 / internet 2 / parking 2 / stars 2 / open hours 3 / departure 45 destination 45 / leave after 45 / arrive by 45 / no of people 1235 / reference no. 1235 / trainID 5 / ticket price 5 / travel time 5 / department 7 / day 1235 / no of days 123 the collection set-up was designed to provide an easy-to-operate system interface for the Wizards and easy-to-follow goals for the users. This resulted in a bigger diversity and semantical richness of the collected data (see Section 4.3). Moreover, having a large set of workers mitigates the problem of artificial encouragement of a variety of behavior from users. A detailed explanation of the data-gathering process from both sides is provided below. Subsequently, we show how the crowdsourcing scheme can also be employed to annotate the collected dialogues with dialogue acts.

Dialogue Task
The domain of a task-oriented dialogue system is often defined by an ontology, a structured representation of the back-end database. The ontology defines all entity attributes called slots and all possible values for each slot. In general, the slots may be divided into informable slots and requestable slots. Informable slots are attributes that allow the user to constrain the search (e.g., area or price range). Requestable slots represent additional information the users can request about a given entity (e.g., phone number). Based on a given ontology spanning several domains, a task template was created for each task through random sampling. This results in single and multi-domain dialogue scenarios and domain specific constraints were generated. In domains that allowed for that, an additional booking requirement was sampled with some probability.
To model more realistic conversations, goal changes are encouraged. With a certain probability, the initial constraints of a task may be set to values so that no matching database entry exists. Once informed about that situation by the system, the users only needed to follow the goal which provided alternative values.

User Side
To provide information to the users, each task template is mapped to natural language. Using heuristic rules, the task is then gradually introduced to the user to prevent an overflow of information. The goal description presented to the user is dependent on the number of turns already performed. Moreover, if the user is required to perform a sub-task (for example -booking a venue), these sub-goals are shown straight-away along with the main goal in the given domain. This makes the dialogues more similar to spoken conversations. 3 Figure 1 shows a sampled task description spanning over two domains with booking requirement. Natural incorporation of co-referencing and lexical entailment into the dialogue was achieved through implicit mentioning of some slots in the goal.

System Side
The wizard is asked to perform a role of a clerk by providing information required by the user. He is given an easy-to-operate graphical user interface to the back-end database. The wizard conveys the information provided by the current user input through a web form. This information is persistent across turns and is used to query the database. Thus, the annotation of a belief state is performed implicitly while the wizard is allowed to fully focus on providing the required information. Given the result of the query (a list of entities satisfying current constraints), the wizard either requests more details or provides the user with the adequate information. At each system turn, the wizard starts with the results of the query from the previous turn.
To ensure coherence and consistency, the wizard and the user alike first need to go through the dialogue history to establish the respective context. We found that even though multiple workers contributed to one dialogue, only a small margin of dialogues were incoherent.

Annotation of Dialogue Acts
Arguably, the most challenging and timeconsuming part of any dialogue data collection is the process of annotating dialogue acts. One of the major challenges of this task is the definition of a set and structure of dialogue acts (Traum and Hinkelman, 1992;Bunt, 2006). In general, a dialogue act consists of the intent (such as request or inform) and slot-value pairs. For example, the act inform(domain=hotel,price=expensive) has the intent inform, where the user is informing the system to constrain the search to expensive hotels. Expecting a big discrepancy in annotations between annotators, we initially ran three trial tests over a subset of dialogues using Amazon Mechanical Turk. Three annotations per dialogue were gathered resulting in around 750 turns. As this requires a multi-annotator metric over a multi-label task, we used Fleiss' kappa metric (Fleiss, 1971) per single dialogue act. Although the weighted kappa value averaged over dialogue acts was at a high level of 0.704, we have observed many cases of very poor annotations and an unsatisfactory coverage of dialogue acts. Initial errors in annotations and suggestions from crowd workers gradually helped us to expand and improve the final set of dialogue acts from 8 to 13 -see Table  2.
The variation in annotations made us change the initial approach. We ran a two-phase trial to first identify set of workers that perform well. Turkers were asked to annotate an illustrative, long dialogue which covered many problematic examples that we have observed in the initial run described above. All submissions that were of high quality were inspected and corrections were reported to annotators. Workers were asked to re-run a new trial dialogue. Having passed the second test, they were allowed to start annotating real dialogues. This procedure resulted in a restricted set of annotators performing high quality annotations. Appendix A contains a demonstration of a created system.

Data Quality
Data collection was performed in a two-step process. First, all dialogues were collected and then the annotation process was launched. This setup allowed the dialogue act annotators to also report errors (e.g., not following the task or confusing utterances) found in the collected dialogues. As a result, many errors could be corrected. Finally, additional tests were performed to ensure that the provided information in the dialogues match the pre-defined goals.
To estimate the inter-annotator agreement, the averaged weighted kappa value for all dialogue acts was computed over 291 turns. With κ = 0.884, an improvement in agreement between annotators was achieved although the size of action set was significantly larger.

MultiWOZ Dialogue Corpus
The main goal of the data collection was to acquire highly natural conversations between a tourist and a clerk from an information center in a touristic city. We considered various possible dialogue scenarios ranging from requesting basic information about attractions through booking a hotel room or travelling between cities. In total, the presented corpus consists of 7 domains -Attraction, Hospital, Police, Hotel, Restaurant, Taxi, Train. The latter four are extended domains which include the sub-task Booking. Through a task sampling procedure (Section 3.1), the dialogues cover between 1 and 5 domains per dialogue thus greatly varying in length and complexity. This broad range of domains allows to create scenarios where domains are naturally connected. For example, a tourist needs to find a hotel, to get the list of attractions and to book a taxi to travel between both places. Table 2 presents the global ontology with the list of considered dialogue acts.

Data Statistics
Following data collection process from the previous section, a total of 10, 438 dialogues were collected. Figure 2 (left) shows the dialogue length distribution grouped by single and multi domain dialogues. Around 70% of dialogues have more than 10 turns which shows the complexity of the corpus. The average number of turns are 8.93 and 15.39 for single and multi-domain dialogues respectively with 115, 434 turns in total. Figure 2 (right) presents a distribution over the turn lengths.
As expected, the wizard replies are much longerthe average sentence lengths are 11.75 and 15.12 for users and wizards respectively. The responses are also more diverse thus enabling the training of more complex generation models.
Figure 3 (left) shows the distribution of dialogue acts annotated in the corpus. We present here a summarized list where different types of actions like inform are grouped together. The right graph in the Figure 3 presents the distribution of number of acts per turn. Almost 60% of dialogues turns have more than one dialogue act showing again the richness of system utterances. These create a new challenge for reinforcement learning-based models requiring them to operate on concurrent actions.
In total, 1, 249 workers contributed to the corpus creation with only few instances of intentional wrongdoing. Additional restrictions were added to automatically discover instances of very short utterances, short dialogues or missing single turns during annotations. All such cases were corrected or deleted from the corpus.

Data Structure
There are 3, 406 single-domain dialogues that include booking if the domain allows for that and 7, 032 multi-domain dialogues consisting of at least 2 up to 5 domains. To enforce reproducibility of results, the corpus was randomly split into a train, test and development set. The test and development sets contain 1k examples each. Even though all dialogues are coherent, some of them were not finished in terms of task description. Therefore, the validation and test sets only contain fully successful dialogues thus enabling a fair comparison of models.
Each dialogue consists of a goal, multiple user and system utterances as well as a belief state and set of dialogue acts with slots per turn. Additionally, the task description in natural language presented to turkers working from the visitor's side is added.

Comparison to Other Structured Corpora
To illustrate the contribution of the new corpus, we compare it on several important statistics with the DSTC2 corpus (Henderson et al., 2014a), the SFX corpus , the WOZ2.0 corpus (Wen et al., 2017), the FRAMES corpus (Asri et al., 2017), the KVRET corpus (Eric et al., 2017), and the M2M corpus (Shah et al., 2018). Figure 1 clearly shows that our corpus compares favorably to all other data sets on most of the metrics with the number of total dialogues, the average number of tokens per turn and the total number of unique tokens as the most prominent ones. Especially the latter is important as it is directly linked to linguistic richness.

MultiWOZ as a New Benchmark
The complexity and the rich linguistic variation in the collected MultiWOZ dataset makes it a great benchmark for a range of dialogue tasks.
To show the potential usefulness of the Multi-WOZ corpus, we break down the dialogue modelling task into three sub-tasks and report a benchmark result for each of them: dialogue state tracking, dialogue-act-to-text generation, and dialoguecontext-to-text generation. These results illustrate new challenges introduced by the MultiWOZ dataset for different dialogue modelling problems.

Dialogue State Tracking
A robust natural language understanding and dialogue state tracking is the first step towards building a good conversational system. Since multidomain dialogue state tracking is still in its infancy and there are not many comparable approaches available (Rastogi et al., 2017), we instead report our state-of-the-art result on the restaurant subset of the MultiWOZ corpus as the reference baseline. The proposed method (Ramadan et al., 2018) exploits the semantic similarity between dialogue utterances and the ontology terms which allows the information to be shared across domains. Furthermore, the model parameters are independent of the ontology and belief states, therefore the number of the parameters does not increase with the size of   Table 3 shows that the performance of the model is consecutively poorer on the new dataset compared to WOZ2.0. These results demonstrate how demanding is the new dataset as the conversations are richer and much longer.

Dialogue-Context-to-Text Generation
After a robust dialogue state tracking module is built, the next challenge becomes the dialogue management and response generation components. These problems can either be addressed separately , or jointly in an end-to-end fashion (Bordes et al., 2017;Wen et al., 2017;Li et al., 2017). In order to establish a clear benchmark where the performance of the composite of dialogue management and response generation is completely independent of the belief tracking, we experimented with a baseline neural response generation model with an oracle beliefstate obtained from the wizard annotations as discussed in Section 3.3. 5 Following Wen et al. (2017) which frames the dialogue as a context to response mapping problem, a sequence-to-sequence model (Sutskever et al., 2014) is augmented with a belief tracker and a discrete database accessing component as additional features to inform the word decisions in the decoder. Note, in the original paper the belief tracker was pre-trained while in this work the annotations of the dialogue state are used as an oracle tracker. Figure 4 presents the architecture of the system . 4 The model is publicly available at https://github.com/osmanio2/ multi-domain-belief-tracking Training and Evaluation Since often times the evaluation of a dialogue system without a direct interaction with the real users can be misleading (Liu et al., 2016), three different automatic metrics are included to ensure the result is better interpreted. Among them, the first two metrics relate to the dialogue task completion -whether the system has provided an appropriate entity (Inform rate) and then answered all the requested attributes (Success rate); while fluency is measured via BLEU score (Papineni et al., 2002). The best models for both datasets were found through a grid search over a set of hyper-parameters such as the size of embeddings, learning rate and different recurrent architectures.
We trained the same neural architecture (taking into account different number of domains) on both MultiWOZ and Cam676 datasets. The best results on the Cam676 corpus were obtained with bidirectional GRU cell. In the case of MultiWOZ dataset, the LSTM cell serving as a decoder and an encoder achieved the highest score with the global type of attention (Bahdanau et al., 2014). Table 4 presents the results of a various of model architectures and shows several challenges. As expected, the model achieves almost perfect score on the Inform metric on the Cam676 dataset taking the advantage of an oracle belief state signal. However, even with the perfect dialogue state tracking of the user intent, the baseline models obtain almost 30% lower score on the Inform metric on the new corpus. The addition of the attention improves the score on the Success metric on the new dataset by less than 1%. Nevertheless, as expected, the best model on MultiWOZ is still falling behind by a large margin in comparison to the results on the Cam676 corpus taking into account both Inform and Success metrics. As most of dialogues span over at least two domains, the model has to be much more effective in order to execute a successful dialogue. Moreover, the BLEU score on the MultiWOZ is lower than the one reported on the Cam676 dataset. This is mainly caused by the much more diverse linguistic expressions observed in the MultiWOZ dataset.

Dialogue-Act-to-Text Generation
Natural Language Generation from a structured meaning representation (Oh and Rudnicky, 2000;Bohus and Rudnicky, 2005) has been a very popular research topic in the community, and the lack of data has been a long standing block for the field to adopt more machine learning methods. Due to the additional annotation of the system acts, the MultiWOZ dataset serves as a new benchmark for studying natural language generation from a structured meaning representation. In order to verify the difficulty of the collected dataset for the language generation task, we compare it to the SFX dataset (see Table 1), which consists of around 5k dialogue act and natural language sentence pairs. We trained the same Semantically Conditioned Long Short-term Memory network (SC-LSTM) proposed by Wen et al. (2015) on both datasets and used the metrics as a proxy to estimate the difficulty of the two corpora. To make a fair comparison, we constrained our dataset to only the restaurant sub-domain which contains around 25k dia-  logue turns. To give more statistics about the two datasets: the SFX corpus has 9 different act types with 12 slots comparing to 12 acts and 14 slots in our corpus. The best model for both datasets was found through a grid search over a set of hyperparameters such as the size of embeddings, learning rate, and number of LSTM layers. 6 Table 5 presents the results on two metrics: BLEU score (Papineni et al., 2002) and slot error rate (SER) (Wen et al., 2015). The significantly lower metrics on the MultiWOZ corpus showed that it is much more challenging than the SFX restaurant dataset. This is probably due to the fact that more than 60% of the dialogue turns are composed of at least two system acts, which greatly harms the performance of the existing model.

Conclusions
As more and more speech oriented applications are commercially deployed, the necessity of building an entirely data-driven conversational agent becomes more apparent. Various corpora were gathered to enable data-driven approaches to dialogue modelling. To date, however, the available datasets were usually constrained in linguistic variability or lacking multi-domain use cases.
In this paper, we established a data-collection pipeline entirely based on crowd-sourcing enabling to gather a large scale, linguistically rich corpus of human-human conversations. We hope that MultiWOZ offers valuable training data and a new challenging testbed for existing modularbased approaches ranging from belief tracking to 6 The model is publicly available at https://github.com/andy194673/ nlg-sclstm-multiwoz dialogue acts generation. Moreover, the scale of the data should help push forward research in the end-to-end dialogue modelling. Figure A1 presents the user side interface where the worker needs to properly respond given the task description and the dialogue history. Figure  A2 shows the wizard page with the GUI over all domains. Finally, Figure A3 shows the set-up for annotation of the system acts with Restaurant domain being turned on.