RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling

In order to alleviate the shortage of multi-domain data and to capture discourse phenomena for task-oriented dialogue modeling, we propose RiSAWOZ, a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic Annotations. RiSAWOZ contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains, which is larger than all previous annotated H2H conversational datasets. Both single- and multi-domain dialogues are constructed, accounting for 65% and 35%, respectively. Each dialogue is labeled with comprehensive dialogue annotations, including dialogue goal in the form of natural language description, domain, dialogue states and acts at both the user and system side. In addition to traditional dialogue annotations, we especially provide linguistic annotations on discourse phenomena, e.g., ellipsis and coreference, in dialogues, which are useful for dialogue coreference and ellipsis resolution tasks. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. A series of benchmark models and results are reported, including natural language understanding (intent detection&slot filling), dialogue state tracking and dialogue context-to-text generation, as well as coreference and ellipsis resolution, which facilitate the baseline comparison for future research on this corpus.


Introduction
In recent years, we have witnessed that a variety of datasets tailored for task-oriented dialogue have been constructed, such as MultiWOZ (Budzianowski et al., 2018), SGD (Rastogi et al.,* Equal Contributions. 1 The corpus is publicly available at https://github. com/terryqj0107/RiSAWOZ. 2019a) and CrossWOZ (Zhu et al., 2020), along with the increasing interest in conversational AI in both academia and industry (Gao et al., 2018). These datasets have triggered extensive research in either end-to-end or traditional modular taskoriented dialogue modeling (Wen et al., 2019;Dai et al., 2020). Despite of substantial progress made based on these newly built corpora, more efforts in creating challenging datasets in terms of size, multiple domains, semantic annotations, multilinguality, etc., are still in demand (Wen et al., 2019).
Among the existing datasets, the majority of them are not large in size, e.g., ATIS (Hemphill et al., 1990), WOZ 2.0 (Wen et al., 2017), FRAMES (El Asri et al., 2017) and KVRET (Eric et al., 2017), which might not well support datahungry neural dialogue models. Very large taskoriented dialogue datasets can be created in a machine-to-machine fashion, such as M2M (Shah et al., 2018) and SGD (Rastogi et al., 2019a). Datasets collected in this way need to simulate both user and system and contain unnatural conversations.
MultiWOZ (Budzianowski et al., 2018), probably the most promising and notable dialogue corpus collected in a Wizard-of-Oz (i.e., Human-to-Human) way recently, is one order of magnitude larger than the aforementioned corpora collected in the same way. However, it contains noisy systemside state annotations and lacks user-side dialogue acts 2 (Eric et al., 2019;Zhu et al., 2020). Yet another very recent dataset CrossWOZ (Zhu et al., 2020), the first large-scale Chinese H2H dataset for task-oriented dialogue, provides semantic annotations on both user and system side although it is relatively smaller than MultiWOZ. The number of domains in both MultiWOZ and CrossWOZ is  Table 1: Comparison of our dataset to other task-oriented dialogue datasets (training set). H2H, H2M, M2M represent human-to-human, human-to-machine, machine-to-machine respectively. fewer than 10. MultiWOZ dialogues cover 7 domains. However, the distribution of dialogues over these domains is imbalanced. Dialogues from two domains (hospital, police) account for less than 6% in the training set and don't appear in either the development or test set. CrossWOZ involves 5 domains and dialogue goal descriptions for the domain taxi and metro are relatively simple than those from other domains. Neither MultiWOZ nor CrossWOZ provide linguistic annotations to capture discourse phenomena which are ubiquitous in multi-turn dialogues and are important in dialogue modeling (Quan et al., 2019;Rastogi et al., 2019b;Zhang et al., 2019a) In order to alleviate the aforementioned issues, we propose RiSAWOZ, a large-scale Chinese multi-domain Wizard-of-Oz task-oriented dialogue dataset with rich semantic annotations. Compared with existing datasets (particularly MultiWOZ and CrossWOZ), our contributions can be summarized as follows: • RiSAWOZ is to date the largest fully annotated human-to-human task-oriented dialogue dataset to our knowledge.  Figure 1: A dialogue example spanning over multiple domains. We show dialogue annotations and necessary linguistic annotations (green words for ellipsis resolution and green box for coreference clusters) for each user utterance (in yellow callout) and system utterance (in blue callout). Better viewed in color.
logue. Detailed comparison of RiSAWOZ to existing task-oriented dialogue datasets is shown in Table 1.
• We provide richer manual semantic annotations on the crowd-sourced dialogues, including both dialogue annotations (i.e., various structured semantic labels for dialogue modeling) and linguistic annotations that are not available in previous Wizard-of-Oz datasets (e.g., MultiWOZ or CrossWOZ). Figure 1 shows a dialogue example that demonstrates semantic annotations in RiSAWOZ. User goal description, domain label, dialogue states and dialogue acts at both user and system side are annotated for each dialogue. In order to facilitate the study of ellipsis and coreference in dialogue, we also provide two kinds of linguistic annotations collected in two different ways. Annotations for unified ellipsis/coreference resolution via utterance rewriting are more comprehensive and at least one order of magnitude larger than existing datasets with such annotations (Quan et al., 2019;Zhang et al., 2019a;Rastogi et al., 2019b). Coreference clusters in each dialogue are also manually annotated, providing a new large-scale coreference dataset on dialogue, which is complementary to previous coreference datasets on documents (Pradhan et al., 2012). In a nutshell, RiSAWOZ integrates human-to-human conversations, dialogue annotations and linguistic annotations on ellipsis/coreference into a single unified dataset.
• We use RiSAWOZ as a new benchmark testbed and report benchmark results on 5 tasks for future comparison study and tracking progress on this dataset. The 5 tasks are NLU, DST, Dialogue Context-to-Text Generation, Coreference Resolution and Unified Generative Ellipsis and Coreference Resolution. We discuss the usability of the dataset for other tasks, e.g., Dialogue Policy Learning, Natural Language Generation, User Simulator, Dialogue Summarization, etc. The dataset and the benchmark models will be publicly available soon.

Related Work
We follow Budzianowski et al. (2018) to roughly categorize existing task-oriented dialogue datasets into three groups: machine-to-machine, humanto-machine, and human-to-human. From the perspective of domain quantity and data scale, most existing datasets cover only one single or a few domains while large-scale multi-domain datasets are not widely available. As suggested by Wen et al. (2019), task-oriented dialogue datasets in languages other than English are few. To the best of our knowledge, there has been no large-scale dialogue datasets with linguistic annotations aiming at ubiquitous discourse phenomena (e.g., ellipsis and coreference) in dialogue. Although some recent works have proposed datasets with utterance completion annotation for ellipsis or coreference in dialogue (Quan et al., 2019;, these datasets are at small scale and with simple dialogue goals. No dialogue datasets provide annotations of coreference clusters. Machine-to-Machine Collecting data of this type requires to create a user and system simulator to generate multi-turn dialogue outlines, which are further transformed into natural language utterances via paraphrasing with predefined rules or crowdsourced workers (Shah et al., 2018;Rastogi et al., 2019a). Despite of less human effort required in this approach, the diversity and complexity of created dialogues greatly depend on the quality of user and system simulators. It's also difficult to avoid mismatch between machine-created dialogues and real human conversations.
Human-to-Machine In this method, humans converse with an existing dialogue system to collect dialogue data. The Dialogue State Tracking Challenges (DSTC) has provided several datasets created in this way (Williams et al., 2013;Henderson et al., 2014a,b). Generally, the quality of human-to-machine data heavily relies on the performance of the given dialogue system.
Human-to-Human To collect data of this type, crowdsourced workers talk to each other according to given dialogue goals to create diverse and natural dialogues. ATIS (Hemphill et al., 1990), WOZ 2.0 (Wen et al., 2017), FRAMES (El Asri et al., 2017) and KVRET (Eric et al., 2017) are small-scale datasets built in this way. In contrast, MultiWOZ Budzianowski et al. (2018) and Cross-WOZ (Zhu et al., 2020) are two large-scale H2H datasets proposed recently.
Coreference Resolution Coreference is ubiquitous in dialogue. However, there is no available dialogue dataset with labeled coreference clusters. Generally, coreference datasets are created on text paragraphs or documents. The OntoNotes 5.0 dataset (Pradhan et al., 2012) is one of the most widely-used document-level dataset on coreference resolution from the CoNLL-2012 shared task.
Generative Ellipsis and Coreference Resolution In recent years, ellipsis and coreference resolution in dialogue has been treated as an end-to-end generative task.  propose to rewrite dialogue utterances to recover all co-referred and omitted information with an annotated Chinese   ) with coreference and ellipsis information and propose an end-to-end generative resolution model for both ellipsis and coreference in a single unified framework. This task is also treated as an auxiliary module to improve dialogue understanding (Zhang et al., 2019a) and dialogue state tracking (Rastogi et al., 2019b). However, the scale of these built or used dialogue datasets is relatively small.

Dataset Creation
The whole process of data collection consists of database and ontology construction, goal generation, dialogue collection and two rounds of annotations.

Database and Ontology Construction
We crawl 3,325 unique entities with their attributes from several Chinese public websites. Statistics on entities and slots across 12 domains are shown in Table 2. An ontology is constructed over these entities and attributes, which defines all possible slots for each domain and all possible values for each slot. Slots in the dataset can be divided into two categories: informable and requestable, as shown in Table 3. Informable slots are attributes that allow the user to constrain the search into the database. Requestable slots represent specific attributes that the user wants to know about an entity.

Dialogue Goal
First of all, we design dialogue goal templates with placeholders representing slots and values. We have designed 80 dialogue goal templates for 12 domains, including 52 single-domain goals and 28 multi-domain goals. Then we randomly sample actual slots and values from the ontology to fill in the placeholders in the templates to generate dialogue goal instances. We finally generate 5,600 dialogue goal instances. An example of dialogue goal (i.e. user goal) is shown at the top of Figure 1. A dialogue goal is a natural language description that only the user can see. The user needs to talk with the system step by step according to the given goal description until the task is finished. We assign each dialogue goal to two different pairs of workers to accomplish. In this way, we can collect two different dialogues from one dialogue goal. The  total of dialogues is therefore 2 * 5, 600 = 11, 200. This setting can ensure the diversity of collected dialogues while the amount of cost for crowdsourced workers is under the budget.

Dialogue Collection and Annotation
In order to collect high-quality coherent dialogues and annotations, we develop a collecting platform based on the Client-Server architecture, including two versions of client platform for user and system side respectively. Crowdsourced workers can choose to play the role of either user or system. They work in pairs and enter the chat room to construct dialogues. To ensure the quality of dialogue collection, we hire workers via our in-house crowdsourcing platform and train the workers strictly in advance. Finally, we select 92 well-trained workers to participate in our dialogue collection and annotation. At this stage, we collect task-oriented dialogue data with dialogue annotations including domain labels, dialogue states and acts.

User Side
During dialogue collection, a user first reads through the natural language description of a given dialogue goal to understand the task that is required to finish. After that, the user communicates with the system step by step to accomplish the given goal. We encourage the users to follow their own personalized language style in communication and train different workers to play the role of user, which makes created dialogues more complex and diverse, and more similar to the spoken conversations in our daily life. According to the instructions described in the given dialogue goal, the user should provide specific constraints to the system step by step and request the corresponding information. The user can terminate the dialogue when he/she believes the task has been accomplished.

System Side
The worker who plays the role of system (i.e. wizard) provides consulting services in various domains to users. When receiving an utterance from the user side, the wizard needs to first determine which domain the user is talking about and convert the user utterance into structured user acts. A dialogue act for both user and system side consists of the act type (i.e. intent) and slot-value pairs such as inform (area=Gusu District). All the dialogue act types are shown in Table 4. We define 5 different user act types. Inform represents that a user provides specific constraints to information search from the database. Request denotes that a user asks for the values of specific slots. Greeting and bye are to express greeting and farewell. General represents other behaviors that are not covered above. By understanding the goal of a user utterance, the wizard needs to annotate the constraints the user wants to provide and the slots requested by the user. The constraints are called belief states which are a set of slot-value pairs. The belief state is persistent across turns and is used to query the database. The wizard then retrieves the database according to the constraints. Considering both the results of database retrieval and the dialogue context, the wizard should send a natural language response to the user. In addition, the wizard needs to convert the natural language system response into structured system acts.
Similar to the user acts, 7 different system act types are predefined. Inform represents that the system informs the user about the attribute values of specific entities. Recommend denotes that the system recommends required entities to the user. Request represents that the system asks the user whether there are special constraints for the slots in question. No offer means that the system tells the user there is no matched entity. The remaining three act types, greeting, bye and general, are the same as those user act types described above.
Although the tasks for the system side look complex at the first glance, we design a simple and easy-to-operate graphical user interface (GUI). The wizard only needs to follow the prompts to perform simple operations, such as checking the multicheck box, picking the drop-down box, filling in the input box, clicking specific buttons, etc., which can be easily done by well-trained workers.
In this way, the information of domains, belief states, dialogue acts for both user and system side can be annotated during the process of collecting dialogues. A dialogue example with these annotations are demonstrated in Figure 1.
Different from the data collecting way that multiple workers contribute to one dialogue adopted  by MultiWOZ (Budzianowski et al., 2018), in our dataset, the construction and annotation for each dialogue are completed by a pair of well-trained workers. This is to guarantee the coherence and consistency of each created dialogue and the accuracy of the annotation for them. Moreover, we train each worker to play different roles to diversify dialogue utterances.

Coreference Clusters Annotation
We develop a toolkit with easy-to-operate GUI for annotating coreference clusters. With this annotation toolkit, well-trained annotators read through a dialogue to locate all entity mentions. They then group each of these mentions into an appropriate cluster. As shown at the bottom of Figure 1, entity mentions in each cluster are co-referential to one another.

Ellipsis and Coreference Annotation via Utterance Rewriting
As shown in Figure 1, both referenced and absent information can be recovered by rewriting an incomplete utterance into a complete version. In this way, we can reformulate ellipsis and coreference resolution as sentence rewriting in a unified framework. The merit of such rewriting is to help the dialogue model better understand the goal of a user utterance in context. In order to facilitate such task reformulation, we provide the second type of linguistic annotation on RiSAWOZ: utterance rewriting for ellipsis and coreference resolution. We train crowdsourced workers to accomplish this annotation task and develop an annotation toolkit for them. Each annotator needs to read an entire dialogue sentence  by sentence, detecting ellipsis or coreference phenomena in user utterances. For an utterance with ellipsis or coreference, the annotator rewrites the utterance into its complete version with recovered referenced/absent information according to dialogue context. If none of them occurs in the user utterance, the original utterance is kept. Both cases are presented in the example in Figure 1.

Our Dialogue Dataset
Our dataset contains not only single-domain dialogues, but also a great amount of multi-domain dialogues where domains are naturally connected. For example, a user wants to travel from one place to another. After checking the air ticket or train ticket, she wants to ask for the local weather information as well. In this section, we will introduce our dataset from two aspects: data structure and data statistics.

Data Structure
As shown in Figure 1, each dialogue in our dataset consists of a user goal description in natural language, a label of dialogue domain, multiple user and system turns and a set of coreference clusters.
In each user turn, the user act and dialogue state are annotated over the user utterance. We also label whether there are ellipsis or coreference phe-  nomena in each user utterance. If so, a complete version of the user utterance is provided. In each system turn, the system utterance is labeled with the corresponding system acts.

Data Statistics
Dialogue Statistics: We reshuffle all created dialogues and divide them into the training/dev/test sets which maintain approximately the same distribution on domains. As shown in Table 5, the training set contains 10,000 dialogues with 134,580 turns. The development and test set contain 600 dialogues with more than 8K and 9K turns respectively. The 5th column of Table 5 shows the statistics on single-domain dialogues. Multi-domain dialogues (the 6th column of Table 5) cover 8 domains excluding Computer, Car, Hospital and Education. After Chinese word segmentation via Jieba, 3 there are 1,658,645 tokens in total in RiSAWOZ. On average, there are 10.91 tokens in each turn and 13.57 turns in each dialogue. Multi-domain dialogues have more turns and utterances are longer than those in single-domain dialogues.
Annotation Statistics: As shown in Table 5, each dialogue contains an average of 1.46 dialogue acts per turn. Each user and system turn have an average of 1.79 and 1.13 dialogue acts respectively. The richness of dialogue acts also make our data set a new challenge. On average, there are 1.77 coreference clusters in each dialogue. As multi-domain dialogues are more complex, each dialogue has an average of 2.45 coreference clusters. Regarding utterance rewriting for ellipsis and coreference res-3 https://github.com/fxsjy/jieba olution, 75,991 user utterances are reformulated, as shown in Table 6. Only 38.47% of the user utterances have neither ellipsis nor coreference phenomena, and the remaining 61.53% have at least one of them.

RiSAWOZ as a New Benchmark
The large size and rich semantic annotations of RiSAWOZ make it a suitable testbed for various benchmark tasks. We conduct five different evaluation tasks with the benchmark models and in-depth analyses on RiSAWOZ in this section. We also discuss the applicability of RiSAWOZ for other tasks. Results of the 5 tasks are reported in Table 7.

Natural Language Understanding
Task Definition: In task-oriented dialogue system, the NLU module aims to convert the user utterance into the representation that computer can understand, which includes intent and dialogue act (slot & value) detection. Model: We adapt BERT (Devlin et al., 2019) for the NLU task (intent detection and slot filling). We initialize BERT with the Chinese pre-trained BERT model (Cui et al., 2019) and then finetune it on Ri-SAWOZ. To take dialogue history into account, we employ the same BERT to model previous dialogue context. We also experiment on the situation without context. For fine-tuning BERT on RiSAWOZ, we set the learning rate to 0.00003 and the dropout rate to 0.1. Results: From Table 7, we can clearly find that the model using dialogue context preforms better than not. Also, the model obtains lower F 1 scores on multi-domain dialogues than single-domain dialogues.

Dialogue State Tracking
Task Definition: Dialogue State Tracking (DST) is a core component in task-oriented dialogue systems, which extracts dialogue states (user goals) embedded in dialogue context. It has progressed toward open-vocabulary or generation-based DST where state-of-the-art models can generate dialogue states from dialogue context directly. Model: To report the benchmark results of the DST task, we implement the TRADE model (Wu et al., 2019) and the MLCSG model (Quan and Xiong, 2020) which improves long context modeling through a multi-task learning framework based on TRADE and achieves the state-of-theart joint accuracy on the MultiWOZ 2.0 dataset (Budzianowski et al., 2018). We train the models with a learning rate of 0.001 and a weight decay rate of 0.5. Early stopping and dropout are also used to prevent overfitting. The dropout rate is set to 0.2. Results: As illustrated in Table 7, we show the joint accuracy results for the two models under two different word embedding initialization settings: random and fastText (Grave et al., 2018) initialization. When we use randomly initialized word embeddings of 100 dimensions, TRADE achieves 65.35%, 50.49% and 58.19% joint accuracy on single-domain, multi-domain and all data respectively. While using 300 dimensional pretrained word vectors from fastText, TRADE performs a little better. Under the same setting, MLCSG achieves the higher 73.04%, 58.77% and 66.16% joint accuracy. In general, the performances of the two DST models significantly drop on multidomain dialogues.

Dialogue Context-to-Text Generation
Task Definition: We recast dialogue response generation a sequence-to-sequence problem: encoding dialogue context to decode system response. Model: To this task, we use the Domain-Aware Multi-Decoder (DAMD) model (Zhang et al., 2019b) which achieves state-of-the-art performance on the MultiWOZ 2.0 dataset (Budzianowski et al., 2018). It's an end-to-end model proposed to handle the multi-domain response generation problem, which uses one encoder to encode dialogue context and three decoders to decode the belief span, system action and system response. We set the vocabulary size to 8,000 and randomly initialize 50-dimensional word embeddings. The size of hidden states is set to 100. We train the model with a learning rate of 0.005 and a decay rate of 0.5. Table 7, we report inform rate, success rate, BLEU (Papineni et al., 2002) and combined score for this task. The inform rate measures the percentage that the output contains the appropriate entity the user asks for, and the success rate estimates the proportion that all the requested attributes have been answered. The combined score is calculated via (inf orm + success) * 0.5 + BLEU as an overall quality (Zhang et al., 2019b). Still, multi-domain dialogues exhibit a high difficulty level.

Coreference Resolution
Task Definition: We predict coreference clusters where all mentions are referring to the same entity for each dialogue.
Model: We use the e2e-coref model (Lee et al., 2017), which is the first end-to-end coreference resolution model, as the benchmark model for this task. The model predicts coreference clusters from texts end-to-end without using any auxiliary syntactic parser or hand-engineered mention detector. It considers all spans in a text as potential mentions and learn distributions over all possible antecedents for each mention. The whole process contains two steps: scoring potential entity mentions by calculating embedding representations of corresponding spans and estimating the score for an antecedent from pairs of span representations. The 300 dimensional word vectors from fastText (Grave et al., 2018) are used for the e2e-coref model. We set the size of hidden states to 200 and the number of layers to 3. The model is trained with a learning rate of 0.001 and a decay rate of 0.999. The dropout rate is set to 0.2.

Results:
We report the standard MUC, B 3 and CEAFφ 4 F 1 metrics using the official CoNLL-2012 evaluation scripts and an average F 1 score of the three metrics. As shown in Table 7, the e2e-coref model achieves 84.49%, 80.60%, 82.41% average F 1 score on single-domain, multi-domain and all data respectively. The model performs the worst on multi-domain dialogues where coreference links may cross different domains.

Unified Generative Ellipsis and Coreference Resolution
Task Definition: This is a new task reformulated recently Quan et al., 2019). It usually takes the current user utterance and dialogue context as input. If there exits ellipsis or coreference phenomena in the user utterance, a complete version of the utterance is generated according to the dialogue context. Otherwise, the original user utterance is kept. Model: We adopt the GECOR model (Quan et al., 2019) which is an end-to-end generative ellipsis and coreference resolution model with two encoders and one decoder which can produce a pragmatically complete user utterance via generation and copying. We set both the size of hidden states and word embeddings to 300. We use 300 dimensional fastText (Grave et al., 2018) word vectors to initialize word embeddings in the embedding layer. We train the model with a learning rate of 0.003 and a decay rate of 0.5. Early stopping is used and the dropout rate is 0.5. Results: We follow Quan et al. (2019) to use the exact match rate (EM), BLEU (Papineni et al., 2002) and Resolution F 1 as the evaluation metrics for this task. EM measures whether the generated utterances exactly match the ground-truth utterances. Resolution F 1 is calculated by comparing machine-generated words with ground-truth words only from the ellipsis / co-reference part of user utterances. As shown in Table 7, the GECOR model achieves 58.26% EM score, 87.50% BLEU score and 78.14% Resolution F 1 score on all data, which are much lower than the results reported on CamRest676 by Quan et al. (2019).

Other Tasks
Apart from the five evaluation tasks introduced above, RiSAWOZ can also facilitate the research of many other tasks. For example, the text of dialogues, as well as the annotation of dialogue states and acts, can support the study of dialog policy learning (DPL), natural language generation (NLG) and user simulator. Dialogue act, text and goal description can be potentially used for the task of dialogue summarization (Goo and Chen, 2018).
RiSAWOZ is also suitable for domain adaptation, zero-shot and few-shot learning for multi-domain task-oriented dialogue modeling due to its wide domain coverage. Rich linguistic annotations of RiSAWOZ would also promote the deep integra-tion of discourse and dialogue. We leave these tasks for our future work.

Conclusion
In this paper, we have presented RiSAWOZ, to date the largest human-to-human multi-domain dataset annotated with rich semantic information for taskoriented dialogue modeling. We manually label each dialogue in RiSAWOZ not only with comprehensive dialogue annotations for various sub-tasks of task-oriented dialogue systems (e.g., NLU, DST, response generation), but also linguistic annotations over ellipsis and coreference in multi-turn dialogue. In addition, the process of data creation and annotation is described in detail. We also report the benchmark models and results of five evaluation tasks on RiSAWOZ, indicating that the dataset is a challenging testbed for future work. RiSAWOZ is featured with large scale, wide domain coverage, rich semantic annotation and functional diversity, which can facilitate the research of task-oriented dialogue modeling from different aspects.