From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap

Dialogue state tracking (DST) is at the heart of task-oriented dialogue systems. However, the scarcity of labeled data is an obstacle to building accurate and robust state tracking systems that work across a variety of domains. Existing approaches generally require some dialogue data with state information and their ability to generalize to unknown domains is limited. In this paper, we propose using machine reading comprehension (RC) in state tracking from two perspectives: model architectures and datasets. We divide the slot types in dialogue state into categorical or extractive to borrow the advantages from both multiple-choice and span-based reading comprehension models. Our method achieves near the current state-of-the-art in joint goal accuracy on MultiWOZ 2.1 given full training data. More importantly, by leveraging machine reading comprehension datasets, our method outperforms the existing approaches by many a large margin in few-shot scenarios when the availability of in-domain data is limited. Lastly, even without any state tracking data, i.e., zero-shot scenario, our proposed approach achieves greater than 90% average slot accuracy in 12 out of 30 slots in MultiWOZ 2.1.


Introduction
Building a task-oriented dialogue system that can comprehend users' requests and complete tasks on their behalf is a challenging but fascinating problem. Dialogue state tracking (DST) is at the heart of task-oriented dialogue systems. It tracks the state of a dialogue during the conversation between a user and a system. The state is typically defined as the (slot name, slot value) pair that represents, given a slot, the value that the user provides or system-provided value that the user accepts. *Authors contributed equally.
Despite the importance of DST in task-oriented dialogues systems, few large datasets are available. To address this issue, several methods have been proposed for data collection and bootstrapping the DST system. These approaches either utilize Wizard-of-Oz setup via crowd sourcing (Wen et al., 2017;Budzianowski et al., 2018) or Machines Talking To Machines (M2M) framework (Shah et al., 2018). Currently the most comprehensive dataset with state annotation is MultiWOZ (Budzianowski et al., 2018), which contains seven domains with around 10, 000 dialogues. However, compared to other NLP datasets, MultiWOZ is still relatively small, especially for training data-intensive neural models. In addition, it is also a non-trivial to get a large amount of clean labeled data given the nature of task-oriented dialogues (Eric et al., 2019).
Another thread of approaches have tried to utilize data in a more efficient manner. These approaches Zhou and Small, 2019) usually train the models on several domains and perform zero-shot or few-shot learning on unseen domains. However, these methods require slot definitions to be similar between the training data and the unseen test data. If such systems are given a completely new slot type, the performance would degrade significantly. Therefore, these approaches still rely on considerable amount of DST data to cover a broad range of slot categories.
We find machine reading comprehension task (RC) (Rajpurkar et al., 2016;Chen, 2018) as a source of inspiration to tackle these challenges. The RC task aims to evaluate how well machine models can understand human language, whose goals are actually similar to DST. Ultimately, DST focuses on the contextual understanding of users' request and inferring the state from the conversation, whereas RC focuses on the general understanding of the text regardless of its format, which can be either passages or conversations. In addition, recent advances have shown tremendous success in RC tasks. Thus, if we could formulate the DST task as a RC task, it could benefit DST in two aspects: first, we could take advantage of the fastgrowing RC research advances; second, we could make use of the abundant RC data to overcome the data scarcity issue in DST task.
Building upon this motivation, we formulate the DST task into an RC task by specially designing a question for each slot in the dialogue state, similar to . Then, we divide the slots into two types: categorical and extractive, based on the number of slot values in the ontology. For instance, in MultiWOZ, slots such as parking take values of {Yes, No, Don't Care} and can thus be treated as categorical. In contrast, slots such as hotel-name may accept an unlimited number of possible values and these are treated as extractive. Accordingly, we propose two machine reading comprehension models for dialogue state tracking. For categorical slots, we use multiple-choice reading comprehension models where an answer has to be chosen from a limited number of options. And for the extractive dialogue state tracking, span-based reading comprehension are applied where the answer can be found in the form of a span in the conversation.
To summarize our approach and contributions: • We divide the dialogue state slots into categorical and extractive types and use RC techniques for state tracking. Our approach can leverage the recent advances in the field of machine reading comprehension, including both multiple-choice and span-based reading comprehension models.
• We propose a two-stage training strategy. We first coarse-train the state tracking models on reading comprehension datasets, then finetune them on the target state tracking dataset.
• We show the effectiveness of our method under three scenarios: First, in full data setting, we show our method achieves close to the current state-of-the-art on MultiWoz 2.1 in terms of joint goal accuracy. Second, in fewshot setting, when only 1-10% of the training data is available, we show our methods significantly outperform the previous methods for 5 test domains in MultiWoz 2.0. In particular, we achieve 45.91% joint goal accuracy with just 1% (around 20-30 dialogues) of hotel domain data as compared to previous best result of 19.73% .
Thirdly, in zero-shot setting where no state tracking data is used for training, our models still achieve considerable average slot accuracy. More concretely, we show that 13 out of 30 slots in MultiWOZ 2.1 can achieve an average slot accuracy of greater than 90% without any training.
• We demonstrate the impact of canonicalization on extractive dialogue state tracking. We also categorize errors based on None and Not None slot values. We found the majority errors for our DST model come from distinguishing None or Not None for slots.

Related Works
Traditionally, dialogue state tracking methods (Liu and Lane, 2017;Mrkšić et al., 2016;Zhong et al., 2018;Nouri and Hosseini-Asl, 2018;) assume a fully-known fixed ontology for all slots where the output space of a slot is constrained by the values in the ontology. However, such approaches cannot handle previously unseen values and do not scale well for slots such as restaurantname that can take potentially unbounded set of values. To alleviate these issues, Rastogi et al. (2017); Goel et al. (2018) generate and score slot-value candidates from the ontology, dialogue context ngrams, slot tagger outputs, or a combination of them. However, these approaches suffer if a reliable slot tagger is not available or if the slot value is longer than the candidate n-grams. Xu and Hu (2018) proposed attention-based pointing mechanism to find the start and end of the slot value to better tackle the issue of unseen slot values. Gao  (2019) utilize pointer generator network to either copy from the context or generate from vocabulary. Perhaps, the most similar to our work is by Zhang et al. (2019) and Zhou and Small (2019) where they divide slot types into span-based (extractive) slots and pick-list (categorical) slots and use QA framework to point or pick values for these slots. A major limitation of these works is that they utilize heuristics to determine which slots should be categorical and which non-categorical. Moreover, in these settings most of the slots are treated as categorical (21/30 and 25/30), even though some of them have very large number of possible values, e.g., restaurant-name. This is not scalable especially when the ontology is large, not comprehensive, or when new domains/slots can occur at test time as in DSTC8 dataset (Rastogi et al., 2019).
There are recent efforts into building or adapting dialog state tracking systems in low source data scenarios ; Zhou and Small (2019). The general idea in these approaches is to treat all but one domain as in-domain data and test on the remaining unseen domain either directly (zero shot) or after fine-tuning on small percentage (1%-10%) of the unseen domain data (few shot). A major drawback of these approaches is that they require several labeled in-domain examples in order perform well on the unseen domain. This limits these approaches to in-domain slots and slot definitions and they do not generalize very well to new slots or completely unseen target domain. This also requires large amount of labeled data in the source domain, which may not be available in realworld scenario. Our proposed approach, on the other hand, utilizes domain-agnostic QA datasets with zero or a small percentage of DST data and significantly outperforms these approaches in lowresource settings.

Dialogue State Tracking as Reading Comprehension
Dialogue as Paragraph For a given dialogue at turn t, let us denote the user utterance tokens and the agent utterance tokens as u t and a t respectively. We concatenate the user utterance tokens and the agent utterance tokens at each turn to construct a sequence of tokens as D t = {u 1 , a 1 , ..., u t }. D t can be viewed as the paragraph that we are going to ask questions on at turn t.
Slot as Question We can formulate a natural language question q i , for each slot s i in the dialogue state. Such a question describes the meaning of that slot in the dialogue state. Examples of (slot, question) pairs can be seen in Table 2 and 3. We formulate questions by considering characteristics of domain and slot. In this way, DST becomes finding answers a i to the question q i given the paragraph D t . Note that  formulate dialogue state tracking problem in a similar way but their question formulation "what is the value of a slot ?" is more abstract, whereas our questions are more concrete and meaningful to the dialogue. "Encoder"is a pre-trained sentence encoder such as BERT.

Span-based RC To Extractive DST
For many slots in the dialogue state such as names of attractions, restaurants, and departure times, one can often find their values in the dialogue context with exact matches. Slots with a wide range of values fits this description. Table 1 shows the exact match rate for each slot in Multi-WOZ 2.1 dataset (Budzianowski et al., 2018;Eric et al., 2019) where slots with large number of possible values tend to have higher exact match rate (≥ 80%). We call tracking such slots as extractive dialogue stack tracking (EDST).
This problem is similar to span-based RC where the goal is to find a span in the passage that best answers the question. Therefore, for EDST, we adopt the simple BERT-based question answering model used by Devlin et al. (2019), which has shown strong performance on multiple datasets (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Reddy et al., 2019). In this model as shown in Figure 1, the slot question and the dialogue are represented as a single sequence. The probability of a dialogue token t i being the start of the slot value span is computed as p i = e s·T i j e s·T j , where T j is the embedding of each token t j and s is a learnable vector. A similar formula is applied for finding the end of the span.
Handling None Values At any given turn in the conversation, there are typically, many slots that  have not been mentioned or accepted yet by the user. All these slots must be assigned a None value in the dialogue state. We can view such cases as no answer exists in reading comprehension formulation. Similar to Devlin et al. (2019) for SQuAD 2.0 task, we assign the answer span with start and end at the beginning token [CLS] for these slots.
Handling Don't Care Values To handle don't care value in EDST, a span is also assigned to don't care in the dialogue. We find the dialogue turn when the slot value first becomes don't care and set the start and end of don't care span to be the start and end of the user utterance of this turn. See Table 2 for an example. The other type of slots in the dialogue state cannot be filled through exact match in the dialogue  context in a large number of cases. For example, a user might express intent for hotel parking as "oh! and make sure it has parking" but the slot hotelparking only accepts values from {Yes, No, Don't Care}. In this case, the state tracker needs to infer whether or not the user wants parking based on the user utterance and select the correct value from the list. These kind of slots may not have exact-match spans in the dialogue context but usually require a limited number of values to choose from. Tracking these type of slots is surprisingly similar to multiple-choice reading comprehension (MCRC) tasks. In comparison to span-based RC tasks, the answers of MCRC datasets (Lai et al., 2017;Sun et al., 2019) are often in the form of open, natural language sentences and are not restricted to spans in text. Following the traditional models of MCRC (Devlin et al., 2019;Jin et al., 2019), we concatenate the slot question, the dialogue context and one of the answer choices into a long sequence. We then feed this sequence into a sentence encoder to obtain a logit vector. Given a question, we can get m logit vectors assuming there are m answer choices. We then transform these m logit vectors into a probability vector through a fully connected layer and a softmax layer, see Figure 2 for details.

Multiple-Choice Reading Comprehension to Categorical Dialogue State Tracking
Handling None and Don't Care Values For each question, we simply add two additional choices "not mentioned" and "do not care" in the answer options, representing None and don't care, as shown in Table 3. It is worth noting that certain slots not only accept a limited number of values but also their values can be found as an exact-match span in the dialogue context. For these slots, both extractive and categorical DST models can be applied as shown in Table 1.

Canonicalization for Extractive Dialogue State Tracking
For extractive dialogue state tracking, it is common that the model will choose a span that is either a super-set of the correct reference or has a similar meaning as the correct value but with a different wording. Following this observation, we adopt a simple canonicalization procedure after our spanbased model prediction. If the predicted value does not exist in the ontology of the slot, then we match the prediction with the value in the ontology that is closest to the predicted value in terms of edit distance 1 . Note that this procedure is only applied at model inference time. At training time for extractive dialogue state tracking, the ontology is not required.

Two-stage Training
A two-stage training procedure is used to train the extractive and categorical dialogue state tracking models with both types of reading comprehension datasets (DREAM, RACE, and MRQA) and the dialogue state tracking dataset (MultiWOZ). Dialog State Tracking Training Stage After being trained on the reading comprehension datasets, we expect our models to be capable of answering (passage, question) pairs. In this phase, we further fine-tune these models on the MultiWOZ dataset.

DST with Full Training Data
Joint Goal Accuracy SpanPtr (Xu and Hu, 2018) 29.09% FJST (Eric et al., 2019) 38.00% HyST (Goel et al., 2019) 39.10% DSTreader  36.40% TRADE  45.96% DS-DST (Zhang et al., 2019) 51.21% DSTQA w/span (Zhou and Small, 2019) 49.67% DSTQA w/o span (Zhou and Small, 2019)   We use the full data in MultiWOZ 2.1 to test our models. For the first 15 slots with lowest number of possible values (from hotel.semi.type to ho-tel.book.day in Table 1, we use our proposed categorical dialogue state tracking model whereas for the remaining 15 slots, we use the extractive dialogue state tracking model. We use the pre-trained word embedding RoBERTa-Large (Liu et al., 2019) in our experiment. Table 5 summarizes the results. We can see that our model, STARC (State Tracking As Reading Comprehension), achieves close to the state-of-theart accuracy on MultiWOZ 2.1 in the full data setting. It is worth noting that the best performing approach DS-DST (Zhang et al., 2019), cherry-picks 9 slots as span-based slots whereas the remaining 21 slots are treated as categorical. Further, the second best result DSTQA w/o span (Zhou and Small, 2019) does not use span-based model for any slot. Unlike these state-of-the-art methods, our method simply categorizes the slots based on the number of values in the ontology. As a result, our approach uses less number of (15 as compared to 21 in DS-DST) and more reasonable (only those with few values in the ontology) categorical slots. Thus, our approach is more practical to be applied in a real-world scenario.

Ablation Study
We also run ablation study to understand which component of our model helps with accuracy. Table 6 summarizes the results. For fair comparison, we also report the numbers for DS-DST Threshold-10 (Zhang et al., 2019) where they also use the first 15 slots for categorical model and the remaining for extractive model. We observe that both two-stage training strategy using reading comprehension data and canonicalization play important role in higher accuracy. Without the categorical model (using extractive model for all slots), STARC is still able to achieve joint goal accuracy of 47.86%. More interestingly, if we remove the categorical model as well as the canonicalization, the performance drops drastically, but is still slight better than purely extractive model of DS-DST.  Handling None Value Through error analysis of our models, we have learned that models' performance on None value has a significant impact on the overall accuracy. Table 7 summarizes our findings. We found that plurality errors for extractive model comes from cases where ground-truth is not None but model predicted None. For categorical model, the opposite was true. The majority errors were from model predicting not None value but the ground-truth is actually None. We leave further investigation on this issue as a future work.

Few shot from RC to DST
In few-shot setting, our model (both extractive and categorical) is pre-trained on reading comprehension datasets and we randomly select limited amount of target domain data for fine-tuning.
We do not use out-of-domain MultiWOZ data for training for few-shot experiments unlike previous works. We evaluate our model with 1%, 5% and 10% of training data in the target domain. Table 8 shows the results of our model under this setting for five domains in MultiWOZ 2.0 2 . We also report the few-shot results for other two models: TRADE  and DSTQA (Zhou and Small, 2019), where they perform the same few-shot experiments but pre-trained with a holdout strategy, i.e., training on the other four domains in MultiWOZ and fine-tune on the held-out domain. We can see that under all three different data settings, our model outperforms the TRADE and DSTQA models (expect the attraction domain for DSTQA) by a large margin. Especially in 1% data setting for hotel domain, which contains the most number of slots (10) among all the five domains, the joint goal accuracy dropped to 19.73% for TRADE while our model can still achieve relatively high joint goal accuracy of 45.91%. This significant performance difference can be attributed to pre-training our models on reading comprehension datasets, which gives our model ability to comprehend passages or dialogues (which we have empiri-cally verified in next section). The formulation of dialogue state tracking as a reading comprehension task helps the model to transfer comprehension capability. We also tried to repeat these experiments with vanilla pre-trained Roberta-Large model (without pretraining on RC dataset), but we could not even get these models to converge in such lowresource data settings. This further highlights the importance of RC pretraining for low resource dialogue state tracking.

Zero shot from RC to DST
In zero-shot experiments, we want to investigate how would the reading comprehension models behave on MultiWOZ dataset without any training on state tracking data. To do so, we train our models on reading comprehension datasets and test on Mul-tiWOZ 2.1. Note that, in this setting, we only take labels in MultiWOZ 2.1 that are not missing, ignoring the data that is "None" in the dialogue state. For zero-shot experiments from multiple-choice RC to DST, we take the first fifteen slots in Table 1 that are classified as categorical. For zero shot from span-based RC to DST, we take twenty-seven slots which are extractive expect the first three slots in Table 1. Figure 3 summarizes the results for hotel, restaurant, taxi and train domain in MultiWOZ 2.1. For attraction domain, please refer to the supplementary section A. We can see that most of the slots have an average accuracy of at least 50% or above in both multiple-choice RC and span-based RC approaches, indicating the effectiveness of RC data. For some slots such as hotel.stay, hotel.people, hotel.day, restaurant.people, restaurant.day, and train.day, we are able to achieve very high zeroshot accuracy (greater than 90%). The zero-shot setting in TRADE , where the transfer is from the four source domains to the heldout target domain, fails completely on certain slot types like hotel.name. In contrast, our zero-shot experiments from RC to DST are able to transfer almost all the slots. Table 9 illustrates the zero shot examples for span-based RC model. We can see that although the span-based RC model does not directly point to the state value itself, it usually points to a span that contains the ground truth state value and the canonicalization procedure then turns the span into the actual slot value. Such predicted spans can be viewed as evidence for getting the ground-truth   (78) day (11) people (8) stay (8) stars (7) pricerange (6) area (6) parking (4) internet (3) type ( (182) food (103) time (67) day (8) people (8) area (6) pricerange ( (253) destination (251) leaveat (108) arriveby (97 (201) arriveby (156) departure (31) destination (27) people (12) day (8

Conclusion
Task-oriented dialogue systems aim to help users to achieve a variety of tasks. It is not unusual to have hundreds of different domains in modern taskoriented virtual assistants. How can we ensure the dialogue system is robust enough to scale to different tasks given limited amount of data? Some approaches focus on domain expansion by training on several source domains and then adapting to the target domain. While such methods can be successful in certain cases, it is hard for them to generalize to other completely different out-of-domain tasks.
Machine reading comprehension provides us a clear and general basis for understanding the context given a wide variety of questions. By formulating the dialogue state tracking as reading comprehension, we can utilize the recent advances in reading comprehension models. More importantly, we can utilize reading comprehension datasets to mitigate some of the resource issues in task-oriented dialogue systems. As a result, we achieve much higher accuracy in dialogue state tracking across different domains given limited amount of data compared to the existing methods. As the variety of tasks and functionalities in a dialogue system continues to grow, general methods for tracking dialogue state across all tasks will become increasingly necessary. We hope that the developments suggested here will help to address this need.

B Question Formation for Reading Comprehension
The structural construct and the surface form of the question can have an impact on the performance of RC models. In this work, we handcrafted a question for each slot that needs to be tracked. Each question roughly asks What is the value of the slot that the user in interested in?. The exact question was tailored to each specific slot also taking domains into account. We experimented with two sets of handcrafted questions. The first set was created in a procedural manner largely following a template. The other was created in a more freeform manner and was more natural. We did not notice any significant model performance difference between the two sets. However, we did not explore this dimension any further and leave it to future work. An interesting future direction could be to use a decoder to generate questions given slot description as the input.