Dialog State Tracking: A Neural Reading Comprehension Approach

Dialog state tracking is used to estimate the current belief state of a dialog given all the preceding conversation. Machine reading comprehension, on the other hand, focuses on building systems that read passages of text and answer questions that require some understanding of passages. We formulate dialog state tracking as a reading comprehension task to answer the question what is the state of the current dialog? after reading conversational context. In contrast to traditional state tracking methods where the dialog state is often predicted as a distribution over a closed set of all the possible slot values within an ontology, our method uses a simple attention-based neural network to point to the slot values within the conversation. Experiments on MultiWOZ-2.0 cross-domain dialog dataset show that our simple system can obtain similar accuracies compared to the previous more complex methods. By exploiting recent advances in contextual word embeddings, adding a model that explicitly tracks whether a slot value should be carried over to the next turn, and combining our method with a traditional joint state tracking method that relies on closed set vocabulary, we can obtain a joint-goal accuracy of 47.33% on the standard test split, exceeding current state-of-the-art by 11.75%**.


Introduction
A task-oriented spoken dialog system involves continuous interaction with a machine agent and a human who wants to accomplish a predefined task through speech. Broadly speaking, the system has *Authors contributed equally. **We note that after publication, a new state-of-the-art can now be obtained with a similar attention mechanism followed by a enoder-decoder architecture (Wu et al., 2019). four components, the Automatic Speech Recognition (ASR) module, the Natural Language Understanding (NLU) module, the Natural Language Generation (NLG) module, and the Dialog Manager. The dialog manager has two primary missions: dialog state tracking (DST) and decision making. At each dialog turn, the state tracker updates the belief state based on the information received from the ASR and the NLU modules. Subsequently, the dialog manager chooses the action based on the dialog state, the dialog policy and the backend results produced from previously executed actions. Table 1 shows an example conversation with the associated dialog state. Typical dialog state tracking system combines user speech, NLU output, and context from previous turns to track what has happened in a dialog. More specifically, the dialog state at each turn is defined as a distribution over a set of predefined variables (Williams et al., 2005). The distributions output by a dialog state tracker are sometimes referred to as the tracker's belief or the belief state. Typically, the tracker has complete access to the history of the dialog up to the current turn.
Traditional machine learning approaches to dialog state tracking have two forms, generative and discriminative. In generative approaches, a dialog is modeled as a dynamic Bayesian network where true dialog state and true user action are unobserved random variables (Williams and Young, 2007); whereas the discriminative approaches are directly modeling the distribution over the dialog state given arbitrary input features.
Despite the popularity of these approaches, they often suffer from a common yet overlooked problem -relying on fixed ontologies. These systems, therefore, have trouble handling previously unseen mentions. On the other hand, reading comprehension tasks (Rajpurkar et al., 2016;Chen et al., 2017;Reddy et al., 2019) require us to find the answer spans within the given passage and hence state-of-the-art models are developed in such a way that a fixed vocabulary for an answer is usually not required. Motivated by the limitations of previous dialog state tracking methods and the recent advances in reading comprehension (Chen, 2018), we propose a reading comprehension based approach to dialog state tracking. In our approach, we view the dialog as a passage and ask the question what is the state of the current dialog? We use a simple attention-based neural network model to find answer spans by directly pointing to the tokens within the dialog, which is similar to Chen et al. (2017). In addition to this attentive reading model, we also introduce two simple models into our dialog state tracking pipeline, a slot carryover model to help the tracker make a binary decision whether the slot values from the previous turn should be used; a slot type model to predict whether the answer is {Yes, No, DontCare, Span}, which is similar to Zhu et al. (2018). To summarize our contributions: • We formulate dialog state tracking as a reading comprehension task and propose a simple attention-based neural network to find the state answer as a span over tokens within the dialog. Our approach overcomes the limitations of fixed-vocabulary issue in previous approaches and can generalize to unseen state values.
• We present the task of dialog state tracking as making three sequential decisions: i) a binary carryover decision by a simple slot carryover model ii) a slot type decision by a slot type model iii) a slot span decision by an attentive reading comprehension model. We show effectiveness of this approach.
• We adopt recent progress in large pretrained contextual word embeddings, i.e., BERT (Devlin et al., 2018) into dialog state tracking, and get considerable improvement.
• We show our proposed model outperforms more complex previously published methods on the recently released MultiWOZ-2.0 corpus Ramadan et al., 2018). Our approach achieves a jointgoal accuracy of 42.12%, resulting in a 6.5% absolute improvement over previous state-of-   (2017), we achieve a joint-goal accuracy of 47.33%, further advancing the state-of-the-art by 11.75%.
• We provide an in-depth error analysis of our methods on the MultiWOZ-2.0 dataset and explain to what extent an attention-based reading comprehension model can be effective for dialog state tracking and inspire future improvements on this model.

Related Work
Dialog State Tracking Traditionally, dialog state tracking methods assume a fixed ontology, wherein the output space of a slot is constrained by the predefined set of possible values (Liu and Lane, 2017). However, these approaches are not applicable for unseen values and do not scale for large or potentially unbounded vocabulary (Nouri and Hosseini-Asl, 2018). To address these concerns, a class of methods employing scoring mechanisms to predict the slot value from a endogenously defined set of candidates have been proposed (Rastogi et al., 2017;Goel et al., 2018). In these methods, the candidates are derived from either a predefined ontology or by extraction of a word or n-grams in the prior dialog context. Previously, Perez and Liu (2017) also formulated state tracking as a machine reading comprehension problem. However, their model architecture used a memory network which is relatively complex and still assumes a fixed-set vocabulary. Perhaps, the most similar technique to our work is the pointer networks proposed by Xu and Hu (2018) wherein an attention-based mechanism is employed to point the start and end token of a slot value. However, their formulation does not incorporate a slot carryover component and outlines an encoder-decoder architecture in which the slot type embeddings are derived from the last state of the RNN.
Reading Comprehension A reading comprehension task is commonly formulated as a supervised learning problem where for a given training dataset, the goal is to learn a predictor, which takes a passage p and a corresponding question q as inputs and gives the answer a as output. In these tasks, an answer type can be cloze-style as in CNN/Daily Mail (Hermann et al., 2015), multiple choice as in MCTest (Richardson et al., 2013), span prediction as in SQuaD (Rajpurkar et al., 2016), and free-form answer as in Narra-tiveQA (Kočiskỳ et al., 2018). In span prediction tasks, most models encode a question into an embedding and generate an embedding for each token in the passage and then a similarity function employing attention mechanism between the question and words in the passage to decide the starting and ending positions of the answer spans (Chen et al., 2017;Chen, 2018). This approach is fairly generic and can be extended to multiple choice questions by employing bilinear product for different types (Lai et al., 2017) or to free-form text by employing seq-to-seq models (Sutskever et al., 2014).

Deep Contextual Word Embeddings
The recent advancements in the neural representation of words includes using character embeddings (Seo et al., 2016) and more recently using contextualized embeddings such as ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018). These methods are usually trained on a very large corpus using a language model objective and show superior results across a variety of tasks. Given their wide applicability (Liu et al., 2019), we employ these architectures in our dialog state tracking task.
3 Our Approach

DST as Reading Comprehension
Let us denote a sub-dialog D t of a dialog D as prefix of a full dialog ending with user's tth utterance, then state of the dialog D t is defined by the values of constituent slots s j (t), i.e., S t = {s 1 (t), s 2 (t), .s j (t), . . . , s M (t)}.
Using the terminology in reading comprehension tasks, we can treat D t as a passage, and for each slot i, we formulate a question q i : what is the value for slot i? The dialog state tracking task then becomes understanding a sub-dialog D t and to answer the question q i for each slot i.

Encoding
Dialog Encoding For a given dialog D t at turn t, we first concatenate user utterances and agent utterances {u 1 , a 1 , u 2 , a 2 , . . . , u t }. To differentiate between user utterance and agent utterance, we add symbol [U] before each user utterance and [A] before each agent utterance. Then, we use pre-trained word vectors to form p i for each token in the dialog sequence and pass them as input into a recurrent neural network, i.e., where L is the total length of the concatenated dialog sequence and d i is the output of RNN for each token, which is expected to encode contextaware information of the token. In particular, for pre-trained word vectors p i , we experiment with using deep contextualized word embeddings using BERT (Devlin et al., 2018). For RNN, we use a one layer bidirectional long short-term memory network (LSTM) and each d i is the concatenation of two LSTMs from both directions, i.e., . Furthermore, we denote e(t) as our dialog embedding at turn t as follows: Question Encoding In our methodology, we formulate questions q i defined earlier as what is the value for slot i? For each dialog, there are M similar questions corresponding to M slots, therefore, we represent each question q i as a fixeddimension vector q i to learn.

Models
Overview In our full model set up, three different model components are used to make a sequence of predictions: first, we use a slot carryover model for deciding whether to carryover a slot value from the last turn. If the first model decided not to carry over, a slot type model is executed to predict type of the answer from a set of {Yes, No, DontCare, Span}. If the slot type model predicts span, slot span model will finally be predicting the slot value as a span of tokens x c r s 5 6 s V + t t 3 r p k l T M H 6 I e s 9 y 9 H A p H W < / l a t e x i t > d 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " k j r v l V N x f 8 p y m p N I 0

Slot Span Prediction (Attention + Softmax)
Start End e < l a t e x i t s h a 1 _ b a s e 6 4 = " r p z C H V E A r h K i t w 1 + p F a J q 1 3 s v x 8 = " > A A A B 8 X i c d Z B N S w M x E I a z 9 a v W r 6 p H L 8 E i e F q 2 V d F j 0 Y v H C v Y D 2 1 K y 6 W w b m s 0 u y a x Y l v 4 L L x 4 U 8 e q / 8 e a / M W 2 3 o K I v B F 6 e m S E z r x 9 L Y d D z P p 3 c 0 v L K 6 l p + v b C x u b W 9 U 9 z d a 5 g o 0 R z q P Z N j y R d z B R L I j J U P K I U w I W + T 6 w B w i j T E 3 7 9 X 6 l 6 r n n X i 7 s u d 7 x v l z c r W 9 s 7 u X n X / o K P j V F H W p r G I V S 8 g m g k u W R s 4 C N Z L F C N R I F g 3 m F z n 9 e 4 9 U 5 r H 8 g 6 m C f M j M p I 8 5 J S A Q Z 4 H 7 A G C M E t m g 8 a g W n P s c y c X d m x n a Q r i F q S G C r U G 1 Q 9 y n 3 y c y E m k 9 j Q L T G R E Y 6 9 + 1 H P 5 V 6 6 c Q i Z / R i p d a T 9 W q 9 L V p L V j F z i H 7 I e v 8 C W t q R 4 w = = < / l a t e x i t > p L < l a t e x i t s h a 1 _ b a s e 6 4 = " I D X c X I e S q V p F b V N a / Z 2 0 8 B A / 2 U Q = " > A A A B 8 3 i c d Z D L S s N A F I Y n 9 V b r r e r S z W A R X I W k K r o s u n H h o o K 9 Q B P K Z D p p h 0 6 S Y e Z E L K G v 4 c a F I m 5 9 G X e + j Z M 2 B R X 9 Y e D n O + d w z v y B F F y D 4 3 x a p a X l l d W 1 8 n p l Y 3 N r e 6 e 6 u 9 f W S a o o a 9 F E J K o b E M 0 E j 1 k L O A j W l Y q R K B C s E 4 y v 8 n r n n i n N k / g O J p L 5 E R n G P O S U g E G e B + w B g j C T 0 / 5 N v 1 p z 7 D M n F 3 Z s Z 2 E K 4 h a k h g o 1 + 9 U P b 5 D Q N G I x U E G 0 7 r m O B D 8 j C j g V b F r x U s 0 k o W M y Z D 1 j Y x I x 7 W e z m 6 f 4 y J A B D h N l X g x 4 R r 9 P Z C T S e h I F p j M i M N K / a z n 8 q 9 Z L I b z w M x 7 L F F h M 5 4 v C V G B I c B 4 A H n D F K I i J M Y Q q b m 7 F d E Q U o W B i q p g Q F j / F / 5 t 2 3 X Z P 7 P r t a a 1 x W c R R R g f o E B 0 j F 5 2 j B r p G T d R C F E n 0 i J 7 R i 5 V a T 9 a r 9 T Z v L V n F z D 7 6 I e v 9 C 4 C + k f w = < / l a t e x i t >

Prediction Layer
Contextual Representation

Context Encoding
Layer ( within the dialog. The full model architecture is shown in Figure 1. Slot Carryover Model To model dynamic nature of dialog state, we introduce a model whose purpose is to decide whether to carry over a slot value from the previous turn. For a given slot s j , C j (t) = 1 if s j (t) = s j (t − 1) and 0 if they are equal. We multiply the dialog embedding e(t) with a fully connected layer W i to predict the change for slot i as: The network architecture is shown in Figure 1. In our implementation, the weights W i for each slot are trained together, i.e., the neural network would predict the slot carryover change C i (t) jointly for all M slots.
Slot Type Model A typical dialog state comprises of slots that can have both categorical and named entities within the context of conversation. To adopt a flexible approach and inspired by the state-of-the-art reading comprehension approaches, we propose a classifier that predicts the type of slot value at each turn. In our setting, we prescribe the output space to be {Yes, No, DontCare, Span} where Span indicates the slot value is a named entity which can be found within the dialog. As shown in Figure 1, we concatenate the dialog embedding e(t) with the question encoding q i for slot i as the input to the affine layer A to predict the slot type T i (t) as: Slot Span Model We map our slot values into a span with start and end position in our flattened conversation D t . We then use the dialog encoding vectors {d 1 , d 2 , . . . d L } and the question vector q i to compute the bilinear product and train two classifiers to predict the start position and end position of the slot value. More specifically, for slot j, Similarly, we define P (end) j (x) with Θ (end) . During span inference, we choose the best span from word i to word i such that i ≤ i and P

Data
We use the recently-released MultiWOZ-2.0 dataset Ramadan et al., 2018) to test our approach. This dataset consists of multi-domain conversations from seven domains with a total of 37 slots across domains. Many of these slot types such as day and people are shared across multiple domains. In our experiments, we process each slot independently by considering the concatenation of slot domain, slot category, and slot name, e.g., {bus.book.people}, {restaurant.semi.food}. An example of conversation is shown in Table 1. We use standard training/development/test present in the data set.
It is worth-noting that the dataset in the current form has certain annotation errors. First, there is Method Accuracy MultiWOZ Benchmark 25.83% GLAD (Zhong et al., 2018) 35.57% GCE (Nouri and Hosseini-Asl, 2018) 35.58% Our approach (single) 39.41% Our approach (ensemble) 42.12% HyST (ensemble) (Goel et al., 2019) 44.22% Our approach + JST (ensemble) 47.33% lack of consistency between the slot values in the ontology and the ground truth in the context of the dialog. For example, the ontology has moderate but the dialog context has moderately. Second, there are erroneous delay in the state updates, sometimes extending turns in the dialog. This error negatively impacts the performance of the slot carryover model.

Experimental Setup
We train our three models independently without sharing the dialog context. For all the three models, we encode the word tokens with BERT (Devlin et al., 2018) followed by an affine layer with 200 hidden units. This output is then fed into a one-layer bi-directional LSTM with 50 hidden units to obtain the contextual representation as show in Figure 1. In all our experiments, we keep the parameters of the BERT embeddings frozen. For slot carryover model, we predict a binary vector over 37 slots jointly to get the decisions of whether to carry over values for each slot. For slot type and slot span models, we treat dialogquestion pairs (D t , q i ) as separate prediction tasks for each slot.
We use the learning rate of 0.001 with ADAM optimizer and batch size equal to 32 for all three models. We stop training our models when the loss on the development set has not been decreasing for ten epochs.

Results
Table 2 presents our results on MultiWOZ-2.0 test dataset. We compare our methods with global-local self-attention model (GLAD) (Zhong et al., 2018), global-conditioned encoder model (GCE) (Nouri and Hosseini-Asl, 2018), and hybrid joint state tracking model (OV ST+JST) (Liu and Lane, 2017;Goel et al., 2019). As in previous work, we report joint goal accuracy as our metric. For each user turn, joint goal accuracy checks whether all predicted states exactly matches the ground truth state for all slots. We can see that our system with single model can achieve 39.41% joint goal accuracy, and with the ensemble model we can achieve 42.12% joint goal accuracy. Table 3 shows the accuracy for each slot type for both our method and the joint state tracking approach with fix vocabulary in Goel et al. (2019).
We can see our approach tends to have higher accuracy on some of the slots that have larger set of possible values such as attraction.semi.name and taxi.semi.destination. However, it is worth-noting that even for slots with smaller vocabulary sizes such as hotel.book.day and hotel.semi.pricerange, our approach achieves better accuracy than using closed vocabulary approach. Our hypothesis for difference is that such information appear more frequently in user utterance thus our model is able to learn it more easily from the dialog context.
We also reported the result for a hybrid model by combining our approach with the JST approach in (Goel et al., 2019). Our combination strategy is as follows: first we calculated the slot type accuracy for each model on the development dataset; then for each slot type, we choose to use the predictions from either our model or JST model based on the accuracy calculated on the development set, whichever is higher. With this approach, we achieve the joint-goal accuracy of 46.28%. We hypothesize that this is because our method uses an open vocabulary, where all the possible values can only be obtained from the conversation; the joint state tracking method uses closed ontology, we can get the best of both the worlds by combining two methods. Table 4 illustrates the ablation studies for our model on development set. The contextual embedding BERT (Devlin et al., 2018) can give us around 2% gains. As for the oracle models, we can see that even if using all the oracle results (ground truth), our development set accuracy is only 73.12%. This is because our approach is only considering the values within the conversation, if values are not present in the dialog, the oracle models would fail. It is interesting to see that if we replace our slot carryover model with an oracle one, the accuracy improves significantly to 60.18% (+19.08%) compared to replacing other two models (41.43% and 45.77%). This is because our span-based reading comprehension approach model already gives us accuracy as high as 96% per slot on development data, there is not much room for improvement. Whereas our binary slot carryover model only achieve an accuracy of 72% per turn. We hypothesis that for slot carryover problem is imbalanced, i.e., there are significantly more slot carryovers than slot updates, making the  model training and predictions harder. This suggest further improvements are needed for slot carryover model to make overall state tracking accuracy higher.

Error Analysis
In Table 5, we conduct an error analysis of our models and investigate its performance for different use cases. Since we formulate the problem to be an open-vocabulary state tracking approach wherein the slot values are extracted in the dialog context, we divide the errors into following categories: • Unanswerable Slot Error This category contains two type of errors: (1) Ground truth slot is a not None value, but our prediction is None; (2) Ground truth slot is None, but our prediction is a not None value. This type of error can be attributed to the incorrect predictions made by our slot carryover model.
• Imprecise Slot Reference where multiple potential candidates in the context exists. The model refers to the incorrect entity in the conversation. This error can be largely attributed to following reasons: (1) the model overfits to the set of tokens that it has seen more frequently in the training set; (2) the model does not generalize well for scenarios where the user corrects the previous entity; (3) the model incorrectly overfits to the order or position of the entity in the context. These reasons motivate future research in incorporating more neural reading comprehension approaches for dialog state tracking.
• Imprecisie Slot Resolution In this type of errors, we cannot find the exact match of ground truth value in the dialog context.  However, our predicted model span is a paraphrase or has very close meaning to the ground truth. This error is inherent in approaches that do not extract the slot value from an ontology but rather the dialog context. On similar lines, we also observe cases where the slot value in the dialog context is resolved (or canonicalized) to a different surface-form entity that is perhaps more amenable for downstream applications.
• Imprecise Slot Boundary In this category of errors, our model chooses a span that is either a superset or subset of the correct reference. This error is especially frequent for proper nouns where the model has a weaker signal to outline the slot boundary precisely.  Table 4, where oracle slot carryover model would give us the most boost in joint goal accuracy. Additionally, 12.9% of errors are due to imprecise slot resolution, this suggests future directions of resolving the context words to the ontology.

Evaluating Different Context Encoders for Slot Carryover Model
As shown in oracle ablation studies in Table 4, slot carryover model plays a significant role in our pipeline. Therefore we explore the different types of context encoders for slot carryover model to see whether if it improves the performance in table 6. In addition to use a flat dialog context of user and agent turns [U] and [A] to predict carryover for every slot in the state, we explored hierarchical context encoder with an utterance-level LSTM over each user and agent utterance and a dialoglevel LSTM over the whole dialog with both constrained and unconstrained context window, similar to Liu and Lane (2017). However, we did not witness any significant performance change across the two variants as show in Table 6. Lastly, we employed self-attention over the flattened dialog context in line with Vaswani et al. (2017). However, we can see from Table 6 that this strategy slightly hurts the model performance. One hypothesis for sub par slot carryover model performance is due to the inherent noise in the annotated data for state updates. Through a preliminary analysis on the development set, we encountered few erroneous delay in the state updates sometimes extending to over multiple turns. Nevertheless, these experimental results motivate future research in slot carryover models for multi-domain conversations.

Analyzing Conversation Depth
In Table 7, we explore the relationship between the depth of a conversation and the performance of our models. More precisely, we segment a given set of dialogs into individual turns and measure the state accuracy for each of these segments. We mark a turn correct only if all the slots in its state are predicted correctly. We observe that the model perfor-   mance degrades as the number of turns increase. The primary reason for this behavior is that an error committed earlier in the conversation can be carried over for later turns. This results in a strictly higher probability for a later turn to be incorrect as compared to the turns earlier in the conversation. These results motivate future research in formulating models for state tracking that are more robust to the depth of the conversation.

Conclusion
The problem of tracking user's belief state in a dialog is a historically significant endeavor. In that context, research on dialog state tracking has been geared towards discriminative methods, where these methods are usually estimating the distribution of user state over a fixed vocabulary. However, modern dialog systems presents us with problems requiring a large scale perspective. It is not unusual to have thousands of slot values in the vocabulary which could have millions variations of dialogs. So we need a vocabulary-free way to pick out the slot values. How can we pick the slot values given an in-finite amount of vocabulary size? Some methods adopt a candidate generation mechanism to generate slot values and make a binary decision with the dialog context. Attention-based neural network gives a clear and general basis for selecting the slot values by direct pointing to the context spans. While this type of methods has already been proposed recently, we explored this type of idea furthermore on MultiWOZ-2.0 dataset. We introduced a simple attention based neural network to encode the dialog context and point to the slot values within the conversation. We have also introduced an additional slot carryover model and showed its impact on the model performance. By incorporating the deep contextual word embeddings and combining the traditional fixed vocabulary approach, we significantly improved the joint goal accuracy on MultiWOZ-2.0.
We also did a comprehensive analysis to see to what extent our proposed model can achieve. One interesting and significant finding from the oblation studies suggests the importance of the slot carryover model. We hope this finding can inspire future dialog state tracking research to work towards this direction, i.e., predicting whether a slot of state is none or not.
The field of machine reading comprehension has made significant progress in recent years. We believe human conversation can be viewed as a special type of context and we hope that the developments suggested here can help dialog related tasks benefit from modern reading comprehension models.