Scalable and Accurate Dialogue State Tracking via Hierarchical Sequence Generation

Existing approaches to dialogue state tracking rely on pre-defined ontologies consisting of a set of all possible slot types and values. Though such approaches exhibit promising performance on single-domain benchmarks, they suffer from computational complexity that increases proportionally to the number of pre-defined slots that need tracking. This issue becomes more severe when it comes to multi-domain dialogues which include larger numbers of slots. In this paper, we investigate how to approach DST using a generation framework without the pre-defined ontology list. Given each turn of user utterance and system response, we directly generate a sequence of belief states by applying a hierarchical encoder-decoder structure. In this way, the computational complexity of our model will be a constant regardless of the number of pre-defined slots. Experiments on both the multi-domain and the single domain dialogue state tracking dataset show that our model not only scales easily with the increasing number of pre-defined domains and slots but also reaches the state-of-the-art performance.


Introduction
A Dialogue State Tracker (DST) is a core component of a modular task-oriented dialogue system (Young et al., 2013). For each dialogue turn, a DST module takes a user utterance and the dialogue history as input, and outputs a belief estimate of the dialogue state. Then a machine action is decided based on the dialogue state according to a dialogue policy module, after which a machine response is generated.
Traditionally, a dialogue state consists of a set of requests and joint goals, both of which are represented by a set of slot-value pairs (e.g. (request, phone), (area, north), (food, Japanese)) (Henderson et al., 2014). In a recently proposed multi-

DST Models ITC
NBT-CNN  O(mn) MD-DST (Rastogi et al., 2017) O(n) GLAD  O(mn) StateNet PSI (Ren et al., 2018) O(n) TRADE (Wu et al., 2019) O(n) HyST (Goel et al., 2019) O(n) DSTRead (Gao et al., 2019) O(n) domain dialogue state tracking dataset, MultiWoZ , a representation of dialogue state consists of a hierarchical structure of domain, slot, and value is proposed. This is a more practical scenario since dialogues often include multiple domains simultaneously. Many recently proposed DSTs  are based on predefined ontology lists that specify all possible slot values in advance. To generate a distribution over the candidate set, previous works often take each of the slot-value pairs as input for scoring. However, in real-world scenarios, it is often not practical to enumerate all possible slot value pairs and perform scoring from a large dynamically changing knowledge base (Xu and Hu, 2018). To tackle this problem, a popular direction is to build a fixed-length candidate set that is dynamically updated throughout the dialogue development. Table 1 briefly summaries the inference time complexity of multiple state-of-the-art DST models following this direction. Since the inference complexity of all of previous model is at least pro-portional to the number of the slots, these models will struggle to scale to multi-domain datasets with much larger numbers of pre-defined slots.
In this work, we formulate the dialogue state tracking task as a sequence generation problem, instead of formulating the task as a pair-wise prediction problem as in existing work. We propose the COnditional MEmory Relation Network (COMER), a scalable and accurate dialogue state tracker that has a constant inference time complexity. 1 Specifically, our model consists of an encoderdecoder network with a hierarchically stacked decoder to first generate the slot sequences in the belief state and then for each slot generate the corresponding value sequences. The parameters are shared among all of our decoders for the scalability of the depth of the hierarchical structure of the belief states. COMER applies BERT contextualized word embeddings (Devlin et al., 2018) and BPE (Sennrich et al., 2016) for sequence encoding to ensure the uniqueness of the representations of the unseen words. The word embeddings for sequence generation are initialized and fixed with the static word embeddings generated from BERT to have the potential of generating unseen words. Figure 1 shows a multi-domain dialogue in which the user wants the system to first help book a train and then reserve a hotel. For each turn, the DST will need to track the slot-value pairs (e.g. (arrive by, 20:45)) representing the user goals as well as the domain that the slot-value pairs belongs to (e.g. train, hotel). Instead of representing the belief state via a hierarchical structure, one can also combine the domain and slot together to form a combined slot-value pair (e.g. (train; arrive by, 20:45) where the combined slot is "train; arrive by"), which ignores the subordination relationship between the domain and the slots.

Motivation
A typical fallacy in dialogue state tracking datasets is that they make an assumption that the slot in a belief state can only be mapped to a single value in a dialogue turn. We call this the single value assumption. Figure 2 shows an example of this fallacy from the WoZ2.0 dataset: Based on the belief state label (food, seafood), it will be impossible for the downstream module in the di- Figure 1: An example dialogue from the multi-domain dataset, MultiWOZ. At each turn, the DST needs to output the belief state, a nested tuple of (DOMAIN, (SLOT, VALUE)), immediately after the user utterance ends. The belief state is accumulated as the dialogue proceeds. Turns are separated by black lines. Figure 2: An example in the WoZ2.0 dataset that invalidates the single value assumption. It is impossible for the system to generate the sample response about the Chinese restaurant with the original belief state (food, seafood). A correction could be made as (food, seafood > chinese) which has multiple values and a logical operator ">".
alogue system to generate sample responses that return information about Chinese restaurants. A correct representation of the belief state could be (food, seafood > chinese). This would tell the system to first search the database for information about seafood and then Chinese restaurants. The logical operator ">" indicates which retrieved information should have a higher priority to be returned to the user. Thus we are interested in building DST modules capable of generating structured sequences, since this kind of sequence representation of the value is critical for accurately capturing the belief states of a dialogue.

Hierarchical Sequence Generation for DST
Given a dialogue D which consists of T turns of user utterances and system actions, our target is to predict the state at each turn. Different from previous methods which formulate multi-label state prediction as a collection of binary prediction problems, COMER adapts the task into a sequence generation problem via a Seq2Seq framework. As shown in Figure 3, COMER consists of three encoders and three hierarchically stacked decoders. We propose a novel Conditional Memory Relation Decoder (CMRD) for sequence decoding. Each encoder includes an embedding layer and a BiLSTM. The encoders take in the user utterance, the previous system actions, and the previous belief states at the current turn, and encodes them into the embedding space. The user encoder and the system encoder use the fixed BERT model as the embedding layer.
Since the slot value pairs are un-ordered set elements of a domain in the belief states, we first order the sequence of domain according to their frequencies as they appear in the training set (Yang et al., 2018), and then order the slot value pairs in the domain according to the slot's frequencies of as they appear in a domain. After the sorting of the state elements, We represent the belief states following the paradigm: (Domain1-Slot1, Value1; Slot2, Value2; ... Domain2-Slot1, Value1; ...) for a more concise representation compared with the nested tuple representation.
All the CMRDs take the same representations from the system encoder, user encoder and the be-lief encoder as part of the input. In the procedure of hierarchical sequence generation, the first CMRD takes a zero vector for its condition input c, and generates a sequence of the domains, D, as well as the hidden representation of domains H D . For each d in D, the second CMRD then takes the corresponding h d as the condition input and generates the slot sequence S d , and representations, H S,d . Then for each s in S, the third CMRD generates the value sequence V d,s based on the corresponding h s,d . We update the belief state with the new (d, (s d , V d,s )) pairs and perform the procedure iteratively until a dialogue is completed. All the CMR decoders share all of their parameters.
Since our model generates domains and slots instead of taking pre-defined slots as inputs, and the number of domains and slots generated each turn is only related to the complexity of the contents covered in a specific dialogue, the inference time complexity of COMER is O(1) with respect to the number of pre-defined slots and values.

Encoding Module
Let X represent a user utterance or system transcript consisting of a sequence of words {w 1 , . . . , w T }. The encoder first passes the sequence {[CLS ], w 1 , . . . , w T , [SEP ]} into a pretrained BERT model and obtains its contextual embeddings E X . Specifically, we leverage the output of all layers of BERT and take the average to obtain the contextual embeddings.
For each domain/slot appeared in the training set, if it has more than one word, such as 'price range', 'leave at', etc., we feed it into BERT and take the average of the word vectors to form the extra slot embedding E s . In this way, we map each domain/slot to a fixed embedding, which allows us to generate a domain/slot as a whole instead of a token at each time step of domain/slot sequence decoding. We also construct a static vocabulary embedding E v by feeding each token in the BERT vocabulary into BERT. The final static word embedding E is the concatenation of the E v and E s .
After we obtain the contextual embeddings for the user utterance, system action, and the static embeddings for the previous belief state, we feed each of them into a Bidirectional LSTM (Hochreiter and Schmidhuber, 1997).
where c 0 is the zero-initialized hidden state for the BiLSTM. The hidden size of the BiLSTM is d m /2. We concatenate the forward and the backward hidden representations of each token from the BiLSTM to obtain the token representation h kt ∈ R dm , k ∈ {a, u, b} at each time step t. The hidden states of all time steps are concatenated to obtain the final representation of H k ∈ R T ×dm , k ∈ {a, u, B}. The parameters are shared between all of the BiLSTMs.

Conditional Memory Relation Decoder
Inspired by Residual Dense Networks (Zhang et al., 2018), End-to-End Memory Networks (Sukhbaatar et al., 2015) and Relation Networks (Santoro et al., 2017), we here propose the Conditional Memory Relation Decoder (CMRD). Given a token embedding, e x , CMRD outputs the next token, s, and the hidden representation, h s , with the hierarchical memory access of different encoded information sources, H B , H a , H u , and the relation reasoning under a certain given condition c, the final output matrices S, H s ∈ R ls×dm are concatenations of all generated s and h s (respectively) along the sequence length dimension, where d m is the model size, and l s is the generated sequence length. The general structure of the CMR decoder is shown in Figure 4. Note that the CMR decoder can support additional memory sources by adding the residual connection and the attention block, but here we only show the structure with three sources: belief state representation (H B ), system transcript representation (H a ), and user utterance representation (H u ), corresponding to a dialogue state tracking scenario. Since we share the parameters between all of the decoders, thus CMRD is actually a 2-dimensional auto-regressive model with respect to both the condition generation and the sequence generation task.
At each time step t, the CMR decoder first embeds the token x t with a fixed token embedding E ∈ R de×dv , where d e is the embedding size and d v is the vocabulary size. The initial token x 0 is "[CLS]". The embedded vector e xt is then encoded with an LSTM, which emits a hidden repre- where q t is the hidden state of the LSTM. q 0 is initialized with an average of the hidden states of the belief encoder, the system encoder and the user encoder which produces H B , H a , H u respectively. h 0 is then summed (element-wise) with the condition representation c ∈ R dm to produce h 1 , which is (1) fed into the attention module; (2) used for residual connection; and (3) concatenated with other h i , (i > 1) to produce the concatenated working memory, r 0 , for relation reasoning, where Attn k (k ∈ {belief, sys, usr}) are the attention modules applied respectively to H B , H a , H u , and ⊕ means the concatenation operator. The gradients are blocked for h 1 , h 2 , h 3 during the backpropagation stage, since we only need them to work as the supplementary memories for the relation reasoning followed.
The attention module takes a vector, h ∈ R dm , and a matrix, H ∈ R dm×l as input, where l is the sequence length of the representation, and outputs h a , a weighted sum of the column vectors in H.
where the weights W 1 ∈ R dm×dm , W 2 ∈ R dm×dm and the bias b 1 ∈ R dm , b 2 ∈ R dm are the learnable parameters. The order of the attention modules, i.e., first attend to the system and the user and then the belief, is decided empirically. We can interpret this hierarchical structure as the internal order for the memory processing, since from the daily life experience, people tend to attend to the most contemporary memories (system/user utterance) first and then attend to the older history (belief states). All of the parameters are shared between the attention modules.
The concatenated working memory, r 0 , is then fed into a Multi-Layer Perceptron (MLP) with four layers, where σ is a non-linear activation, and the weights W 1 ∈ R 4dm×dm , W i ∈ R dm×dm and the bias b 1 ∈ R dm , b i ∈ R dm are learnable parameters, and 2 ≤ i ≤ 4. The number of layers for the MLP is decided by the grid search. The hidden representation of the next token, h s , is then (1) emitted out of the decoder as a representation; and (2) fed into a dropout layer with drop rate p, and a linear layer to generate the next token, where the weight W k ∈ R dm×de and the bias b k ∈ R de are learnable parameters. Since d e is the embedding size and the model parameters are independent of the vocabulary size, the CMR decoder can make predictions on a dynamic vocabulary and implicitly supports the generation of unseen words. When training the model, we minimize the cross-entropy loss between the output probabilities, p s , and the given labels.  of 1,200 dialogues from the restaurant reservation domain with three pre-defined slots: food, price range, and area. Since the name slot rarely occurs in the dataset, it is not included in our experiments, following previous literature (Ren et al., 2018;Liu and Perez, 2017). Our model is also tested on the multi-domain dataset, MultiWoZ . It has a more complex ontology with 7 domains and 25 predefined slots. Since the combined slot-value pairs representation of the belief states has to be applied for the model with O(n) ITC, the total number of slots is 35. The statistics of these two datsets are shown in Table 2. Based on the statistics from these two datasets, we can calculate the theoretical Inference Time Multiplier (ITM), K, as a metric of scalability. Given the inference time complexity, ITM measures how many times a model will be slower when being transferred from the WoZ2.0 dataset, d 1 , to the MultiWoZ dataset, d 2 , where O(x) means the Inference Time Complexity (ITC) of the variable x. For a model having an ITC of O(1) with respect to the number of slots n, and values m, the ITM will be a multiplier of 2.15x, while for an ITC of O(n), it will be a multiplier of 25.1, and 1,143 for O(mn). As a convention, the metric of joint goal accuracy is used to compare our model to previous work. The joint goal accuracy only regards the model making a successful belief state prediction if all of the slots and values predicted are exactly matched with the labels provided. This metric gives a strict measurement that tells how often the DST module will not propagate errors to the downstream modules in a dialogue system. In this work, the model with the highest joint accuracy on the validation set is evaluated on the test set for the test joint accuracy measurement.

Implementation Details
We use the BERT large model for both contextual and static embedding generation. All LSTMs in the model are stacked with 2 layers, and only the output of the last layer is taken as a hidden representation. ReLU non-linearity is used for the activation function, σ.
The hyper-parameters of our model are identical for both the WoZ2.0 and the MultiwoZ datasets: dropout rate p = 0.5, model size d m = 512, embedding size d e = 1024. For training on WoZ2.0, the model is trained with a batch size of 32 and the ADAM optimizer (Kingma and Ba, 2015) for 150 epochs, while for MultiWoZ, the AMSGrad optimizer (Reddi et al., 2018) and a batch size of 16 is adopted for 15 epochs of training. For both optimizers, we use a learning rate of 0.0005 with a gradient clip of 2.0. We initialize all weights in our model with Kaiming initialization (He et al., 2015) and adopt zero initialization for the bias. All experiments are conducted on a single NVIDIA GTX 1080Ti GPU.

Results
To measure the actual inference time multiplier of our model, we evaluate the runtime of the bestperforming models on the validation sets of both the WoZ2.0 and MultiWoZ datasets. During evaluation, we set the batch size to 1 to avoid the influence of data parallelism and sequence padding. On the validation set of WoZ2.0, we obtain a runtime of 65.6 seconds, while on MultiWoZ, the runtime is 835.2 seconds. Results are averaged across 5 runs. Considering that the validation set of Mul-tiWoZ is 5 times larger than that of WoZ2.0, the actual inference time multiplier is 2.54 for our model. Since the actual inference time multiplier roughly of the same magnitude as the theoretical value of 2.15, we can confirm empirically that we have the O(1) inference time complexity and thus obtain full scalability to the number of slots and values pre-defined in an ontology.   84.2% -O(mn) StateNet PSI (Ren et al., 2018) 88.9% -O(n) GLAD (Nouri and Hosseini-Asl, 2018) 88.5% 35.58% O(mn) HyST (ensemble) (Goel et al.,    , while the baseline for the MultiWoZ dataset is taken from the official website of MultiWoZ .

Model
Joint Acc.
COMER 88.64% -Hierachical-Attn 86.69% -MLP 83.24% Table 4: The ablation study on the WoZ2.0 dataset with the joint goal accuracy on the test set. For "-Hierachical-Attn", we remove the residual connections between the attention modules in the CMR decoders and all the attention memory access are based on the output from the LSTM. For "-MLP", we further replace the MLP with a single linear layer with the nonlinear activation.
the-art, with a marginal drop of 0.3% compared with previous work. Considering the fact that WoZ2.0 is a relatively small dataset, this small difference does not represent a significant big performance drop. On the muli-domain dataset, Mul-tiWoZ, our model achieves a joint goal accuracy of 45.72%, which is significant better than most of the previous models other than TRADE which applies the copy mechanism and gains better generalization ability on named entity coping.

Ablation Study
To prove the effectiveness of our structure of the Conditional Memory Relation Decoder (CMRD), we conduct ablation experiments on the WoZ2.0 dataset. We observe an accuracy drop of 1.95% after removing residual connections and the hierarchical stack of our attention modules. This proves the effectiveness of our hierarchical attention design. After the MLP is replaced with a linear layer  Table 5: The ablation study on the MultiWoZ dataset with the joint domain accuracy (JD Acc.), joint domain-slot accuracy (JDS Acc.) and joint goal accuracy (JG Acc.) on the test set. For "-ShareParam", we remove the parameter sharing mechanism on the encoders and the attention module. For "-Order", we further arrange the order of the slots according to its global frequencies in the training set instead of the local frequencies given the domain it belongs to. For "-Nested", we do not generate domain sequences but generate combined slot sequences which combines the domain and the slot together. For "-BlockGrad", we further remove the gradient blocking mechanism in the CMR decoder. of hidden size 512 and the ReLU activation function, the accuracy further drops by 3.45%. This drop is partly due to the reduction of the number of the model parameters, but it also proves that stacking more layers in an MLP can improve the relational reasoning performance given a concatenation of multiple representations from different sources.
We also conduct the ablation study on the Mul-tiWoZ dataset for a more precise analysis on the hierarchical generation process. For joint domain accuracy, we calculate the probability that all do-mains generated in each turn are exactly matched with the labels provided. The joint domain-slot accuracy further calculate the probability that all domains and slots generated are correct, while the joint goal accuracy requires all the domains, slots and values generated are exactly matched with the labels. From Table 5, We can further calculate that given the correct slot prediction COMER has 83.52% chance to make the correct value prediction. While COMER has done great job on domain prediction (95.53%) and value prediction (83.52%), the accuracy of the slot prediction given the correct domain is only 57.30%. We suspect that this is because we only use the previous belief state to represent the dialogue history, and the inter-turn reasoning ability on the slot prediction suffers from the limited context and the accuracy is harmed due to the multi-turn mapping problem (Wu et al., 2019). We can also see that the JDS Acc. has an absolute boost of 5.48% when we switch from the combined slot representation to the nested tuple representation. This is because the subordinate relationship between the domains and the slots can be captured by the hierarchical sequence generation, while this relationship is missed when generating the domain and slot together via the combined slot representation. Figure 5 shows an example of the belief state prediction result in one turn of a dialogue on the MultiWoZ test set. The visualization includes the CMRD attention scores over the belief states, system transcript and user utterance during the decoding stage of the slot sequence.

Qualitative Analysis
From the system attention (top right), since it is the first attention module and no previous context information is given, it can only find the information indicating the slot "departure" from the system utterance under the domain condition, and attend to the evidence "leaving" correctly during the generation step of "departure". From the user attention, we can see that it captures the most helpful keywords that are necessary for correct prediction, such as "after" for "day" and "leave at", "to" for "destination". Moreover, during the generation step of "departure", the user attention successfully discerns that, based on the context, the word "leave" is not the evidence that need to be accumulated and choose to attend nothing in this step. For the belief attention, we can see that the belief attention module correctly attends to a previous slot for each generation step of a slot that has been presented in the previous state. For the generation step of the new slot "destination", since the previous state does not have the "destination" slot, the belief attention module only attends to the '-' mark after the 'train' domain to indicate that the generated word should belong to this domain.

Related Work
Semi-scalable Belief Tracker Rastogi et al. (2017) proposed an approach that can generate fixed-length candidate sets for each of the slots from the dialogue history. Although they only need to perform inference for a fixed number of values, they still need to iterate over all slots defined in the ontology to make a prediction for a given dialogue turn. In addition, their method needs an external language understanding module to extract the exact entities from a dialogue to form candidates, which will not work if the label value is an abstraction and does not have the exact match with the words in the dialogue.
StateNet (Ren et al., 2018) achieves state-ofthe-art performance with the property that its parameters are independent of the number of slot values in the candidate set, and it also supports online training or inference with dynamically changing slots and values. Given a slot that needs tracking, it only needs to perform inference once to make the prediction for a turn, but this also means that its inference time complexity is proportional to the number of slots.
TRADE (Wu et al., 2019) achieves state-of-theart performance on the MultiWoZ dataset by applying the copy mechanism for the value sequence generation. Since TRADE takes n combinations of the domains and slots as the input, the inference time complexity of TRADE is O(n). The performance improvement achieved by TRADE is mainly due to the fact that it incorporates the copy mechanism that can boost the accuracy on the name slot, which mainly needs the ability in copying names from the dialogue history. However, TRADE does not report its performance on the WoZ2.0 dataset which does not have the name slot.
DSTRead (Gao et al., 2019) formulate the dialogue state tracking task as a reading comprehension problem by asking slot specified questions to the BERT model and find the answer span in the dialogue history for each of the pre-defined combined slot. Thus its inference time complexity is still O(n). This method suffers from the fact that its generation vocabulary is limited to the words occurred in the dialogue history, and it has to do a manual combination strategy with another joint state tracking model on the development set to achieve better performance.
Contextualized Word Embedding (CWE) was first proposed by Peters et al. (2018). Based on the intuition that the meaning of a word is highly correlated with its context, CWE takes the complete context (sentences, passages, etc.) as the input, and outputs the corresponding word vectors that are unique under the given context. Recently, with the success of language models (e.g. Devlin et al. (2018)) that are trained on large scale data, contextualizeds word embedding have been further improved and can achieve the same performance compared to (less flexible) finely-tuned pipelines.
Sequence Generation Models. Recently, sequence generation models have been successfully applied in the realm of multi-label classification (MLC) (Yang et al., 2018). Different from traditional binary relevance methods, they proposed a sequence generation model for MLC tasks which takes into consideration the correlations between labels. Specifically, the model follows the encoder-decoder structure with an attention mechanism (Cho et al., 2014), where the decoder generates a sequence of labels. Similar to language modeling tasks, the decoder output at each time step will be conditioned on the previous predic-tions during generation. Therefore the correlation between generated labels is captured by the decoder.

Conclusion
In this work, we proposed the Conditional Memory Relation Network (COMER), the first dialogue state tracking model that has a constant inference time complexity with respect to the number of domains, slots and values pre-defined in an ontology. Besides its scalability, the joint goal accuracy of our model also achieve the similar performance compared with the state-of-the-arts on both the MultiWoZ dataset and the WoZ dataset. Due to the flexibility of our hierarchical encoderdecoder framework and the CMR decoder, abundant future research direction remains as applying the transformer structure, incorporating open vocabulary and copy mechanism for explicit unseen words generation, and inventing better dialogue history access mechanism to accommodate efficient inter-turn reasoning.