Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems

Over-dependence on domain ontology and lack of sharing knowledge across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short when tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using copy mechanism, facilitating transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art 48.62% joint goal accuracy for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show the transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.


Introduction
Dialogue state tracking (DST) is a core component in task-oriented dialogue systems, such as restaurant reservation or ticket booking. The goal of DST is to extract user goals/intentions expressed during conversation and to encode them as a compact set of dialogue states, i.e., a set of slots and their corresponding values. For example, as shown in Fig. 1, (slot, value) pairs such as (price, cheap) and (area, centre) are extracted from the conversation. Accurate DST performance is crucial for appropriate dialogue management, where user intention determines the next system action and/or the content to query from the databases.
Traditionally, state tracking approaches are based on the assumption that ontology is defined in advance, where all slots and their values are known. Having a predefined ontology can simplify DST into a classification problem and improve performance (Henderson et al., 2014b;Zhong et al., 2018). However, there are two major drawbacks to this approach: 1) A full ontology is hard to obtain in advance (Xu and Hu, 2018). In the industry, databases are usually exposed through an external API only, which is owned and maintained by others. It is not feasible to gain access to enumerate all the possible values for each slot. 2) Even if a full ontology exists, the number of possible slot values could be large and variable. For example, a restaurant name or a train departure time can contain a large number of possible values. Therefore, many of the previous works that are based on neural classification models may not be applicable in real scenario.  recently introduced a multi-domain dialogue dataset (Multi-WOZ), which adds new challenges in DST due to its mixed-domain conversations. As shown in Fig. 1, a user can start a conversation by asking to reserve a restaurant, then requests information regarding an attraction nearby, and finally asks to book a taxi. In this case, the DST model has to determine the corresponding domain, slot and value at each turn of dialogue, which contains a large number of combinations in the ontology, i.e., 30 (domain, slot) pairs and over 4,500 possible slot values in total. Another challenge in the multidomain setting comes from the need to perform multi-turn mapping. Single-turn mapping refers to the scenario where the (domain, slot, value) triplet can be inferred from a single turn, while in multiturn mapping, it should be inferred from multiple turns which happen in different domains. For instance, the (area, centre) pair from the attraction domain in Fig. 1 can be predicted from the area information in the restaurant domain, which is mentioned in the preceding turns.
To tackle these challenges, we emphasize that DST models should share tracking knowledge across domains. There are many slots among different domains that share all or some of their values. For example, the area slot can exist in many domains, e.g., restaurant, attraction, and taxi. Moreover, the name slot in the restaurant domain can share the same value with the departure slot in the taxi domain. Additionally, to enable the DST model to track slots in unseen domains, transferring knowledge across multiple domains is imperative. We expect DST models can learn to track some slots in zero-shot domains by learning to track the same slots in other domains.
In this paper, we propose a Transferable Dialogue State Generator (TRADE) for multidomain task-oriented dialogue state tracking. The simplicity of our approach and the boost of the performance is the main advantage of TRADE. Contributions in this work are summarized as 1 : • To overcome the multi-turn mapping problem, TRADE leverages its context-enhanced slot gate and copy mechanism to properly track slot 1 The code is released at github.com/ jasonwu0731/trade-dst values mentioned anywhere in dialogue history.
• By sharing its parameters across domains, and without requiring a predefined ontology, TRADE can share knowledge between domains to track unseen slot values, achieving state-ofthe-art performance on multi-domain DST.
• TRADE enables zero-shot DST by leveraging the domains it has already seen during training. If a few training samples from unseen domains are available, TRADE can adapt to new few-shot domains without forgetting the previous domains.

TRADE Model
The proposed model in Fig. 2 comprises three components: an utterance encoder, a slot gate, and a state generator. Instead of predicting the probability of every predefined ontology term, our model directly generates slot values. Similar to Johnson et al. (2017) for multilingual neural machine translation, we share all the model parameters, and the state generator starts with a different start-of-sentence token for each (domain, slot) pair. The utterance encoder encodes dialogue utterances into a sequence of fixed-length vectors. To determine whether any of the (domain, slot) pairs are mentioned, the context-enhanced slot gate is used with the state generator. The state generator decodes multiple output tokens for all (domain, slot) pairs independently to predict their corresponding values. The context-enhanced slot gate predicts whether each of the pairs is actually triggered by the dialogue via a three-way classifier.
Let us define X = {(U 1 , R 1 ), . . . , (U T , R T )} as the set of user utterance and system response pairs in T turns of dialogue, and B =

Utterance Encoder
Note that the utterance encoder can be any existing encoding model. We use bi-directional gated recurrent units (GRU) (Chung et al., 2014) Figure 2: The architecture of the proposed TRADE model, which includes (a) an utterance encoder, (b) a state generator, and (c) a slot gate, all of which are shared among domains. The state generator will decode J times independently for all the possible (domain, slot) pairs. At the first decoding step, state generator will take the j-th (domain, slot) embeddings as input to generate its corresponding slot values and slot gate. The slot gate predicts whether the j-th (domain, slot) pair is triggered by the dialogue. encode the dialogue history. The input to the utterance encoder is denoted as history X t = [U t−l , R t−l , . . . , U t , R t ] ∈ R |Xt|×d emb , which is the concatenation of all words in the dialogue history. l is the number of selected dialogue turns and d emb indicates the embedding size. The encoded dialogue history is represented as H t = [h enc 1 , . . . , h enc |Xt| ] ∈ R |Xt|×d hdd , where d hdd is the hidden size. As mentioned in Section 1, due to the multi-turn mapping problem, the model should infer the states across a sequence of turns. Therefore, we use the recent dialogue history of length l as the utterance encoder input, rather than the current utterance only.

State Generator
To generate slot values using text from the input source, a copy mechanism is required. There are three common ways to perform copying, i.e., index-based copy (Vinyals et al., 2015), hardgated copy (Gulcehre et al., 2016;Madotto et al., 2018) and soft-gated copy (See et al., 2017;Mc-Cann et al., 2018). The index-based mechanism is not suitable for DST task because the exact word(s) of the true slot value are not always found in the utterance. The hard-gate copy mechanism usually needs additional supervision on the gating function. As such, we employ soft-gated pointergenerator copying to combine a distribution over the vocabulary and a distribution over the dialogue history into a single output distribution.
We use a GRU as the decoder of the state generator to predict the value for each (domain, slot) pair, as shown in Fig. 2. The state generator decodes J pairs independently. We simply supply the summed embedding of the domain and slot as the first input to the decoder. At decoding step k for the j-th (domain, slot) pair, the generator GRU takes a word embedding w jk as its input and returns a hidden state h dec jk . The state generator first maps the hidden state h dec jk into the vocabulary space P vocab jk using the trainable embedding E ∈ R |V |×d hdd , where |V | is the vocabulary size. At the same time, the h dec jk is used to compute the history attention P history jk over the encoded dialogue history H t : The final output distribution P final jk is the weighted-sum of two distributions, The scalar p gen jk is trainable to combine the two distributions, which is computed by where W 1 is a trainable matrix and c jk is the context vector.

Slot Gate
Unlike single-domain DST problems, where only a few slots that need to be tracked, e.g., four slots in WOZ (Wen et al., 2017), and eight slots in DSTC2 (Henderson et al., 2014a), there are a large number of (domain, slot) pairs in multi-domain DST problems. Therefore, the ability to predict the domain and slot at current turn t becomes more challenging. Our context-enhanced slot gate G is a simple three-way classifier that maps a context vector taken from the encoder hidden states H t to a probability distribution over ptr, none, and dontcare classes. For each (domain, slot) pair, if the slot gate predicts none or dontcare, we ignore the values generated by the decoder and fill the pair as "not-mentioned" or "does not care". Otherwise, we take the generated words from our state generator as its value.
With a linear layer parameterized by W g ∈ R 3×d hdd , the slot gate for the j-th (domain, slot) pair is defined as where c j0 is the context vector computed in Eq (3) using the first decoder hidden state.

Optimization
During training, we optimize for both the slot gate and the state generator. For the former, the crossentropy loss L g is computed between the predicted slot gate G j and the true one-hot label y gate j , For the latter, another cross-entropy loss L v between P final jk and the true words Y label j is used. We L v is the sum of losses from all the (domain, slot) pairs and their decoding time steps. We optimize the weighted-sum of these two loss functions using hyper-parameters α and β, 3 Unseen Domain DST In this section, we focus on the ability of TRADE to generalize to an unseen domain by considering zero-shot transferring and few-shot domain expanding. In the zero-shot setting, we assume we have no training data in the new domain, while in the few-shot case, we assume just 1% of the original training data in the unseen domain is available (around 20 to 30 dialogues). One of the motivations to perform unseen domain DST is because collecting a large-scale task-oriented dataset for a new domain is expensive and time-consuming , and there are a large amount of domains in realistic scenarios.

Zero-shot DST
Ideally, based on the slots already learned, a DST model is able to directly track those slots that are present in a new domain. For example, if the model is able to track the departure slot in the train domain, then that ability may transfer to the taxi domain, which uses similar slots. Note that generative DST models take the dialogue context/history X, the domain D, and the slot S as input and then generate the corresponding values Y value . Let (X, D source , S source , Y value source ) be the set of samples seen during the training phase and (X, D target , S target , Y value target ) the samples which the model was not trained to track. A zero-shot DST model should be able to generate the correct values of Y value target given the context X, domain D target , and slot S target , without using any training samples. The same context X may appear in both source and target domains but the pairs (D target , S target ) are unseen. This setting is extremely challenging if no slot in S target appears in S source , since the model has never been trained to track such a slot.

Expanding DST for Few-shot Domain
In this section, we assume that only a small number of samples from the new domain (X, D target , S target , Y value target ) are available, and the purpose is to evaluate the ability of our DST model to transfer its learned knowledge to the new domain without forgetting previously learned domains. There are two advantages to performing few-shot domain expansion: 1) being able to quickly adapt to new domains and obtain decent performance with only a small amount of training data; 2) not requiring retraining with all the data from previously learned domains, since the data may no longer be available and retraining is often very time-consuming.
Firstly, we consider a straightforward naive baseline, i.e., fine-tuning with no constraints. Then, we employ two specific continual learning techniques: elastic weight consolidation (EWC) (Kirkpatrick et al., 2017) and gradient episodic memory (GEM) (Lopez-Paz et al., 2017) to fine-tune our model. We define Θ S as the model's parameters trained in the source domain, and Θ indicates the current optimized parameters according to the target domain data.
EWC uses the diagonal of the Fisher information matrix F as a regularizer for adapting to the target domain data. This matrix is approximated using samples from the source domain. The EWC loss is defined as where λ is a hyper-parameter. Different from EWC, GEM keeps a small number of samples K from the source domains, and, while the model learns the new target domain, a constraint is applied on the gradient to prevent the loss on the stored samples from increasing. The training process is defined as: where L(Θ, K) is the loss value of the K stored samples. Lopez-Paz et al. (2017) show how to solve the optimization problem in Eq (9)

Training Details
Multi-domain Joint Training The model is trained end-to-end using the Adam optimizer (Kingma and Ba, 2015) with a batch size of 32. The learning rate annealing is in the range of [0.001, 0.0001] with a dropout ratio of 0.2. Both α and β in Eq (7) are set to one. All the embeddings are initialized by concatenating Glove embeddings (Pennington et al., 2014) and character embeddings (Hashimoto et al., 2016), where the dimension is 400 for each vocabulary word. A greedy search decoding strategy is used for our state generator since the generated slot values are usually short in length. In addition, to increase model generalization and simulate an outof-vocabulary setting, a word dropout is utilized with the utterance encoder by randomly masking a small amount of input tokens, similar to Bowman et al. (2016).
Domain Expanding For training, we follow the same procedure as in the joint training section, and we run a small grid search for all the methods using the validation set. For EWC, we set different values of λ for all the domains, and the optimal value is selected using the validation set. Finally, in GEM, we set the memory sizes K to 1% of the source domains.

Results
Two evaluation metrics, joint goal accuracy and slot accuracy, are used to evaluate the performance on multi-domain DST. The joint goal accuracy compares the predicted dialogue states to the ground truth B t at each dialogue turn t, and the output is considered correct if and only if all the predicted values exactly match the ground truth values in B t . The slot accuracy, on the other hand, individually compares each (domain, slot, value) triplet to its ground truth label.

Multi-domain Training
We make a comparison with the following existing models: MDBT (Ramadan et al., 2018), GLAD (Zhong et al., 2018), GCE (Nouri and Hosseini-Asl, 2018), and SpanPtr (Xu and Hu, 2018), and we briefly describe these baselines models below: • MDBT 2 : Multiple bi-LSTMs are used to encode system and user utterances. The semantic similarity between utterances and every predefined ontology term is computed separately. Each ontology term is triggered if the predicted score is greater than a threshold.
• GLAD 3 : This model uses self-attentive RNNs to learn a global tracker that shares parameters among slots and a local tracker that tracks each slot. The model takes previous system actions and the current user utterance as input, and computes semantic similarity with predefined ontology terms.
• GCE: This is the current state-of-the-art model on the single-domain WOZ dataset . It is a simplified and sped up version of GLAD without slot-specific RNNs.
• SpanPtr: Most related to our work, this is the first model that applies pointer networks (Vinyals et al., 2015) to the single-domain DST  problem, which generates both start and end pointers to perform index-based copying.
To have a fair comparison, we modify the original implementation of the MDBT and GLAD models by: 1) adding name, destination, and departure slots for evaluation if they were discarded or replaced by placeholders; and 2) removing the hand-crafted rules of tracking the booking slots such as stay and people slots if there are any; and 3) creating a full ontology for their model to cover all (domain, slot, value) pairs that were not in the original ontology generated by the data provider.
As shown in Table 2, TRADE achieves the highest performance, 48.62% on joint goal accuracy and 96.92% on slot accuracy, on MultiWOZ. For comparison with the performance on singledomain, the results on the restaurant domain of MultiWOZ are reported as well. The performance difference between SpanPtr and our model mainly comes from the limitation of index-based copying. For examples, if the true label for the price range slot is cheap, the relevant user utterance describing the restaurant may actually be, for example, economical, inexpensive, or cheaply. Note that the MDBT, GLAD, and GCE models each need a predefined domain ontology to perform binary classification for each ontology term, which hinders their DST tracking performance, as mentioned in Section 1.
We visualize the cosine similarity matrix for all possible slot embeddings in Fig. 3. Most of the slot embeddings are not close to each other, which is expected because the model only depends on these features as start-of-sentence embeddings to distinguish different slots. Note that some slots are relatively close because either the values they track may share similar semantic meanings or the slots are correlated. For example, destination and departure track names of cities, while people and stay track numbers. On the other hand, price range and star in hotel domain are correlated because high-star hotels are usually expensive. An error analysis is shown in Fig. 4. In general,name slots in the attraction, hotel, and restaurant domains have the highest error rate, 3.22%, 2.96%, and 2.78%, respectively. It is because this slot usually has a large number of possible values that is hard to recognize. On the other hand, timerelated slots such as leave at and arrive by in the taxi domain have the lowest error rate.
Zero-shot We run zero-shot experiments by excluding one domain from the training set. As shown in Table 3, the taxi domain achieves the highest zero-shot performance, 60.58% on joint goal accuracy, which is close to the result achieved by training on all the taxi domain data (76.13%). Although performances on the other zero-shot domains are not especially promising, they still achieve around 50 to 65% slot accuracy without using any in-domain samples. The reason why the zero-shot performance on the taxi domain is high is because all four slots share similar values with the corresponding slots in the train domain.  perform fine-tuning. After fine-tuning on the new domain, we evaluate the performance of TRADE on 1) the four pre-trained domains and 2) the new domain. We experiment with different fine-tuning strategies. The base model row in Table 4 indicates the results evaluated on the four domains using their in-domain training data, and the Training 1% New Domain row indicates the results achieved by training from scratch using 1% of the new domain data. In general, GEM outperforms naive and EWC fine-tuning in terms of overcoming catastrophic forgetting. We also find that pre-training followed by fine-tuning outperforms training from scratch on the single domain. Fine-tuning TRADE with GEM maintains higher performance on the original four domains. Take the hotel domain as an example, the performance on the four domains after fine-tuning with GEM only drops from 58.98% to 53.54% (-5.44%) on joint accuracy, whereas naive finetuning deteriorates the tracking ability, dropping Finally, when considering hotel and attraction as new domain, fine-tuning with GEM outperforms the naive fine-tuning approach on the new domain. To elaborate, GEM obtains 34.73% joint accuracy on the attraction domain, but naive finetuning on that domain can only achieve 29.39%. This implies that in some cases learning to keep the tracking ability (learned parameters) of the learned domains helps to achieve better performance for the new domain.

Related Work
Dialogue State Tracking Traditional dialogue state tracking models combine semantics extracted by language understanding modules to estimate the current dialogue states (Williams and Young, 2007;Thomson and Young, 2010;Wang and Lemon, 2013;Williams, 2014), or to jointly learn speech understanding (Henderson et al., 2014b;Zilka and Jurcicek, 2015;. One drawback is that they rely on hand-crafted features and complex domain-specific lexicons (be-sides the ontology), and are difficult to extend and scale to new domains.  use distributional representation learning to leverage semantic information from word embeddings to and resolve lexical/morphological ambiguity. However, parameters are not shared across slots. On the other hand, Nouri and Hosseini-Asl (2018) utilizes global modules to share parameters between slots, and Zhong et al. (2018) uses slot-specific local modules to learn slot features, which has proved to successfully improve tracking of rare slot values. Lei et al. (2018) use a Seq2Seq model to generate belief spans and the delexicalized response at the same time. Xu and Hu (2018) use the indexbased pointer network for different slots, and show the ability to point to unknown values. However, many of them require a predefined domain ontology, and the models were only evaluated on singledomain setting (DSTC2).
For multi-domain DST, Rastogi et al. (2017) propose a multi-domain approach using two-layer bi-GRU. Although it does not need an ad-hoc state update mechanism, it relies on delexicalization to extract the features.  propose a model to jointly track domain and the dialogue states using multiple bi-LSTM. They utilize semantic similarity between utterances and the ontology terms and allow the information to be shared across domains. For a more general overview, readers may refer to the neural dialogue review paper from Gao et al. (2018).
Zero/Few-Shot and Continual Learning Different components of dialogue systems have previously been used for zero-shot application, e.g., intention classifiers (Chen et al., 2016), slotfilling (Bapna et al., 2017), and dialogue policy (Gašić and Young, 2014). For language generation, Johnson et al. (2017) propose single encoder-decoder models for zero-shot machine translation, and Zhao and Eskenazi (2018) propose cross-domain zero-shot dialogue generation using action matching. Moreover, few-shot learning in natural language applications has been applied in semantic parsing (Huang et al., 2018), machine translation (Gu et al., 2018), and text classification (Yu et al., 2018) with meta-learning approaches (Schmidhuber, 1987;Finn et al., 2017). These tasks usually have multiple tasks to perform fast adaptation, instead in our case the number of existing domains are limited. Lastly, several approaches have been proposed for continual learning in the machine learning community (Kirkpatrick et al., 2017;Lopez-Paz et al., 2017;Rusu et al., 2016;Fernando et al., 2017;, especially in image recognition tasks Rannen et al., 2017). The applications within NLP has been comparatively limited, e.g., Shu et al. (2016Shu et al. ( , 2017b for opinion mining, Shu et al. (2017a) for document classification, and Lee (2017) for hybrid code networks (Williams et al., 2017).

Conclusion
We introduce a transferable dialogue state generator for multi-domain dialogue state tracking, which learns to track states without any predefined domain ontology. TRADE shares all of its parameters across multiple domains and achieves stateof-the-art joint goal accuracy and slot accuracy on the MultiWOZ dataset for five different domains. Moreover, domain sharing enables TRADE to perform zero-shot DST for unseen domains and to quickly adapt to few-shot domains without forgetting the learned ones. In future work, we hope to collect a dataset with a large number of domains to facilitate the application and study of metalearning techniques within multi-domain DST.