Domain Adaptive Dialog Generation via Meta Learning

Domain adaptation is an essential task in dialog system building because there are so many new dialog tasks created for different needs every day. Collecting and annotating training data for these new tasks is costly since it involves real user interactions. We propose a domain adaptive dialog generation method based on meta-learning (DAML). DAML is an end-to-end trainable dialog system model that learns from multiple rich-resource tasks and then adapts to new domains with minimal training samples. We train a dialog system model using multiple rich-resource single-domain dialog data by applying the model-agnostic meta-learning algorithm to dialog domain. The model is capable of learning a competitive dialog system on a new domain with only a few training examples in an efficient manner. The two-step gradient updates in DAML enable the model to learn general features across multiple tasks. We evaluate our method on a simulated dialog dataset and achieve state-of-the-art performance, which is generalizable to new tasks.


Introduction
Modern personal assistants, such as Alexa and Siri, are composed of thousands of single-domain task-oriented dialog systems. Every dialog task is different, due to the specific domain knowledge. An end-to-end trainable dialog system requires thousands of dialogs for training. However, the availability of the training data is usually limited as real users have to be involved to obtain the training dialogs. Therefore, adapting existing rich-resource data to new domains with limited resource is an essential task in dialog system research. Transfer learning (Caruana, 1997a;Bengio, 2012;Cohn et al., 1994;Mo et al., 2018), few-shot learning (Salakhutdinov et al., 2012;Li et al., 2006;Norouzi et al., 2013;Socher et al., 2013) and meta-learning (Finn et al., 2017) are introduced in solving such data scarcity problem in machine learning. Because every dialog domain is very different from each other, generalize information from rich-resource domains to another low resource domain is difficult. Therefore, only a few studies have tackled domain adaptive end-to-end dialog training methods (Zhao and Eskénazi, 2018). We propose DAML based on meta-learning to combine multiple dialog tasks in training, in order to learn general and transferable information that is applicable to new domains. Zhao and Eskénazi (2018) introduces action matching, a learning framework that could realize zero-shot dialog generation (ZSDG), based on domain description, in the form of seed response. With limited knowledge of a new domain, the model trained on several rich-resource domains achieves both impressive task completion rate and natural generated response. Rather than action matching, we propose to use modelagnostic meta-learning (MAML) algorithm (Finn et al., 2017) to perform dialog domain adaptation. The MAML algorithm tries to build an internal representation of multiple tasks and maximize the sensitivity of the loss function when applied to new tasks, so that small update of parameters could lead to large improvement of new task loss value. This allows our dialog system to adapt to new domain successfully not only with little target domain data but also in a more efficient manner.
The key idea of this paper is utilizing the abundant data in multiple resource domains and finding an initialization that could be accurately and quickly adapted to an unknown new domain with little data. We use the simulated data generated by SimDial (Zhao and Eskénazi, 2018). Specifically, we use three domains: restaurant, weather, and bus information search, as source data and test the meta-learned parameter initialization against the target domain, movie information search. By modifying Sequicity (Lei et al., 2018), a seq2seq encoder-decoder network, improving it with a two-stage CopyNet (Gu et al., 2016), we implement the MAML algorithm to achieve an optimal initialization using dialog data from source domains. Then, we fine-tune the initialization towards the target domain with a minimal portion of dialog data using normal gradient descent. Finally, we evaluate the adapted model with testing data also from the target domain. We outperform the state-of-the-art zero-shot baseline, ZSDG (Zhao and Eskénazi, 2018), as well as other transfer learning methods (Caruana, 1997b). We publish the code on the github 1 .

Related Works
Task-oriented dialog systems are developed to assist users to complete specific tasks, such as booking restaurant or querying weather information. The traditional method to build a dialog system is to train modules separately  such as: natural language understanding (NLU) (Deng et al., 2012;Dauphin et al., 2014;Hashemi et al.), dialog state tracker (Henderson et al., 2014), dialog policy learning (Cuayáhuitl et al., 2015; and natural language generation (NLG) (Dhingra et al., 2017;. Henderson et al. (2013) introduces the concept of belief tracker that tracks users' requirements and constraints in the dialog across turns. Recently, more and more works combine all the modules into a seq2seq model for the reason of easier model update. Lei et al. (2018) has introduced a new end-to-end dialog system, sequicity, constructed on a two-stage CopyNet (Gu et al., 2016): one for the belief tracker and another one for the response generation. This model has fewer number of parameters and trains faster than the state-of-the-art baselines while outperforming baselines on two large-scale datasets.
The traditional paradigm in machine learning research is to train a model for a specific task with plenty of annotated data. Obviously, it is not reasonable that large amount of data is still required to train a model from scratch if we already have models for similar tasks. Instead, we want to quickly adapt a trained model to a new task with a small amount of new data. Dialog adaptation 1 https://github.com/qbetterk/sequicity.git has been explored in various dimensions. Shi and Yu (2018) introduces an end-to-end dialog system that adapts to user sentiment. Mo et al. (2018) and Genevay and Laroche (2016) also trains a user adaptive dialog systems using transfer learning. Recently, effective domain adaptation has been introduced for natural language generation in dialog systems (Tran and Nguyen, 2018;Wen et al., 2016). Some domain adaptation work has been done on dialog states tracking (Mrkšić et al., 2015) and dialog policy learning (Vlasov et al., 2018) as well. However, there is no recent work about domain adaptation for a seq2seq dialog system, except ZSDG Zhao and Eskénazi (2018). ZSDG is a zero-shot learning method that adapts action matching to adapt models learned from multiple source domains to a new target domain only using its domain description. Different from ZSDG, we propose to adapt meta-learning to achieve similar domain adaption ability.
Meta-learning aims at learning new tasks with few steps and little data based on well-known tasks. One way to realize meta-learning is to learn an optimal initialization that could be adapted to new task accurately and quickly with little data (Vinyals et al., 2016;Snell et al., 2017). Another way to learn the learning progress is to train a meta-learner to optimize the optimizer of original network for updating parameters (Andrychowicz et al., 2016;Grant et al., 2018). Meta-learning has been applied in various circumstances such as image classification (Santoro et al., 2016;Finn et al., 2017), machine translation (Gu et al., 2018), robot manipulation (Duan et al., 2016;Wang et al., 2016), etc. We propose to apply meta-learning algorithm on top of the sequicity model to achieve dialog domain adaptation. Specifically, we chose the recently introduced algorithm, model-agnostic meta-learning(MAML) (Finn et al., 2017), because it generalizes across different models. This algorithm is compatible with any model optimized with gradient descent, such as regression, classification and even policy gradient reinforcement learning. Moreover, this algorithm outperforms other state-of-the-art one-shot algorithms for image classification.

Problem Formulation
Seq2Seq-based dialog models take the dialog context c as the input and generates a sentence r as the response. Given the abundant data in the K differ- ent source domains, we have the training data in each source domain S k , denoted as: we also denote the data in the target domain T as: where N << N and N is only 1% of N in our setting.
During the training process, we generate a model where C is the set of context and R is the set of system responses.
For the adaptation, we fine-tune the model M source with target domain training data D T train and obtain a new model M target . Our primary goal is to learn a model that could perform well in the new target domain:

Proposed Methods
We first introduce how to combine the MAML algorithm and the sequicity model. As illustrated in the Figure 1, the typical gradient descent includes (1) combining training data and initialized model, Again we use the data (c (k) , r (k) ) from each domain and its corresponding temporarily updated domain model M k to calculate a new loss Loss k in each domain, (6) then sum all the new domain loss to obtain the final loss. (7) Finally, we use the final loss to update the original model M.
In the following part, we describe the implementation details of the MAML algorithm and the sequicity model separately. As illustrated in Algorithm 1, sequicity model is used to combine natural language understanding (NLU), dialog managing and response generation in a seq2seq fashion, while meta-learning is a method to adjust loss function value for better optimization. α and β in the algorithm are the learning rate. As mentioned in Section 3, c denotes the context and is the input to the model at each turn. In order to use the sequicity model, we format c as {B t−1 , R t−1 , U t } at time t, where B t−1 is the previous belief span at time t − 1, R t−1 is the last system response and U t is the current user utterance. Sequicity model introduces belief spans to store values of all the informable slots and also record requestable slot names through the history. In this way, rather than put all the history utterances into a RNN to extract context features, we directly deal with the slots stored in the belief span as the representation of all history contexts. The belief span is more accurate and simple to represent the history context and needed to be updated in every turn. The informable and requestable slots are stored in the same span, but with different labels to avoid ambiguity. The context at time t = 1 contains an empty set as the former belief span B 0 , and an empty string as the previous system response R 0 The intuition behind the MAML algorithm is that some internal representations are more trans-Algorithm 1 DAML Input: dataset on source domain D S train ; α; β Output: optimal meta-learned model ferable than others. This suggests that some internal features can be applied to multiple dialog domains rather than a single domain.
Since MAML is compatible with any gradient descent based model, we denote the current generative dialog model as M, which can be randomly initialized. According to the algorithm, for each source domain S k , certain size of training data is sampled. We input the training data (c (k) , r (k) ) into sequicity model and obtain generated system response. We adopt cross-entropy as the loss function for all the domains: For each source domain S k , We use gradient descent to update and get a temporary model.
To be consistent with (Finn et al., 2017), we only update the model for one step. In this way, we have an updated model in each source domain, one step away from M. We may consider multiple steps of gradient update in the future work. Then, we compute the loss based on the updated model with the same training data in each source domain: After this step, we have meta loss value in each domain. We sum up the updated loss value from all source domains as the objective function of metalearning: Finally, we update the model to minimize the meta objective function: Unlike common gradient, in MAML, the objective loss we use to update model is not calculated directly from the current model M k , but from the temporary model M k . The idea behind this operation is that the loss calculated from the updated model is obviously more sensitive to the changes in original domains, so that we learn more about the common internal representations of all source domains rather than the distinctive features of each domain. Then in the adaptation step, since the basic internal representation has already been captured, the model is sensitive to the unique features of the new domain. As a result, one or a few gradient steps and minimum amount of data are required to optimize the model to the new domain. The sequicity model is constructed based on a single seq2seq model incorporating copying mechanism and belief span to record dialog states. Given a context c in the form of {B t−1 , R t−1 , U t }, the belief span B t at time t is extracted based on the previous belief span B t−1 at time t−1, the history response R t−1 at time t − 1 and the utterance U t at time t: Then, we generate system response based on both context and belief span extracted before: m t is a simple label that helps generate the response. It checks whether or not requested information is available in the database with constraints stored in B t . m t has three possible values: no match, exact match and multiple match. m t = "no match" denotes that the system cannot find a match in the database given the constraints, then the system would initiate restart the conversation. m t = "exact match" indicates the system successfully retrieves the requested information and completes the task, then the system would Figure 2: Structure of dialog system end the conversation. m t = "multiple matches" means there are multiple items matches all the constraints, so more constraints are needed to reduce the range of search in the backend database. So the system will then output a question to elicit more information.
The structure is illustrated in Figure 2 and it is compatible with any seq2seq model. To have a simple architecture, we adopt the basic encoderdecoder structure. Both encoder and decoder employ GRU with attention mechanism. The response is generated using belief span and utterance at the current time. To simplify the model, we let the belief extractor and response generator share the same encoder. So we reformulate the equations into: We also need to apply the third attention-based GRU for the response decoding.
Because the response and the utterance usually share some word tokens, the sequicity model also incorporates copy-attention mechanism. Originally, to decode an encoded vector, the model uses softmax to obtain a probability over vocabulary P vocab (v) where v ∈ V . With copy-attention, the decoder not only considers the word generation probability distribution over vocabulary, but also the likelihood of copy the word from input sequence P copy (v) where v ∈ V ∪ U t and U t is the current user utterance in the input context c. Then the total probability of word v at ith token in the output sequence is calculated by summing these two probabilities (normalization is performed after the summation): The copy probability is calculated similarly in Gu et al. (2016) and is different for belief span decoder and response decoder.
For the belief span decoder, the copy probability is calculated as: where Z is a normalization factor and u j is the jth word tokens in the utterance U t . We only add the component when u j is the same as the target word v. ψ(u j ) is computed by: where h enc j is the hidden state in the encoder for the jth word as input, h dec j is the hidden state in the belief span decoder and W ∈ R d×d is the copyattention weight.
For the response decoder, we apply the copy attention on the recently generated belief span B t rather than utterance U t : where both hidden states come from belief span decoder.

Experiment
We first introduce the dataset and the metrics used to evaluate our models. Then, we describe models evaluated in the experiments and their implementation details.

Dataset
For a fair comparison with the state-of-the-art domain adaptation algorithm, ZSDG (Zhao and Eskénazi, 2018), we use the dataset, SimDial, which first introduced to evaluate ZSDG. Please refer to Appendix A for an example dialog. There are in total six dialog domains in SimDial: restaurant, weather, bus, movie, restaurant-slot and restaurant-style, where restaurant-slot data has the same slot type and sentence generation templates as the restaurant task but a different slot vocabulary. Similarly, restaurant-style has the same slots but different natural language generation (NLG) templates compared to the restaurant domain. We choose restaurant, weather and bus as source domains, denoted as following the experiment setting of ZSDG in (Zhao and Eskénazi, 2018). For each source domain, we have 900, 100, 500 conversations for training, validation and testing correspondingly, each of which has 9 turns and each utterance has 13 word tokens on average. The rest three domains are for evaluation, which are considered as target domains. The seed response used in ZSDG is a set of system utterances and corresponding labels. To achieve a fair comparison, we use dialog data of the same size for adaptation training. We generate 9 dialogs (1% of source domain) for each domain's adaptation training, each averagely contains about 8.4 turns. So for each target domain, we assume we have around 76 system response, which is smaller than the 100 seed response, ZSDG used as domain description. For testing, we use 500 dialogs for each target model. Movie is chosen to be the new target domain for evaluation. Because movie has completely different NLG templates and dialog structure, sharing very few common traits with the source domains at the surface level.
To avoid any random results in this few-shot learning setting, we report the average of ten random runs for all results. For further exploring the property of the proposed method, we have also generated one dialog for the one-shot experiment, 45 dialogs (5% of the size in source domain), 90 dialogs (10% of the size in source domain) study the adaptation efficiency of our methods.

Metrics
There are three main metrics in our experiments: BLEU score, entity F1 score and adapting time. The first two are the most important and persuasive metrics used in Finn et al. (2017) has exhaustively demonstrated the MAML's fast adaptation speed to new tasks. It could even achieve amazing performance with one step of gradient update incorporating with halfcheetah and ant. We would also like to count the number of epochs for adaptation to compare the adaptation speed between our methods and the baseline of transfer learning.
• BLEU We use BLEU score (Papineni et al., 2002) to evaluate the quality of generated response sentences since generating natural language is also part of the task.
• Entity F1 Score For each dialog, we compare the generated belief span and the Oracle one. Since belief span contains all the slots that constraints the response, this score also checks the completeness of tasks.
• Adapting Time We count the number of epochs during the adaptation training. We only compare the adaptation with the data of the same size.

Baseline Models
To evaluate the effectiveness of our model, we compare DAML with the following two baselines: • ZSDG (Zhao and Eskénazi, 2018) is the state-of-the-art dialog domain adaptation model. This model strengthens the LSTMbased encoder-decoder with an action matching mechanism. The model samples 100 labeled utterances as domain description seeds for domain adaptation.
• Transfer learning is applied on the sequicity model as the second baseline. We train the basic model by simply mixing all the data from source domains and then following Figure 1 (a) to update the model. We also enlarge the vocabulary with the training data in target domain. Besides, we implement one-shot learning version of this model by only using one target domain dialog for adaptation, as a comparison with the one-shot learning case of DAML.

Implementation details
For all experiments, we use the pre-trained GloVe word embedding (Pennington et al., 2014) with a dimension of 50. We choose the one-layer GRU networks with a hidden size of 50 to construct the encoder and decoder. The model is optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.003. We reduce the learning rate to half if the validation loss increases. We set the batch (Ioffe and Szegedy, 2015) size to 32 and the dropout (Zaremba et al., 2014) rate to 0.5. restaurant, weather and bus domains as "In Domain" data since they are in the same domains as what we use to train. The data from movie domain is denoted as "New Domain" as it is unseen in training data. "Unseen Slot" and "Unseen NLG" represent restaurant-slot and restaurant-style domains correspondingly. To keep a fair comparison, both Transfer and DAML use 1% of source domain data (9 dialogs, in total 76 system responses), which is equal to the seed response that Zhao and Eskénazi (2018) uses. We found that both transfer learning and DAML obtain better results than ZSDG. Especially for the "New Domain", DAML achieves the entity F1 score of 66.2, 25.8% relative improvement compared with ZSDG. As for "In Domain" testing, DAML also obtains 14.4% improvement beyond ZSDG. However, our method does not get large improvement in the "Unseen slot" and "Unseen NLG" domains. We notice that these two domains are actually generated from one of the source domain (restaurant domain). So, even though the slots or templates are changed, they should still share some features with the original domain data. If we could take advantage of the original restaurant domain, the result should be improved. Following this intuition, in the "Unseen slot" domain and the "Unseen NLG" domain, we first fine-tune the model obtained from DAML with the original restaurant data in training, and then we do further fine-tune with the adaptation data. The results are further improved and presented in the parenthesis in Table 1. We see that in most cases, fine-tuning on restaurant data increases both the BLEU score and entity F1 score on the "Unseen Slot" and "Unseen NLG" domain. Finn et al. (2017) emphasizes that meta-learning obtains decent results with extremely small size of data, even in the one-shot cases. To verify this claim, we perform a one-shot version of the DAML training along with one-shot transfer learning by only using one target domain dialog. The result shows that even the one-shot case of DAML outperforms the ZSDG baseline in all cases except "Unseen slot" in entity F1. For the "Unseen NLG" domain, the DAML one-shot case even obtains the highest score. Considering DAML one-shot also having out-standing performance when adapted to "In Domain," this suggests that the "Unseen NLG" domain is relatively close to the "In Domain." And nearly every model achieves a similarly high score by fine-tuning the model which is already adapted to the "In Domain" data. Since the score of "In Domain" is already extremely high, we assume the model have learned the common features well. We also mention in the Sec 4 that MAML is sensitive to the new knowledge. Given that the model already learns the common features well ,in the one-shot setting, the model focuses on learning the unique features of the target domain, while the setting with 1% adaptation data still partially focus on some common features.

Results and Analysis
And our method shows evident advantage not only with better scores but also with much fewer update steps. We observe in Table 1, DAML only needs one epoch to find the optimum when adapt-ing to the "In Domain." Even for the "New Domain," DAML only uses 5.8 epochs on average to converge, which is only 40% of epochs used in transfer learning. The epoch numbers in the Table 1 are not integers because all the results in our experiment are the average value of results from ten random runs, explained in Sec 5.1. Therefore, we conclude DAML is more efficient compared with simple transfer learning.
DAML's success mainly comes from three possible reasons. The first is the CopyNet mechanism. The copy model directly copy and output word tokens from the context, contributing to the high entity F1 score. The belief span also helps to improve the performance. With the belief span, we no longer need to extract slots from all the history utterances in each turn. Instead, we only need the previous slots, stored in belief span, that the copy model could directly deal with. This allows us to simplify our framework and improve the performance. Finally, the meta-learning allows our model to learn inner features of the dialog across different domains.  We also change different tasks used in source and target data to validate the robustness of our model. We use the leave-one-out approach to compare the difference between movie, restaurant, bus and weather domains. When we choose one of them as the target domain, we use the other three as the source domains. The size of the dataset (1% target data for adaptation) and model hyperparameters are keeping the same as the main experiment described above. We observe in the table 2, the restaurant domain achieves both the highest entity F1 score and the highest BLEU score, which means it is the easiest domain to adapt to. The bus domain receives the lowest entity F1 score and the movie domain holds the second lowest one, as well as the lowest BLEU score. This demonstrates that the movie domain is really a hard domain for adaptation and is worth being chosen as the target domain. Among all combinations, DAML outperforms the transfer learning algorithms in both Entity F1 and BLEU. In addition, we investigate the impact of using different amount of target domain data on system performance. We use the best model trained on restaurant, bus and weather and test on the movie domain. The size of target data varies from one dialog in one-shot learning to 10% of the data, which is 90 dialogs. Figure 3 shows the system performance positively correlates with the amount of training data available in the target domain. We observe that both entity F1 and BLEU scores nearly converge when 4% of the data is used. Although 4% is three times the size of the seed response used in Zhao and Eskénazi (2018), we notice that even the one-shot case of our model outperforms ZSDG in the new domain. This demonstrates our method's capability to achieve good performance with only little target data.
Although the DAML has demonstrated outstanding performance in dialog domain adaptation, it still cannot perfectly adapt to a new domain, especially when there is out of domain words in new domain, denoted as unk. If unk lies in the utterance, such as "system: Movie from what country?" "user: Movie from unk." System can hardly extract the needed slot since it does not recognize the surface form of the slot, even if we recognize the unk as the entity. If unk appears in the belief span, when our system uses copy model to generate the new belief span based on the previous one, it is hard to handle the unk token.
The model also has difficulties in handling complex utterances, especially when a sentence has corrections, such as: "new request. in 2000-2010. oh no, in 70s." In this case, our system successfully adds only 70s to the belief span, mainly because the adverb in suggests 70s is a year. However, the system keeps the original slot year, leading to a no match result. Moreover, in the case "that's wrong. i love western ones.", our system is confused on what the pronoun "ones" refers to. So it does not recognize "western" is a dialog slot.

Conclusion and Future Work
We propose a domain adaptive dialog generation method based on meta-learning(DAML). We also construct an end-to-end trainable dialog system that utilizes a two-step gradient update to obtain models that are more sensitive to new domains. We evaluate our model on a simulated dataset with multiple independent domains. DAML reaches the state-of-the-art performance in Entity F1 compared with a zero-shot learning method and a transfer learning method. DAML is an effective and robust method for training dialog systems with low-resources.
The DAML also provides promising potential extension, such as applying DAML on reinforcement learning-based dialog system. We also plan to adapt DAML to multi-domain dialog tasks.