Few-Shot Dialogue Generation Without Annotated Data: A Transfer Learning Approach

Learning with minimal data is one of the key challenges in the development of practical, production-ready goal-oriented dialogue systems. In a real-world enterprise setting where dialogue systems are developed rapidly and are expected to work robustly for an ever-growing variety of domains, products, and scenarios, efficient learning from a limited number of examples becomes indispensable. In this paper, we introduce a technique to achieve state-of-the-art dialogue generation performance in a few-shot setup, without using any annotated data. We do this by leveraging background knowledge from a larger, more highly represented dialogue source — namely, the MetaLWOz dataset. We evaluate our model on the Stanford Multi-Domain Dialogue Dataset, consisting of human-human goal-oriented dialogues in in-car navigation, appointment scheduling, and weather information domains. We show that our few-shot approach achieves state-of-the art results on that dataset by consistently outperforming the previous best model in terms of BLEU and Entity F1 scores, while being more data-efficient than it by not requiring any data annotation.


Introduction
Data-driven dialogue systems are becoming widely adopted in enterprise environments. One of the key properties of a dialogue model in this setting is its data efficiency, i.e. whether it can attain high accuracy and good generalization properties when only trained from minimal data.
Recent deep learning-based approaches to training dialogue systems (Ultes et al., 2018;Wen et al., 2017) put emphasis on collecting large amounts of data in order to account for numerous variations in the user inputs and to cover as many dialogue trajectories as possible. However, in realworld production environments there isn't enough domain-specific data easily available throughout the development process. In addition, it's important to be able to rapidly adjust a system's behavior according to updates in requirements and new product features in the domain. Therefore, dataefficient training is a priority direction in dialogue system research.
In this paper, we build on a technique to train a dialogue model for a new domain in a 'zeroshot' setup (in terms of full dialogues in the target domain) only using annotated 'seed' utterances .
We present an alternative, 'few-shot' approach to data-efficient dialogue system training: we do use complete in-domain dialogues while using approximately the same amount of training data as , with respect to utterances. However, in our method, no annotation is required -we instead use a latent dialogue act annotation learned in an unsupervised way from a larger (multi-domain) data source, broadly following the model of . This approach is potentially more attractive for practical purposes because it is easier to collect unannotated dialogues than collecting utterances across various domains under a consistent annotation scheme.

Related Work
There is a substantial amount of work on learning dialogue with minimal data -starting with the Dialog State Tracking Challenge 3 (Henderson et al., 2014) where the problem was to adjust a pre-trained state tracker to a different domain using a seed dataset.
In dialogue response generation, there has also been work on bootstrapping a goal-oriented dialogue system from a few examples using a lin-guistically informed model:  used an incremental semantic parser -DyLan (Eshghi et al., 2011;Eshghi, 2015) -to obtain contextual meaning representations, and based the dialogue state on this (Kalatzis et al., 2016). Incremental response generation was learned using Reinforcement Learning, again using the parser to incrementally process the agent's output and thus prune ungrammatical paths for the learner. Compared to a neural model -End-to-End Memory Network (Sukhbaatar et al., 2015), this linguistically informed model was superior in a 1-shot setting . At the same time, its main linguistic resource -a domain-general dialogue grammar for English -makes the model inflexible unless wide coverage is achieved.
Transfer learning for Natural Language Processing is strongly motivated by recent advances in vision. When training a convolutional neural network (CNN) on a small dataset for a specific problem domain, it often helps to learn low-level convolutional features from a greater, more diverse dataset. For numerous applications in vision, ImageNet (Deng et al., 2009) became the source dataset for pre-training convolutional models. For NLP, the main means for transfer were Word2Vec word embeddings (Mikolov et al., 2013) which have recently been updated to models capturing contexts as well (Peters et al., 2018;Devlin et al., 2018). While these tools are widely known to improve performance in various tasks, more specialized models could as well be created for specific research areas, e.g. dialogue generation in our case.
The models above are some of the approaches to one of the central issues of efficient knowledge transfer -learning a unified data representation generalizable across datasets, dubbed 'representation learning'. In our approach, we will use one such technique based on variational autoencoding with discrete latent variables . In this paper we present an approach to transfer learning which is more tailored -both modelwise and dataset-wise -to goal-oriented dialogue in underrepresented domains.
3 The approach 3.1 Zero-shot theoretical framework We first describe the original Zero-Shot Dialogue Generation (ZSDG) theoretical framework of  which we base our work on. For ZSDG, there is a set of source dialogue domains and one target domain, with the task of training a dialogue response generation model from all the available source data and a significantly reduced subset of the target data (referred to as seed data). The trained system's performance is evaluated exclusively on the target domain.
More specifically, the data in ZSDG is organized as follows. There are unannotated dialogues in the form of {c, x, d} src/tgt -tuples of dialogue contexts, responses, and domain names respectively for each of the source and target domains. There are also domain descriptions in the form of {x, a, d} src/tgt -tuples of utterances, slot-value annotations, and domain names respectively for source and target domains.
ZSDG is essentially a hierarchical encoderdecoder model which is trained in a multi-task fashion by receiving two types of data: (1) dialogue batches drawn from all the available sourcedomain data, and (2) seed data batches, a limited number of which are drawn from domain description data for all of the source and target domains.
ZSDG model optimizes for 2 objectives. With dialogue batches, the model maximizes the probability of generating a response given the context: where F e and F d are respectively the encoding and decoding components of a hierarchical generative model; R is the shared recurrent utterance encoder (the recognition model); and D is a distance function (L2 norm). In turn, with domain description batches, the model maximizes the probability of generating the utterance given its slot-value annotation, both represented as sequences of tokens: In this multi-task setup, the latent space of R is shared between both utterances and domain descriptions across all the domains. Moreover, the distance-based loss terms make sure that (a) utterances with similar annotations are closer together in the latent space (Eq. 2), and (b) utterances are closer to their dialogue contexts (Eq. 1) so that their encodings capture some of the contexts'  (1a), we train the discretized LAED dialogue representation on the Transfer dataset. We then train a zero/few-shot dialogue generation model on SMD with this representation incorporated (1b).
meaning. These properties of the model make it possible to achieve better cross-domain generalization.

Unsupervised representation learning
As was the case with ZSDG, robust representation learning helps achieve better generalization across domains. The most widely-adopted way to train better representations has been to leverage a greater data source. In this work, we consider unsupervised, variational autoencoder-based (VAE) representation learning on a large dataset of unannotated dialogues. The specific approach we refer to is the Latent Action Encoder-Decoder (LAED) model of . LAED is a variant of VAE with two modifications: (1) an optimization objective augmented with mutual information between the input and the latent variable for better and more stable learning performance, and (2) discretized latent variable for the interpretability of the resulting latent actions. Just as in ZSDG, LAED is a hierarchical encoder-decoder model with the key component being a discreteinformation (DI) utterance-level VAE. Two versions of this model are introduced, with respective optimization objectives: where R and G are recognition and generation components respectively, x is the model's input, z is the latent variable, and p(z) and q(z) are respectively prior and posterior distributions of z.
DI-VAE works in a standard VAE fashion reconstructing the input x itself, while DI-VST follows the idea of Variational Skip-Thought reconstructing the input's previous and next contexts: {x n , x p }. As reported by the authors, the two models capture different aspects of utterances, i.e. DI-VAE reconstructs specific words within an utterance, whereas DI-VST captures the overall intent better -see the visualization in Figure 1a.

Proposed models 1
In our approach, we simplify the ZSDG setup by not using any explicit domain descriptions, therefore we only work with 'dialogue' batches. We also make use of Knowledge Base information without loss of generality (see Section 5) -thus we work with data of the form {c, x, k, d} where k is the KB information. We refer to this model as Few-Shot Dialogue Generation, or FSDG.
For learning a reusable dialogue representation, we use an external multi-domain dialogue dataset, the Transfer dataset (see Section 4).
We compare this LAED-augmented model to a similar one, with latent representation trained on the same data but using a regular VAE objective and thus providing regular continuous embeddings (we refer to it as FSDG+VAE).
Finally, in order to explore the original ZSDG setup as much as possible, we also consider its version with automatic Natural Language Understanding (NLU) markup instead of human annotations as domain descriptions. Our NLU annotations include Named Entity Recognizer (Finkel et al., 2005), a date/time extraction library (Chang and Manning, 2012), and a Wikidata entity linker (Pappu et al., 2017). We have models with (NLU ZSDG+LAED) and without LAED representation (NLU ZSDG). Our entire setup is shown in Figure 1.

Datasets
We use the Stanford Multi-Domain (SMD) human-human goal-oriented dialogue dataset (Eric et al., 2017) in 3 domains: appointment scheduling, city navigation, and weather information. Each dialogue comes with knowledge base snippet from the underlying domain-specific API.
For LAED training, we use MetaLWOz (Lee et al., 2019), a human-human goal-oriented dialogue corpus specifically designed for various meta-learning and pre-training purposes. It contains conversations in 51 domains with several tasks in each of those. The dialogues are collected using the Wizard-of-Oz method where human participants were given a problem domain and a specific task. No domain-specific APIs or knowledge bases were available for the participants, and in the actual dialogues they were free to use fictional names and entities in a consistent way. The dataset totals more than 40, 000 dialogues, with the average length of 11.9 turns.

Experimental setup and evaluation
Our few-shot setup is as follows. Given the target domain, we first train LAED models (a dialoguelevel DI-VST and an utterance-level DI-VAE, both of the size 10 × 5) on the MetaLWOz datasethere we exclude from training every domain that might overlap with the target one.
Next, using the LAED encoders, we train a Few-Shot Dialogue Generation model on all the SMD source domains. We use a random sample (1% to 10%) of the target domain utterances together with their contexts as seed data.
We incorporate KB information into our model by simply serializing the records and prepending them to the dialogue context, ending up with a setup similar to CopyNet in (Eric et al., 2017).
For the NLU ZSDG setup, we use 1000 random seed utterances from each source domain and 200 utterances from the target domain 2 .
For evaluation, we follow the approach of  and report BLEU and Entity F1 scores -means/variances over 10 runs.

Results and discussion
Our results are shown in Table 1. Our objective here is maximum accuracy with minimum training data required, and it can be seen that fewshot models with LAED representation are the best performing models for this objective. While the improvements can already be seen with simple FSDG, the use of LAED representation helps to significantly reduce the amount of in-domain training data needed: in most cases, the state-ofthe-art results are attained with as little as 3% of in-domain data. At 5%, we see that FSDG+LAED consistently improves upon all other models in every domain, either by increasing the mean accuracy or by decreasing the variation. In SMD, with its average dialogue length of 5.25 turns (see Table 4), 5% of training dialogues amounts to approximately 200 in-domain training utterances. In contrast, the ZSDG setup used approximately 150 annotated training utterances for each of the 3 domains, totalling about 450 annotated utterances. Although in our few-shot approach we use full indomain dialogues, we end up having a comparable amount of target-domain training data, with the crucial difference that none of those has to be annotated for our approach. Therefore, the method we introduced attains state-of-the-art in both accuracy and data-efficiency.
The results of the ZSDG NLU setup demonstrate that single utterance annotations, if not domain-specific and produced by human experts, don't provide as much signal as raw dialogues.
The comparison of the setups with different latent representations also gives us some insight: while the VAE-powered FSDG model improves on the baseline in multiple cases, it lacks generalization potential compared to LAED. The reason for that might be inherently more stable training of LAED due to its modified objective function which in turn results in a more informative, generalizable representation.
Finally, we discuss the evaluation metrics. Since we base this paper on the work of , we have had to fully conform to the metrics they used to enable direct comparison. However, BLEU as the primary evaluation metric, does not necessarily reflect NLG quality in dialogue settings -see examples in Table 2 of the Appendix (see also Novikova et al. (2017)). This is a general issue in dialogue model evaluation since the variability of possible responses equivalent in meaning is very high in dialogue. In future work, instead of using BLEU, we will put more emphasis on the meaning of utterances, for example by using external dialogue act tagging resources, using quality metrics of language generation -e.g. perplexity -as well as more taskoriented metrics like Entity F1. We expect these to make for more meaningful evaluation criteria.

Conclusion and future work
In this paper, we have introduced a technique to achieve state-of-the-art dialogue generation performance in a few-shot setup, without using any annotated data. By leveraging larger, more highly represented dialogue sources and learning robust latent dialogue representations from them, we obtained a model with superior generalization to an underrepresented domain. Specifically, we showed that our few-shot approach achieves stateof-the art results on the Stanford Multi-Domain dataset while being more data-efficient than the previous best model, by not requiring any data annotation.
Although being state-of-the-art, the accuracy scores themselves still suggest that our technique is not ready for immediate adoption for real-world production purposes, and the task of few-shot generalization to a new dialogue domain remains an area of active research. We expect that such initiatives will be fostered by the release of large dialogue corpora such as MetaLWOz.
In our own future work, we will try and find ways to improve the unsupervised representation in order to increase the transfer potential. Adversarial learning can also be beneficial in the setting of limited data. And apart from improving the model itself, it is necessary to consider an alternative criterion to BLEU-score for adequate evaluation of response generation.