Towards Low-Resource Semi-Supervised Dialogue Generation with Meta-Learning

In this paper, we propose a meta-learning based semi-supervised explicit dialogue state tracker (SEDST) for neural dialogue generation, denoted as MEDST. Our main motivation is to further bridge the chasm between the need for high accuracy dialogue state tracker and the common reality that only scarce annotated data is available for most real-life dialogue tasks. Specifically, MEDST has two core steps: meta-training with adequate unlabelled data in an automatic way and meta-testing with a few annotated data by supervised learning. In particular, we enhance SEDST via entropy regularization, and investigate semi-supervised learning frameworks based on model-agnostic meta-learning (MAML) that are able to reduce the amount of required intermediate state labelling. We find that by leveraging un-annotated data in meta-way instead, the amount of dialogue state annotations can be reduced below 10% while maintaining equivalent system performance. Experimental results show MEDST outperforms SEDST substantially by 18.7% joint goal accuracy and 14.3% entity match rate on the KVRET corpus with 2% labelled data in semi-supervision.


Introduction
Task-oriented dialogue systems (Young et al., 2013) are designed to help users to achieve specific goals such as restaurant reservation or navigation inquiry. In recent years, fully neural end-to-end architectures usually take the sequence-to-sequence (Se-q2Seq) (Sutskever et al., 2014) model to generate dialogue responses from the user inputs and context history Madotto et al., 2018;Wen et al., 2018;Qin et al., 2019;Wu et al., 2019). Neural generative models for task-oriented dialogue systems have achieved promising performance on generation tasks if given a huge training dataset and detailed annotations (Zhao et al., 2017;  Zhang et al., 2019). Arguably, high-quality intermediate labels play a key role in terms of obtaining satisfactory results in this line of tasks . However, collecting these labels is often the bottleneck in dataset creation, as the process is expensive and time-consuming, requiring domain and expert knowledge (Asri et al., 2017), holding back development in the area of dialogue systems and greatly limiting their application in real-world settings.
Various approaches (Kannan et al., 2018;Tseng et al., 2019;Yin et al., 2019;Chang et al., 2019;Peng et al., 2020) have been proposed recently to conquer this challenge, semi-supervised explicit dialogue state tracker (SEDST)  1 is one of them. SEDST tries to solve the label-lacking problem by classical semi-supervision. However, the gain with SEDST is not as satisfactory when the annotated data is very scarce, only a small fraction of what is expected.
In this paper, we focus on improving dialogue performance on top of SEDST. Our first contribution is to enhance SEDST with meta-learning. We propose a MAML-based (Finn et al., 2017) semisupervision architecture for low-resource. Our experiments with the KVRET dataset show this integration of MAML and SEDST can reach comparable dialogue state tracking accuracy with below 10% intermediate annotation. Second, we improve (1) (3) (6) (8) (2) (4) (4) (2) (9) (11) , meta-testing with labelled data D test . (b) SEDST model trains the unlabelled data D U and labelled data D A in the same phase. More details can be found in our previous work (Huang et al., 2020).
SEDST with Entropy Regularization, that leads to a more robust and better accuracy model. Third, to the best of our knowledge, this proposal is the first attempt to explore meta-based semi-supervised learning for multi-domain task-oriented dialogue tasks, and the novel method can be easily applied to other new scenario too.

Proposed Approaches
In this section, we describe the details of MEDST by starting with brief overviews of SEDST.

SEDST
SEDST is a generative model with copying mechanism and posterior regularization. Dialogue states S t are represented by text span and flow along the dialogue turns and finally attend to the generation of dialogue response via copying mechanism. Posterior regularization is applied to optimize the training procedure. The normal forward pass of the network calculates the prior probability distribution over vocabulary space which utilize the concatenations of current utterance U t and previous response R t−1 as input. By contrast, the posterior distribution is computed with current actual response R t added to the inputs which gives more information. The networks will be updated via the KL-divergence of prior and posterior distribution over unlablled data. Figure 1 illustrates how the process works. It first takes the concatenation of previous responses R t−1 and the current user utterance U t as input and Pre-update model with gradient descent: : end while 10: Meta-testing Steps: 11: while not done do 12: encodes them into hidden vectors. The belief Span decoder is attention-based and extracts the belief span S t from previous response R t−1 , the utterance U t and previous state S t−1 . S t is then concatenated with R t−1 and U t to generate response R t . Denote the context as c = {S t−1 , R t−1 , U t }. The forward propagation network calculates the prior probability distribution P Θ over vocabulary of S t . Posterior regularization builds a posterior network which has the same structure with the prior network to learn S t with the posterior probability distribution Q Φ .
Posterior regularization is utilized to force prior distribution to approximate that of the posterior network by KL-divergence in training process. While testing, only prior network is active.
In supervised or semi-supervised learning, the learning objectives are defined as, (1) where A and U indicates the set of labelled and unlabelled data, p i and q i are the prior probability and the posterior probability distribution of the i-th word in the state. N is the length of the state span. When the dataset is entirely unlabelled, the posterior network is extended with an auto-encoder and learning objectives include response generation loss, reconstruction loss and KL-divergence loss,

MEDST
We present a new perspective on how to effectively use unlabelled examples for better accuracy and domain adaptation under low-resource. Our proposed model MEDST, motivated by the powerful internal representation ability of meta-learning and the positive effect of entropy in semi-supervised learning, approach the challenges that SEDST remains by the following: (1) MEDST enhances the original loss with entropy regularization.
(2) MEDST contains MAML-based semi-supervision on top of SEDST as shown in Algorithm 1. MEDST includes two phases: meta-training with unlabelled data and meta-testing with labelled data as shown in Figure 2.
In meta-training phase, we train our model similar to SEDST's L 2 , further with entropy loss improvement L 2 . As shown in Section 2.1, SEDST suffers data bias when no label resource is available. Following Grandvalet and Bengio (2004), entropy minimization uses a simple loss term to unlabelled data so that the network can make a high confidence (low entropy) prediction. The regularizer can avoid the decision boundary passing through data points which leads to smaller classes overlap. Therefore, we add the entropy regularization to L 2 , In meta-testing phase, a small amount of labelled data is available to optimize the pre-training model. Different from unsupervised learning, labelled data can be utilized to compute the prior and posterior probability distribution of S t which replaces the entropy of the prior probability p i and the posterior probability q i to obtain more deterministic information. The loss function in meta-testing can be derived as, We further explore the adaption ability of MEDST to the new domain with KVRET dataset. Specifically, we choose two domains as source domains with adequate unlabelled data and the other domain as target domain with a small amount of labelled data.

Experimental setup 3.1 Corpus and metrics
The KVRET 2 corpus ) is a multidomain task-oriented dialogue dataset. This dataset includes three distinct domains. There are 284 informable slot values for state tracking. The corpus contains 2425, 302, 302 dialogues for training, validation and testing.  We use three metrics for evaluation following . Joint goal accuracy (Acc) calculates the proportion of the dialogue turns where all the constraints are captured correctly. Entity match rate (EMR) is the proportion that the system captures the correct user goal. We use BLEU (Papineni et al., 2002), a word-overlapping based metric to evaluate the language fluency of generated responses.

Implementation details
We choose single-layer GRU networks with a hidden size of 50 to be the encoder and the decoder. All the embeddings are initialized by Glove (Pennington et al., 2014) and the size of the word embedding is set to 50. The model is optimized using Adam (Kingma and Ba, 2015) with a learning rate of 0.003 in meta-training and 0.0015 in metatesting. The model is implemented in the pyTorch.
In MEDST's semi-supervision, we randomly choose 720 unlabelled dialogues for each domain in meta-training with batch size of support dataset 32 and query dataset 8. Parameters β is set to 0.04, α is 0.1 and λ is 1. Different data amounts (2%, 4%, 6%, 8% of training set) of labelled data are utilized for meta-testing and the trained models are selected on the basis of validation performance.

Main results
In the main experiments, we take the original SEDST faithful to  as baseline. Table 1 shows MEDST achieves great improvement with different proportions of labelled data. To prove the effectiveness of our structure, we conduct ablation experiments in different setups. w/o Entropy (remove entropy, Acc increases 8.98% and EMR 8.0% on average) has the same regularization loss function as SEDST in meta-training. The improvement here mainly benefits from MAML algorithm, which tries to build an internal representation of multiple tasks and maximize the sensitivity of the loss function when applied to new tasks. w/o MAML (remove MAML, Acc increases 1.83% and EMR 2.15% on average) has the same framework and one-stage training procedure with SEDST. It shows the improvement due to entropy regularization, which takes account of the uncertainty of unlabelled data. We can find that MEDST's advantage mainly comes from MAML and MAML is a potential mechanism in semi-supervision for further studies. Figure 3 plots another evaluation metric BLEU for MEDST in different amounts of labelled data. MEDST with only 10% labelled data can reach the similar BLEU as SEDST, which requires 25%.
In new domain adaption experiments, our model MEDST performs meta-training on unlabelled data from source domains and meta-testing on 5% labelled data from target domain. SEDST inputs source domain's unlabelled data and 5% target domain's labelled data together to perform one-stage training process. From the results shown in Table 2, we can see MEDST improves the ability of new domain adaption. Three new domains achieve 18.97% higher in Acc and 13.93% higher in EMR. Target domains can also obtain better generated language quality.

Conclusions
In this work, we investigate MAML algorithm and entropy regularization on top of SEDST for lowresource dialogue tasks. We demonstrate the superiority of our proposed model MEDST with lowresource labelled data and perform a fair amount of ablation studies. MEDST can also be adapted to new domains with much better performance. Future work includes exploring more internal mechanism with the combination of semi-supervision and MAML for other tasks.