Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight

Current state-of-the-art neural dialogue models learn from human conversations following the data-driven paradigm. As such, a reliable training corpus is the crux of building a robust and well-behaved dialogue model. However, due to the open-ended nature of human conversations, the quality of user-generated training data varies greatly, and effective training samples are typically insufficient while noisy samples frequently appear. This impedes the learning of those data-driven neural dialogue models. Therefore, effective dialogue learning requires not only more reliable learning samples, but also fewer noisy samples. In this paper, we propose a data manipulation framework to proactively reshape the data distribution towards reliable samples by augmenting and highlighting effective learning samples as well as reducing the effect of inefficient samples simultaneously. In particular, the data manipulation model selectively augments the training samples and assigns an importance weight to each instance to reform the training data. Note that, the proposed data manipulation framework is fully data-driven and learnable. It not only manipulates training samples to optimize the dialogue generation model, but also learns to increase its manipulation skills through gradient descent with validation samples. Extensive experiments show that our framework can improve the dialogue generation performance with respect to various automatic evaluation metrics and human judgments.


Introduction
Open-domain dialogue generation, due to its potential applications, is becoming ubiquitous in the community of natural language processing.Current end-to-end neural dialogue generation models (Li et al., 2016a;Serban et al., 2017;Zhao et al., 2017) are primarily built following the data-driven * Work done at Data Science Lab, JD.com.paradigm, that is, these models mimic the human conversations by training on the large-scale queryresponse pairs.As such, a reliable training corpus that exhibits high-quality conversations is the crux of building a robust and well-behaved dialogue model.
Unfortunately, owing to the subjectivity and open-ended nature of human conversations, the quality of the collected human-generated dialogues varies greatly (Shang et al., 2018), which hampers the effectiveness of data-driven dialogue models: 1) Effective conversation samples are quite insufficient.To glean some insights on the data quality of dialogue corpus, we choose the queryrelatedness to take a glimpse of the data quality.In dialogue corpus, some conversations are quite coherent, where the queries and responses are wellcorrelated, while others are not.Query-relatedness measures the semantic similarities between the query and its corresponding response in the embedding space and ranges from 0 to 1.When reviewing DailyDialog (Li et al., 2017), we find that only 12% conversation samples are of relatively high queryrelatedness scores (> 0.6).Without adequate reliable training samples, the neural dialogue model is prone to converge to a sub-optimal point.2) Meanwhile, noisy and even meaningless conversation samples frequently appear.As Li et al. (2016b) reported, "I don't know" appears in over 113K sentences in the training corpus OpenSubtitles (Lison and Tiedemann, 2016).Such kind of noisy conversation data prevails in neural dialogue model training, and vitally impedes the model learning.
Therefore, effective dialogue learning requires not only more reliable learning samples, but also fewer noisy samples.In this work, as illustrated in Figure 1, we propose a novel learnable data manipulation framework to proactively reshape the data distribution towards reliable samples by augmenting and highlighting effective learning samples as well as reducing the weights of inefficient samples simultaneously.Specifically, to generate more effective data samples, the data manipulation model selectively augments the training samples in terms of both word level and sentence level, using masked language models such as BERT (Devlin et al., 2019) and back-translation (Sennrich et al., 2016a) technique.To reduce the weights of inefficient samples from the original training samples and the augmented samples, the data manipulation model assigns an importance weight to each sample to adapt the sample effect on dialogue model training.It gives out higher importance weights to critical learning samples and lower weights to those inefficient samples.Furthermore, different from most previous data augmentation or data weighting studies (Li et al., 2019;Shang et al., 2018;Csáky et al., 2019), which are unaware of the target model states during augmentation or weighting, our data manipulation framework not only manipulates training samples to optimize the dialogue generation model, but also learns to increase its manipulation skills through gradient descent with validation samples.
We apply the proposed data manipulation framework on several state-of-the-art generation models with two real-life open-domain conversation datasets and compare with the recent data manipulation approaches in terms of 13 automatic evaluation metrics and human judgment.Experiment results show that our data manipulation framework outperforms the baseline models over most of the metrics on both datasets.

Data Manipulation for Neural Dialogue Generation
The proposed data manipulation framework tackles the problem of un-even quality data by inducing the model learning from more effective dialogue sam-  ples and reducing effects of those inefficient samples simultaneously.In particular, as illustrated in Figure 2, it manipulates and reshapes the data distribution for neural dialogue model learning in mainly three stages: First, each batch of training samples are selectively augmented to generate more variant samples; and then, all the samples, including the original samples and the augmented samples, are assigned with instance weights indicating their importance regarding current learning status; finally, the weighted samples are fed into the neural dialogue model to induce the model learning from more effective training instances.
Note that, although we describe the framework in three components for ease of understanding, in fact, the whole framework can be trained in an endto-end manner.As a result, the data manipulation network is capable of not only manipulating training samples to optimize the dialogue generation model, but also learning to increase its manipulation skills through gradient descent with validation samples.
We first introduce the augmentation and weighting strategies for data manipulation in § 2.1 and § 2.2, and then describe how the neural dialogue generation model learns from the manipulated samples in § 2.3.Parameters estimation for the data manipulation model is elaborated in § 2.4.

Dialogue Augmentation
To induce the neural dialogue generation model to learn from more effective samples, we develop a gated data augmentation mechanism for the manipulation framework to selectively augment the learning samples.
Specifically, as shown in Figure 3, given a training sample, the manipulation framework first specifies whether to augment it or not through an instance filter, which can be implemented using a

• • •
< l a t e x i t s h a 1 _ b a s e 6 4 = " v q p A n r 4 x a G q Q u 1 e h F 8 4 Z E 4 e x a r 0 = " > A O M c 9 C q T N K I t T h k P s k L 4 D 2 r U 1 G 9 J e e q Z K z e g U n 9 6 E l C Z 2 S B N R X k J Y n m a q e K a c J f u b 9 1 h 5 y r u N 6 O 9 q r 4 B Y g U t i / 9 J N M v + r k 7 U I 9 H G o a v C o p l g x s j q m X T L V F X l z 8 0 t V g h x i 4 i T u U T w h z J R y 0 m d T a V J V u + y t o + J v K l O y c s 9 0 b o Z 3 e U s a s P 1 z n N O g U S n b e + X K 2 X 6 p e q R H n c c W t r F L 8 z x A F S e o o U 7 e V 3 j E E 5 6 N U + P a u D X u P l O N n N Z s 4 t s y H j 4 A 8 b + R t A = = < / l a t e x i t >

• • •
< l a t e x i t s h a 1 _ b a s e 6 4 = " v q p A n r 4 x a G q Q u 1 e h F 8 4 Z E 4 e x a r 0 = " > A O M c 9 C q T N K I t T h k P s k L 4 D 2 r U 1 G 9 J e e q Z K z e g U n 9 6 E l C Z 2 S B N R X k J Y n m a q e K a c J f u b 9 1 h 5 y r u N 6 O 9 q r 4 B Y g U t i / 9 J N M v + r k 7 U I 9 H G o a v C o p l g x s j q m X T L V F X l z 8 0 t V g h x i 4 i T u U T w h z J R y 0 m d T a V J V u + y t o + J v K l O y c s 9 0 b o Z 3 e U s a s P 1 z n N O g U S n b e + X K 2 X 6 p e q R H n c c W t r F L 8 z x A F S e o o U 7 e V 3 j E E 5 6 N U + P a u D X u P l O N n N Z s 4 t s y H j 4 A 8 b + R t A = = < / l a t e x i t >

• • •
< l a t e x i t s h a 1 _ b a s e 6 4 = " v q p A n r 4 x a G q Q u 1 e h F 8 4 Z E 4 e x a r 0 = " > A sigmoid gating function.Then, two levels of data augmentation are introduced, word-level contextual augmentation and sentence-level data augmentation, to augment the chosen sample accordingly.

Word-level Contextual Augmentation
As the name suggests, word-level augmentation enriches the training samples by substituting the words in the original sample (Figure 3 (a)).Here, we employ a masked language model, BERT (Devlin et al., 2019), to implement word-level augmentation.Given an original sentence, the language model first randomly masks out a few words.BERT then takes in the masked sentence and predicts the corresponding masked positions with new words.A fixed pre-trained BERT may not generalize well for our data manipulation framework, because BERT is unaware of the dialogue learning status.To mitigate such defects, we further fine-tune BERT through backpropagation (more details in § 2.4).In particular, BERT is adapted to be differentiable by utilizing a gumbel-softmax approximation (Jang et al., 2017) when predicting substitution words.

Sentence-level Data Augmentation
Word-level data augmentation is quite straightforward.However, such kind of rewriting is limited to only a few words.In human dialogues, there exist various synonymous conversations with different sentence structures.To further diversify the expressions in conversion, we introduce the sentence-level data augmentation through backtranslation as in Edunov et al. (2018); Yu et al.Similar to the fine-tuning strategy in word-level data augmentation, we also fine-tune the sentencelevel data augmentation components to encourage the model to generate more effective samples for dialogue training.The gradients are backpropagated into the translation-based augmentation model, where a differentiable gumbel-softmax is utilized when predicting sentences using the translation model.

Data Weighting
Given the original training samples and the augmented samples, to deal with the problem of noisy instances, data manipulation model assigns an importance weight to each training sample regarding the learning status.In particular, the sample importance weights are approximated through a softmax function over the scores of these instances.A multilayer perceptron is employed to compute example scores, taking distributional representations of these instances as input.Each sample is converted into its corresponding distributional representation through a transformer-based encoder.

Dialogue Generation with Data Manipulation
Conventionally, neural dialogue generation model is optimized with a vanilla negative log-likelihood loss using the training data D with size N : L vanilla = N j=1 − log p(y j |x j ), where each sample is treated equally.In our framework, we assign each sample with an importance weight and augment the original training set D = {(x j , y j )} N j=1 to D = {(x j , y j )} N j=1 regarding the learning status.To perform the weighted optimization with augmented training set D , we utilize a weighted negative log-likelihood loss function: where w j is the instance weight produced by the data manipulation network.

Parameter Estimation for Data Manipulation
The data manipulation network not only manipulates training samples to optimize the dialogue learning process, but also learns to increase its manipulation skills through gradient descent with validation samples.We formulate such joint learning process following a novel policy learning paradigm (Hu et al., 2019;Tan et al., 2019), where the manipulation framework is formulated as a learnable data-dependent reward function R φ (d = {x, y}|D), the dialogue model p θ (y|x) is treated as a policy, the input x as the "state", and the output y as the "action".The reward function R φ (d|D) is defined as: where φ denotes the parameter of data manipulation network and w i ∈ R is the importance weight associated with the ith data sample.In such formulation, a sample d receives a real-valued reward when d is an augmented sample, or d matches an instance in the original training set.
As depicted in Algorithm 1, the parameter θ of the neural dialogue model and parameter φ of the data manipulation network are alternatively optimized.Jointly optimizing the dialogue model and the manipulation network can be regarded as reward learning, where the policy p θ (y|x) receives relatively higher rewards for effective samples and where ∇ θ L dm (θ, φ) is the gradient of θ with respect to the loss L dm , and α is the step size.The parameter φ of the data manipulation network is learned by taking a meta gradient descent step on validation samples (Ren et al., 2018).Equation (3) shows that θ depends on φ.Therefore, the manipulation model (i.e. the reward function R φ (d|D)) can be optimized by directly backpropagating the gradient through θ to φ.

Experiment Setup
Data We conduct experiments on two English conversation datasets: (1) DailyDialog (Li et al., 2017), a collection of real-world dialogues widely used in open-domain dialogue generation.This is a multi-turn dataset, and we treat each turn as a training pair in this work.The overlapping pairs are removed from the data set.(2) OpenSubtitles (Lison and Tiedemann, 2016), a group of human-human conversations converted from movie transcripts.80,000 instances are sampled from the original corpus and the data proportion for train/valid/test set is set to 8/1/1, respectively.The dataset statistics are listed in Table 1.
Experimental Models To ascertain the effectiveness and applicability of our method, we implement the proposed data manipulation framework on following representative models: (i) SEQ2SEQ: a RNN-based sequence-to-sequence model with attention mechanisms (Bahdanau et al., 2015); (ii) CVAE: a latent variable model using conditional variational auto-encoder, trained with KLannealing and a BoW loss as in Zhao et al. (2017); (iii) Transformer: an encoder-decoder architecture relying solely on the attention mechanisms (Vaswani et al., 2017).
Comparison Models We also compare our approach with previous data augmentation or instance weighting methods: (i) CVAE-GAN (Li et al., 2019): a model that combines CVAE and GAN for augmenting the training data to generate more diversified expressions.(ii) Calibration (Shang et al., 2018): a calibration network measures the quality of data samples and enables weighted training for dialogue generation.(iii) Clustering (Csáky et al., 2019): it clusters highentropy samples as noises and filters them out.

Evaluation Metrics
We adopt several widely used metrics in existing works (Liu et al., 2016;Li et al., 2016a;Serban et al., 2017;Gu et al., 2019) to measure the performance of dialogue generation models, including BLEU, embedding-based metrics, entropybased metrics and distinct metrics.In particular, BLEU measures how much a generated response contains n-gram overlaps with the reference.We compute BLEU scores for n<4 using smoothing techniques1 .Embedding-based metric computes the cosine similarity of bag-of-words embeddings between the hypothesis and the reference.We employ the following three embedding metrics to assess the response quality: (1) Embedding Average (Avg): cosine similarity between two utterances, in which the sentence embedding is computed by taking the average word embedding weighted by the smooth inverse frequency sent emb(e) = 1 |e| ν∈e 0.001 0.001+p(ν) emb(ν) of words as in Arora et al. (2017).where emb(ν) and p(ν) are the embedding and the probability2 of word ν respectively.(2) Embedding Greedy (Gre): greedily matching words in two utterances based on the cosine similarities between their embeddings, and averaging the obtained scores, (3) Embedding Extrema (Ext): cosine similarity between the largest extreme values among the word embeddings in the two utterances.We use Glove vectors as the word embeddings.Regarding entropybased metrics, we compute the n-gram entropy Ent-n = − 1 |r| ν∈r log 2 p(ν) of responses to measure their non-genericness, where the probabilities p(ν) of n-grams (n=1,2,3) are calculated based on the maximum likelihood estimation on the training data (Serban et al., 2017).Distinct computes the diversity of the generated responses.Dist-n is defined as the ratio of unique n-grams (n=1,2,3) over all n-grams in the generated responses.Following Gu et al. (2019), we also report Intra-{1,2,3} metrics which are computed as the average of distinct values within each sampled response.

Implementation & Reproducibility
For word-level dialogue augmentation, we employ the pre-trained BERT-base language model with the uncased version of tokenizer.We follow the hyper-parameters and settings suggested in Devlin et al. (2019).The replacement probability is set to 15%.For back-translation in sentence-level dialogue augmentation, we use the Transformer model (Vaswani et al., 2017) trained on En-De and En-Ru WMT'19 news translation tasks (Ng et al., 2019).German and Russian sentences were tokenized with the Moses tokenizer (Koehn et al., 2007).The same hyper-parameters are used for the translation tasks, i.e., word representations of size 1024, dropout with 0.8 keep probability, feedforward layers with dimension 4096, 6 blocks in the encoder and decoder with 16 attention heads.Models are optimized with Adam (Kingma and Ba, 2014) optimizer using initial learning rate 7e-4.Regarding dialogue models implementation, we adopt a 2-layer bidirectional LSTM as the encoder and a unidirectional one as the decoder for both the SEQ2SEQ and CVAE.The hidden size is set to 256, and the latent size used in CVAE is set to 64.The transformer model for dialogue generation is configured with 512 hidden size, 8 attention heads Table 3: Performance (%) of our approach instantiated on the naive SEQ2SEQ and the baseline approaches on (a) DailyDialog and (b) OpenSubtitles.
and 6 blocks in both the encoder and decoder.The hyper-parameters in the baseline models are set following the original papers (Li et al., 2019;Shang et al., 2018;Csáky et al., 2019).

Evaluation Results
To investigate the effectiveness and general applicability of the proposed framework, we instantiate our data manipulation framework on several stateof-the-art models for dialogue generation.The automatic evaluation results of our proposed learning framework and the corresponding vanilla models are listed in Table 2. Compared with the vanilla training procedure, the proposed data manipulation framework brings solid improvements for all the three architectures regarding almost all the evaluation metrics.Such improvements are consistent across both two conversation datasets, affirming the superiority and general applicability of our proposed framework.
We further compare our model with existing related methods.Not surprisingly, as shown in Table 3, our data manipulation framework outperforms the baseline methods on most of metrics.In particular, the improvement on Distinct metrics of our model is much greater, which implies that data manipulation effectively induce the neural dialogue  model generating more diverse responses.

Human Evaluation
We use the DailyDialog as the evaluation corpus since it is more similar to our daily conversations and easier for annotators to make the judgement.Three graduate students are recruited to conduct manual evaluations.100 test messages are randomly sampled and the corresponding responses generated by our model and the comparison model are presented to each annotator, who has no knowledge about which system the response is from.The annotators are required to compare the quality of these two responses (response 1 , response 2 ) and evaluate among "win" (response 1 is better), "loss" (response 2 is better) and "tie" (they are equally good or bad), considering the following criteria: coherence, logical consistency, fluency and diversity.Note that cases with different rating results are  1456 5.4386 11.1140 86.399 92.293 94.825 6.8752 10.579 11.837 0.2002 64.937 46.540 67.541 w/o instance filter 1.8627 8.2850 15.9400 88.551 93.445 94.419 7.1440 11.305 12.823 0.2813 65.606 46.912 67.863   counted as "tie".Table 4 summarizes human evaluation results.The kappa scores indicate that the annotators came to a fair agreement in the judgement.Compared with the baseline methods, our data manipulation approach brings about more informative and coherent replies.

Model Analysis
Learning Efficiency Figure 4 presents validation results along iterations when training the SEQ2SEQ model on DailyDialog.We observe that when training SEQ2SEQ using our framework, the initial learning speed is a bit slower than the standard vanilla training.However, our framework surpasses the vanilla training on the final stage.One reason is that, at the early stage, the data manipulation model takes some time to improve its manipulation skills.This may slow down the neural dialogue model learning.Once the manipulation skills are effective enough, the neural dialogue model may benefit from learning more effective samples instead of those inefficient instances, and achieves better performance.
Examples with Different Augmentation Frequencies The data manipulation model selectively chooses samples to conduct data augmentation.To further glean the insights regarding which samples are favored by the augmentation model, we list examples with different augmentation frequencies in Figure 5.We notice that samples frequently augmented by the manipulation model are more reliable than those seldom augmented ones.Therefore, the dialogue model is able to learn from those effective instances and their synonymous variants.
Word-level vs. Sentence-level Augmentation In our framework, we implement two kinds of augmentation mechanisms.Word-level augmentation enriches the given samples by substituting words, while sentence-level augmentation paraphrases the original samples through back-translation.We evaluate their performances and report results in Table 5.Both augmentation mechanisms improve the performance over the vanilla SEQ2SEQ baseline, while sentence-level augmentation performs slightly better than word-level augmentation on most evaluation metrics.One possible reason is that sentence-level augmentation captures more paraphrasing phenomenon.

Related Work
Existing approaches to improving neural dialogue generation models mainly target on building more powerful learning systems, using extra information such as conversation topics (Xing et al., 2017), persona profile (Song et al., 2019), user emotions (Zhou et al., 2018), or out-sourcing knowledge (Liu et al., 2018).Another popular framework for dialogue generation is variational autoencoder (Kingma and Welling, 2014;Zhao et al., 2017;Shen et al., 2017), in which a latent variable is introduced to benefit the dialogue model with more diverse response generation.Contrasted with previous researches, we investigate to improve the dialogue model from a different angle, i.e., adapting the training examples using data manipulation techniques.
Data augmentation is an effective way to improve the performance of neural models.To name a few, Kurata et al. (2016) propose to generate more utterances by introducing noise to the decoding process.Kobayashi (2018); Wu et al. (2019) demonstrate that contextual augmentation using label-conditional language models helps to improve the neural networks classifier on text classification tasks.Sennrich et al. (2016b) boost neural machine translation models using back-translation.Xie et al. (2017); Andreas (2019) design manuallyspecified strategies for data augmentation.Hou et al. (2018) utilize a sequence-to-sequence model to produce diverse utterances for language understanding.Li et al. (2019); Niu and Bansal (2019) propose to generate sentences for dialogue augmentation.Compared with previous augmentation approaches for dialogue generation, augmented sentences in our framework are selectively generated using the pretrained models and the augmentation process is additionally fine-tuned jointly with the training of dialogue generation.
Regarding data weighting, past methods (Jiang and Zhai, 2007;Rebbapragada and Brodley, 2007;Wang et al., 2017;Ren et al., 2018;Hu et al., 2019) have been proposed to manage the problem of training set biases or label noises.Lison and Bibauw (2017) propose to enhance the retrieval-based dialog system with a weighting model.Shang et al. (2018) likewise design a matching network to calibrate the dialogue model training through instance weighting.Whereas our proposed framework learns to reweight not only the original training examples but also the augmented examples.Another difference is that, we directly derive data weights based on their gradient directions on a validation set, instead of separately training a external weighting model.Csáky et al. (2019) claim that high-entropy utterances in the training set lead to those boring generated responses and thus propose to ameliorate such issue by simply removing training instances with high entropy.Although data filtering is a straightforward approach to alleviate the problem of noisy data, the informative training samples remain untouched and insufficient.Whereas our method holds the promise of generating more valid training data and alleviating the negative noises in the meantime.
Note that either data augmentation or instance reweighting can be considered band-aid solution: simply augmenting all training data risks introducing more noisy conversations as such low-quality examples prevail in human-generated dialogues, whilst adapting the sample effect merely by instance reweighting is also suboptimal since effective training samples remain insufficient.The proposed learning-to-manipulate framework organically integrates these two schemes, which collectively fulfill the entire goal.

Conclusion
In this work, we consider the automated data manipulation for open-domain dialogue systems.To induce the model learning from effective instances, we propose a learnable data manipulation model to augment effective training samples and reduce the weights of inefficient samples.The resulting data manipulation model is fully end-to-end and can be trained jointly with the dialogue generation model.Experiments conducted on two public conversation datasets show that our proposed framework is able to boost the performance of existing dialogue systems.
Our learning-to-manipulate framework for neural dialogue generation is not limited to the elaborately designed manipulation skills in this paper.Future work will investigate other data manipulation techniques (e.g., data synthesis), which can be further integrated to improve the performance.
Figure 1: Data manipulation helps the dialogue model training by augmenting and highlighting effective learning samples as well as reducing the weights of inefficient samples.

Figure 2 :
Figure 2: Overview of the proposed automated data manipulation framework for neural dialogue generation.At training step t, the data manipulation model augments and weights the training samples for the dialogue model learning.

Figure 3 :
Figure 3: Illustration of the data manipulation model.During training, it takes the original batch samples as input, and generates the augmented data samples as well as the importance weights for dialogue model training.
(2018), which trains two translation models: one translation model from the source language to target language and another backward translation model from the target to the source, as shown in Figure3 (b).By transforming the expression styles across different languages, the augmented training samples are expected to convey similar information while with different expressions.

Algorithm 1
Joint Learning of Dialogue Model and Data Manipulation Network Input: The dialogue model θ, data manipulation network φ, training set D and validation set D v 1: Initialize dialogue model parameter θ and data manipulation model parameter φ 2: repeat 3: Optimize θ on D enriched with data manipulation.4: Optimize φ by maximizing data log-likelihood on D v .5: until convergence Output: Learned dialogue model θ * and data manipulation model φ * lower rewards for those inefficient samples.More concretely, to optimize the neural dialogue model, at each iteration, mini-batch instances are sampled from the training set, and are then enriched through augmentation and weighting.The parameter θ of the neural dialogue model is then updated with a weighted negative log-likelihood loss function in Eq.(1):

Figure 4 :
Figure 4: Comparison of the training with data manipulation and vanilla training using SEQ2SEQ on the validation set of DailyDialog.Dist-1, Embedding Extrema and Ent-3 are denoted as "Distinct", "Embedding" and "Entropy", respectively.

Table 1 :
Data statistics of the experiment corpora.

Table 4 :
The results of human evaluation on the test set of DailyDialog.

Table 6
Examples with different augmentation frequencies.Instances with higher augmentation frequencies are more effective than those seldom augmented examples.