Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

We propose an adversarial learning approach to the generation of multi-turn dialogue responses. Our proposed framework, hredGAN, is based on conditional generative adversarial networks (GANs). The GAN's generator is a modified hierarchical recurrent encoder-decoder network (HRED) and the discriminator is a word-level bidirectional RNN that shares context and word embedding with the generator. During inference, noise samples conditioned on the dialogue history are used to perturb the generator's latent space to generate several possible responses. The final response is the one ranked best by the discriminator. The hredGAN shows major advantages over existing methods: (1) it generalizes better than networks trained using only the log-likelihood criterion, and (2) it generates longer, more informative and more diverse responses with high utterance and topic relevance even with limited training data. This superiority is demonstrated on the Movie triples and Ubuntu dialogue datasets in terms of perplexity, BLEU, ROUGE and Distinct n-gram scores.


Introduction
Recent advances in deep neural network architectures have enabled tremendous success on a number of difficult machine learning problems. While these results are impressive, producing a deployable neural network-based model that can engage in open domain conversation still remains elusive. A dialogue system needs to be able to generate meaningful and diverse responses that are simultaneously coherent with the input utterance and the overall dialogue topic. Unfortunately, earlier conversation models trained with naturalistic dialogue data suffered greatly from limited contextual information (Sutskever et al., 2014; and lack of diversity (Li et al., 2016a).
These problems often lead to generic and safe responses to a variety of input utterances.  and Xing et al. (2017) proposed the Hierarchical Recurrent Encoder-Decoder (HRED) network to capture long temporal dependencies in multi-turn conversations to address the limited contextual information but the diversity problem remained. In contrast, some HRED variants such as variational (Serban et al., 2017b) and multi-resolution (Serban et al., 2017a) HREDs attempt to alleviate the diversity problem by injecting noise at the utterance level and by extracting additional context to condition the generator on. While these approaches achieve a certain measure of success over the basic HRED, the generated responses are still mostly generic since they do not control the generator's output. This is because the output conditional distribution is not calibrated. Li et al. (2016a), on the other hand, consider a diversity promoting training objective but their model is for single turn conversations and cannot be trained end-to-end.
The generative adversarial network (GAN) (Goodfellow et al., 2014) seems to be an appropriate solution to the diversity problem. GAN matches data from two different distributions by introducing an adversarial game between a generator and a discriminator. We explore hredGAN: conditional GANs for multi-turn dialogue models with an HRED generator and discriminator. hredGAN combines ideas from both generative and retrieval-based multi-turn dialogue systems to improve their individual performances. This is achieved by sharing the context and word embeddings between the generator and the discriminator allowing for joint end-to-end training using back-propagation. To the best of our knowledge, no existing work has applied conditional GANs to multi-turn dialogue models and especially not with HRED generators and discriminators. We demonstrate the effectiveness of hredGAN over the VHRED for dialogue modeling with evaluations on the Movie triples and Ubuntu technical support datasets.

Related Work
Our work is related to end-to-end neural networkbased open domain dialogue models.
Most neural dialogue models use transduction frameworks adapted from neural machine translation (Sutskever et al., 2014;Bahdanau et al., 2015). These Seq2Seq networks are trained end-to-end with MLE criteria using large corpora of humanto-human conversation data. Others use GAN's discriminator as a reward function in a reinforcement learning framework (Yu et al., 2017) and in conjunction with MLE Che et al., 2017). Zhang et al. (2017) explored the idea of GAN with a feature matching criterion. Xu et al. (2017) and Zhang et al. (2018) employed GAN with an approximate embedding layer as well as with adversarial information maximization respectively to improve Seq2Seq's diversity performance.
Still, Seq2Seq models are limited in their ability to capture long temporal dependencies in multi-turn conversation. Although Li et al. (2016b) attempted to optimize a pair of Seq2Seq models for multi-turn dialogue, the multi-turn objective is only applied at inference and not used for actual model training. Hence the introduction of HRED models (Serban et al., , 2017aXing et al., 2017) for modeling dialogue response in multi-turn conversations. However, these HRED models suffer from lack of diversity since they are only trained with MLE criteria. On the other hand, adversarial system has been used for evaluating open domain dialogue models (Bruni and Fernndez, 2018;Kannan and Vinyals, 2017). Our work, hredGAN, is closest to the combination of HRED generation models  and adversarial evaluation (Kannan and Vinyals, 2017).

Adversarial Learning of Dialogue Response
Consider a dialogue consisting of a sequence of N utterances, x = x 1 , x 2 , · · · , x N , where each utterance variable-length sequence of M i word tokens such that x i j ∈ V for vocabulary V . At any time step i, the dialogue history is given by x i = x 1 , x 2 , · · · , x i . The dialogue response generation task can be defined as follows: Given a dialogue history x i , generate a response y i = y 1 i , y 2 i , · · · , y T i i , where T i is the number of generated tokens. We also want the distribution of the generated response P (y i ) to be indistinguishable from that of the ground truth P (x i+1 ) and T i = M i+1 . Conditional GAN learns a mapping from an observed dialogue history, x i , and a sequence of random noise vectors, z i to a sequence of output tokens, y i , G : {x i , z i } → y i . The generator G is trained to produce output sequences that cannot be distinguished from the ground truth sequence by an adversarially trained discriminator D that is trained to do well at detecting the generator's fakes. The distribution of the generator output sequence can be factored by the product rule: where y i:j−1 i = (y 1 i , · · · , y j−1 i ) and θ G are the parameters of the generator model. P θ G y i:j−1 i , x i is an autoregressive generative model where the probability of the current token depends on the past generated sequence. Training the generator G with the log-likelihood criterion is unstable in practice, and therefore the past generated sequence is substituted with the ground truth, a method known as teacher forcing (Williams and Zipser, 1989), i.e., Using (3) in relation to GAN, we define our fake sample as the teacher forcing output with some input noise z i and the corresponding real sample as ground truth x j i+1 .
With the GAN objective, we can match the noise distribution, P (z i ), to the distribution of the ground truth response, P (x i+1 |x i ). Varying the noise input then allows us to generate diverse responses to the same dialogue history. Furthermore, the discriminator, since it is calibrated, is used during inference to rank the generated responses, providing a means of controlling the generator output.

Objectives
The objective of a conditional GAN can be expressed as where G tries to minimize this objective against an adversarial D that tries to maximize it: Previous approaches have shown that it is beneficial to mix the GAN objective with a more traditional loss such as cross-entropy loss (Lamb et al., 2016;. The discriminator's job remains unchanged, but the generator is tasked not only to fool the discriminator but also to be near the ground truth x i+1 in the cross-entropy sense: (7) Our final objective is, It is worth mentioning that, without z i , the net could still learn a mapping from x i to y i , but it would produce deterministic outputs and fail to match any distribution other than a delta function (Isola et al., 2017). This is one key area where our work is different from Lamb et al.'s and Li et al.'s. The schematic of the proposed hredGAN is depicted at the right hand side of Figure 1.

Generator
We adopted an HRED dialogue generator similar to Serban et al. ( , 2017a and Xing et al. (2017). The HRED contains three recurrent structures, i.e. the encoder (eRN N ), context (cRN N ), and decoder (dRN N ) RNN. The conditional probability modeled by the HRED per output word token is given by where E(.) is the embedding lookup, h i = cRN N (eRN N (E(x i ), h i−1 ), eRN N (.) maps a sequence of input symbols into fixed-length vector, and h and h are the hidden states of the decoder and context RNN, respectively. In the multi-resolution HRED, (Serban et al., 2017a), high-level tokens are extracted and processed by another RNN to improve performance. We circumvent the need for this extra processing by allowing the decoder to attend to different parts of the input utterance during response generation (Bahdanau et al., 2015;Luong et al., 2015). We introduce a local attention into (9) and encode the attention memory differently from the context through an attention encoder RNN (aRN N ), yielding: , h is the hidden state of the attention RNN, and α k is either a logit projection of (h j−1 i , h m i ) in the case of Bahdanau et al. (2015) Luong et al. (2015). The modified HRED architecture is shown in Figure 2.
Noise Injection: We inject Gaussian noise at the input of the decoder RNN. Noise samples could be injected at the utterance or word level. With noise injection, the conditional probability of the decoder output becomes where z j i ∼ N i (0, I), for utterance-level noise and z j i ∼ N j i (0, I), for word-level noise.

Discriminator
The discriminator shares context and word embeddings with the generator and can discriminate at the word level (Lamb et al., 2016). The word-level discrimination is achieved through a bidirectional RNN and is able to capture both syntactic and conceptual differences between the generator output and the ground truth. The aggregate classification of an input sequence, χ can be factored over word-  level discrimination and expressed as where D RN N (.) is the word discriminator RNN, h i is an encoded vector of the dialogue history x i obtained from the generator's cRN N (.) output, and χ j is the jth word or token of the input sequence χ. χ = y i and J = T i for the case of generator's decoder output, χ = x i+1 and J = M i+1 for the case of ground truth. The discriminator architecture is depicted on the left hand side of Figure 1.

Adversarial Generation of Multi-turn Dialogue Response
In this section, we describe the generation process during inference. The generation objective can be mathematically described as (13) where y i,l = G * (x i , z i,l ), z i,l is the lth noise samples at dialogue step i, and L is the number of response samples. Equation 13 shows that our inference objective is the same as the training objective (8), combining both the MLE and adversarial criteria. This is in contrast to existing work where the discriminator is usually discarded during inference.
The inference described by (13) is intractable due to the enormous search space of y i,l . Therefore, we turn to an approximate solution where we use greedy decoding (MLE) on the first part of the objective function to generate L lists of responses based on noise samples {z i,l } L l=1 . In order to facilitate the exploration of the generator's latent space, we sample a modified noise distribution, z j i,l ∼ N i,l (0, αI), or z j i,l ∼ N j i,l (0, αI)

Algorithm 1 Adversarial Learning of hredGAN
Require: A generator G with parameters θ G . Require: A discriminator D with parameters θ D . for number of training iterations do Initialize cRN N to zero state, h0 Sample a mini-batch of conversations, x = {xi} N i=1 , xi = (x1, x2, · · · , xi) with N utterances. Each utterance mini batch i contains Mi word tokens.
Sample a corresponding mini batch of utterance yi. yi ∼ P θ G yi|, zi, xi end for Compute the discriminator accuracy Dacc over N − 1 utterances else Update θ G with both adversarial and MLE losses.
where α > 1.0, is the exploration factor that increases the noise variance. We then rank the L lists using the discriminator score, The response with the highest discriminator ranking is the optimum response for the dialogue context.

Training of hredGAN
We trained both the generator and the discriminator simultaneously as highlighted in Algorithm ?? with λ G = λ M = 1. GAN training is prone to instability due to competition between the generator and the discriminator. Therefore, parameter updates are conditioned on the discriminator performance (Lamb et al., 2016).
The generator consists of four RNNs with different parameters, that is, aRN N, eRN N, cRN N , and dRN N . aRN N and eRN N are both bidirectional, while cRN N and dRN N are unidirectional. Each RNN has 3 layers, and the hidden state size is 512. The dRN N and aRN N are connected using an additive attention mechanism (Bahdanau et al., 2015).
The discriminator shares aRN N, eRN N , and cRN N with the generator. D RN N is a stacked bidirectional RNN with 3 layers and a hidden state size of 512. The cRN N states are used to initialize the states of D RN N . The output of both the forward and the backward cells for each word are concatenated and passed to a fullyconnected layer with binary output. The output is the probability that the word is from the ground truth given the past and future words of the sequence.
Others: All RNNs used are gated recurrent unit (GRU) cells (Cho et al., 2014). The word embedding size is 512 and shared between the generator and the discriminator. The initial learning rate is 0.5 with decay rate factor of 0.99, applied when the adversarial loss has increased over two iterations. We use a batch size of 64 and clip gradients around 5.0. As in Lamb et al. (2016), we find acc D th = 0.99 and acc G th = 0.75 to suffice. All parameters are initialized with Xavier uniform random initialization (Glorot and Bengio, 2010). The vocabulary size V is 50, 000. Due to the large vocabulary size, we use sampled softmax loss (Jean et al., 2015) for MLE loss to expedite the training process. However, we use full softmax for evaluation. The model is trained end-to-end using the stochastic gradient descent algorithm.

Experiments and Results
We consider the task of generating dialogue responses conditioned on the dialogue history and the current input utterance. We compare the proposed hredGAN model against some alternatives on publicly available datasets.

Datasets
Movie Triples Corpus (MTC) dataset . This dataset was derived from the Movie-DiC dataset by Banchs (2012). Although this dataset spans a wide range of topics with few spelling mistakes, its small size of only about 240,000 dialogue triples makes it difficult to train a dialogue model, as pointed out by . We thought that this scenario would really benefit from the proposed adversarial generation.
Ubuntu Dialogue Corpus (UDC) dataset (Serban et al., 2017b). This dataset was extracted from the Ubuntu Relay Chat Channel. Although the topics in the dataset are not as diverse as in the MTC, the dataset is very large, containing about 1.85 million conversations with an average of 5 utterances per conversation.
We split both MTC and UDC into training, validation, and test sets, using 90%, 5%, and 5% proportions, respectively. We performed minimal preprocessing of the datasets by replacing all words except the top 50,000 most frequent words by an UNK symbol.

Evaluation Metrics
Accurate evaluation of dialogue models is still an open challenge. In this paper, we employ both automatic and human evaluations.

Automatic Evaluation
We employed some of the automatic evaluation metrics that are used in probabilistic language and dialogue models, and statistical machine translation. Although these metrics may not correlate well with human judgment of dialogue responses (Liu et al., 2016), they provide a good baseline for comparing dialogue model performance.
Perplexity -For a model with parameter θ, we define perplexity as: where K is the number of conversations in the dataset, N k is the number of utterances in conversation k, and N W is the total number of word tokens in the entire dataset. The lower the perplexity, the better. The perplexity measures the likelihood of generating the ground truth given the model parameters. While a generative model can generate a diversity of responses, it should still assign a high probability to the ground truth utterance.
BLEU -The BLEU score (Papineni et al., 2002) provides a measure of overlap between the generated response (candidate) and the ground truth (reference) using a modified n-gram precision. According to Liu et. al. (Liu et al., 2016), BLEU-2 score is fairly correlated with human judgment for non-technical dialogue (such as MTC).
ROUGE -The ROUGE score (Lin, 2014) is similar to BLEU but it is recall-oriented instead. It is used for automatic evaluation of text summarization and machine translation. To compliment the BLEU score, we use ROUGE-N with N = 2 for our evaluation.
Distinct n-gram -This is the fraction of unique n-grams in the generated responses and it provides a measure of diversity. Models with higher a number of distinct n-grams tend to produce more diverse responses (Li et al., 2016a). For our evaluation, we use 1-and 2-grams.
Normalized Average Sequence Length (NASL) -This measures the average number of words in model-generated responses normalized by the average number of words in the ground truth.

Human Evaluation
For human evaluation, we follow a similar setup as Li et al. (2016a), employing crowd-sourced judges to evaluate a random selection of 200 samples. We presented both the multi-turn context and the generated responses from the models to 3 judges and asked them to rank the general response quality in terms of relevance and informativeness. For N models, the model with the lowest quality is assigned a score 0 and the highest is assigned a score N-1. Ties are not allowed. The scores are normalized between 0 and 1 and averaged over the total number of samples and judges. For each model, we also estimated the per sample score variance between judges and then averaged over the number of samples, i.e., sum of variances divided by the square of number of samples (assuming sample independence). The square root of result is reported as the standard error of the human judgment for the model.

Baseline
We compare the performance of our model to (V)HRED (Serban et al., , 2017b, since they are the closest to our approach in implementation and are the current state of the art in open-domain dialogue models. HRED is very similar to our proposed generator, but without the input utterance attention and noise samples. VHRED introduces a latent variable to the HRED between the cRNN and the dRNN and was trained using the variational lower bound on the log-likelihood. The VHRED can generate multiple responses per context like hredGAN, but it has no specific criteria for selecting the best response.
The training and validation sets used for UDC and MTC dataset were obtained directly from the authors 1 of (V)HRED. For model comparison, we use a test set that is disjoint from the training and validation sets.

Results
We have two variants of hredGAN based on the noise injection approach, i.e., hredGAN with utterance-level (hredGAN u) and word-level (hredGAN w) noise injections. We compare the performance of these two variants with HRED and VHRED models.
Perplexity: The average perplexity per word performance of all the four models on MTC and UDC datasets (validation/test) are reported in the first column on Table 1. The table indicates that both variants of the hredGAN model perform better than the HRED and VHRED models in terms of the perplexity measure. However, using the adversarial loss criterion (Eq. (8)), the hredGAN u model performs better on MTC and worse on UDC. Note that, for this experiment, we run all models in teacher forcing mode.
Generation Hyperparameter: For adversarial generation, we perform a linear search for α between 1 and 20 at an increment of 1 using Eq. (13), with sample size L = 64, on validation sets with models run in autoregression. The optimum values of α for hredGAN u and hredGAN w for UDC are 7.0 and 9.0 respectively. The values for MTC are not convex, probably due to small size of the dataset, so we use the same α values as UDC. We however note that for both datasets, any integer value between 3 and 10 (inclusive) works well in practice.
Quantitative Generator Performance: We run autoregressive inference for all the models (using optimum α values for hredGAN models and selecting the best of L = 64 responses using a discriminator) with dialogue contexts from a unique test set. Also, we compute the average BLEU-2, ROUGE-2(f1), Distinct(1/2), and normalized  A good dialogue model should find the right balance between precision (BLEU) and diversity. We strongly believe that our adversarial approach is better suited to solving this problem.  As hredGAN generators explore diversity, the discriminator ranking gives hredGAN an edge over (V)HRED because it helps detect responses that are out of context and the natural language structure (Table 2). Also, the ROGUE(f1) performance indicates that hredGAN w strikes a better balance between precision (BLEU) and diversity than the rest of the models. This is also obvious from the quality of generated responses.
Qualitative Generator Performance: The results of the human evaluation are reported in the last column of Table 1. The human evaluation agrees largely with the automatic evaluation. hredGAN w performs best on both datasets although the gap is more on the MTC than on the UTC. This implies that the improvement of HRED with adversarial generation is better than with variational generation (VHRED). In addition, looking at the actual samples from the generator outputs in Table 6 shows that hredGAN, especially hredGAN w, performs better than (V)HRED. While other models produce short and generic ut-terances, hredGAN w mostly yields informative responses. For example, in the first dialogue in Table 6, when the speaker is sarcastic about "the man upstairs", hredGAN w responds with the most coherent utterance with respect to the dialogue history. We see similar behavior across other samples. We also note that although hredGAN u's responses are the longest on Ubuntu (in line with the NASL score), the responses are less informative compared to hredGAN w resulting in a lower human evaluation score. We reckon this might be due to a mismatch between utterance-level noise and word-level discrimination or lack of capacity to capture the data distribution using single noise distribution. We hope to investigate this further in the future.
Discriminator Performance: Although only hredGAN uses a discriminator, the observed discriminator behavior is interesting. We observe that the discriminator score is generally reasonable with longer, more informative and more personarelated responses receiving higher scores as shown in Table 2. It worth to note that this behavior, although similar to the behavior of a human judge is learned without supervision. Moreover, the discriminator seems to have learned to assign an average score to more frequent or generic responses such as "I don't know," "I'm not sure," and so on, and high score to rarer answers. That's why we sample a modified noise distribution during inference so that the generator can produce rarer utterances that will be scored high by the discriminator.

Conclusion and Future Work
In this paper, we have introduced an adversarial learning approach that addresses response diversity and control of generator outputs, using an HRED-derived generator and discriminator. The proposed system outperforms existing stateof-the-art (V)HRED models for generating responses in multi-turn dialogue with respect to automatic and human evaluations. The performance improvement of the adversarial generation (hredGAN) over the variational generation (VHRED) comes from the combination of adversarial training and inference which helps to address the lack of diversity and contextual relevance in maximum likelihood based generative dialogue models. Our analysis also concludes that the word-level noise injection seems to perform better in general. Z. Xu, B. Liu, B. Wang, S. Chengjie, X. Wang, Z. Wang, and C. Qi. 2017. Neural response generation via gan with an approximate embedding layer. In EMNLP.
L. Yu, W. Zhang, J. Wang, and Y. Yu. 2017. Seqgan: sequence generative adversarial nets with policy gradient. In Proceedings of The Thirty-first AAAI Conference on Artificial Intelligence (AAAI 2017).

A Ablation Experiments
Before proposing the above adversarial learning framework for multi-turn dialogue, we carried out some experiments.

A.1 Generator:
We consider two main factors here, i.e., addition of an attention memory and injection of Gaussian noise into the generator input.

A.1.1 Addition of Attention Memory
First, we noted that by adding an additional attention memory to the HRED generator, we improved the test set perplexity score by more than 12 and 25 points on the MTC and UDC respectively as shown in Table 4. The addition of attention also shows strong performance at autoregressive inference across multiple metrics as well as an observed improvement in response quality. Hence the decision for the modified HRED generator.

A.1.2 Injection of Noise
Before injecting noise into the generator, we first train hredGAN without noise. The result is also reported in 4. We observe accelerated generator training but without an appreciable improvement in performance. It seems the discrimination task is very easy since there is no stochasticity in the generator output. Therefore, the adversarial feedback does not meaningfully impact the generator weight update. Finally, we also notice that even with noise injection, there is no appreciable improvement in the auto-regressive performance if we sample with L = 1 even though the perplexity is higher. However, as we increase L, producing L responses per turn, the discriminator's adversarial selection gives a better performance as reported in Table 1.
Therefore, we conclude that the combination of adversarial training and adversarial inference helps to address the lack of diversity and contextual relevance observed in the generated responses.

A.2 Discriminator:
Before deciding on the word-level discrimination, we experimented with utterance-level discrimination. The utterance-level discriminator trains very quickly but it leads to mostly generic responses from the generator. We also note that utterancelevel discriminator scores are mostly extreme (i.e., either low or high). Since we had used a convolutional neural network discriminator (Yu et al., 2017) in our experiments, we hope to investigate this further with other architectures.

A.3 Adversarial Training:
Lastly, we also tried a basic policy gradient approach , where word-level discriminator score is used as a reward for each generated word token, but this leads to training instability. This is probably due to the instability of Monte Carlo sampling over a large vocabulary size. We believe this might improve with other sampling methods such as importance sampling and hope to investigate this further in the future.

Model
Teacher