Adversarial Learning on the Latent Space for Diverse Dialog Generation

Generating relevant responses in a dialog is challenging, and requires not only proper modeling of context in the conversation, but also being able to generate fluent sentences during inference. In this paper, we propose a two-step framework based on generative adversarial nets for generating conditioned responses. Our model first learns a meaningful representation of sentences by autoencoding, and then learns to map an input query to the response representation, which is in turn decoded as a response sentence. Both quantitative and qualitative evaluations show that our model generates more fluent, relevant, and diverse responses than existing state-of-the-art methods.


Introduction
Dialog generation is a challenging problem because it not only requires us to model the context in a conversation but also to exploit it to generate a relevant and fluent response. A dialog generation system can be divided into two parts: 1) encoding the context of the conversation, and 2) generating a response conditioned on the given context. A generated response is considered to be "good" if it is meaningful, fluent, and most importantly, relevant to the given context.
With the advancement of deep learning, sequence-to-sequence (Seq2Seq) models (Sutskever et al., 2014) are adopted for dialog systems to encode conversational context and generate a response. However, they suffer from the problem of generic utterance generation, e.g., always generating "I don't know" (Serban et al., 2016;Li et al., 2016). One possible explanation (Wei et al., 2019) is the high uncertainty in dialog generation. A plausible response is analogous to a "mode" of a continuous distribution, and the response distribution is thus multimodal. However, the decoder of a Seq2Seq model is trained by crossentropy loss, which is equivalent to minimizing the KL divergence between the target and predicted distributions. The asymmetric nature of KL divergence makes the learned distribution wide-spreading, analogous to the mode-averaging problem for continuous variables.
Variational encoder-decoders (Serban et al., 2017;Bahuleyan et al., 2018;Zhao et al., 2017) and Wasserstein encoder-decoders (Bahuleyan et al., 2019) adopt probabilistic modeling to encourage diversity in responses. However, their decoders are also trained by cross-entropy loss against the target sequence, still making the model generate generic utterances.
In this paper, we propose an approach that uses adversarial learning in the latent space for dialog generation. We first train a variational autoencoder (VAE) (Kingma and Welling, 2014) on sentences, and then apply a generative aderversarial network (GAN) on the latent space of the VAE. At inference time, we obtain the latent representation of the response from the generator of the GAN and decode it using the VAE's decoder. In this way, we can benefit from the mode-capturing property of GANs (Mao et al., 2019;Thanh-Tung et al., 2019). Also, our GAN is trained on the latent space, and techniques like Gumbel-Softmax and reinforcement-learning (RL) are not necessary, which largely simplifies the training procedure.We further introduce a mean squared error (MSE) auxiliary loss to our adversarial module, (a) Step 1: Variational Autoencoder (b) Step 2: Adversarial network (dashed box) Figure 1: The framework of our proposed two-step training procedure. denotes concatenation.
which mitigates the mode-missing problem in GANs (Che et al., 2017), resulting in more relevant and diverse responses. We evaluate our model on the deduplicated version (Bahuleyan et al., 2018) of the benchmark Daily-Dialog dataset  and also the Switchboard dataset (Godfrey et al., 1992). Results indicate that responses generated by our model are more relevant to the input query/context, and are more diverse and fluent than the existing baselines.
The main contributions of our paper are as follows.
1. We propose a two-step framework of latent-space adversarial learning for generating diverse and relevant responses.
2. We propose a combination of adversarial loss and an auxiliary mean squared loss to help the GAN to converge faster and achieve better performance for dialog generation.

Approach
Figure 1 provides an overview of our proposed two-step approach.
Step 1: We first train an autoencoder, which takes an utterance s (either a query or a response) as input, gets its latent code z s from the encoder, and then feeds it to a decoder for reconstructing. The autoencocder learns a real-valued vector representation of a generic sentence.
Step 2: We train an adversarial network on the latent z space for learning dialog generation. Given a query-response pair (q, r) in the training set, we use the trained encoder from Step 1 to obtain their latent variables z q and z r . The query latent variable z q is fed to a generator G, that maps it to the corresponding response's latent variableẑ r . When training the generator, we aim to match z r andẑ r through the adversarial loss combined with a mean-squared error loss. In here, the adversarial loss further involves a discriminator that classifies the predicted response representationẑ r versus the encoded representation of the true response z r , conditioned on the query z q .
The details of our approach will be introduced in the rest of this section.

In
Step 1, our primary goal is to learn a continuous representation of all utterances in the dialog corpus. The mapping from a sentence to its continuous representation should ideally be invertible so that our adversarial loss (in Step 2) could be applied to the continuous space to generate dialog responses. In particular, we adopt a variational autoencode (Kingma and Welling, 2014, VAE) for our first step. A VAE encodes an input sentence s to a probabilistic, latent continuous representation z, from which the input sentence s is reconstructed.
We first impose a prior distribution on z, which is typically set to standard normal p(z) = N (0, I). Given the sentence s, VAE encodes a posterior distribution q E (z|s) = N (µ, diag σ 2 ), where µ and σ are predicted by the encoder of VAE. The training objective is to minimize the expected reconstruction loss, penalized by a KL divergence term between the posterior and the prior. This is given by where λ KL balances the two terms. Compared with a deterministic autoencoder, VAE learns a smoother latent space by its KL regularization. This is helpful during the second step, where a GAN is trained to predict the latent representation of a response for decoding.

2.2
Step 2: Predicting the Representation of the Response

In
Step 2, the main objective is to predict the representation of the response given dialog context (such as the previous utterance). In this way, the predicted latent representation of the response can be fed to the trained decoder in Step 1 to generate the response utterance.
To predict the response representation, we re-use the encoder in Section 2.1 to capture the meaning of the context query as z q . Then we have a two-layer perceptron (with a ReLU activation function in the hidden layer) to predict the representation of the utterance to be generated, denoted byẑ r = G(z q ).
For adversarial training, we also encode the representation of the ground truth reply r as z r by the encoder in Step 1. We train an adversarial discriminator D to classify whether the response representation is real or predicted. Such classification should be based on context, because we would learn not only if an utterance is appropriate as a reply, but also if the utterance is appropriate to a specific query. Therefore, we also feed the encoded context representation into the discriminator. The classification is denoted by D(z r , z q ) or D(ẑ r , z q ), where we essentially concatenate the representations of the response and the query before feeding them to a logistic regression layer.
The adversarial loss for training the latent space is given by: where D train is the training data.
In other words, the discriminator D is trained by maximizing V (D, G) so as to distinguish the true representation of a response and the predicted response representation given the query, whereas the generator G is trained to fool the discriminator by minimizing V (D, G).
It should be emphasized that our model is different from adversarial autoencoders (Makhzani et al., 2015), because our discriminator takes the query into consideration. Our adversarial loss learns an implicit conditional distribution p(z r |z q ), instead of a marginal distribution p(z r ) as in Zhao et al. (2018).
Additionally, we introduce an auxiliary mean square error (MSE) loss to the objective function: The MSE loss on the generator helps stabilize the GAN training and mitigate the mode-missing problem of GANs (Che et al., 2017). In summary, the overall training objective is given by where γ is a tunable hyperparameter that moderates the effect of the MSE loss.
For inference, our model first uses the pretrained VAE from Step 1 to encode an unseen query q * as z q * . This encoded representation is then passed to the generator G to predict the response latent code G(z q * ), which is finally fed to the decoder of the VAE from Step 1 to generate a response sentence.
In our experiments, we have two settings for dialog generation: single-turn and multi-turn. In the single-turn setting, we form query-response samples by extracting every pair of consecutive utterances of a conversation in the training data.
In the multi-turn setting, we have the query-response pairs by extracting every utterance with its preceding utterances in the entire conversation. The VAE in Step 1 remains the same, but we introduce   another RNN to encode context. Specifically, it is built upon the VAE's encoded representation of each utterance, and yields a fixed-length vector representation of the entire context. During the adversarial training, we concatenate the context vector with the query (immediate preceding utterance) representation before feeding them to the generator. In this way, our generator now also takes the context into account when predicting the response latent code. A similar adjustment is applied during inference as well.

Experiments
We conduct experiments on the DailyDialog dataset , a manually labeled multi-turn dialog dataset, and the Switchboard dataset (Godfrey et al., 1992), a dialog dataset containing transcripts of telephonic conversations. For DailyDialog, we use the original splits after removing duplicates following Bahuleyan et al. (2019). We use the the AllenNLP framework (Gardner et al., 2018) to implement all our models. Appendix A presents more experimental details and hyper-parameters. We use the following baseline models for comparison: • Seq2Seq. The standard sequence to sequence model based on LSTM.

Results and Analysis
The results for the Daily Dialog and the Switchboard datasets are shown in Tables 1 and 2, respectively. The generated responses are evaluated by the following criteria: Overall quality. We measure the quality of the generated responses by BLEU scores (Papineni et al., 2002), for which we adopt the smoothing techniques in Gu et al. (2019). For each query, we generate 10 responses for a query, and compute the average and maximum BLEU scores. Then we also compute the harmonic mean of the average and the maximum BLEU scores. 2 Our model is either the best-performing model or highly competitive in terms of the BLEU scores. The DialogWAE model also achieves high BLEU scores, while the Seq2Seq model is the worst-performing model.
Diversity. We measure the diversity of dialog generation in two aspects: • Intra-diversity. The Intra-diversity score measures the proportion of distinct unigrams and bigrams in each response. It is similar for most models. • Inter-diversity. The Inter-diversity scores measure the proportion of distinct unigrams and bigrams across all 10 responses. We note that our model performs the best across Inter-diversity metrics. We further use other diversity indicators such as the Average Sentence Length (ASL) of the responses. We see that diversity scores for the Seq2Seq model are very high on the Switchboard dataset; however, it has the lowest ASL score as well. This observation is within expectations, and the Seq2Seq model does not generate diverse responses overall. DialogWAE generates longer responses on average; however, our model is closer to the ground truth ASL (14.43 for DailyDialog and 8.49 for Switchboard). We also note that our model achieves good Type-Token Ratio (TTR) scores, 3 indicating diverse word choices when compared with other models.
Fluency. We compute the PPL scores of generated responses to measure fluency. We notice that our model achieves the best PPL scores, although DialogWAE is quite close. The Seq2Seq model also achieves low PPL, but this is mainly due to the short and generic responses. Interestingly, PPL scores are generally higher in the multi-turn setting, which may be attributed to the increased complexity of the output when more context is given.
Analysis of Losses. Combining the MSE and adversarial losses leads to significant improvements across all metrics, including the BLEU scores, response diversity (Inter-1 and Inter-2), and fluency (PPL). In our experiments, we also notice that the MSE term leads to quicker and more stable convergence of the GAN (within 6 epochs), making training easier.
We present human evaluation in Appendix B and a case study in Appendix C.

Conclusion
We propose an effective two-stage model for dialog generation. We make use of sentence representations learned by a VAE and train a adversarial network on VAE's latent space to generate diverse responses given a query and context. We observe that our model outperforms existing state-of-the-art approaches by generating more diverse, fluent, and relevant sentences.

A Hyperparameter Settings and Training
Single-turn. In this setting, we first train a VAE on the entire corpus. We use a single-layer encoder with Bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) and a unidirectional LSTM layer for the decoder of the VAE. Both use a hidden size of 512. The dimension of our latent vectors is 128, and that of the word embeddings is 300. Further, we adopt KL-annealing and word dropout from Bowman et al. (2016) to stabilize VAE's training. We use a word dropout probability of 0.5 and a sigmoid annealing schedule to anneal the KL weight to 0.15 for 4500 iterations. The performance statistics of VAE in Step 1 are shown in 3.
Model KL BLEU Dist-1 Dist-2 VAE 18.8 0.18 0.32 0.49 For the GAN, we use a 2-layer feed-forward network with a hidden layer of 256 units as the generator, along with batch normalization (Ioffe and Szegedy, 2015) and LeakyReLU activation (Maas et al., 2013). The discriminator shares a similar architecture. We use Adam (Kingma and Ba, 2015) to optimize all our networks.
Multi-turn. In this setting, the VAE's architecture remains the same as the Single-turn setting. We introduce another BiLSTM encoder with hidden size of 512, which is fed with the VAE-encoded representations of the context sentences. Other hyperparameters are kept the same. For implementation, our generator predicts the response representation at each turn, but we use teacher-forcing, assuming the context is the actual previous utterances.

B Human Evaluation
In addition to automatic metrics, we also evaluate our model and compare it with DialogWAE (Gu et al., 2019) using human evaluation. Five human judges rate the response of each model on a scale of 1-5, according to two criteria: 1) Relevance to the query, and 2) Fluency of the generated response. As can be seen from Table 4

C Case Study
We conduct a case study on VAE-AM's responses with DialogWAE's in the single-turn and multi-turn settings. For both settings, we show the best response out of 10 generated responses for each input. Table 5 shows that DialogWAE generates generic responses, although its sentences are longer. Our model has more concrete concepts, such as "leaflets" and "money." Table 6 shows generated responses in the multi-turn setting. We see that both models are more informative as we have more context compared with Table 5. However, DialogWAE is repeating the same concept, such as "a lot of" in Examples 2, 3, and 5, and "have to pay" in Examples 2 and 5. This observation is also consistent with the TTR results in Tables 1 and 2.  Query: allan, i am really disappointed by your lack of effort on this project . VAE-AM: it seems you are not aware of the actual work split . i have done more than anyone else. DialogWAE: ok , it's not that bad . but it's not that bad . i don't want to have any other questions about it .
2 Query: when is he coming ? VAE-AM: next week . he is enjoying his month long vacation in miami . DialogWAE: sorry , he's not very serious . he's got a lot of money to do .
3 Query: i ll try my best to make a good impression on the interviewer . VAE-AM: do nt worry pal. only weak candidates should. just dress properly and try to make it alive there haha ... DialogWAE: that is ok , we ' ll be able to get a raise in the next few days . we will have to pay you a lot of money .

4
Query: can you tell me where the pots and pans are ? VAE-AM: absolutely not . i m not letting you cook today . you messed it real bad yesterday . DialogWAE: yes , of course .

5
Query: what kind of personal information ? VAE-AM: your name , height , sex , and contact . also tick this check box if you have faced this issue before . DialogWAE: not at the moment . but there is a lot of money in the city , but we have to pay a lot of money for the money . Table 6: Generated responses in the multi-turn setting