A Discrete CVAE for Response Generation on Short-Text Conversation

Neural conversation models such as encoder-decoder models are easy to generate bland and generic responses. Some researchers propose to use the conditional variational autoencoder (CVAE) which maximizes the lower bound on the conditional log-likelihood on a continuous latent variable. With different sampled latent variables, the model is expected to generate diverse responses. Although the CVAE-based models have shown tremendous potential, their improvement of generating high-quality responses is still unsatisfactory. In this paper, we introduce a discrete latent variable with an explicit semantic meaning to improve the CVAE on short-text conversation. A major advantage of our model is that we can exploit the semantic distance between the latent variables to maintain good diversity between the sampled latent variables. Accordingly, we propose a two-stage sampling approach to enable efficient diverse variable selection from a large latent space assumed in the short-text conversation task. Experimental results indicate that our model outperforms various kinds of generation models under both automatic and human evaluations and generates more diverse and informative responses.


Introduction
Open-domain response generation (Perez-Marin, 2011;Sordoni et al., 2015) for single-round short text conversation (Shang et al., 2015), aims at generating a meaningful and interesting response given a query from human users.Neural generation models are of growing interest in this topic due to their potential to leverage massive conversational datasets on the web.These generation models such as encoder-decoder models (Vinyals and Le, 2015;Shang et al., 2015;Wen et al., 2015), directly build a mapping from the input query to its output response, which treats all query-response pairs uniformly and optimizes the maximum likelihood estimation (MLE).However, when the models converge, they tend to output bland and generic responses (Li et al., 2016a,c;Serban et al., 2016).
Many enhanced encoder-decoder approaches have been proposed to improve the quality of generated responses.They can be broadly classified into two categories (see Section 2 for details): (1) One that does not change the encoder-decoder framework itself.These approaches only change the decoding strategy, such as encouraging diverse tokens to be selected in beam search (Li et al., 2016a,b); or adding more components based on the encoder-decoder framework, such as the Generative Adversarial Network (GAN)-based methods (Xu et al., 2017;Zhang et al., 2018;Li et al., 2017) which add discriminators to perform adversarial training; (2) The second category modifies the encoder-decoder framework directly by incorporating useful information as latent variables in order to generate more specific responses (Yao et al., 2017;Zhou et al., 2017).However, all these enhanced methods still optimize the MLE of the log-likelihood or the complete log-likelihood conditioned on their assumed latent information, and models estimated by the MLE naturally favor to output frequent patterns in training data.
Instead of optimizing the MLE, some researchers propose to use the conditional variational autoencoder (CVAE), which maximizes the lower bound on the conditional data log-likelihood on a continuous latent variable (Zhao et al., 2017;Shen et al., 2017).Open-domain response generation is a one-to-many problem, in which a query can be associated with many valid responses.The CVAE-based models generally assume the latent arXiv:1911.09845v1[cs.CL] 22 Nov 2019 variable follows a multivariate Gaussian distribution with a diagonal covariance matrix, which can capture the latent distribution over all valid responses.With different sampled latent variables, the model is expected to decode diverse responses.Due to the advantage of the CVAE in modeling the response generation process, we focus on improving the performance of the CVAE-based response generation models.
Although the CVAE has achieved impressive results on many generation problems (Yan et al., 2016;Sohn et al., 2015), recent results on response generation show that the CVAE-based generation models still suffer from the low output diversity problem.That is multiple sampled latent variables result in responses with similar semantic meanings.To address this problem, extra guided signals are often used to improve the basic CVAE.Zhao et al. (2017) use dialogue acts to capture the discourse variations in multi-round dialogues as guided knowledge.However, such discourse information can hardly be extracted for short-text conversation.
In our work, we propose a discrete CVAE (DC-VAE), which utilizes a discrete latent variable with an explicit semantic meaning in the CVAE for short-text conversation.Our model mitigates the low output diversity problem in the CVAE by exploiting the semantic distance between the latent variables to maintain good diversity between the sampled latent variables.Accordingly, we propose a two-stage sampling approach to enable efficient selection of diverse variables from a large latent space assumed in the short-text conversation task.
To summarize, this work makes three contributions: (1) We propose a response generation model for short-text conversation based on a DC-VAE, which utilizes a discrete latent variable with an explicit semantic meaning and could generate high-quality responses.(2) A two-stage sampling approach is devised to enable efficient selection of diverse variables from a large latent space assumed in the short-text conversation task.(3) Experimental results show that the proposed DCVAE with the two-stage sampling approach outperforms various kinds of generation models under both automatic and human evaluations, and generates more high-quality responses.All our code and datasets are available at https://ai.tencent.com/ailab/nlp/dialogue.

Related Work
In this section, we briefly review recent advancement in encoder-decoder models and CVAE-based models for response generation.

Encoder-decoder models
Encoder-decoder models for short-text conversation (Vinyals and Le, 2015;Shang et al., 2015) maximize the likelihood of responses given queries.During testing, a decoder sequentially generates a response using search strategies such as beam search.However, these models frequently generate bland and generic responses.
Some early work improves the quality of generated responses by modifying the decoding strategy.For example, Li et al. (2016a) propose to use the maximum mutual information (MMI) to penalize general responses in beam search during testing.Some later studies alter the data distributions according to different sample weighting schemes, encouraging the model to put more emphasis on learning samples with rare words (Nakamura et al., 2018;Liu et al., 2018).As can be seen, these methods focus on either pre-processing the dataset before training or post-processing the results in testing, with no change to encoder-decoder models themselves.
Some other work use encoder-decoder models as the basis and add more components to refine the response generation process.Xu et al. (2017) present a GAN-based model with an approximate embedding layer.Zhang et al. (2018) employ an adversarial learning method to directly optimize the lower bounder of the MMI objective (Li et al., 2016a) in model training.These models employ the encoder-decoder models as the generator and focus on how to design the discriminator and optimize the generator and discriminator jointly.Deep reinforcement learning is also applied to model future reward in chatbot after an encoder-decoder model converges (Li et al., 2016c(Li et al., , 2017)).The above methods directly integrate the encoder-decoder models as one of their model modules and still do not actually modify the encoder-decoder models.
Many attentions have turned to incorporate useful information as latent variables in the encoderdecoder framework to improve the quality of generated responses.Yao et al. (2017) consider that a response is generated by a query and a precomputed cue word jointly.Zhou et al. (2017) uti-lize a set of latent embeddings to model diverse responding mechanisms.Xing et al. (2017) introduce pre-defined topics from an external corpus to augment the information used in response generation.Gao et al. (2019) propose a model that infers latent words to generate multiple responses.These studies indicate that many factors in conversation are useful to model the variation of a generated response, but it is nontrivial to extract all of them.Also, these methods still optimize the MLE of the complete log-likelihood conditioned on their assumed latent information, and the model optimized with the MLE naturally favors to output frequent patterns in the training data.Note that we apply a similar latent space assumption as used in (Yao et al., 2017;Gao et al., 2019), i.e. the latent variables are words from the vocabulary.However, they use a latent word in a factorized encoder-decoder model, but our model uses it to construct a discrete CVAE and our optimization algorithm is entirely different from theirs.

The CVAE-based models
A few works indicate that it is worth trying to apply the CVAE to dialogue generation which is originally used in image generation (Yan et al., 2016;Sohn et al., 2015) and optimized with the variational lower bound of the conditional loglikelihood.For task-oriented dialogues, Wen et al. (2017) use the latent variable to model intentions in the framework of neural variational inference.For chit-chat multi-round conversations, Serban et al. ( 2017) model the generative process with multiple levels of variability based on a hierarchical sequence-to-sequence model with a continuous high-dimensional latent variable.Zhao et al. (2017) make use of the CVAE and the latent variable is used to capture discourse-level variations.Gu et al. (2019) propose to induce the latent variables by transforming context-dependent Gaussian noise.Shen et al. (2017) present a conditional variational framework for generating specific responses based on specific attributes.Yet, it is observed in other tasks such as image captioning (Wang et al., 2017) and question generation (Fan et al., 2018) that the CVAE-based generation models suffer from the low output diversity problem, i.e. multiple sampled variables point to the same generated sequences.In this work, we utilize a discrete latent variable with an interpretable meaning to alleviate this low output di-versity problem on short-text conversation.
We find that Zhao et al. ( 2018) make use of a set of discrete variables that define high-level attributes of a response.Although they interpret meanings of the learned discrete latent variables by clustering data according to certain classes (e.g.dialog acts), such latent variables still have no exact meanings.In our model, we connect each latent variable with a word in the vocabulary, thus each latent variable has an exact semantic meaning.Besides, they focus on multi-turn dialogue generation and presented an unsupervised discrete sentence representation learning method learned from the context while our concentration is primarily on single-turn dialogue generation with no context information.
3 Proposed Models

DCVAE and Basic Network Modules
Following previous CVAE-based generation models (Zhao et al., 2017), we introduce a latent variable z for each input sequence and our goal is to maximize the lower bound on the conditional data log-likelihood p(y|x), where x is the input query sequence and y is the target response sequence: Here, p(z|x)/q(z|y, x)/p(y|x, z) is parameterized by the prior/posterior/generation network respectively.D KL (q(z|y, x)||p(z|x)) is the Kullback-Leibler (KL) divergence between the posterior and prior distribution.Generally, z is set to follow a Gaussian distribution in both the prior and posterior networks.As mentioned in the related work, directly using the above CVAE formulation causes the low output diversity problem.This observation is also validated in the short-text conversation task in our experiments.Now, we introduce our basic discrete CVAE formulation to alleviate the low output diversity problem.We change the continuous latent variable z to a discrete latent one with an explicit interpretable meaning, which could actively control the generation of the response.An intuitive way is to connect each latent variable with a word in the vocabulary.With a sampled latent z from the prior (in testing)/posterior network (in training), the generation network will take the query representation together with the word embedding of this latent variable as the input to decode the response.Here, we assume that a single word is enough to drive the generation network to output diverse responses for short text conversation, in which the response is generally short and compact.
A major advantage of our DCVAE is that for words with far different meanings, their word embeddings (especially that we use a good pretrained word embedding corpus) generally have a large distance and drive the generation network to decode scattered responses, thus improve the output diversity.In the standard CVAE, z's assumed in a continuous space may not maintain the semantic distance as in the embedding space and diverse z's may point to the same semantic meaning, in which case the generation network is hard to train well with such confusing information.Moreover, we can make use of the semantic distance between latent variables to perform better sampling to approximate the objective during optimization, which will be introduced in Section 3.2.
The latent variable z is thus set to follow a categorical distribution with each dimension corresponding to a word in the vocabulary.Therefore the prior and posterior networks should output categorical probability distributions: (2) where θ and φ are parameters of the two networks respectively.The KL distance of these two distributions can be calculated in a closed form solution: z∈Z q(z|y, x) log q(z|y,x) p(z|x) , where Z contains all words in the vocabulary.In the following, we present the details of the prior, posterior and generation network.
Prior network p(z|x): It aims at inferring the latent variable z given the input sequence x.We first obtain an input representation h p x by encoding the input query x with a bi-directional GRU and then compute g θ (x) in Eq. 2 as follows: where θ contains parameters in both the bidirectional GRU and Eq. 5.
Posterior network q(z|y, x): It infers a latent variable z given a input query x and its target response y.We construct both representations for the input and the target sequence by separated bidirectional GRU's, then add them up to compute f φ (y, x) in Eq. 3 to predict the probability of z: where φ contains parameters in the two encoding functions and Eq. 6.Note that the parameters of the encoding functions are not shared in the prior and posterior network.Generation network p(y|x, z): We adopt an encoder-decoder model with attention (Luong et al., 2015) used in the decoder.With a sampled latent variable z, a typical strategy is to combine its representation, which in this case is the word embedding e z of z, only in the beginning of decoding.However, many previous works observe that the influence of the added information will vanish over time (Yao et al., 2017;Gao et al., 2019).Thus, after obtaining an attentional hidden state at each decoding step, we concatenate the representation h z of the latent variable and the current hidden state to produce a final output in our generation network.

A Two-Stage Sampling Approach
When the CVAE models are optimized, they tend to converge to a solution with a vanishingly small KL term, thus failing to encode meaningful information in z.To address this problem, we follow the idea in (Zhao et al., 2017), which introduces an auxiliary loss that requires the decoder in the generation network to predict the bag-of-words in the response y.Specifically, the response y is now represented by two sequences simultaneously: y o with word order and y bow without order.These two sequences are assumed to be conditionally independent given z and x.Then our training objective can be rewritten as: − D KL (q(z|y, x)||p(z|x)) where p(y bow |x, z) is obtained by a multilayer perceptron h b = MLP(x, z): where |y| is the length of y, y t is the word index of t-th word in y, and V is the vocabulary size.
During training, we generally approximate E z∼q(z|y,x) [log p(y|x, z)] by sampling N times of z from the distribution q(z|y, x).In our model, the latent space is discrete but generally large since we set it as the vocabulary in the dataset1 .The vo-cabulary consists of words that are similar in syntactic or semantic.Directly sampling z from the categorical distribution in Eq. 3 cannot make use of such word similarity information.
Hence, we propose to modify our model in Section 3.1 to consider the word similarity for sampling multiple accurate and diverse latent z's.We first cluster z ∈ Z into K clusters c 1 , . . ., c K .Each z belongs to only one of the K clusters and dissimilar words lie in distinctive groups.We use the K-means clustering algorithm to group z's using a pre-trained embedding corpus (Song et al., 2018).Then we revise the posterior network to perform a two-stage cluster sampling by decomposing q(z|y, x) as : That is, we first compute q(c kz |y, x), which is the probability of the cluster that z belongs to conditioned on both x and y.Next, we compute q(z|x, y, c kz ), which is the probability distribution of z conditioned on the x, y and the cluster c kz .When we perform sampling from q(z|x, y), we can exploit the following two-stage sampling approach: first sample the cluster based on q(c k |x, y); next sample a specific z from z's within the sampled cluster based on q(z|x, y, c kz ).
Similarly, we can decompose the prior distribution p(z|x) accordingly for consistency: In testing, we can perform the two-stage sampling according to p(c k |x) and p(z|x, c kz ).Our full model is illustrated in Figure 1.
Network structure modification: To modify the network structure for the two-stage sampling method, we first compute the probability of each cluster given x in the prior network (or x and y in the posterior network) with a softmax layer (Eq. 5 or Eq. 6 followed by a softmax function).We then add the input representation and the cluster embedding e cz of a sampled cluster c z , and use another softmax layer to compute the probability of each z within the sampled cluster.In the generation network, the representation of z is the sum of the cluster embedding e cz and its word embedding e z .
Network pre-training: To speed up the convergence of our model, we pre-extract keywords from each query using the TF-IDF method.Then we use these keywords to pre-train the prior and posterior networks.The generation network is not pre-trained because in practice it converges fast in only a few epochs.

Experimental Settings
Next, we describe our experimental settings including the dataset, implementation details, all compared methods, and the evaluation metrics.

Dataset
We conduct our experiments on a short-text conversation benchmark dataset (Shang et al., 2015) which contains about 4 million post-response pairs from the Sina Weibo2 , a Chinese social platforms.We employ the Jieba Chinese word segmenter3 to tokenize the queries and responses into sequences of Chinese words.We use a vocabulary of 50,000 words (a mixture of Chinese words and characters), which covers 99.98% of words in the dataset.
All other words are replaced with <UNK>.

Implementation Details
We use single-layer bi-directional GRU for the encoder in the prior/posterior/generation network, and one-layer GRU for the decoder in the generation network.The dimension of all hidden vectors is 1024.The cluster embedding dimension is 620.Except that the word embeddings are initialized by the word embedding corpus (Song et al., 2018), all other parameters are initialized by sampling from a uniform distribution [−0.1, 0.1].The batch size is 128.We use Adam optimizer with a learning rate of 0.0001.For the number of clusters K in our method, we evaluate four different values (5, 10, 100, 1000) using automatic metrics and set K to 10 which tops the four options empirically.It takes about one day for every two epochs of our model on a Tesla P40 GPU, and we train ten epochs in total.During testing, we use beam search with a beam size of 10.

Compared Methods
In our work, we focus on comparing various methods that model p(y|x) differently.We compare our proposed discrete CVAE (DCVAE) with the two-stage sampling approach to three categories of response generation models: 1. Baselines: Seq2seq, the basic encoderdecoder model with soft attention mechanism (Bahdanau et al., 2015) used in decoding and beam search used in testing; MMI-bidi (Li et al., 2016a), which uses the MMI to re-rank results from beam search.(Zhao et al., 2017): We adjust the original work which is for multi-round conversation for our single-round setting.For a fair comparison, we utilize the same keywords used in our network pre-training as the knowledge-guided features in this model.

CVAE
3. Other enhanced encoder-decoder models: Hierarchical Gated Fusion Unit (HGFU) (Yao et al., 2017), which incorporates a cue word extracted using pointwise mutual information (PMI) into the decoder to generate meaningful responses; Mechanism-Aware Neural Machine (MANM) (Zhou et al., 2017), which introduces latent embeddings to allow for multiple diverse response generation.
Here, we do not compare RL/GAN-based methods because all our compared methods can replace their objectives with the use of reward functions in the RL-based methods or add a discriminator in the GAN-based methods to further improve the overall performance.However, these are not the contribution of our work, which we leave to future work to discuss the usefulness of our model as well as other enhanced generation models combined with the RL/GAN-based methods.

Evaluation
To evaluate the responses generated by all compared methods, we compute the following automatic metrics on our test set: 1. BLEU: BLEU-n measures the average n-gram precision on a set of reference responses.We report BLEU-n with n=1,2,3,4.
2. Distinct-1 & distinct-2 (Li et al., 2016a): We count the numbers of distinct uni-grams and bigrams in the generated responses and divide the numbers by the total number of generated unigrams and bi-grams in the test set.These metrics can be regarded as an automatic metric to evaluate the diversity of the responses.
Three annotators from a commercial annotation company are recruited to conduct our human eval-
uation.Responses from different models are shuffled for labeling.300 test queries are randomly selected out, and annotators are asked to independently score the results of these queries with different points in terms of their quality: (1) Good (3 points): The response is grammatical, semantically relevant to the query, and more importantly informative and interesting; (2) Acceptable (2 points): The response is grammatical, semantically relevant to the query, but too trivial or generic (e.g.,"我 不 知 道(I don't know)", "我 也 是(Me too)", "我喜欢(I like it)" etc.); (3) Failed (1 point): The response has grammar mistakes or irrelevant to the query.

Experimental Results and Analysis
In the following, we will present results of all compared methods and conduct a case study on such results.Then, we will perform further analysis of our proposed method by varying different settings of the components designed in our model.

Results on All Compared Methods
Results on automatic metrics are shown on the left-hand side of Table 1.From the results we can see that our proposed DCVAE achieves the best BLEU scores and the second best distinct ratios.The HGFU has the best dist-2 ratio, but its BLEU scores are the worst.These results indicate that the responses generated by the HGFU are less close to the ground true references.Although the automatic evaluation generally indicates the quality of generated responses, it can not accurately evaluate the generated response and the automatic metrics may not be consistent with human perceptions (Liu et al., 2016).Thus, we consider human evaluation results more reliable.
For the human evaluation results on the righthand side of Table 1, we show the mean and standard deviation of all test results as well as the per-centage of acceptable responses (2 or 3 points) and good responses (3 points only).Our proposed DC-VAE has the best quality score among all compared methods.Moreover, DCVAE achieves a much higher good ratio, which means it generates more informative and interesting responses.Besides, the HGFU's acceptable and good ratios are much lower than our model indicating that it may not maintain enough response relevance when encouraging diversity.This is consistent with the results of the automatic evaluation in Table 1.We also notice that the CVAE achieves the worst human annotation score.This validates that the original CVAE for open-domain response generation does not work well and our proposed DCVAE is an effective way to improve the CVAE for better output diversity.

Case Study
Figure 2 shows four example queries with their responses generated by all compared methods.The Seq2seq baseline tends to generate less informative responses.Though MMI-bidi can select different words to be used, its generated responses are still far from informative.MANM can avoid generating generic responses in most cases, but sometimes its generated response is irrelevant to the query, as shown in the left bottom case.Moreover, the latent responding mechanisms in MANM have no explicit or interpretable meaning.Similar results can be observed from HGFU.If the PMI selects irrelevant cue words, the resulting response may not be relevant.Meanwhile, responses generated by our DCVAE are more informative as well as relevant to input queries.

Different Sizes of the Latent Space
We vary the size of the latent space (i.e., sampled word space Z) used in our proposed DCVAE.Figure 3 shows the automatic and human evaluation results on the latent space setting to the top 10k, 静谧的生活。 I just talked to Wei Zhe on the phone and he was calm.The house, the courtyard, the car, I said now I had time to play ball with him.
the door of fallen petals, quiet life.20k, all words in the vocabulary.On the automatic evaluation results, if the sampled latent space is getting larger, the BLEU-4 score increases but the distinct ratios drop.We find out that though the DCVAE with a small latent space has a higher distinct-1/2 ratio, many generated sentences are grammatically incorrect.This is also why the BLEU-4 score decreases.On human evaluation results, all metrics improve with the use of a larger latent space.This is consistent with our motivation that open-domain short-text conversation covers a wide range of topics and areas, and the top frequent words are not enough to capture the content of most training pairs.Thus a small latent space, i.e. the top frequent words only, is not feasible to model enough latent information and a large latent space is generally favored in our proposed model.

Analysis on the Two-Stage Sampling
We further look into whether the two-stage sampling method is effective in the proposed DCVAE.drops drastically.This means that the proposed two-stage sampling method is important for the DCVAE to work well.
Besides, to validate the effectiveness of clustering, we implemented a modified DCVAE (DCVAE-CD) that uses a pure categorical distribution in which each variable has no exact meaning.That is, the embedding of each latent variable does not correspond to any word embedding.Automatic evaluation results of this modified model are shown in Figure. 4(c).We can see that DCVAE-CD performs worse, which means the distribution on word vocabulary is important in our model.

Conclusion
In this paper, we have presented a novel response generation model for short-text conversation via a discrete CVAE.We replace the continuous latent variable in the standard CVAE by an interpretable discrete variable, which is set to a word in the vocabulary.The sampled latent word has an explicit semantic meaning, acting as a guide to the generation of informative and diverse responses.We also propose to use a two-stage sampling approach to enable efficient selection of diverse variables from a large latent space, which is very essential for our model.Experimental results show that our model outperforms various kinds of generation models under both automatic and human evaluations.

Figure 2 :
Figure 2: Examples of the generated responses.The sampled latent words (z) are showed in the brackets.

Figure 3 :
Figure 3: Different sizes of the latent space used in the DCVAE: automatic evaluation (left) and human evaluation (right).
Figure 4: (a)/(b): Automatic/human evaluation on the DCVAE with/without the two-stage sampling approach.(c): Automatic evaluation on our proposed DCVAE and the modified DCVAE that uses a pure categorical distribution (DCVAE-CD) in which each variable has no exact meaning.
The architecture of the proposed discrete CVAE.e cz and e z are embeddings of a cluster and a word sampled from the estimated discrete distributions.e cz is only applied when the two-stage sampling approach in Section 3.2 is used.If e cz is applied, the latent representation h z is the sum of e cz and e z ; otherwise, h z is e z .α denotes the attention weight.⊕ denotes the sum of input vectors.