Are Training Samples Correlated? Learning to Generate Dialogue Responses with Multiple References

Due to its potential applications, open-domain dialogue generation has become popular and achieved remarkable progress in recent years, but sometimes suffers from generic responses. Previous models are generally trained based on 1-to-1 mapping from an input query to its response, which actually ignores the nature of 1-to-n mapping in dialogue that there may exist multiple valid responses corresponding to the same query. In this paper, we propose to utilize the multiple references by considering the correlation of different valid responses and modeling the 1-to-n mapping with a novel two-step generation architecture. The first generation phase extracts the common features of different responses which, combined with distinctive features obtained in the second phase, can generate multiple diverse and appropriate responses. Experimental results show that our proposed model can effectively improve the quality of response and outperform existing neural dialogue models on both automatic and human evaluations.


Introduction
In recent years, open-domain dialogue generation has become a research hotspot in Natural Language Processing due to its broad application prospect, including chatbots, virtual personal assistants, etc. Though plenty of systems have been proposed to improve the quality of generated responses from various aspects such as topic , persona modeling  and emotion controlling (Zhou et al., 2018b), most of these recent approaches are primarily built upon the sequence-to-sequence architecture Shang et al., 2015) which suffers from the "safe" response problem (Li et al., 2016a;Sato et al., 2017). This can be ascribed to modeling the response generation process as 1to-1 mapping, which ignores the nature of 1-to-Figure 1: An illustration of the two-step generation architecture. Different from the conventional methods (shown in green color) which model each response from scratch every time, our method first builds a common feature of multiple responses and models each response based on it afterward. n mapping of dialogue that multiple possible responses can correspond to the same query.
To deal with the generic response problem, various methods have been proposed, including diversity-promoting objective function (Li et al., 2016a), enhanced beam search (Shao et al., 2016), latent dialogue mechanism (Zhou et al., , 2018a, Variational Autoencoders (VAEs) based models Serban et al., 2017), etc. However, these methods still view multiple responses as independent ones and fail to model multiple responses jointly. Recently, Zhang et al. (2018a) introduce a maximum likelihood strategy that given an input query, the most likely response is approximated rather than all possible responses, which is further implemented by Rajendran et al. (2018) with reinforcement learning for task-oriented dialogue. Although capable of generating the most likely response, these methods fail to model other possible responses and ignore the correlation of different responses.
In this paper, we propose a novel response generation model for open-domain conversation, which learns to generate multiple diverse responses with multiple references by considering the correlation of different responses. Our motivation lies in two aspects: 1) multiple responses for a query are likely correlated, which can facilitate building the dialogue system. 2) it is easier to model each response based on other responses than from scratch every time. As shown in Figure 1, given an input query, different responses may share some common features e.g. positive attitudes or something else, but vary in discourses or expressions which we refer to as distinct features. Accordingly, the system can benefit from modeling these features respectively rather than learning each query-response mapping from scratch.
Inspired by this idea, we propose a two-step dialogue generation architecture as follows. We jointly view the multiple possible responses to the same query as a response bag. In the first generation phase, the common feature of different valid responses is extracted, serving as a base from which each specific response in the bag is further approximated. The system then, in the second generation phase, learns to model the distinctive feature of each individual response which, combined with the common feature, can generate multiple diverse responses simultaneously.
Experimental results show that our method can outperform existing competitive neural models under both automatic and human evaluation metrics, which demonstrates the effectiveness of the overall approach. We also provide ablation analyses to validate each component of our model. To summarize, our contributions are threefold: • We propose to model multiple responses to a query jointly by considering the correlations of responses with multi-reference learning.
• We consider the common and distinctive features of the response bag and propose a novel two-step dialogue generation architecture.
• Experiments show that the proposed method can generate multiple diverse responses and outperform existing competitive models on both automatic and human evaluations.

Related Work
Along with the flourishing development of neural networks, the sequence-to-sequence framework has been widely used for conversation response generation (Shang et al., 2015;Sordoni et al., 2015) where the mapping from a query x to a reply y is learned with the negative log likelihood. However, these models suffer from the "safe" response problem. To address this problem, various methods have been proposed. Li et al. (2016a) propose a diversity-promoting objective function to encourage diverse responses during decoding. Zhou et al. ( , 2018a introduce a responding mechanism between the encoder and decoder to generate various responses.  incorporate topic information to generate informative responses. However, these models suffer from the deterministic structure when generating multiple diverse responses. Besides, during the training of these models, response utterances are only used in the loss function and ignored when forward computing, which can confuse the model for pursuing multiple objectives simultaneously. A few works explore to change the deterministic structure of sequence-to-sequence models by introducing stochastic latent variables. VAE is one of the most popular methods (Bowman et al., 2016;Serban et al., 2017;Cao and Clark, 2017), where the discourse-level diversity is modeled by a Gaussian distribution. However, it is observed that in the CVAE with a fixed Gaussian prior, the learned conditional posteriors tend to collapse to a single mode, resulting in a relatively simple scope (Wang et al., 2017). To tackle this, WAE (Gu et al., 2018) which adopts a Gaussian mixture prior network with Wasserstein distance and VAD (Du et al., 2018) which sequentially introduces a series of latent variables to condition each word in the response sequence are proposed. Although these models overcome the deterministic structure of sequence-to-sequence model, they still ignore the correlation of multiple valid responses and each case is trained separately.
To consider the multiple responses jointly, the maximum likelihood strategy is explored. Zhang et al. (2018a) propose the maximum generated likelihood criteria which model a query with its multiple responses as a bag of instances and proposes to optimize the model towards the most likely answer rather than all possible responses. Similarly, Rajendran et al. (2018) propose to reward the dialogue system if any valid answer is produced in the reinforcement learning phase. Though considering multiple responses jointly, the maximum likelihood strategy fails to utilize all the references during training with some cases ig- Figure 2: The overall architecture of our proposed dialogue system where the two generation steps and testing process are illustrated. Given an input query x, the model aims to approximate the multiple responses in a bag {y} simultaneously with the continuous common and distinctive features, i.e., the latent variables c and z obtained from the two generation phases respectively. nored. In our approach, we consider multiple responses jointly and model each specific response separately by a two-step generation architecture.

Approach
In this paper, we propose a novel response generation model for short-text conversation, which models multiple valid responses for a given query jointly. We posit that a dialogue system can benefit from multi-reference learning by considering the correlation of multiple responses. Figure 2 demonstrates the whole architecture of our model. We now describe the details as follows.

Problem Formulation and Model Overview
Training consist of each query x and the set of its valid responses {y}, where N denotes the number of training samples. For a dialogue generation model, it aims to map from the input query x to the output response y ∈ {y}. To achieve this, different from conventional methods which view the multiple responses as independent ones, we propose to consider the correlation of multiple responses with a novel twostep generation architecture, where the response bag {y} and each response y ∈ {y} are modeled by two separate features which are obtained in each generation phase respectively. Specifically, we assume a variable c ∈ R n representing the common feature of different responses and an unobserved latent variable z ∈ Z corresponding to the distinct feature for each y in the bag. The com-mon feature c is generated in the first stage given x and the distinctive feature z is sampled from the latent space Z in the second stage given the query x and common feature c. The final responses are then generated conditioned on both the common feature c and distinct feature z simultaneously.

Common Feature of the Response Bag
In the first generation step, we aim to map from the input query x to the common feature c of the response bag {y}. Inspired by multi-instance learning (Zhou, 2004), we start from the simple intuition that it is much easier for the model to fit multiple instances from their mid-point than a random start-point, as illustrated in Figure 1.
To obtain this, we model the common feature of the response bag as the mid-point of embeddings of multiple responses. In practice, we first encode the input x with a bidirectional gated recurrent units (GRU)  to obtain an input representation h x . Then, the common feature c is computed by a mapping network which is implemented by a feed-forward neural network whose trainable parameter is denoted as θ. The feature c is then fed into the response decoder to obtain the intermediate response y c which is considered to approximate all valid responses. Mathematically, the objective function is defined as: where |{y}| is the cardinality of the response bag {y} and p ψ represents the response decoder. Besides, to measure how well the intermediate response y c approximates the mid-point response, we set up an individual discriminator and derive the mapping function to produce better results. As to the discriminator, we first project each utterance to an embedding space with fixed dimensionality via convolutional neural networks (CNNs) with different kernels as the process shown in Figure 3. Then, the cosine similarity of the query and response embeddings is computed, denoted as D θ (x, y), where θ represents trainable parameter in the discriminator. For the response bag {y}, the average response embedding is used to compute the matching score. The objective of intermediate response y c is then to minimize the difference between D θ (x, y c ) and D θ (x, {y}): where y c denotes the utterance produced by the decoder conditioned on the variable c.
To overcome the discrete and non-differentiable problem, which breaks down gradient propagation from the discriminator, we adopt a "soft" continuous approximation (Hu et al., 2017): where o t is the logit vector as the inputs to the softmax function at time-step t and the temperature τ is set to τ → 0 as training proceeds for increasingly peaked distributions. The whole loss for the step-one generation is then which is optimized by a minimax game with adversarial training (Goodfellow et al., 2014).

Response Specific Generation
The second generation phase aims to model each specific response in a response bag respectively. In practice, we adopt the CVAE  architecture, while two prominent modifications remain. Firstly, rather than modeling each response with the latent variable z from scratch, our model approximates each response based on the bag representation c with only the distinctive feature of each specific response remaining to be captured. Secondly, the prior common feature c can provide extra information for the sampling network which is supposed to decrease the latent searching space. Specifically, similar to the CVAE architecture, the overall objective for our model in the second generation phase is as below: where q φ represents the recognition network and p ϕ is the prior network with φ and ϕ as the trainable parameters; D(·||·) is the regularization term which measures the distance between the two distributions. In practice, the recognition networks are implemented with a feed-forward network that where h x and h y are the utterance representations of query and response got by GRU respectively, and the latent variable z ∼ N (µ, σ 2 I). For the prior networks, we consider two kinds of implements. One is the vanilla CVAE model  where the prior p ϕ (z|x, c) is modeled by a another feed-forward network conditioned on the representations h x and c as follows, and the distance D(·||·) here is measured by the KL divergence. For the other, we adopt the WAE model (Gu et al., 2018) in which the prior p ϕ (z|x, c) is modeled by a mixture of Gaussian distributions GMM(π k , µ k , σ k 2 I) K k=1 , where K is the number of Gaussian distributions and π k is the mixture coefficient of the k-th component of the GMM module as computed: and To sample an instance, Gumble-Softmax reparametrization trick (Kusner and Hernández-Lobato, 2016) is utilized to normalize the coefficients. The distance here is measured by the Wasserstein distance which is implemented with an adversarial discriminator .
Recap that in the second generation phase the latent variable z is considered to only capture the distinctive feature of each specific response. Hence to distinguish the latent variable z for each separate response, we further introduce a multireference bag-of-word loss (MBOW) which requires the network to predict the current response y against the response bag: where the probability is computed by a feedforward network f as the vanilla bag-of-word loss  does; {ȳ} is the complementary response bag of y and its probability is computed as the average probability of responses in the bag; and λ is a scaling factor accounting for the difference in magnitude. As it shows, the MBOW loss penalizes the recognition networks if other complementary responses can be predicted from the distinctive variable z. Besides, since the probability of the complementary term may approach zero which makes it difficult to optimize, we actually adopt its lower bound in practice: (11) where |V | is vocabulary size.
Totally, the whole loss for the step-two generation is then: which can be optimized in an end-to-end way.

Optimization and Testing
Our whole model can be trained in an end-to-end fashion. To train the model, we first pre-train the word embedding using Glove ( (Pennington et al., 2014)) 1 . Then modules of the model are jointly trained by optimizing the losses L f irst and L second of the two generation phases respectively. To overcome the vanishing latent variable problem (Wang et al., 2017) of CVAE, we adopt the KL annealing strategy (Bowman et al., 2016), where the weight of the KL term is gradually increased during training. The other technique employed is the MBOW loss which is able to sharpen the distribution of latent variable z for each specific response and alleviate the vanishing problem at the same time.
During testing, diverse responses can be obtained by the two generation phases described above, where the distinctive latent variable z corresponding to each specific response is sampled from the prior probability network. This process is illustrated in Figure 2. Capable of capturing the common feature of the response bag, the variable c is obtained from the mapping network and no intermediate utterance is required, which facilitates reducing the complexity of decoding.

Dataset
Focusing on open-domain dialogue, we perform experiments on a large-scale single-turn conversation dataset Weibo (Shang et al., 2015), where each input post is generally associated with multiple response utterances 2 . Concretely, the Weibo dataset consists of short-text online chit-chat dialogues in Chinese, which is crawled from Sina Weibo 3 . Totally, there are 4,423,160 queryresponse pairs for training set and 10000 pairs for the validation and testing, where there are around 200k unique query in the training set and each query used in testing correlates with four responses respectively. For preprocessing, we follow the conventional settings (Shang et al., 2015).

Baselines
We compare our model with representative dialogue generation approaches as listed below:

Method
Multi

S2S
: the vanilla sequence-to-sequence model with attention mechanism  where standard beam search is applied in testing to generate multiple different responses.
S2S+DB: the vanilla sequence-to-sequence model with the modified diversity-promoting beam search method (Li et al., 2016b) where a fixed diversity rate 0.5 is used.
MMS: the modified multiple responding mechanisms enhanced dialogue model proposed by Zhou et al. (2018a) which introduces responding mechanism embeddings  for diverse response generation.
WAE: the conditional Wasserstein autoencoder model for dialogue generation (Gu et al., 2018) which models the distribution of data by training a GAN within the latent variable space.
Ours: we explore our model Ours and conduct various ablation studies: the model with only the second stage generation (Ours-First), the model without the discriminator (Ours-Disc) and multireference BOW loss (Ours-MBOW), and the model with GMM prior networks (Ours+GMP).

Evaluation Metrics
To comprehensively evaluate the quality of generated response utterances, we adopt both automatic and human evaluation metrics: BLEU: In dialogue generation, BLEU is widely used in previous studies (Yao et al., 2017;Shang et al., 2018). Since multiple valid responses exist in this paper, we adopt multi-reference BLEU where the evaluated utterance is compared to provided multiple references simultaneously.
Distinctness: To distinguish safe and commonplace responses, the distinctness score (Li et al., 2016a) is designed to measure word-level diversity by counting the ratio of distinctive [1,2]-grams. In our experiments, we adopt both Intra-Dist: the distinctness scores of multiple responses for a given query and Inter-Dist: the distinctness scores of generated responses of the whole testing set.
Embedding Similarity: Embedding-based metrics compute the cosine similarity between the sentence embedding of a ground-truth response and that of the generated one. There are various ways to obtain the sentence-level embedding from the constituent word embeddings. In our experiments, we apply three most commonly used strategies: Greedy matches each word of the reference with the most similar word in the evaluated sentence; Average uses the average of word embed- These are some magnificent sights at the moment of the volcanic eruption.
There remain ten minutes before we entering the era of win8. I am a geek of system updating. Gold 大自然才是人类的最终boss。 问个白痴问题必须正版才能升级吧？ Nature is the final boss of human.
May I ask an idiot problem. Does the update require a license? 真帅，12月份的时候就能亲眼看到了，好开心啊。 不是给平板电脑用的系统吗？ So cool! I am so happy to see it by myself in December.
Isn't this system for PAD? 被惊艳震撼到了。 已经用了一个多月了，不过还是不喜欢8 I am deeply surprised and shocked.
I have used it for a month but I still don't like it 8 震撼了，小小人类仰视造物主的强大。 好久木用电脑了，想念。 Shocked! The imperceptible humanity looks up to the power of the creator.
Having not used the computer for a long time, I miss it.
Do you want to use the phone? 一天一天就能看到了。 我是升级了升级版了。 We can see it day after day.
I have updated to the upgrade. 天地之间的风景有如此之美。 我还以为是我的电脑。 How could there exist such amazing sights.
I thought it was my computer.

火山喷发瞬间的萤火虫。
升级版的机器人。 The glowworm at the moment of volcanic eruption.
I am wondering what software it is. 好壮观啊一定要保存下来。 我觉得微软的ui还不错。 It's so magnificent that it should be preserved.
I think the ui of Microsoft is not bad. 大白天的不能看到。 现在的产品已经不是新产品了。 It can't be seen during the day.
The current product is not the new. 如果有机会亲眼所见过。 这个是什么应用啊。 If you have chance to see it yourself.
What application is this. 如此这般这般淼小。 我觉得这样的界面更像windows8。 It is so so imperceptible.
I think interface like this looks more like windows8. dings; and Extreme takes the most extreme value among all words for each dimension of word embeddings in a sentence. Since multiple references exist, for each utterance to be evaluated, we compute its score with the most similar reference. Human Evaluation with Case Analysis: As automatic evaluation metrics lose sight of the overall quality of a response (Tao et al., 2018), we also adopt human evaluation on 100 random samples to assess the generation quality with three independent aspects considered: relevance (whether the reply is relevant to the query), diversity (whether the reply narrates with diverse words) and readability (whether the utterance is grammatically formed). Each property is assessed with a score from 1 (worst) to 5 (best) by three annotators. The evaluation is conducted in a blind process with the utterance belonging unknown to the reviewers.

Implementation Details
All models are trained with the following hyperparameters: both encoder and decoder are set to one layer with GRU  cells, where the hidden state size of GRU is 256; the utterance length is limited to 50; the vocabulary size is 50,000 and the word embedding dimension is 256; the word embeddings are shared by the encoder and decoder; all trainable parameters are initialized from a uniform distribution [-0.08, 0.08]; we employ the Adam (Kingma and Ba, 2014) for optimization with a mini-batch size 128 and initialized learning rate 0.001; the gradient clipping strategy is utilized to avoid gradient explosion, where the gradient clipping value is set to be 5. For the latent variable, we adopt dimensional size 256 and the component number of the mixture Gaussian for prior networks in WAE is set to 5. As to the discriminator, we set the initialized learning rate as 0.0002 and use 128 different kernels for each kernel size in {2, 3, 4}. The size of the response bag is limited to 10 where the instances inside are randomly sampled for each mini-batch. All the models are implemented with Pytorch 0.4.1 4 . Table 1 shows our main experimental results, with baselines shown in the top and our models at the bottom. The results show that our model (Ours) outperforms competitive baselines on various evaluation metrics. The Seq2seq based models (S2S, S2S-DB and MMS) tend to generate fluent utterances and can share some overlapped words with the references, as the high BLEU-2 scores show. However, the distinctness scores illustrate that these models fail to generate multiple diverse responses in spite of the diversitypromoting objective and responding mechanisms used. We attribute this to that these models fail to consider multiple references for the same query, which may confuse the models and lead to a commonplace utterance. As to the CVAE and WAE models, with the latent variable to control the discourse-level diversity, diverse responses can be obtained. Compared against these previous methods, our model can achieve the best or second best performances on different automatic evaluation metrics where the improvements are most consistent on BLEU-1 and embedding-based metrics, which demonstrates the overall effectiveness of our proposed architecture.

Comparison against Baselines
In order to better study the quality of generated responses, we also report the human evaluation results in Table 2. As results show, although there remains a huge gap between existing methods and human performance (the Gold), our model gains promising promotions over previous methods on generating appropriate responses with diverse expressions. With both obvious superiority (readability for S2S and diversity for CVAE) and inferiority (diversity for S2S and relevance for CVAE), the baselines show limited overall performances, in contrast to which our method can output more diverse utterances while maintaining the relevance to the input query and achieve a high overall score.

Ablation Study
To better understand the effectiveness of each component in our model, we further conduct the ablation studies with results shown at the bottom of Table 1. Above all, to validate the effectiveness of the common feature, we remove the first generation stage and get the Ours-First model. As the results of BLEU and embedding-based metrics show, the system can benefit from the common feature for better relevance to the query.
Moreover, pairwise comparisons Ours-Disc vs. Ours and Ours-MBOW vs. Ours validate the effects of the discriminator and modified multireference bag-of-word loss (MBOW). As results show, the discriminator facilitates extracting the common feature and yields more relevant responses to the input query afterward. The MBOW loss, similar to the effects of BOW loss in the CVAE, can lead to a more unique latent variable for each response and improve the final distinctness scores of generated utterances. In the experiments, we also observed the KL vanishing problem when training our model and we overcame it with the KL weight annealing strategy and the MBOW loss described above. Table 3 illustrates two examples of generated replies to the input query got from the testing set. Comparing the CVAE and Ours, we can find that although the CVAE model can generate diverse utterances, its responses tend to be irrelevant to the query and sometimes not grammatically formed, e.g. the words "glowworm" and "robot" in the sentences. In contrast, responses generated by our model show better quality, achieving both high relevance and diversity. This demonstrates the ability of the two-step generation architecture. For better insight into the procedure, we present the intermediately generated utterances which show that the feature extracted in the first stage can focus on some common and key aspects of the query and its possible responses, such as the "amazing" and "software". With the distinctive features sampled in the second generation phase, the model further revises the response and outputs multiple responses with diverse contents and expressions.

Case Study and Discussion
Recap that the common feature is expected to capture the correlations of different responses and serve as the base of a response bag from which different responses are further generated, as shown in Figure 1. To investigate the actual performances achieved by our model, we compute the distance between the input query/intermediate utterance and gold references/generated responses and present the results in Figure 4. As shown, intermediate utterances obtained in the first generation phase tend to approximate multiple responses with similar distances at the same time. Comparing the generated responses and the references, we find that generated responses show both high relevant and irrelevant ratios, as the values near 0.00 and 1.00 show. This actually agrees well with our observation that the model may sometimes rely heavily on or ignore the prior common feature information. From a further comparison between the input query and the mid, we also observe that the intermediate utterance is more similar to final responses than the input query, which correlates well with our original intention shown in Figure 1.

Conclusion and future work
In this paper, we tackle the one-to-many queryresponse mapping problem in open-domain conversation and propose a novel two-step generation architecture with the correlation of multiple valid responses considered. Jointly viewing the multiple responses as a response bag, the model extracts the common and distinct features of different responses in two generation phases respectively to output multiple diverse responses. Experimental results illustrate the superior performance of the proposed model in generating diverse and appropriate responses compared to previous representative approaches. However, the modeling of the common and distinct features of responses in our method is currently implicit and coarse-grained. Directions of future work may be pursuing betterdefined features and easier training strategies.