Towards Less Generic Responses in Neural Conversation Models: A Statistical Re-weighting Method

Sequence-to-sequence neural generation models have achieved promising performance on short text conversation tasks. However, they tend to generate generic/dull responses, leading to unsatisfying dialogue experience. We observe that in the conversation tasks, each query could have multiple responses, which forms a 1-to-n or m-to-n relationship in the view of the total corpus. The objective function used in standard sequence-to-sequence models will be dominated by loss terms with generic patterns. Inspired by this observation, we introduce a statistical re-weighting method that assigns different weights for the multiple responses of the same query, and trains the common neural generation model with the weights. Experimental results on a large Chinese dialogue corpus show that our method improves the acceptance rate of generated responses compared with several baseline models and significantly reduces the number of generated generic responses.


Introduction
Many recent works have been proposed to use neural networks to generate responses for opendomain dialogue systems (Shang et al., 2015;Sordoni et al., 2015;Vinyals and Le, 2015;Li et al., 2016a,c;Serban et al., 2017;Shen et al., 2017;Li et al., 2017;Yu et al., 2017;Xu et al., 2017). These methods are inspired by the sequence-tosequence (Seq2Seq) framework (Sutskever et al., 2014), which is originally applied for Neural Machine Translation (NMT). They aim at maximizing the probability of generating a response given an input query, and generally use the maximum likelihood estimation (MLE) as their objective function. However, various problems occur when Seq2Seq models are used for dialogue generation tasks. One of the most important problems is that such models are inclined to generate generic and dull responses (e.g., I don't know), rather than meaningful and specific answers (Sordoni et al., 2015;Serban et al., 2016;Li et al., 2016a,c;Kannan et al., 2016;Li et al., 2017;Xie, 2017;Wei et al., 2017;. Until now, it has attracted increasing studies to address the issue of generating generic response. For example, Li et al. (2016a) used the mutual information theory to reconstruct MLE, but this model is easy to generate ungrammatical outputs. They further proposed a fast diverse decoding approach (Li et al., 2016b), which modifies the beam search to re-rank meaningful responses into higher positions. Similar works explore different ways to encourage response diversity for picking less generic responses in the decoding search (Vijayakumar et al., 2016;Li and Jurafsky, 2016). In the reinforcement learning framework (Li et al., 2016c), the reward function used in the decoding considers the ease of answering, which is measured by a distance towards a set of 8 generic responses. Thus, it can also alleviate the problem of generating generic responses to some extent. Lison and Bibauw (2017) proposed to add a weighting model to learn the "quality" of the query and response pair, but it relies heavily on additional inputs. All these works tried to add extra optimized terms in the encoding or decoding modules in Seq2Seq, making the training or prediction more complicated.
In this work, we consider the reason why Seq2Seq often generates generic responses by analyzing the MLE objective function directly. We notice that multiple responses are often associated with one single input query. As shown in Figure 1, the relationship between queries and responses is much looser in conversation models than that in NMT, since the space of possible responses is much larger than the space of possible translations for a given sentence. On one hand, the information of these responses is only required to be relevant to the input query but usually differs from it. On the other hand, a query accepts large semantic diversity among its responses. Hence, it is a 1-to-n relationship between a query and its responses (Vinyals and Le, 2015;. Meanwhile, we can see there is a m-to-n relationship between all queries and responses in the training corpus. Then, we find that MLE, which learns a 1-to-1 mapping in response generation, naturally puts more emphasis on optimizing the frequent patterns. Thus, the converged local optimum is easy to output these patterns or their combinations, leading to generic responses. Response 4 (b) Dialogue Figure 1: An illustration of the differences between NMT and dialogue generation. Response 4 is the potential cases that are not collected in corpus.
Inspired by this observation, we propose a statistical re-weighting method which modifies MLE by re-weighting the multiple responses for each query such that MLE will not be dominated by the frequent patterns or their combinations. The proposed method calculates the weights of a response with the consideration of two statistical features: similarity frequency and sentence length. Our model is simple and efficient to optimize without adding additional terms into the original Seq2Seq objective function. We validate the performance of our proposed method on a large Chinese dialogue corpus. Results show that it can improve the acceptance rate of the generated responses and significantly suppress the number of generic responses.

Proposed Method
Standard Seq2Seq models for NMT and dialogue generation aim at estimating the conditional probability p(y|x) where x = (x 1 , . . . , x T ) is an input sequence and y = (y 1 , . . . , y T ) is its corresponding output sequence whose length T may differ from T . During training, we learn all the model parameters θ θ θ by summing the negative log likelihood of each sample pair (x, y) in the training corpus C: (2) Recall that generic responses are those that are safe and universal for many queries and thus frequently appear in the training corpus. Hence, if we have two responses of x in which one is generic and the other one contains more meaningful content, using L(C, θ θ θ) in Eq. 1 will put the same emphasis on optimizing each of their loss terms. Therefore, L(C, θ θ θ) contains a large amount of patterns from the generic responses, thus it is not surprised to see that the trained models are stuck into local optimum that are inclined to generate these patterns or their combinations.
Based on this observation, we argue that a good loss function of Seq2Seq for dialogue generation should not be dominated by the patterns from generic responses. Here, we propose a reweighting method for responses of a query x. Specifically, (x, y, θ θ θ) in Eq. 1 is modified to be: where w(y|x) ∈ (0, 1] is a soft weight for a response y of a query x. In the implementation, we make the normalization of this loss at the mini-batch level for better computational efficiency. Hence, the loss of Eq. 2 for a mini-batch L(B, θ θ θ) takes the form: We summarize two common properties for the responses: • Responses with the patterns of frequently appearing in the training corpus tend to be generic. Here, the patterns refer to both the whole sentence or n-grams which can be described by similarities among responses.
• Very short and long responses should be avoid.
Owing to the MLE objective function, the Seq2Seq frameworks are inclined to generate short responses that are universal replies. While long responses usually contain more specific information which may not be generalized to most conversation scenarios. Hence, high-quality responses tend to be with moderate length.
We propose an estimator by considering these two properties: where R denotes all collected responses of x in C.
For each response, the estimator gives a weight by: Here, E(y) and F(y) correspond to the mentioned two properties respectively: • E(y) = e −af (y) , where f (y) is a function related to the frequency of response y. It could be formulated as where D(·) refers to the similarity between two sentences, a is a scale factor, b is bias and τ ∈ [0, 1] is a threshold specifying the similarity that two responses will be considered identical. For instance, it could be the simplest strictly matching, which is used in our experiments. Other methods like cosine distance of TF-IDF (token or n-grams) can also be applied, but may encounter computational issues for large corpus. A response with a higher frequency will be assigned with a smaller E(y).
• F(y) = e −c||y|−|ŷ|| , where |y| denotes the number of tokens in y, |ŷ| = 1 |C| r∈C |r| refers to the average length of responses in the total training corpus, and c is a scale factor. Here, the "moderate length" is set to the average length of responses of the total training corpus. In practice, we have tried to use long responses (longer than average length) to fine-tune the Seq2Seq model. Though it slightly increases the average length of generated responses, the generated responses suffer from more ungrammatical and influent issues. Hence, if a response is too short or long, it will receive a low score of F(y).
Mentioned hyper-parameters {α, β, a, b, τ, c} are constant values in the following experiments, which are set to {0.5, 0.5, 0.33, 3, 1.0, 0.33}. When we performed our experiments, we tried several hyper-parameter settings and found that our method is not sensitive to different hyper-parameters and achieves stable results in general. Hence, we do not spend many efforts to specifically tune these hyper-parameters. Response   To validate that our design function in Eq. 5 and Eq. 6 are effective to weight the responses, Table 1 shows the weights of 8 responses for a query "其实 单身也挺好的 (It's pretty good to be single)". As can be seen, the weights are reasonable, in which the higher-ranked responses are more informative ones with low similarity frequency and moderate length.

Corpus and Evaluation
We crawl conversation pairs from some popular Chinese social media websites 1 , and select 7M high-quality pairs as our training corpus. Conventional metrics such as BLEU (Papineni et al., 2002) and perplexity, are improper to be used for response generation tasks. Following previous works (Li et al., 2016c, we apply human annotations. We randomly sample 500 queries (not used in training) as our test samples, and recruit 3 annotators to evaluate each generated response from two aspects: • Fluency: 0 (unreadable), 1 (readable but with some grammar mistakes), 2 (fluent); • Relevance: 0 (not relevant at all), 1 (relevant at a distant level), 2 (relevant, including the generic responses), 3 (relevant as well as interesting).
Acceptance is then automatically calculated as a metric reflecting whether the response is acceptable to real users. A response will be assigned 1 when it gets Fluency≥1 and Relevance≥2, otherwise it will be assigned 0. We implement our baseline Seq2Seq model using its standard objective function in Eq. 1 with two LSTM layers for encoding/decoding and a standard beam search with a beam size of 5 (the best setting), termed as Seq2Seq. We also compare several Seq2Seq variants: • Seq2Seq-RS: training with a subset by randomly sampling only one from the multiple responses for each query; • Seq2Seq-MMI: applying the maximum mutual information (Li et al., 2016a) (only the MMI-bidi); • Seq2Seq-DD: applying the diverse decoding algorithm (Li et al., 2016b); • Ours-RW: calculating weights via our reweighting method proposed in Section 2. Without applying any other tricks, we implement three versions of our method by using E(·) only, F(·) only, a linear combination of E(·) and F(·) in Eq 6, termed as Ours-RW {E,F,EF} .

Results and Discussion
Human annotation results are shown in Table 2. Several observations can be made. First, Seq2Seq-RS performs slightly worse than the baseline model. This means that it does not work to simply discard a large amount of training data to construct a 1to-1 query-response subset for training. Second, Seq2Seq-MMI not only provides no improvement for the baseline but also inclines to generate generic response. Third, Seq2Seq-DD obtains higher relevance and acceptance scores than the baseline, which shows its effectiveness by re-ranking more meaningful responses into higher positions in beam search. Fourth, our method achieves the best performance on almost all metrics. When we use strictly matched frequency of each response, Ours-RW E does not perform better than the baseline model because that the percentage of responses with frequency higher than 3 is about 0.5% in our training corpus. However, it still enhances the performance in Ours-RW EF , which performs the best and increases the acceptance of the baseline model from 0.42 to 0.55. This validates that the properties about similarity frequency and sentence length play important roles in generating better responses.  Specifically, the average percentage of the generated responses that are assigned to relevance rating 2 (relevant, including the generic responses) and 3 (relevant as well as interesting) are presented in Table 4. It shows that our method achieves higher relevance score owing to generating more highquality responses with rating 3.
To validate that our method is effective to reduce the number of generated generic responses, we calculated the distinct-1 and distinct-2 (Li et al., 2016a) for the compared methods respectively, which are the number of distinct unigrams and bigrams divided by total number of generated words respectively. As shown in Table 5, Ours-RW EF achieves the best performance on the two metrics. This indicates that our model often outputs more meaningful and relevant responses than the other compared methods.
We further randomly sample another 100K queries (not used in training) and use the various models to generate responses. We compare the frequencies of several common generic responses appearing in the generated results, as shown in Table 6. It shows that our method can significantly reduce the number of generic responses. For instance, we reduce about 75% of the case "我也 不知道 (I don't know, either.)" and 77% of the case "我也想知道 (I want to know, too)" to be generated.

Conclusion
In this paper, we propose a statistical re-weighting method to weight multiple responses differently and optimize the MLE objective function. The weight of each response is calculated based on    two terms according to the similarity frequency and its length. Experiments show that our approach improves the performance over the baseline models and reduces the number of generated generic responses significantly. It indicates that mismatching issue of objective function can be alleviated through such similar re-weighting methods, by which current encoder-decoder architectures can take full use of the m-to-n training corpus and model the dialogue generation tasks better.