Why are Sequence-to-Sequence Models So Dull? Understanding the Low-Diversity Problem of Chatbots

Diversity is a long-studied topic in information retrieval that usually refers to the requirement that retrieved results should be non-repetitive and cover different aspects. In a conversational setting, an additional dimension of diversity matters: an engaging response generation system should be able to output responses that are diverse and interesting. Sequence-to-sequence (Seq2Seq) models have been shown to be very effective for response generation. However, dialogue responses generated by Seq2Seq models tend to have low diversity. In this paper, we review known sources and existing approaches to this low-diversity problem. We also identify a source of low diversity that has been little studied so far, namely model over-confidence. We sketch several directions for tackling model over-confidence and, hence, the low-diversity problem, including confidence penalties and label smoothing.


Introduction
Sequence-to-sequence (Seq2Seq) models (Sutskever et al., 2014) have been designed for sequence learning. Generally, a Seq2Seq model consists of two recurrent neural networks (RNN) as its encoder and decoder, respectively, through which the model cannot only deal with inputs and outputs with variable lengths separately, but also be trained end-to-end. Seq2Seq models can use different settings for the encoder and decoder networks, such as the number of input/output units, ways of stacking layers, dictionary, etc. After showing promising results in machine translation (MT) tasks (Sutskever et al., 2014;Wu et al., 2016), Seq2Seq models also proved to be effective Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI;  for tasks like question answering (Yin et al., 2015), dialogue response generation (Vinyals and Le, 2015), text summarization (Nallapati et al., 2016), constituency parsing (Vinyals et al., 2015a), image captioning (Vinyals et al., 2015b), and so on.
Seq2Seq models form the cornerstone of modern response generation models (Vinyals and Le, 2015;Serban et al., 2016Serban et al., , 2017Zhao et al., 2017). Although Seq2Seq models can generate grammatical and fluent responses, it has also been reported that the corpus-level diversity of Seq2Seq models is usually low, as many responses are trivial or non-committal, like "I don't know", "I'm sorry" or "I'm OK" (Vinyals and Le, 2015;Sordoni et al., 2015;Serban et al., 2016;. We refer to this problem as the low-diversity problem.
In recent years, there have been several types of approach to diagnosing and addressing the lowdiversity problem. The purpose of this paper is to understand the low-diversity problem, to understand what diagnoses and solutions have been proposed so far, and to explore possible new approaches. We first review the theory of Seq2Seq models, then we give an overview of known causes and existing solutions to the low-diversity problem. We then connect the low-diversity problem to the concept of model over-confidence, and propose approaches to address the over-confidence problem and, hence, the low-diversity problem.

Sequence-to-Sequence Response Generation
Consider a dataset of message-response pairs (X, Y ), where X = (x 1 , x 2 , . . . , x |X| ) and Y = (y 1 , y 2 , . . . , y |Y | ) are the input and output sequences, respectively. During training, the goal is to learn the relationships between X and Y , which can be formulated as maximizing the Seq2Seq arXiv:1809.01941v1 [cs.CL] 6 Sep 2018 model probability of Y given X: where y <t = (y 1 , y 2 , . . . , y t−1 ) are the groundtruth tokens of previous steps. Usually, Seq2Seq models employ Long Short-Term Memory (LSTM) networks as their encoder and decoder. The way a Seq2Seq models realizes (1), is to process the training inputs and outputs separately. On the encoder side, the input sequence X is encoded step-by-step, e.g., at step t: where h enc 0 = 0 is the initial hidden state of the encoder LSTM, and θ is the model parameter. The hidden state of the last step h enc |X| is the vector representation of input sequence X.
Then, the decoder LSTM is initialized by h dec 0 = h enc |X| so that output tokens can be based on the input: with y 0 as a special token (e.g., START ) to indicate the decoder to start generation, and y t−1 as the ground truth token of the last time step. The hidden state h dec t is further used to predict the output distribution by using a multi-layer perceptron (MLP) and softmax function: where c * are possible candidates of y t , which are usually represented as word embeddings. After obtaining this distribution, we can calculate the loss compared with the ground-truth distribution by using, e.g., the cross-entropy loss function, and then we can back-propagate the loss to force the Seq2Seq model to maximize (1). At test time at t, the step-wise decoder output distribution is conditioned on the actual model outputsŷ <t and X, and the token with the highest probability is chosen as the output: which is known as the maximum a posteriori (MAP) objective function.

Diagnosing the Low-Diversity Problem
In the literature, three dominant viewpoints on the low-diversity problem have been shared: lack of variability, improper objective function, and weak conditional signal. Below, we review these diagnoses of the low-diversity problem, with corresponding solutions, and we add a fourth diagnosis: model over-confidence.

Lack of variability
Serban et al. (2017); Zhao et al. (2017) trace the cause of the low-diversity problem in Seq2Seq models back to the lack of model variability. The variability of Seq2Seq models is different from that of retrieval-based chatbots (Fedorenko et al., 2017): in this study, we focus on the lack of variability of system responses, while in (Fedorenko et al., 2017), the authors deal with the low variability between responses and contexts.
To increase variability, Serban et al. (2017); Zhao et al. (2017) propose to introduce variational autoencoders (VAEs) to Seq2Seq models. At generation time, the latent variable z brought by a VAE is used as a conditional signal of the decoder LSTM (Serban et al., 2017): where we omit the contextual hidden states for simplicity. At test time, z is randomly sampled from a prior distribution. Although being effective, the improvement in the degree of diversity of generated responses brought by this kind of method is actually brought by the randomness of z. The underlying Seq2Seq model remains sub-optimal in terms of diversity.  notice that the MAP objective function may be the cause of the low-diversity problem, since it can favor certain responses by only maximizing p(Y |X). Therefore, they propose to maximize the mutual information between X, Y pairs:

Improper objective function
With the help of Bayes' theorem, they derive two Maximum Mutual Information (MMI) objective functions: where λ and γ are hyper-parameters. Here, log p(Y ) and log p(X|Y ) are the language model and a reverse model, respectively, with the latter trained using response-message pairs: (Y, X). Besides the time needed for training a reverse model, it should be noted that both objective functions need the length |Y | of candidate responses, which are maintained in N-best lists generated by beam search. To obtain N-best lists with enough diversity,  use a beam size of 200 during testing, which is much more time-consuming than the basic Seq2Seq model. Influenced by the MMI methods, several beam search based approaches (Li et al., 2016;Vijayakumar et al., 2016;Shao et al., 2017) focus on improving the diversity of N-best lists, in the hope of further enhancing the one-best response diversity. However, there are other faster approaches to the low-diversity problem without using beam search, such as the attention-based model that we describe below.

Weak conditional signal
Since attention layers (Bahdanau et al., 2014) have been introduced into Seq2Seq models for the MT task, they have also been a de facto standard module of Seq2Seq models for response generation. The purpose of Seq2Seq attention layers is different from the purpose of the Transformer model (Vaswani et al., 2017). Transformer proposes to rely only on self-attention and avoid using reccurence or convolutions, while attention layers of Seq2Seq aim at strengthening the input signal.
Although the introducing of attention layers can bring improvements to the response generation task, Tao et al. (2018) argue that the original attention signal often focuses on particular parts of the input sequence, which is not strong enough for the Seq2Seq model to generate specific responses, thus causing the low-diversity problem. The authors propose to use multiple attention heads to encourage the model to focus on various aspects of the input, by mapping encoder hidden states to K different semantic spaces: where W k p ∈ R d×d is a learnable projection matrix. The net effect of the extended attention mechanism is, indeed, improvements in the diversity of generated responses. Readers are referred to (Tao et al., 2018) for more details.

Model over-confidence
As indicated by Hinton et al. (2015), one can think of the knowledge captured in conversation modeling as a mapping from input sequence X to output sequence Y , i.e., the distribution P (Y |X). Therefore, if responses have a low degree of diversity, the learned distribution P (Y |X) is questionable, as re-confirmed by . According to (1), the sequence-level distribution P (Y |X) has a direct relationship with the token-level distribution. Therefore, we hypothesize that the token-level distribution P (y t |y <t , X), produced at the decoder side, may be the culprit.
The decoder LSTM serves as an RNN language model (RNNLM) conditioned on the input sequence (Sutskever et al., 2014). With time steps increasing, the influence of the input sequence X will become weaker according to (3), and if the tokenlevel distribution P (y t |y <t , X) is problematic, it will have further effects on subsequent outputs (a "snowball effect"). An attention mechanism (Bahdanau et al., 2014;Tao et al., 2018) can be used to reinforce the influence of the input sequence, but there are still chances that the detrimental effect of P (y t |y <t , X) is stronger than the input signal.
To analyze the problem of P (y t |y <t , X), we train a Seq2Seq model 2 without attention layer, and plot the token-level distribution of generic responses in Figure 1. Interestingly, we find that the distributions shown signs of model over-confidence (Pereyra et al., 2017). When an attention mechanism is used, similar distributions can still be observed, as illustrated in Figure 2. From these two figures, we can see a common trend of growing confidence: the highest probabilities at each step keep growing, which confirms our conjecture of a snowball effect. Due to this effect, the final several tokens are of low quality, e.g., the no-attention model in Figure 1 starts to repeat itself, and the word "overlapping" in the attention model in Figure 2 is irrelevant for the user input.
A prediction is confident if the entropy of the output distribution is low. Over-confidence is often a symptom of over-fitting (Szegedy et al., 2016),  Note that we kept top-10 probabilities at each prediction step for simplicity and this output was cut before the EOS token was emitted.  Figure 1 when an attention mechanism is used.
which suggests that the inputs or outputs share much similarity from unknown aspects. Although it is hard to figure out what causes the over-fitting, maximizing entropy can usually help to regularize the model, making it generalize better. In (Pereyra et al., 2017), the authors propose to add the negative entropy to the negative log-likelihood loss function during training, which can easily be tailored for conversation modeling: log p(c i |y <t , X)− βH(p(c i |y <t , X)), where β controls the strength of the confidence penalty, and H(·) is the entropy of the output distribution: p(c i |y <t , X) log(p(c i |y <t , X)).
The authors also show that this confidence penalty method is closely related to label smoothing regularization (Szegedy et al., 2016), therefore methods like neighborhood smoothing (Chorowski and Jaitly, 2016) may be used to solve the low-diversity problem. So far, there has been no published work on analyzing the the effectiveness of correcting for model over-confidence on the low-diversity problem. It is important to note the fourth diagnosis of the lowdiversity problem, i.e., that the problem is due to model over-confidence, is essentially different from the three types of diagnosis that we described earlier in the section. Among diagnoses and methods published previously, the VAE-based approaches actually bypass the low-diversity problem by introducing randomness; MMI-based methods have an elegant theoretical basis, yet they end up relying on many extra modules, like reverse models and beam search, and the newly-introduced hyper-parameters were not even learned from training data ; attention-based models offer a complementary approach, since strengthening the conditional signal is likely to make the response more specific, which should in turn improve the corpus-level diversity. Model over-confidence may offer a simpler alternative -we believe that methods such as confidence penalty are likely to alleviate the lowdiversity problem in ways that differ from previous approaches.

Next Steps
In this paper, we described the low-diversity problem for response generation, which is one of the main issues faced by current Seq2Seq-based conversation models. We reviewed existing diagnoses and corresponding approaches to this problem and also added a diagnosis that has not been proposed or used so far, i.e., model over-confidence.
By using entropy maximizing approaches, such as confidence penalty (Pereyra et al., 2017) or label smoothing (Szegedy et al., 2016), we believe that the low-diversity problem of Seq2Seq models can be alleviated. Besides, by using entropy maximizing methods, the self-repeating problem (Li et al., 2017) of Seq2Seq models may also be alleviated since this can reduce the snowball effect and make later outputs more relevant. We also noticed that the low-diversity problem resembles the mode collapse problem of GANs (Goodfellow et al., 2014), therefore inspirations may be drawn from the solutions like (Salimans et al., 2016;Metz et al., 2016).
In addition, since we now have four types of diagnosis of the low-diversity problem, each of which is likely to address part of the problem but not all of the problem, it is natural to systematically compare and combine approaches based on the different types of diagnosis. Understanding how solutions to the low-diversity problem helps to improve the effectiveness of conversational agents for search-oriented tasks is another interesting line of future work.