Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models

Sequence-to-sequence models have been applied to the conversation response generation problem where the source sequence is the conversation history and the target sequence is the response. Unlike translation, conversation responding is inherently creative. The generation of long, informative, coherent, and diverse responses remains a hard task. In this work, we focus on the single turn setting. We add self-attention to the decoder to maintain coherence in longer responses, and we propose a practical approach, called the glimpse-model, for scaling to large datasets. We introduce a stochastic beam-search algorithm with segment-by-segment reranking which lets us inject diversity earlier in the generation process. We trained on a combined data set of over 2.3B conversation messages mined from the web. In human evaluation studies, our method produces longer responses overall, with a higher proportion rated as acceptable and excellent as length increases, compared to baseline sequence-to-sequence models with explicit length-promotion. A back-off strategy produces better responses overall, in the full spectrum of lengths.


Introduction
Building computer systems capable of generalpurpose conversation is a challenging problem. However, it is a necessary step toward building intelligent agents that can interact with humans via natural language, and for eventually passing the Turing test. The sequence-to-sequence (seq2seq) model has proven very popular as a purely datadriven approach in domains that can be cast as learning to map to and from variable-length sequences, with state-of-the art results in many domains, including machine translation Sutskever et al., 2014;Wu et al., 2016). Neural conversation models are the latest development in the domain of conversation modeling, with the promise of training computers to converse in an end-to-end fashion (Vinyals and Le, 2015;Shang et al., 2015;Sordoni et al., 2015;Wen et al., 2016). Despite promising results, there are still many challenges with this approach. In particular, these models produce short, generic responses that lack diversity (Sordoni et al., 2015;Li et al., 2015). Even when longer responses are explicitly encouraged (e.g. via length normalization), they tend to be incoherent ("The sun is in the center of the sun."), redundant ("i like cake and cake"), or contradictory ("I don't own a gun, but I do own a gun.").
In this paper, we provide two methods to address these issues with minimal modifications to the standard seq2seq model. First, we present a glimpse model that only trains on fixed-length segments of the target-side at a time, allowing us to scale up training to larger data sets. Second, we introduce a segment-based stochastic decoding technique which injects diversity earlier in the generated responses. Together, we find that these two methods lead to both longer responses and higher ratings, compared to a baseline seq2seq model with explicit length and diversitypromoting heuristics integrated into the generation procedure (see Table 1 for examples generated using our model).
In Section 2, we present a high-level overview of these two techniques. We then discuss each technique in more detail in Sections 3 and 4. Finally, we report small and large-scale experimental evaluations of the proposed techniques in Section 5.

Overview and Motivation
A major difference between translation and responding to conversations is that, in the former, the high-level semantic content to generate in the target sequence y is completely given by the source sequence, i.e., given the source x, there is low conditional entropy in the target distribution P (y|x). In the seq2seq approach, the decoder network therefore only has to keep track of where it is in the output, and the content to generate can be transformed from the relevant parts in the source via the attention mechanism . In contrast, in conversation response generation, the prompt turn may be short and general (e.g., "what do you have planned tonight"), while an appropriate response may be long and informative.
The standard seq2seq model struggles with generating long responses, since the decoder has to keep track of everything output so far in its fixedlength hidden state vector, which leads to incoherent or even contradictory outputs. To combat this, we propose to integrate target-side attention into the decoder network, so it can keep track of what has been output so far. This frees up capacity in the hidden state for modeling the higher-level semantics required during the generation of coherent longer responses. We were able to achieve small perplexity gains using this idea on the small Open-Subtitles 2009 data set (Tiedemann, 2009). However, we found it to be too memory-intensive when scaling up to larger data sets.
As a trade-off, we propose a technique (called the 'glimpse model') which interpolates between source-side-only attention on the encoder, and source and target-side attention on the encoder and decoder, respectively. Our solution simply trains the decoder on fixed-length glimpses from the target side, while having both the source sequence and the part of the target sequence before the glimpse on the encoder, thereby sharing the attention mechanism on the encoder. This can be implemented as a simple data-preprocessing technique with an unmodified standard seq2seq implementation, and allows us to scale training to very large data sets without running into any memory issues. See Figure 1 for a graphical overview, where we illustrate this idea with a glimpse-model of length 3.
Given such a trained model, the next challenge is how to generate long, coherent, and diverse responses with the model. As observed in the previous section and in other work, standard maximum a posteriori (MAP) decoding using beam search often yields short, uninformative, and high-frequency responses. One approach to produce longer outputs is to employ length-promoting heuristics (such as lengthnormalization (Wu et al., 2016)) during decoding. We find this increases the length of the outputs, however often at the expense of coherence. Another approach to explicitly create variation in the generated responses is to rerank the N -best MAP-decoded list of responses from the model using diversity-promoting heuristics  or a backward RNN (Wen et al., 2015). We find this works for shorter responses, but not for long responses, primarily for two reasons: First, the method relies on the MAP-decoding to produce the N -best list, and as mentioned above, MAP-decoding prefers short, generic responses. Second, it is too late to delay reranking in the beam search until the whole sequence has been generated, since beam-search decoding tends to yield beams with low diversity per given prompt, even when the number of beams is high. Instead, our solution is to break up the reranking over shorter segments, and to rerank segment-bysegment, thereby injecting diversity earlier during the decoding process, where it has the most impact on the resulting diversity of the generated beams.
To further improve variation in the generated responses, we replace the deterministic MAPdecoding of the beam search procedure with sampling. If a model successfully captures the distribution of responses given targets, one can expect simple greedy sampling to produce reasonable responses. However, due to model underfitting, the learned distributions are often not sharp enough, causing step-by-step sampling to accumulate errors along the way, manifesting as incoherent outputs. We find that integrating sampling into the beam-search procedure yields responses that are more coherent and with more variation overall.
In summary, the contributions of this work are the following: 1. We propose to integrate target-side attention in neural conversation models, and provide a A: I don't know why, but when I was a kid I used to do this all the time when I was younger. (Good, Bad, Bad) B: I have no idea why, but I feel exactly the same way. I don't know why, but my brain is so sensitive to air that it makes my nose twitch. (Excellent, Mediocre, Good) Yeah. I have this friend, and we get bored just hanging out here sometimes, so we take little trips to places. We went to Seattle a couple of months ago.
A: What kind of things do you like to do? (Excellent, Excellent, Excellent) B: That sounds like a great time. I have a lot of friends that go to Seattle, so I was wondering if you could share any of the locations? (Excellent, Excellent, Excellent) have you seen Suicide Squad?
A: I have not, but I will check it out. (Good, Good, Good) B: I saw it in a theater once and it was great. It was very... I don't know, a little dark. I like the psychological effects and the way it portrays the characters in the movie. (Excellent, Excellent, Excellent) Actually I like the snow too, except for today. It's all slushy and it's all over the sidewalks and it's gross.
A: I hate the snow. I hate the snow. I hate the snow. I hate the snow. I hate the snow. I hate the snow. I hate the snow. I hate the snow. (Acceptable, Good, Good) B: I'm in the same boat as you. I hate it when it's snowy and I don't like to shovel it all day. :( So this is a perfect excuse to not get out of bed and go somewhere else. (Excellent, Good, Acceptable) What do you hope to see in the future?
A: I want to be able to go back in time to when I was born. I want to be (Mediocre, Bad, Mediocre) B: I want to see an actual movie where the characters have the ability to hide the fact that they are actually in the movie. ( Table 1: Some example responses generated from the baseline (A) compared to our model (B), with human ratings in parentheses. These were selected for quality. See the appendix for an unbiased sample.
practical approach, referred to as the glimpse model, which scales well and is easy to implement on top of the standard sequence-tosequence model.
2. We introduce a stochastic beam-search procedure with segment-by-segment reranking which improves the diversity of the generated responses.
3. We present large-scale experiments with human evaluations showing the proposed techniques improve over strong baselines.
4. We release our collection of context-free conversation prompts used in our evaluations as a benchmark for future open-domain conversation response research.

Seq2Seq Model with Attention on Target
We discuss conversation response generation in the sequence-to-sequence problem setting. In this setting, there is a source sequence x = (x 1 , x 2 , ..., x M ), and a target sequence y = (y 0 , y 1 , y 2 , ..., y N ). We assume y 0 is always the start-of-sequence token and y N is the end-ofsequence token. In a typical sequence-to-sequence model, the encoder gets its input from the source sequence x and the decoder models the conditional language model P (y|x) of the target sequence y, given x.
Seq2seq models with attention  parameterize the per-symbol conditional probability as: for 1 ≤ i ≤ N , where DecoderRNN() is a recurrent neural network that map the sequence of decoder symbols into fixed-length vectors, and Attention() is a function that yields a fixed-size vector summary of the encoder symbols x (the 'focus') most relevant to predicting y i , given the previous recurrent state of the network h i−1 (the 'context'). The full conditional probability follows from the product rule, as: We propose to implement target-side attention by augmenting the attention mechanism to include the part of the target sequence already generated, i.e., we include y [0:i−2] in the arguments to the attention function: Attention(h i−1 , y [0:i−2] , x). We implemented this in TensorFlow (Abadi et al., 2015) using 3 LSTM layers on both the encoder and the decoder, with 1024 units per layer. We experimented on the OpenSubtitles 2009 data set, and obtained a small perplexity gain from the target-side attention: 24.6 without versus 24.2 with. However, OpenSubtitles is a small data set, The vanilla sequence-to-sequence model. Figure 1: The vanilla seq2seq with attention on the left, and our proposed target-glimpse model on the right. The symbol ">" and "<" are start-of-sequence and end-of-sequence, respectively. and the majority of its response sequences are shorter than 10 tokens. This may prevent us from seeing bigger gains, since our method is designed to help with longer outputs. In order to train on the much larger Reddit data set, we implemented this method on top of the GNMT model (Wu et al., 2016). Unfortunately, we met with frequent outof-memory issues, as the 8-layer GNMT model is already very memory-intensive, and adding targetside attention made it even more so. Ideally, we would like to retain the model's capacity in order to train a rich response model, and therefore a more efficient approach is necessary.

Encoder Decoder
To this end, we propose the target-glimpse model which has a fixed-length decoder. The target-glimpse model is implemented as a standard sequence-to-sequence model with attention, where the decoder has a fixed length K. During training, we split the target sequence into nonoverlapping, contiguous segments (glimpses) with fixed length K, starting from the beginning. We then train on each of these glimpses, one at a time on the decoder, while putting all target-side symbols before the glimpse on the encoder. For example, if a sequence y is split into two glimpses y 1 and y 2 , each with length K (y 2 may be shorter than K), then we will train the model with two examples, (x → y 1 ), and (x, y 1 → y 2 ). Each time the concatenated sequence on the left of the arrow is put on the encoder and the sequence on the right is put on the decoder. Figure 1(b) illustrates the training of (x, y 1 → y 2 ) when K = 3. In our implementation, we always put the source-side endof-sequence token at the end of the whole encoder sequence, and we split the glimpses according to the decoder time steps. For example, if the sequence y is y 0 , y 1 , y 2 , ..., y 10 , and K = 3, the first example will have y 0 , y 1 , y 2 on the input layer of the decoder, and y 1 , y 2 , y 3 on the output layer of the decoder. The second example has y 3 , y 4 , y 5 as input of the decoder and y 4 , y 5 , y 6 as the output of the decoder, and so on. In our experiments, we use K = 10.
While decoding each glimpse, the decoder therefore attends to both the source sequence and the part of the target sequence that precedes the glimpse, thereby benefiting from the GNMT encoder's bidirectional RNN. Through generalization, the decoder should learn to decode a glimpse of length K in any arbitrary position of the target sequence (which we will exploit in our decoding technique discussed in Section 4). One drawback of this model, however, is that the context inputs to the attention mechanism only include the words that have been generated so far in this glimpse, rather than the words from the full target side. The workaround that we use is to simply connect the last hidden state of the GNMT-encoder to the initial hidden state of the decoder 1 , thereby giving the decoder access to all previous symbols regardless of the starting position of the glimpse.

Stochastic Decoding with Segment-by-Segment Reranking
We now turn our attention from training to inference (decoding). Our strategy is to perform reranking with a normalized score at the segment level, where we generate the candidate segments using a trained glimpse-model and using a stochastic beam search procedure, which we discuss next. The full decoding algorithm proceeds segment by segment. The standard beam search algorithm generates symbols step-by-step by keeping a set of the B highest-scoring beams generated so far at each step 2 . The algorithm adds all possible single-token extensions to every existing beam, and then selects the top B beams. In our stochastic beam search algorithm, we replace this deterministic top-B selection by a stochastic sampling operation in order to encourage variation. Further, to discourage a single beam from dominating the search and decreasing the final response diversity, we perform a two-step sampling procedure: 1) For each singletoken extension of an individual beam we don't enumerate all possibilities, but instead sample a fixed number of D candidate tokens to be added to the beam. This yields a total of B ×D beams, each with one additional symbol. 2) We then compute the accumulated conditional log-probabilities for each beam (normalized across all B × D beams), and treat these as the logits for sub-sampling B beams for the next step. We repeat this procedure until we reach the desired segment-length H, or until a segment ends with the end-of-sequence token.
For a given source sequence, we can use this stochastic beam search algorithm to generate B candidate H-length segments as the beginning of the target sequence. We then perform a reranking step (described below), and keep one of these. The concatenation of the source and the first target segment is then used as the input for generating the next B candidate segments. The algorithm continues until the segment selected ends with an end-of-sequence token.
This algorithm behaves similarly to standard beam search when the categorical distribution used during the process is sharp ('peaked'), since the samples are likely to be the top categories (words) . However, when the distribution is smooth, many of the choices are likely. In conversation response generation we are dealing with a conditional probability model with high entropy, so this is what often happens in practice.
For the reranking, we normalize the scores using random prompts. In particular, suppose y k = y 1 , ..., y k−1 is a candidate segment, and (x, y 1:k−1 ) is the input to the stochastic beam search. The normalized score is then computed as follows: S (y k |x, y 1:k−1 ) = P (y k |x, y 1:k−1 ) x ∈Φ P (y k |x , y 1:k−1 ) (3) In this equation, the set Φ is a collection of randomly sampled source sequences (prompts). In our experiments, we randomly select Q prompts from the context-free evaluation set (introduced in the Experiments section).
It is worth noting that when Φ is an unbiased sample from P (x), the summation in the denominator is a Monte-Carlo approximation of P (y k |y 1:k−1 ). In the case of reranking whole target sequences y, this becomes the marginal P (y), which corresponds to the same diversitypromoting objective used in . However, we found that our approximation works better in terms of N-choose-1 accuracy (see Section 5.2), which suggests that its value may be closer to the true conditional probability.
In our experiments, we set number of random prompts Q to 15, segment length H to 10, number of beams B to 2, and samples per beam D to 10. We select a small value for B, since we find that larger values makes the algorithm behave more like standard beam search.

Experimental Results
In this section we present experimental results for evaluating the target-glimpse model and the stochastic decoding method that we presented. We train the model using the Google neural machine translation model (GNMT, (Wu et al., 2016)), on a data set that combines multiple sources mined from the Web: 1. The full Reddit data 3 that contains 1.7 billion messages (221 million conversations).
4. Dialogue-like texts that we recognized and extracted from the web (17 million conversations).
For all these data sets, we extract pairs of messages where one can be considered as a response to the other. For example, in the Reddit data set, the messages belonging to the same post are organized as a tree. A child node is a message that replies to its parent. This may not necessarily be true as people may be replying to other messages that are also visually close. However, for our current single-turn experiments, we treat these as a single exchange.
In this setting, the GNMT model trained on prompt-to-response pairs works surprisingly well without modification when generating short responses with beam search. Similar to previous work on neural conversation models, we find that the generated responses are almost always grammatical, and sometimes even interesting. They are also usually on topic. In addition, we found that even greedy sampling from the 8-layer GNMT model produces grammatical responses most of the time, although these responses are more likely to be semantically-broken than responses generated using standard beam search. We would like to leverage the benefits of greedy sampling, because the induced variation generates more surprises and may potentially help improve user-engagement, and we found that our proposed segment-based beam sampling procedure accomplishes this to some extent.

Evaluation Metric
It is difficult to come up with an objective evaluation metric for conversation response generation that can be computed automatically. The conditional distribution P (y|x) is supposed to have high entropy in order to be interesting (many possible valid responses to a given prompt). Therefore BLEU scores used in translation are not a good fit (also see (Liu et al., 2016)). Other than looking at the evaluation set perplexity, we use two metrics, the N-choose-1 accuracy and 5-scale sideby-side human evaluation. In the N-choose-K metric, we use the model as a retriever. Given a prompt, we ask the model to rank N candidate responses, where one is the ground truth and the other N − 1 are random responses from the same data set. We then calculate the N-choose-K accuracy as the proportion of trials where the true response is in the top K. The prompts used for evaluation are selected randomly from the same data set. This metric isn't necessarily correlated well with the true response quality, but provides a useful first diagnostic for faster experimental iteration. It takes about a day to train a small model on a single GPU that reaches 2-choose-1 accuracies of around 70% or 80%, but it is much harder to make progress on the 50-choose-1 accuracy. As a reference, human performance on the 10-choose-1 task is around 45% accuracy.
In the 5-scale human evaluation, we use a collection of 200 context-free prompts 4 . These prompts are collected from the following sources, and filtered to prompts that are context-free (i.e. do not depend on previous turns in the conversation), general enough, and by eliminating near duplicates: 1. The questions and statements that users asked an internal testing bot.
These can be either generic or specific. Some example prompts from this collection are shown in Table 1. These prompts are open-domain (not about any specific topic), and include a wide range of topics. Many require some creativity for answering, such as "Tell me a story about a bear." Our evaluation set is therefore not from the same distribution as our training set. However, since our goal is to produce good general conversation responses, we found it to be a good general purpose evaluation set.
The evaluation itself is done by human raters. They are well-trained for the purpose of ensuring rating quality, and they are native English speakers. The A 5-scale rating is produced for each prompt-response pair: Excellent, Good, Acceptable, Mediocre, and Bad. For example, the instructions for rating Excellent is "On topic, interesting, shows understanding, moves the conversation forward. It answers the question." The instruction for Acceptable is "On topic but with flaws that make it seem like it didnt come from a human. It implies an answer." The instruction for Bad is "A completely off-topic statement or question, nonsensical, or grammatically broken. It does not provide an answer." In our experiments, we perform the evaluations side-by-side, each time using responses generated from two methods. Every prompt-response pair is rated by three raters. We rate 200 pairs in total for every method, garnering 600 ratings overall. After the evaluation, we report aggregated results from each method individually.

Motivating Experiments
To see whether generating long responses is indeed a challenging problem, we trained the plain seq2seq with the GNMT model where the encoder holds the source sequence and the decoder holds the target sequence. We experimented with the standard beam search and the beam search with length normalization α = 0.8 similar to (Wu et al., 2016). With this length normalization the generated responses are indeed longer. However, they are more often semantically incoherent. It produces "I have no idea what you are talking about." more often, similarly observed in (Li et al., 2016). The human evaluation results are summarized in Figure 2(b). Methods that generate longer responses have more Bad and less Excellent / Good ratings.
We also performed the N-choose-1 evaluation on the baseline model using different normalization schemes. The results are shown in Table 2(a). No Normalization means that we use P (y|x) for scoring, Normalize by Marginal uses P (y|x)/P (y), as suggested in , and Normalize by Random Prompts is our scoring objective described in Section 4. The significant boost when using both normalization schemes indicates that the conditional log probability predicted by the model may be biased towards the language model probability of P (y). After adding the normalization, the score may be closer to the true conditional log probability.
Overall, this reranking evaluation indicates that our heuristic is preferred to scoring using the marginal. However, it is unfortunately hard to directly make use of this score during beam search decoding (i.e., generation), since the resulting sequences are usually ungrammatical, as also observed by . This is the motivation for using a segment-by-segment reranking procedure, as described in Section 4.

Large-Scale Experiments
For our large-scale experiments, we train our target-glimpse model on the full combined data set. Figure 2(d) shows the training progress curve. In this figure, we also include the curve for K = 1, that is, the glimpse model with decoder-length 1. It is clear enough that this model progresses much slower, so we terminated it early. However, it is surprising that the glimpse model with K = 10 progresses faster than the baseline model with only source-side attention, because the model is trained on examples with decoder-length fixed at 10, while the average response length is 38 in our data set. This means it takes on average 3.8x training steps for the glimpse model to train on the same number of raw training-pairs as the baseline model. Despite this, the faster progress indicates that target-side attention indeed helps the model generalize better.
The human evaluation results shown in Figure 2 compare our proposed method with the baseline seq2seq model. For this, we trained a length-10 target-glimpse model and decoded with stochastic beam-search using segment-by-segment reranking. In our experiments, we were unable to generate better long, coherent responses using the whole-sequence level reranking method from  compared to using standard beam search with length-normalization 6 . We therefore choose the latter as our baseline, because it is the only method which generates responses that are long enough that we can compare to. Figure 2 shows that our proposed method generates more long responses overall. One third of all responses are longer than 100 characters, while the baseline model produces only a negligible fraction. Although we do not employ any length-promoting objectives in our method, length-normalization is used for the baseline. For responses generated by our method, the proportion of Acceptable and Excellent responses remains constant or even increases as the responses grow longer. Conversely, human ratings decline sharply with length for the baseline model.
The percentage of test cases with major agreement is high for both methods. We consider a test to have major agreement if two ratings out of the three are the same. For the baseline method, 80% of the responses have major agreements, and for our method it is 70%.
However, shorter responses have a much smaller search space, and we find that standard beam search tends to generate better ("safer") short responses. To maximize cumulative response quality, we therefore implemented a backoff strategy that combines the strengths of the two methods. We choose to fallback to the baseline model without length normalization when the latter produces a response shorter than 40 characters, otherwise we use the response from our method. This corresponds to the white histogram in Figure 2(b). Compared to the other methods in the fig-ure, the combined strategy results in more ratings of Excellent, Good, Acceptable, and Mediocre, and fewer Bad ratings. With this strategy, among the responses generated for the same 200 prompts, 133 were from the standard beam search and 67 were from our model. Out of the 67 long responses, two thirds were longer than 60 characters and half were longer than 75 characters. To compare the combined model's performance with the baseline, we generated responses from both models using the same 200 prompts. For 20 of the response pairs, human raters had no preference, but for the remaining 180, human raters preferred the combined model's response in 103 cases and the baseline's in only 77, indicating a significant win.

Conclusion
The research of building end-to-end systems that can engage in general-purpose conversation is still in its infancy. More significant progress is expected to be made with more advanced neural architectures. However, our results reported in this paper show that minimal modeling change and a slightly more advanced decoding technique, combined with training over very large data sets, can still lead to noticeable improvements in the quality of responses generated using neural conversation models. Overall, we found using fixed-lengths in the decoder to make it easier to train on large data sets, as well as to allow us to improve the diversity and coherence of the generated responses earlier during generation, when it has most impact. While the focus of this work has been on conversation modeling, we expect some of these results to carry over to other sequence-to-sequence settings, such as machine translation or image-captioning.