ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation

In multi-turn dialogue generation, response is usually related with only a few contexts. Therefore, an ideal model should be able to detect these relevant contexts and produce a suitable response accordingly. However, the widely used hierarchical recurrent encoder-decoder models just treat all the contexts indiscriminately, which may hurt the following response generation process. Some researchers try to use the cosine similarity or the traditional attention mechanism to find the relevant contexts, but they suffer from either insufficient relevance assumption or position bias problem. In this paper, we propose a new model, named ReCoSa, to tackle this problem. Firstly, a word level LSTM encoder is conducted to obtain the initial representation of each context. Then, the self-attention mechanism is utilized to update both the context and masked response representation. Finally, the attention weights between each context and response representations are computed and used in the further decoding process. Experimental results on both Chinese customer services dataset and English Ubuntu dialogue dataset show that ReCoSa significantly outperforms baseline models, in terms of both metric-based and human evaluations. Further analysis on attention shows that the detected relevant contexts by ReCoSa are highly coherent with human’s understanding, validating the correctness and interpretability of ReCoSa.


Introduction
This paper is concerned with the multi-turn dialogue generation task, which is critical in many natural language processing (NLP) applications, such as customer services, intelligent assistant and chatbot. Recently, the hierarchical recurrent encoder-decoder (HRED) models (Serban et al., 2016;Sordoni et al., 2015) have been widely used in this area. In the encoding phase of these HRED  models, a recurrent neural network (RNN) based encoder is first utilized to encode each input a context to a vector, and then a hierarchical RNN is conducted to encode these vectors to one vector. In the decoding phase, another RNN decoder is used to generate the response based on the above vector. The parameters of both encoder and decoder are learned by maximizing the averaged likelihood of the training data. However, for this task, it is clear that the response is usually dependent on some relevant contexts, rather than all the context information. Here we give two examples, as shown in Table 1. In the first example, the response is clearly related to the closest context, i.e. post, in the first example. While in the second example, the response is related to context1. In these cases, if we use all contexts indiscriminately, as in HRED, it is likely that many noises will be introduced to the model, and the generation performance will be hurt significantly. Therefore, it is critical to detect and use the relevant contexts for multi-turn dialogue generation.
To tackle this problem, some researchers try to define the relevance of a context by using the sim-ilarity measure, such as the cosine similarity in Tian et al. (Tian et al., 2017). However, the cosine similarity is conducted between each context and the post, with the assumption that the relevance between a context and a response is equivalent to the relevance between the context and the corresponding post, which is clearly insufficient in many cases, e.g. example 2 in Figure ??. Some other researchers, e.g. Xing et al. (Xing et al., 2018) make an attempt by introducing the traditional attention mechanism to HRED. However, some related contexts are far from the response in the multi-turn dialogue generation task, and the RNN-based attention model may not perform well because it usually biases to the close contexts (Hochreiter et al., 2001), namely position bias problem. Therefore, how to effectively detect and use the relevant contexts remains a challenging problem in multi-turn dialogue generation.
In this paper, we propose a new model, namely ReCoSa, to tackle this problem. The core idea is to use the self-attention mechanism to measure the relevance between the response and each context. The motivation comes from the fact that self-attention is superior in capturing long distant dependency, as shown in (Vaswani et al., 2017). Specifically, we first use a word-level LSTM encoder to obtain the fixed-dimensional representation of each context. Then, we use the self-attention mechanism to get the context and masked response representations. Finally, we calculate the attention weight between the context and response representations as the relevant score, and conduct a decoder based on the related contexts to generate the corresponding response.
In our experiments, we use two public datasets to evaluate our proposed models, i.e. Chinese customer services and English Ubuntu dialogue corpus. The results show that ReCoSa has the ability to produce more diverse and suitable responses than traditional HRED models and its attention variants. Besides, we conduct an analysis on attention, and the results show that the ReCoSa obtains higher coherence with the human labels, which indicate that the detected relevant contexts by our model are reasonable.

Related Work
Despite many existing research works on singleturn dialogue generation Zhang et al., 2018a,b), multi-turn di-alogue generation has gain increasing attention. One reason is that it is more accordant with the real application scenario, such as chatbot and customer services. More importantly, the generation process is more difficult since there are more context information and constrains to consider Zhang et al., 2018c,d;Zhou et al., 2016), which poses great challenges for researchers in this area. Serban et al. (Serban et al., 2016) proposed HRED which uses the hierarchical encoderdecoder framework to model all the context sentences. Since then, the HRED based models have been widely used in different multi-turn dialogue generation tasks, and many invariants have been proposed. For example, Serban et al. (Serban et al., 2017b,a) proposed Variable HRED (VHRED) and MrRNN which introduce the latent variables into the middle state to improve the diversity of generated responses.
However, simply treating all contexts indiscriminately is not proper for the application of multiturn dialogue generation, since the response is only usually related to a few previous contexts. Therefore some researchers try to define the relevance of the context by the similarity measure. For example, Tian et al. (Tian et al., 2017) proposed a weighted sequence (WSeq) attention model for HRED, using the cosine similarity to measure the degree of the relevance. Specifically, they first calculate the cosine similarity between the post embedding and each context sentence embedding, and then use this normalized similarity score as the attention weight. We can see that their results are based on an assumption that the relevance between a context and a response is equivalent to the relevance between the context and the corresponding post. However, in many cases, this assumption is actually not proper. Recently, Xing et al. (Xing et al., 2018) has introduced the traditional attention model to HRED, and a new hierarchical recurrent attention network (HRAN) has been proposed, which is similar with the Seq2Seq model with attention (Bahdanau et al., 2015). In this model, the attention weight is computed based on the current state, the sentence-level representation and the word-level representation. However, some relevant contexts in multi-turn dialogue generation are relatively far from the response, therefore the RNN-based attention model may not perform well because it usually biases to the close con-texts (Hochreiter et al., 2001). Shen et al. (Chen et al., 2018) introduced the memory network into the VHRED model, so that the model can remember the context information. Theoretically, it can retrieve some relevant information from the memory in the decoding phase, however, it is not clearly whether and how the system accurately extracts the relevant contexts.
The motivation of this paper is how to effectively extract and use the relevant contexts for multi-turn dialogue generation. Different from previous studies, our proposed model can focus on the relevant contexts, with both long and short distant dependency relations, by using the selfattention mechanism.

Relevant Context Self-Attention Model
In this section, we will describe our relevant context with self-attention (ReCoSa) model in detail, with architecture shown in Figure 1. Re-CoSa consists of a context representation encoder, a response representation encoder and a contextresponse attention decoder. For each part, we use the multi-head self-attention module to obtain the context representation, response representation and the context-response attention weights. Firstly, the word-level encoder encodes each context as a low-dimension representation. And then, a multi-head self-attention component transforms these representations and position embeddings to the context attention representation. Secondly, another multi-head self-attention component transforms the masked response's word embedding and position embedding to the response attention representation. Thirdly, the third multi-head attention component feeds the context representation as key and value, and the response representation as query in the context-response attention module. Finally, a softmax layer uses the output of the third multi-head attention component to obtain the word probability for the generation process.

Context Representation Encoder
We will introduce the main components of the context representation encoder in this section. The word-level encoder first encodes each context as a fixed vector. And then the context self-attention module transforms each sentence vector to a context representation.

Response Representation
Context-Response Attention Figure 1: The architecture of ReCoSa model

Word-level Encoder
We first introduce the LSTM-based word level encoder (Bahdanau et al., 2015) used in our model.
Please note that in our paper the post is treated as the last context sentence s N . Given a sentence s i as the input, a standard LSTM first encodes each input context to a fixed-dimension vector h M as follows.
where i k , f k and o k are the input, memory and output gate, respectively. w k is the word embedding for x k , and h k stands for the vector computed by LSTM at time k by combining w k and h k−1 . c k is the cell at time k, and σ denotes the sigmoid function. W i , W f , W o and W l are parameters. We use the vector h M as the sentence representation. Therefore, we obtain the sentence representations It has been widely accepted that the selfattention mechanism itself cannot distinguish between different positions. So it is crucial to encode each position information. Actually, there are various ways to encode positions, and the simplest one is to use an additional position embedding. In our work, we parameterized position embeddings P i ∈ R d , i = 1, . . . , N . The position embeddings are simply concatenated to the sentence representations. Finally, we obtain the sentences representation{(h s 1 , P 1 ), . . . , (h s N , P N )}.

Context Self-Attention
Self-attention is a special attention mechanism to compute a sequence's representation using only the sequence itself, which has been successfully applied to many tasks, such as machine translation, reading comprehension, summarization, and language understanding (Vaswani et al., 2017;Cheng et al., 2016;Parikh et al., 2016;Paulus et al., 2017;Shen et al., 2018). One critical advantage of self-attention is that it has the ability to well capture the long distant dependency information (Vaswani et al., 2017). That's why we use this mechanism in our work.
In this paper, we adopt the multi-head attention (Vaswani et al., 2017) mechanism. Given a matrix of n query vectors Q ∈ R n×d , keys K ∈ R n×d and values V ∈ R n×d , the scaled dotproduct attention computes the attention scores based on the following equation: where d is the number of the hidden units in our network.
The H parallel heads are used to focus on different parts of channels of the value vectors. Formally, for the i-th head, we denote the learned linear maps by W Q i ∈ R n×d/H ,W K i ∈ R n×d/H and W V i ∈ R n×d/H , which correspond to queries, keys, and values, respectively. Then the scaled dot-product attention is used to calculate the relevance score between queries and keys, to output mixed representations. The mathematical formulation is: Finally, all the vectors produced by parallel heads are concatenated together to form a single vector. Again, a linear map is used to mix different channels from different heads: where M ∈ R n×d and W ∈ R d×d . To obtain the context representation, the multi-head attention mechanism first feeds the matrix of sentences representation vectors {(h s 1 , P 1 ), . . . , (h s N , P N )}. as queries, keys and values matrices by using different linear projections. Then the context representation is computed as O s in equation 1. We use a feedforward network to output the context attention representation O f s .

Response Representation Encoder
Given the response Y = {y 1 , · · · , y M } as the input, another multi-head self-attention component transforms each word embedding and its position embedding to obtain the response representation. For each word y t , this multi-head attention component feeds the matrix of response vectors {(w 1 + P 1 ), · · · , (w t−1 , P t−1 )} as queries, keys and values matrices by using different linear projections. Then the response's hidden representation is computed as O r in equation 1. After that, we use the mask operator on the response for the training. For each word y t , we mask {y t+1 , · · · , y M } and only see {y 1 , · · · , y t−1 }. For inference, we use the loop function on the generated response G. Take the t th generation as an example. Given the context C = {s 1 , . . . , s N } and the generated response {g 1 , · · · , g t−1 }, we feed {g 1 , · · · , g t−1 } as the response representation to obtain the t th word distribution in the generation response.

Context-Response Attention Decoder
The third multi-head attention component feeds the context attention representation O f s as key and value, and the response hidden representation O r as query. The output is denoted as O d . We also use a new feedforward network to obtain the hidden vector O f d , as conducted in section 3.1.2. Finally, a softmax layer is utilized to obtain the word probability for the generation process. Formally, given an input context sequences C = {s 1 , . . . , s N }, the log-likelihood of the corresponding response sequence Y = {y 1 , · · · , y M } is: logP (y t |C, y 1 , · · · , y t−1 ; θ).
Our model predicts the word y t based on the hidden representation O f d produced by the topmost softmax layer: where W o is the parameters. Our training objective is to maximize the log likelihood of the ground-truth words given the input contexts over the entire training set. Adam is used for optimization in our experiments.

Experiments
In this section, we conduct experiments on both Chinese customer service and English Ubuntu dialogue datasets to evaluate our proposed method.

Experimental Settings
We first introduce some empirical settings, including datasets, baseline methods, parameter settings, and evaluation measures.

Datasets
We use two public multi-turn dialogue datasets in our experiments. The Chinese customer service dataset, named JDC, consists of 515,686 conversational context-response pairs published by the JD contest 1 . We randomly split the data to training, validation, and testing sets, which contains 500,000, 7,843 and 7,843 pairs, respectively. The English Ubuntu dialogue corpus 2 is extracted from the Ubuntu question-answering forum, named Ubuntu (Lowe et al., 2015). The original training data consists of 7 million conversations from 2004 to April 27,2012. The validation data are conversational pairs from April 27,2012 to August 7,2012, and the test data are from August 7,2012 to December 1,2012. We use the official script to tokenize, stem and lemmatize, and the duplicates and sentences with length less than 5 or longer than 50 are removed. Finally, we obtain 3,980,000, 10,000 and 10,000 pairs for training, validation and testing, respectively. For JDC, we utilize the Chinese word as input. Specifically, we use the Jieba tool for word segmentation, and set the vocabulary size as 69,644. For Ubuntu, the word vocabulary size is set as  15,000. For a fair comparison among all the baseline methods and our methods, the numbers of hidden nodes are all set to 512, and batch sizes are set to 32. The max length of dialogue turns is 15 and the max sentence length is 50. The head number of ReCoSa model is set as 6. Adam is utilized for optimization, and the learning rate is set to be 0.0001. We run all the models on a Tesla K80 GPU card with Tensorflow 3 .

Evaluation Measures
We use both quantitative metrics and human judgements for evaluation in our experiment. Specifically, we use two kinds of metrics for quantitative comparisons. One kind is traditional metrics, such as PPL and BLEU score , to evaluate the quality of generated responses. They are both widely used in NLP and multi-turn dialogue generation Tian et al., 2017;Xing et al., 2018). The other kind is the recently proposed distinct (Li et al., 2016b), to evaluate the degree of diversity of the generated responses by calculating the number of distinct unigrams and bigrams in the generated responses.
For human evaluation, given 300 randomly sampled context and their generated responses, three annotators (all CS majored students) are required to give the comparison between ReCoSa model and baselines, e.g. win, loss and tie, based on the coherence of the generated response with respect to the contexts. For example, the win label means that the generated response of ReCoSa is more proper than the baseline model.

Experimental Results
Now we demonstrate our experimental results on the two public datasets.

Metric-based Evaluation
The quantitative evaluation results are shown in Table 2. From the results, we can see that the attention-based models, such as WSeq, HRAN and HVMN, outperform the traditional HRED baselines in terms of BLEU and distinct-2 measures. That's because all these models further consider the relevance of the contexts in the optimization process. HRAN uses a traditional attention mechanism to learn the importance of the context sentences. HVMN uses a memory network to remember the relevant context. But their effects are both quite limited. Our proposed Re-CoSa performs the best. Take the BLEU score on JDC dataset for example, the BLEU score of Re-CoSa model is 13.797, which is significantly better than that of HRAN and HVMN, i.e., 12.278 and 13.125. The distinct scores of our model are also higher than baseline models, which indicate that our model can generate more diverse responses. We have conducted the significant test, and the result shows that the improvements of our model are significant on both Chinese and English datasets, i.e., p-value < 0.01. In summary, our ReCoSa model has the ability to produce high quality and diverse responses, as compared with baseline methods.

Human Evaluation
The human evaluation results are shown in Table 4. The percentage of win, loss and tie, as compared with the baselines, are given to evaluate the quality of generated responses by ReCoSa.
From the results, we can see that the percentage of win is always larger than that of loss, which shows that our ReCoSa model significantly outperforms baselines. Take JDC as an example. Compared with HRAN, WSeq and HVMN, the ReCoSa achieves preference gains (win subtracts loss) 10.35%, 25.86% and 13.8%, respectively. Kappa (Fleiss, 1971) value is presented to demonstrate the consistency of different annotators. We also conducted the significant test, and the result shows that the improvements of our model are significant on both two datasets, i.e., p-value < 0.01.

Analysis on Relevant Contexts
To verify whether the performance improvements are owing to the detected relevant contexts, we conduct a further data analysis, including both quantitative evaluation and case study. Specifically, we randomly sample 500 context-response pairs from the JDC dataset, denoted as JDC-RCD 4 . Three annotators are employed (all CS PhD students) to label each context with respect to the human's judgements. If a contextual sentence is related with the response, then it is labeled as 1.
Otherwise it is labeled as 0. The kappa value of this labeled dataset is 0.514, indicting the consistance among different annotators.    Table 5. The x-coordinate shows the context sentences and the y-coordinate shows the generated words.

Quantitative Evaluation
Since HRED considers all the context as relevant context, we calculate the error rate for evaluation. That is, one minus the proportion of all-contextrelevant in the JDC-RCD data, i.e. 98.4%. Therefore, using all contexts indiscriminately is highly inappropriate for multi-turn dialogue generation.
Other models, such as WSeq, HRAN and HVMN, will output the relevance score based on the attention weight for each context. Therefore we can treat it as a ranking problem. Ranking evaluation measures, such as the precision, recall and F1 score, are used for quantitative evaluations 5 . Then we calculate the precision, recall and F1 score of the top 1,3,5,10 for WSeq model, HRAN model and our ReCoSa model. 6 The results are shown in Table 3. We can see that the WSeq obtains the best score for P@1, R@1 and F1@1. That's because there are 80% cases that the post is labeled as 1, and the cosine similarity can rank the explicitly similar context sentence as top 1. Though the WSeq has the best score for F1@1, it doesn't work well for F1@3, F1@5 and F1@10. That's because the WSeq may lose some relevant contexts which are not explicitly similar to the post but are related with the response. Compared with the HRAN and WSeq, ReCoSa performs better in most cases. Take P@3 for example, the P@3 score of ReCoSa-head3 is 26.2, which is significantly better than that of HRAN and WSeq,i.e.,24.13 and 24.27. These results indicate that the relevant contexts detected by our ReCoSa model are highly coherent with human's judgments. Furthermore, we calculate the averaged attention distance to the response, defined as: where i is the index of the context sentence s i and w i is the attention weight of the i th context.

Case Study
To facilitate a better understanding of our model, we give some cases as in Table 5 and 6, and draw the heatmap of our ReCoSa model, including the six heads, to analyze the attention weights in Figure 2 and 3. From the result, we can first see that the attention-based model performs better than the model using all contexts indiscriminately. Take example1 of Table 5 as an example. The baselines of using all contexts are easy to generate some common responses, such as 'What can I do for you?' and 'I am very happy to serve you. '. The attention-based models, i.e. HRAN, WSeq, ReCoSa, can generate relevant response, such as 'Applying' and 'It' s already done, and the system can' t intercept the site.'. The response generated by our ReCoSa is more specific and relevant, i.e. 'Your servers order has not been updated yet, please wait.'. The reason is that ReCoSA considers the difference of contexts and it will focus on the relevant contexts, i.e. context1 and context3. Figure 2 shows the heatmap of example1 in Table 5. The x-coordinate indicates the context1, context2 and context3. And the y-coordinate indicates the generated words. The lighter the color is, the larger the attention weight is. We can see that the ReCoSa pays more attention to the rele-(a)  (b) ReCoSa-head2.
(c)  (d)  (e)  (f) ReCoSa-head6. Figure 3: ReCoSa multi-head attention for example2 in Table 6. The x-coordinate shows the context sentences and the y-coordinate shows the generated words.
vant contexts, i.e. context1 and context3, which is coherent with the human's understanding.
Our model also performs well in the case where the post (i.e. the closest context) and the groundtruth response are not in the same topic. From the example2 in Table 6, the baselines all produce irrelevant or common responses, such as 'Do you have any other questions?' and 'Ok, I am looking for you! Replying to you is not timely enough, sorry!'. The reason is that the baseline models are weak in detecting long distant dependency relations. However, our model gives more relevant responses with specific meanings'You could apply for sale, return the goods and place an order again', by using the self-attention mechanism. Figure 3 shows the heatmap of example2 in Table 6. For example2, the context2 is the most significant context and the context1 is the most useless one. We can see that the ReCoSa ignores the context1 and pays more attention to the context2. In a word, our ReCoSa model can detect both the long and short distant dependencies, even for the difficult case when the response is not related with the post.

Conclusion
In this paper, we propose a new multi-turn dialogue generation model, namely ReCoSa. The motivation comes from the fact that the widely used HRED based models simply treat all contexts indiscriminately, which violate the important characteristic of multi-turn dialogue generation, i.e., the response is usually related to only a few contexts. Though some researchers have considered using the similarity measure such as cosine or traditional attention mechanism to tackle this problem, the detected relevant contexts are not accurate, due to either insufficient relevance assumption or position bias problem. Our core idea is to utilize the self-attention mechanism to effectively capture the long distant dependency relations. We conduct extensive experiments on both Chinese customer services dataset and English Ubuntu dialogue dataset. The experimental results show that our model significantly outperforms existing HRED models and its attention variants. Furthermore, our further analysis show that the relevant contexts detected by our model are significantly coherent with humans' judgements. Therefore, we obtain the conclusion that the relevant contexts can be useful for improving the quality of multiturn dialogue generation, by using proper detection methods, such as self-attention.
In future work, we plan to further investigate the proposed ReCoSa model. For example, some topical information can be introduced to make the detected relevant contexts more accurate. In addition, the detailed content information can be considered in the relevant contexts to further improve the quality of generated response.