Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network

Human generates responses relying on semantic and functional dependencies, including coreference relation, among dialogue elements and their context. In this paper, we investigate matching a response with its multi-turn context using dependency information based entirely on attention. Our solution is inspired by the recently proposed Transformer in machine translation (Vaswani et al., 2017) and we extend the attention mechanism in two ways. First, we construct representations of text segments at different granularities solely with stacked self-attention. Second, we try to extract the truly matched segment pairs with attention across the context and response. We jointly introduce those two kinds of attention in one uniform neural network. Experiments on two large-scale multi-turn response selection tasks show that our proposed model significantly outperforms the state-of-the-art models.


Introduction
Building a chatbot that can naturally and consistently converse with human-beings on opendomain topics draws increasing research interests in past years. One important task in chatbots is response selection, which aims to select the bestmatched response from a set of candidates given the context of a conversation. Besides playing a critical role in retrieval-based chatbots (Ji et al., 2014), response selection models have been used in automatic evaluation of dialogue generation Figure 1: Example of human conversation on Ubuntu system troubleshooting. Speaker A is seeking for a solution of package management in his/her system and speaker B recommend using, the debian package manager, dpkg. But speaker A does not know dpkg, and asks a backchannel-question (Stolcke et al., 2000), i.e., "no clue what do you need it for?", aiming to double-check if dpkg could solve his/her problem. Text segments with underlines in the same color across context and response can be seen as matched pairs.
Early studies on response selection only use the last utterance in context for matching a reply, which is referred to as single-turn response selection (Wang et al., 2013). Recent works show that the consideration of a multi-turn context can facilitate selecting the next utterance Wu et al., 2017). The reason why richer contextual information works is that human generated responses are heavily dependent on the previous dialogue segments at different granularities (words, phrases, sentences, etc), both semantically and functionally, over multiple turns rather than one turn (Lee et al., 2006;Traum and Heeman, 1996). Figure 1 illustrates semantic connectivities between segment pairs across context and response. As demonstrated, generally there are two kinds of matched segment pairs at different granularities across context and response: (1) surface text relevance, for example the lexical overlap of words "packages"-"package" and phrases "debian package manager"-"debian pack-age manager". (2) latent dependencies upon which segments are semantically/functionally related to each other. Such as the word "it" in the response, which refers to "dpkg" in the context, as well as the phrase "its just reassurance" in the response, which latently points to "what packages are installed on my system", the question that speaker A wants to double-check.
Previous studies show that capturing those matched segment pairs at different granularities across context and response is the key to multiturn response selection (Wu et al., 2017). However, existing models only consider the textual relevance, which suffers from matching response that latently depends on previous turns. Moreover, Recurrent Neural Networks (RNN) are conveniently used for encoding texts, which is too costly to use for capturing multi-grained semantic representations (Lowe et al., 2015;Wu et al., 2017). As an alternative, we propose to match a response with multi-turn context using dependency information based entirely on attention mechanism. Our solution is inspired by the recently proposed Transformer in machine translation (Vaswani et al., 2017), which addresses the issue of sequence-to-sequence generation only using attention, and we extend the key attention mechanism of Transformer in two ways: self-attention By making a sentence attend to itself, we can capture its intra word-level dependencies. Phrases, such as "debian package manager", can be modeled with wordlevel self-attention over word-embeddings, and sentence-level representations can be constructed in a similar way with phraselevel self-attention. By hierarchically stacking self-attention from word embeddings, we can gradually construct semantic representations at different granularities.
cross-attention By making context and response attend to each other, we can generally capture dependencies between those latently matched segment pairs, which is able to provide complementary information to textual relevance for matching response with multi-turn context.
We jointly introduce self-attention and crossattention in one uniform neural matching network, namely the Deep Attention Matching Network (DAM), for multi-turn response selection. In practice, DAM takes each single word of an utterance in context or response as the centric-meaning of an abstractive semantic segment, and hierarchically enriches its representation with stacked self-attention, gradually producing more and more sophisticated segment representations surrounding the centric-word. Each utterance in context and response are matched based on segment pairs at different granularities, considering both textual relevance and dependency information. In this way, DAM generally captures matching information between the context and the response from word-level to sentence-level, important matching features are then distilled with convolution & maxpooling operations, and finally fused into one single matching score via a single-layer perceptron.
We test DAM on two large-scale public multiturn response selection datasets, the Ubuntu Corpus v1 and Douban Conversation Corpus. Experimental results show that our model significantly outperforms the state-of-the-art models, and the improvement to the best baseline model on R 10 @1 is over 4%. What is more, DAM is expected to be convenient to deploy in practice because most attention computation can be fully parallelized (Vaswani et al., 2017). Our contributions are two-folds: (1) we propose a new matching model for multi-turn response selection with selfattention and cross-attention.
(2) empirical results show that our proposed model significantly outperforms the state-of-the-art baselines on public datasets, demonstrating the effectiveness of selfattention and cross-attention.

Conversational System
To build an automatic conversational agent is a long cherished goal in Artificial Intelligence (AI) (Turing, 1950). Previous researches include taskoriented dialogue system, which focuses on completing tasks in vertical domain, and chatbots, which aims to consistently and naturally converse with human-beings on open-domain topics. Most modern chatbots are data-driven, either in a fashion of information-retrieval (Ji et al., 2014;Banchs and Li, 2012;Nio et al., 2014;Ameixa et al., 2014) or sequence-generation (Ritter et al., 2011). The retrieval-based systems enjoy the advantage of informative and fluent responses because it searches a large dialogue repository and selects candidate that best matches the current context. The generation-based models, on the other hand, learn patterns of responding from dialogues and can directly generalize new responses.

Response Selection
Researches on response selection can be generally categorized into single-turn and multi-turn. Most early studies are single-turn that only consider the last utterance for matching response (Wang et al., 2013(Wang et al., , 2015. Recent works extend it to multiturn conversation scenario, Lowe et al.,(2015) and  use RNN to read context and response, use the last hidden states to represent context and response as two semantic vectors, and measure their relevance. Instead of only considering the last states of RNN, Wu et al.,(2017) take hidden state at each time step as a text segment representation, and measure the distance between context and response via segment-segment matching matrixes. Nevertheless, matching with dependency information is generally ignored in previous works.

Attention
Attention has been proven to be very effective in Natural Language Processing (NLP) (Bahdanau et al., 2015;Yin et al., 2016;Lin et al., 2017) and other research areas (Xu et al., 2015). Recently, Vaswani et al.,(2017) propose a novel sequenceto-sequence generation network, the Transformer, which is entirely based on attention. Not only Transformer can achieve better translation results than convenient RNN-based models, but also it is very fast in training/predicting as the computation of attention can be fully parallelized. Previous works on attention mechanism show the superior ability of attention to capture semantic dependencies, which inspires us to improve multi-turn response selection with attention mechanism.

Problem Formalization
Given a dialogue data set D = {(c, r, y) Z } N Z=1 , where c = {u 0 , ..., u n−1 } represents a conversation context with {u i } n−1 i=0 as utterances and r as a response candidate. y ∈ {0, 1} is a binary label, indicating whether r is a proper response for c. Our goal is to learn a matching model g(c, r) with D, which can measure the relevance between any context c and candidate response r. Figure 2 gives an overview of DAM, which generally follows the representation-matchingaggregation framework to match response with multi-turn context. For each utterance

Model Overview
in a context and its response candidate r = [w r,t ] nr−1 t=0 , where n u i and n r stand for the numbers of words, DAM first looks up a shared word embedding table and represents u i and r as sequences of word embeddings, namely and R 0 = [e 0 r,0 , ..., e 0 r,nr−1 ] respectively, where e ∈ R d denotes a d-dimension word embedding.
A representation module then starts to construct semantic representations at different granularities for u i and r. Practically, L identical layers of self-attention are hierarchically stacked, each l th self-attention layer takes the output of the l − 1 th layer as its input, and composites the input semantic vectors into more sophisticated representations based on self-attention. In this way, multigrained representations of u i and r are gradually constructed, denoted as [U l i ] L l=0 and [R l ] L l=0 respectively. Given , utterance u i and response r are then matched with each other in a manner of segment-segment similarity matrix. Practically, for each granularity l ∈ [0...L], two kinds of matching matrixes are constructed, i.e., the self-attention-match M u i ,r,l self and cross-attention-match M u i ,r,l cross , measuring the relevance between utterance and response with textual information and dependency information respectively.
Those matching scores are finally merged into a 3D matching image Q 1 . Each dimension of Q represents each utterance in context, each word in utterance and each word in response respectively. Important matching information between segment pairs across multi-turn context and candidate response is then extracted via convolution with max-pooling operations, and further fused into one matching score via a single-layer perceptron, representing the matching degree between the response candidate and the whole context.
Specifically, we use a shared component, the Attentive Module, to implement both selfattention in representation and cross-attention in matching. We will discuss in detail the implementation of Attentive Module and how we used it to implement both self-attention and cross-attention in following sections. Figure 3 shows the structure of Attentive Module, which is similar to that used in Transformer (Vaswani et al., 2017). Attentive Module has three input sentences: the query sentence, the key sentence and the value sentence, namely Q =

Attentive Module
respec- 1 We refer to it as Q because it is like a cube. tively, where n Q , n K and n V denote the number of words in each sentence and e i stands for a ddimension embedding, n K is equal to n V . The Attentive Module first takes each word in the query sentence to attend to words in the key sentence via Scaled Dot-Product Attention (Vaswani et al., 2017), then applies those attention results upon the value sentence, which is defined as: , stores the fused semantic information of words in the value sentence that possibly have dependencies to the i th word in query sentence. For each i, V att [i] and Q[i] are then added up together, compositing them into a new representation that contains their joint meanings. A layer normalization operation (Ba et al., 2016) is then applied, which prevents vanishing or exploding of gradients. A feed-forward network FFN with RELU (LeCun et al., 2015) activation is then applied upon the normalization result, in order to further process the fused embeddings, defined as: where, x is a 2D-tensor in the same shape of query sentence Q and W 1 , b 1 , W 2 , b 2 are learnt parameters. This kind of activation is empirically useful in other works, and we also adapt it in our model. The result FFN(x) is a 2D-tensor that has the same shape as x, FFN(x) is then residually added (He et al., 2016) to x, and the fusion result is then normalized as the final outputs. We refer to the whole Attentive Module as: As described, Attentive Module can capture dependencies across query sentence and key sentence, and further use the dependency information to composite elements in the query sentence and the value sentence into compositional representations. We exploit this property of the Attentive Module to construct multi-grained semantic representations as well as match with dependency information.

Representation
Given U 0 i or R 0 , the word-level embedding representations for utterance u i or response r, DAM takes U 0 i ro R 0 as inputs and hierarchically stacks the Attentive Module to construct multi-grained representations of u i and r, which is formulated as: where l ranges from 0 to L − 1, denoting the different levels of granularity. By this means, words in each utterance or response repeatedly function together to composite more and more holistic representations, we refer to those multi-grained representations as

Utterance-Response Matching
Given [U l i ] L l=0 and [R l ] L l=0 , two kinds of segmentsegment matching matrixes are constructed at each level of granularity l, i.e., the self-attention-match M u i ,r,l self and cross-attention-match M u i ,r,l cross . M u i ,r,l self is defined as: in which, each element in the matrix is the dotproduct of U l i [k] and R l [t], the k th embedding in U l i and the t th embedding in R l , reflecting the textual relevance between the k th segment in u i and t th segment in r at the l th granularity. The crossattention-match matrix is based on cross-attention, which is defined as: where we use Attentive Module to make U l i and R l crossly attend to each other, constructing two new representations for both of them, written as U l i and R l respectively. Both U l i and R l implicitly capture semantic structures that cross the utterance and response. In this way, those inter-dependent segment pairs are close to each other in representations, and dot-products between those latently inter-dependent pairs could get increased, providing dependency-aware matching information.

Aggregation
DAM finally aggregates all the segmental matching degrees across each utterance and response into a 3D matching image Q, which is defined as: where each pixel Q i,k,t is formulated as: ⊕ is concatenation operation, and each pixel has 2(L + 1) channels, storing the matching degrees between one certain segment pair at different levels of granularity. DAM then leverages twolayered 3D convolution with max-pooling operations to distill important matching features from the whole image. The operation of 3D convolution with max-pooling is the extension of typical 2D convolution, whose filters and strides are 3D cubes 2 . We finally compute matching score g(c, r) based on the extracted matching features f match (c, r) via a single-layer perceptron, which is formulated as: where W 3 and b 3 are learnt parameters, and σ is sigmoid function that gives the probability if r is a proper candidate to c. The loss function of DAM is the negative log likelihood, defined as:

Dataset
We test DAM on two public multi-turn response selection datasets, the Ubuntu Corpus V1 (Lowe et al., 2015) and the Douban Conversation Corpus (Wu et al., 2017). The former one contains multiturn dialogues about Ubuntu system troubleshooting in English and the later one is crawled from a Chinese social networking on open-domain topics. The Ubuntu training set contains 0.5 million multiturn contexts, and each context has one positive response that generated by human and one negative response which is randomly sampled. Both validation and testing sets of Ubuntu Corpus have 50k contexts, where each context is provided with one positive response and nine negative replies. The Douban corpus is constructed in a similar way to the Ubuntu Corpus, except that its validation set contains 50k instances with 1:1 positive-negative ratios and the testing set of Douban corpus is consisted of 10k instances, where each context has 10 candidate responses, collected via a tiny invertedindex system (Lucene 3 ), and labels are manually annotated.

Evaluation Metric
We use the same evaluation metrics as in previous works (Wu et al., 2017). Each comparison model is asked to select k best-matched response from n available candidates for the given conversation context c, and we calculate the recall of the true positive replies among the k selected ones as the main evaluation metric, denoted as R n @k = k i=1 y i n i=1 y i , where y i is the binary label for each candidate. In addition to R n @k, we use MAP (Mean Average Precision) (Baeza-3 https://lucenent.apache.org/ Yates et al., 1999), MRR (Mean Reciprocal Rank) (Voorhees et al., 1999), and Precision-at-one P @1 especially for Douban corpus, following the setting of previous works (Wu et al., 2017). Ablation : To verify the effects of multi-grained representation, we setup two comparison models, i.e., DAM f irst and DAM last , which dispense with the multi-grained representations in DAM, and use representation results from the 0 th layer and L th layer of self-attention instead. Moreover, we setup DAM self and DAM cross , which only use self-attention-match or cross-attention-match respectively, in order to examine the effectiveness of both self-attention-match and cross-attention-match.

Model Training
We copy the reported evaluation results of all baselines for comparison. DAM is implemented in tensorflow 4 , and the used vocabularies, word em-bedding sizes for Ubuntu corpus and Douban corpus are all set as same as the SMN (Wu et al., 2017). We consider at most 9 turns and 50 words for each utterance (response) in our experiments, word embeddings are pre-trained using training sets via word2vec (Mikolov et al., 2013), similar to previous works. We use zero-pad to handle the variable-sized input and parameters in FFN are set to 200, same as word-embedding size. We test stacking 1-7 self-attention layers, and reported our results with 5 stacks of self-attention because it gains the best scores on validation set.  et al., 2011) to minimize loss function defined in Eq 15. Learning rate is initialized as 1e-3 and gradually decreased during training, and the batch-size is 256. We use validation sets to select the best models and report their performances on test sets. Table 1 shows the evaluation results of DAM as well as all comparison models. As demonstrated, DAM significantly outperforms other competitors on both Ubuntu Corpus and Douban Conversation Corpus, including SMN dynamic , which is the state-of-the-art baseline, demonstrating the superior power of attention mechanism in matching response with multi-turn context. Besides, both the performances of DAM f irst and DAM self decrease a lot compared with DAM, which shows the effectiveness of self-attention and cross-attention. Both DAM f irst and DAM last underperform DAM, which demonstrates the benefits of using multigrained representations. Also the absence of self-attention-match brings down the precision, as shown in DAM cross , exhibiting the necessity of jointly considering textual relevance and dependency information in response selection. One notable point is that, while DAM f irst is able to achieve close performance to SMN dynamic , it is about 2.3 times faster than SMN dynamic in our implementation as it is very simple in computation. We believe that DAM f irst is more suitable to the scenario that has limitations in computation time or memories but requires high precise, such as industry application or working as an component in other neural networks like GANs.

Analysis
We use the Ubuntu Corpus for analyzing how selfattention and cross-attention work in DAM from both quantity analysis as well as visualization.

Quantity Analysis
We first study how DAM performs in different utterance number of context. The left part in Figure 4 shows the changes of R 10 @1 on Ubuntu Corpus across contexts with different number of utterance. As demonstrated, while being good at matching response with long context that has more than 4 utterances, DAM can still stably deal with short context that only has 2 turns. Moreover, the right part of Figure 4 gives the comparison of performance across different contexts with different average utterance text length and self-attention stack depth. As demonstrated, stacking self-attention can consistently improve matching performance for contexts having different average utterance text length, implying the stability advantage of using multi-grained semantic representations. The performance of matching short utterances, that have less than 10 words, is obviously lower than the other longer ones. This is because the shorter the utterance text is, the fewer information it contains, and the more difficult for selecting the next utterance, while stacking self-attention can still help in this case. However for long utterances like containing more than 30 words, stacking self-attention can significantly improve the matching performance, which means that the more information an utterance contains, the more stacked self-attention it needs to capture its intra semantic structures. self-attention-match cross-attention-match Figure 5: Visualization of self-attention-match, cross-attention-match as well as the distribution of self-attention and crossattention in matching response with the first utterance in Figure 1. Each colored grid represents the matching degree or attention score between two words. The deeper the color is, the more important this grid is.

Visualization
We study the case in Figure 1 for analyzing in detail how self-attention and cross-attention work. Practically, we apply a softmax operation over self-attention-match and cross-attention-match, to examine the variance of dominating matching pairs during stacking self-attention or applying cross-attention. Figure 5 gives the visualization results of the 0 th , 2 nd and 4 th self-attention-match matrixes, the 4 th cross-attention-match matrix, as well as the distribution of self-attention and crossattention in the 4 th layer in matching response with the first utterance (turn 0) due to space limitation. As demonstrated, important matching pairs in selfattention-match in stack 0 are nouns, verbs, like "package" and "packages", those are similar in topics. However matching scores between prepositions or pronouns pairs, such as "do" and "what", become more important in self-attention-match in stack 4. The visualization results of self-attention show the reason why matching between prepositions or pronouns matters, as demonstrated, selfattention generally capture the semantic structure of "no clue what do you need package manager" for "do" in response and "what packages are installed" for "what" in utterance, making segments surrounding "do" and "what" close to each other in representations, thus increases their dot-product results.
Also as shown in Figure 5, self-attentionmatch and cross-attention-match capture complementary information in matching utterance with response. Words like "reassurance" and "its" in response significantly get larger matching scores in cross-attention-match compared with self-attention-match. According to the visualization of cross-attention, "reassurance" generally depends on "system" "don't" and "held" in utterance, which makes it close to words like "list", "installed" or "held" of utterance. Scores of crossattention-match trend to centralize on several segments, which probably means that those segments in response generally capture structure-semantic information across utterance and response, amplifying their matching scores against the others.

Error Analysis
To understand the limitations of DAM and where the future improvements might lie, we analyze 100 strong bad cases from test-set that fail in R 10 @5. We find two major kinds of bad cases: (1) fuzzycandidate, where response candidates are basically proper for the conversation context, except for a few improper details. (2) logical-error, where response candidates are wrong due to logical mismatch, for example, given a conversation context A: "I just want to stay at home tomorrow.", B: "Why not go hiking? I can go with you.", response candidate like "Sure, I was planning to go out tomorrow." is logically wrong because it is contradictory to the first utterance of speaker A. We believe generating adversarial examples, rather than randomly sampling, during training procedure may be a good idea for addressing both fuzzy-candidate and logical-error, and to capture logic-level information hidden behind conversation text is also worthy to be studied in the future.

Conclusion
In this paper, we investigate matching a response with its multi-turn context using dependency information based entirely on attention. Our solution extends the attention mechanism of Transformer in two ways: (1) using stacked selfattention to harvest multi-grained semantic representations.
(2) utilizing cross-attention to match with dependency information. Empirical results on two large-scale datasets demonstrate the effectiveness of self-attention and cross-attention in multi-turn response selection. We believe that both self-attention and cross-attention could benefit other research area, including spoken language understanding, dialogue state tracking or seq2seq dialogue generation. We would like to explore in depth how attention can help improve neural dialogue modeling for both chatbots and taskoriented dialogue systems in our future work.