Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots

We study response selection for multi-turn conversation in retrieval based chatbots. Existing works either ignores relationships among utterances, or misses important information in context when matching a response with a highly abstract context vector finally. We propose a new session based matching model to address both problems. The model first matches a response with each utterance on multiple granularities, and distills important matching information from each pair as a vector with convolution and pooling operations. The vectors are then accumulated in a chronological order through a recurrent neural network (RNN) which models the relationships among the utterances. The final matching score is calculated with the hidden states of the RNN. Empirical study on two public data sets shows that our model can significantly outperform the state-of-the-art methods for response selection in multi-turn conversation.


Introduction
Traditional research in human-computer conversation focused on building task-oriented dialog systems in vertical domains to help people complete specific tasks such as ordering and tutoring etc (Boden, 2006;Wallace, 2009;Young et al., 2010). Recently, with the large amount of conversation data available on Internet, there has been a surge of interest in building non-task-oriented chatbots that can naturally and meaningfully converse with humans on open domain topics (Jafarpour et al., 2010;Ritter et al., 2011). Existing work on building chatbots includes generation based methods * The work was done when the first author was an intern in Microsoft Research Asia. and retrieval based methods. In this work, we study retrieval based chatbots, because they select responses from an index of existing conversation and thus can leverage the existing search power and always return fluent responses. While most existing work on retrieval based chatbots studies response selection for single-turn conversation (Wang et al., 2013;, we consider the problem in a multi-turn scenario which is the nature of conversation but has not been well explored yet. Different from response selection in single-turn conversation in which one only needs to match a response with a single input message, response selection in multi-turn conversation requires matching between a response and a conversation session in which one needs to consider not only the matching between the response and the input message but also the matching between the response and the utterances in previous turns as context. The challenges of the task include (1) how to identify important information (words, phrases, and sentences) in context that is crucial to selecting a proper response for the session and how to leverage the information in matching; and (2) how to model relationships among the utterances. Table 1 illustrates the challenges with an example. First, "hold a drum class" and "drum" in the context are very important. Without them, one may find responses relevant to the message but nonsense in the session (e.g., "what lessons do you want?"). On the other hand, although "Shanghai", "the Bund", and "coaches" are also keywords in their utterances, they are useless and even noise to response selection. It is crucial yet non-trivial to extract the important information from the context and leverage them in matching while circumvent the noise. Second, the message highly depends on Context 1, and the order of the utterances matters in response selection: exchanging Context 2 and the message may lead to different responses. Existing work, however, either ignores relationships among utterances (Lowe et al., 2015;, or loses important information in context in the process of converting the whole session to a vector without enough supervision from responses (Lowe et al., 2015;. We propose a new session based matching model which can tackle both challenges in an endto-end way. One major problem suffered by the existing models is that responses in matching cannot meet the session until the final step, which results in information loss. To overcome this drawback, our model matches a response with each utterance in the session (message and context) at the very beginning. For each utterance-response pair, the model constructs a word-word similarity matrix and a sequence-sequence similarity matrix by the embedding of words and the hidden states of a recurrent neural network with gated unites (GRU) (Chung et al., 2014) respectively. The two matrices capture important matching information in the pair on a word level and a segment level respectively, and the information is distilled and fused as a matching vector through an alternation of convolution and pooling operations on the matrices. By this means, important information in context is recognized under sufficient supervision from the response and carried into matching with minimal loss. The matching vectors are then uploaded to a GRU to form a matching score for the session and the response. The GRU accumulates the pair matching in its hidden states in the chronological order of the utterances in the session. It models the relationships and the dependencies among the utterances in a matching fashion and has the utterance order supervise the accumulation of pair matching. The gate mechanism of the GRU helps select important pairs and filter out noise. The matching degree of the session and the response is computed by a logit model with the hidden states of the GRU. Our model extends the pow-erful "2D" matching paradigm in text pair matching for single-turn conversation to session based matching for multi-turn conversation, and enjoys the advantage that both important information in utterance-response pairs and relationships among utterances are sufficiently preserved and leveraged in matching. We test our model on the Ubuntu dialogue corpus (Lowe et al., 2015) which is a large public English data set for research in multi-turn conversation. The results show that our model can significantly outperform the state-of-the-art methods, and improvement to the best baseline model on R 10 @1 is over 6%. One problem with the Ubuntu data is that negative examples are randomly sampled which might oversimplify the multi-turn problem in a real retrieval based chatbot. To further verify the efficacy of the proposed model in a real situation, we simulate the procedure of a retrieval based chatbot and create a large scale Chinese test set. Instead of negative sampling, labels in the data are generated by 3 human judges. On this data, our model improves the best baseline model over 4% on P@1 (equivalent to R 10 @1). We published the data at https://github.com/MarkWuNLP/ MultiTurnResponseSelection.
Our contributions in this paper are three-folds: (1) proposal of a new session based matching model for multi-turn response selection in retrieval based chatbots; (2) empirical verification of the effectiveness of the model on public data sets; (3) publication of a large human labeled data set to research communities.

Related Work
Early work (Weizenbaum, 1966) on chatbots exploits hand crafted templates to generate responses, which requires huge human effort and is not scalable. Recently, data driven approaches (Ritter et al., 2011;Higashinaka et al., 2014) have drawn a lot of attention. Existing work along this line includes retrieval based methods and generation based methods. The former selects a proper response from an index based on matching between the response and an input message with or without context (Hu et al., 2014;Ji et al., 2014;, while the latter employs statistical machine translation techniques (Ritter et al., 2011) or the sequence to sequence framework (Shang et al., 2015;Vinyals and Le, 2015;Xing et al., 2016;Serban et al., 2016) to generate responses. Our work belongs to retrieval based methods, and we study response selection with context information.
Early studies of retrieval based chatbots focus on response selection for single-turn conversation (Wang et al., 2013;Ji et al., 2014;. Recently, researchers begin to pay attention to multi-turn conversation. For example, Lowe et al. (Lowe et al., 2015) match a response with the literal concatenation of context utterances. Yan et al.  concatenate context utterances with the input message as reformulated queries and perform matching with a deep neural network architecture. Zhou et al.  improve multi-turn response selection with a multi-view model including an utterance view and a word view. The stark difference between our model and the existing models is that our model matches a response with each utterance at the very first and matching information instead of sentences is accumulated in a temporal manner through a GRU.

Problem Formalization
. . , u i,n i } represents a conversation session with {u i1 , . . . , u i,n i −1 } utterances in context and u i,n i an input message. r i is a response candidate and y i ∈ {0, 1} denotes a label. y i = 1 means r i is a proper response for s i , otherwise y i = 0. Our goal is to learn a matching model g(·, ·) with D. For any session-response pair (s, r), g(s, r) measures the matching degree between s and r. Figure 1 gives the architecture of our model. The model first decomposes session-response matching into several utterance-response pair matching and then all pair matching are accumulated as a session based matching through a recurrent neural network. Specifically, the model consists of two layers. The first layer matches a response candidate with each utterance (context and message) in the session on a word level and a segment level. An utterance-response pair is transformed to a word-word similarity matrix and a sequence-sequence similarity matrix, and important matching information in the pair is distilled from the two matrices and encoded in a matching vector. The matching vectors are then fed to the second layer where they are accumulated in the hidden states of a recurrent neural network with gated unites (GRU) following the chronological order of the utterances in the session. The matching degree of the session and the response is calculated with the hidden states of the GRU.

Model Overview
Our model enjoys several advantages over the existing models. First, a response candidate can meet each utterance in the session at the very beginning of the whole matching procedure, thus matching information in every utterance-response pair can be sufficiently extracted and carried to the final matching score with minimal loss. Second, information extraction from each utterance is conducted on different granularities and under sufficient supervision from the response, thus semantic structures that are useful to response selection in each utterance can be well identified and extracted. Third, matching and utterance relationships are coupled rather than separately modeled, thus utterance relationships (e.g., order), as a kind of knowledge, can supervise the formation of the matching score.
By taking utterance relationships into consideration, our model extends the "2D" matching which has proven effective in text pair matching for single-turn response selection to sequential "2D" matching for session based matching in response selection for multi-turn conversation. We name our model "Sequential Match Network" (SMN). In the following sections, we will describe details of the two layers.

Utterance-Response Matching
At the first layer, given an utterance u in a session s and a response candidate r, the model looks up an embedding table and represents u and r as U = [e u,1 , . . . , e u,nu ] and R = [e r,1 , . . . , e r,nr ] respectively, where e u,i , e r,i ∈ R d are the embeddings of the i-th word of u and r respectively. U ∈ R d×nu and R ∈ R d×nr are then used to construct a word-word similarity matrix M 1 ∈ R nu×nr and a sequence-sequence similarity matrix M 2 ∈ R nu×nr which are two input channels of a convolutional neural network (CNN). The CNN distills important matching information from the matrices and encodes the information into a  Specifically, ∀i, j, the (i, j)-th element of M 1 is defined by e 1,i,j = e u,i · e r,j . (1) M 1 models the matching between u and r on a word level.
To construct M 2 , we first employ a recurrent neural network with gated units (GRU) (Chung et al., 2014) to transform U and R to hidden vectors. Suppose that H u = [h u,1 , . . . , h u,nu ] is the hidden vectors of U, then ∀i, h u,i ∈ R m is defined by where h u,0 = 0, z i and r i are an update gate and a reset gate respectively, σ(·) is a sigmoid function, where W 1 ∈ R m×m is a linear transformation. ∀i, GRU models the sequential relationship and the dependency among words up to position i and encodes the text segment until the i-th word to a hidden vector. Therefore, M 2 models the matching between u and r on a segment level. M 1 and M 2 are then processed by a CNN to form v. ∀i = 1, 2, CNN regards M i as an input channel, and alternates convolution and max-pooling operations. Suppose that z (l,f ) = z (l,f ) i,j I (l,f ) ×J (l,f ) denotes the output of feature maps of type-f on layer-l, where z (0,f ) = M f , ∀f = 1, 2. On the convolution layer, we employ a 2D convolution operation with a window size r where σ(·) is a ReLU, W (l,f ) ∈ R r (l,f ) w ×r (l,f ) h and b l,k are parameters, and F l−1 is the number of feature maps on the (l − 1)-th layer. An max pooling operation follows a convolution operation and can be formulated as where p (l,f ) w and p (l,f ) h are the width and the height of the 2D pooling respectively. The output of the final feature maps are concatenated and mapped to a low dimensional space with a linear transformation as the matching vector v ∈ R q .
From Equation (1), (3), (4), and (5), we can see that by learning word embedding and parameters of GRU from training data, words or segments in an utterance that are useful to recognize the appropriateness of a response may have high similarity with some words or segments in the response and result in high value areas in the similarity matrices. These areas will be transformed and selected by convolution and pooling operations and carry the important information in the utterance to the matching vector. This is how our model identifies important information in context and leverage it in matching under the supervision of the response. We consider multiple channels because we want to capture important matching information on multiple granularities of text.

Matching Accumulation
Suppose that [v 1 , . . . , v n ] is the output of the first layer (corresponding to n pairs), at the second layer, a GRU takes [v 1 , . . . , v n ] as an input and encodes the matching sequence into its hidden states H m = [h 1 , . . . , h n ] ∈ R q×n with a detailed parameterization similar to Equation (2). This layer has two functions: (1) it models the dependency and the temporal relationship of utterances in the session; (2) it leverages the temporal relationship to supervise the accumulation of the pair matching as a session based matching. Moreover, from Equation (2), we can see that the reset gate (i.e., r i ) and the update gate (i.e., z i ) control how much information from the previous hidden state and the current input flows to the current hidden state, thus important matching vectors (corresponding to important utterances) can be accumulated while noise in the vectors can be filtered out.
With H m , we define g(s, r) as g(s, r) = sof tmax(W 2 L[h 1 , . . . , h n ] + b 2 ), (6) where W 2 and b 2 are parameters. We consider three parameterizations for L[h 1 , . . . , h n ]: (1) only the last hidden state is used. Then where W 3 ∈ R q×q and b 3 ∈ R q are parameters. t s ∈ R q is a high level virtual context vector which is randomly initialized and jointly learned in training.
Both (2) and (3) aim to learn weights for {h 1 , . . . , h n } from training data and dynamically highlight the effect of important matching vectors in the final matching score. The difference is that weights in (2) are nonparametric and unnormalized, while in (3) they are parametric and normalized. We denote our model with the three parameterizations of L[h 1 , . . . , h n ] as SMN last , SMN non−para , and SMN para respectively, and empirically compare them in experiments.
We learn g(·, ·) by minimizing cross entropy with D. Let Θ denote the parameters of our model, then the objective function L(D, Θ) of learning can be formulated as [yilog(g(si, ri)) + (1 − yi)log(1 − g(si, ri))] , (7) where N in the number of instances in D.

Response Candidate Retrieval
In practice of a retrieval based chatbot, to apply the matching approach to response selection, one needs to retrieve a bunch of response candidates from an index beforehand. While candidate retrieval is not the focus of the paper, it is an important step in a real system. In this work, we exploit a heuristic method to obtain response candidates from index. Given a message u n with {u 1 , . . . , u n−1 } utterances in its previous turns, we extract top 5 keywords based on their tf-idf values 1 and expand u n with the keywords. Then we send the expanded message to the index and retrieve response candidates using the inline retrieval algorithm of the index. Finally, we use g(s, r) to re-rank the candidates and return the top one as a response to the session.

Experiment
We tested our model on a public English data set and a Chinese data set we publish with this paper.

Experiment setup
The English data set is the Ubuntu Corpus (Lowe et al., 2015) which contains large scale multi-turn dialogues collected from chat logs of Ubuntu Forum. The data set consists of 1 million sessionresponse pairs for training, 0.5 million pairs for validation, and 0.5 million pairs for test. Positive responses are true responses from human, and negative ones are randomly sampled. The ratio of the positive and the negative is 1:1 in training, and 1:9 in validation and test. We used the copy shared by  2 in which numbers, urls, and paths are replaced by special placeholders.
One problem with the Ubuntu data is that negative examples is much easier to identify than those in a real chatbot, because they are randomly sampled and most of them are far from the semantics of the context. A better data set that can simulate the real scenario of a retrieval based chatbot must have responses generated following the procedure of information retrieval and labels annotated by humans. As far as we know, however, there are no such data sets publicly available. To test our model in a setting closer to the real case and facilitate the research of multi-turn response selection, we created a new data set and publish it to research communities with the paper. We crawled 15 million post-reply pairs from Sina Weibo 3 which is the largest microblogging service in China and indexed the pairs with an open source Lucene 4 . We then crawled 1.1 million dyadic dialogues (conversation between two people) longer than 2 turns from Douban group 5 which is a popular forum in China. From the data, we randomly sampled 0.5 million dialogues for creating a training set, 25 thousand dialouges for creating a validation set, and 1, 000 dialogues for creating a test set, and made sure that there is no overlap among the three sets. For each dialogue in training and validation, we took the last turn as a positive response for the previous turns as a session and randomly sampled another response from the 1.1 million data as a negative response. In total, there are 1 million session-response pairs in the training set and 50 thousand pairs in the validation set. To create the test set, we took the last turn of each dialogue as a message, retrieved 10 response candidates from the index following the method in Section 4, and finally formed a test set with 10, 000 session-response pairs. We recruited three labelers to judge if a candidate is a proper response to the session. A proper response means the response can naturally reply to the message given the context. Each pair received three labels and the majority of the labels was taken as the final decision.  Table 2 gives the statistics of the three sets. Note that the Fleiss' kappa (Fleiss, 1971) of the labeling is 0.41, which indicates that the three labelers reached a relatively high agreement.
On the Ubuntu data, we followed (Lowe et al., 2015) and employed recall at position k in n candidates (R n @k) as evaluation metrics, and on the human labeled data, we followed the convention of information retrieval and employed mean average precision (MAP) (Baeza-Yates et al., 1999), mean reciprocal rank (MRR) (Voorhees and others, 1999), and precision at position 1 (P@1) as metrics. Note that when using the labeled set, we removed sessions with all negative responses or all positive responses, as models make no difference on them. After that there are 6, 670 sessionresponse pairs left in test.
Multi-view: the model proposed by  who utilize a hierarchical recurrent neural network to model utterance relationships.
Deep learning to respond (DL2R): the model proposed by .
Advanced single-turn matching models: since LSTM and BiLSTM do not represent the state-of-the-art matching models, we concatenated the utterances in a session and matched the long text with a response candidate using more powerful models including MV-LSTM (Wan et al., 2016), Match-LSTM (Wang and Jiang, 2015), and Multi-Channel which is described in Section 3.3. Multi-Channel is a simple version of our model without considering utterance relationships.

Parameter Tuning
For baseline models, if their results are available in the existing literatures (e.g., those on Ubuntu Corpus), we just copied the numbers, otherwise Table 3: Evaluation results on the two data sets Ubuntu data Chinese data R 2 @1 R 10 @1 R 10 @2 R 10 @5 MAP MRR P@1 TF-IDF (Lowe et al., 2015) 0.659 0.410 0.545 0.708 0.331 0.359 0.179 RNN (Lowe et al., 2015) 0.768 0.403 0.547 0.819 0.390 0.422 0.208 CNN (Kadlec et al., 2015) 0.848 0.549 0.684 0.896 0.417 0.440 0.226 LSTM (Kadlec et al., 2015) 0.901 0.638 0.784 0.949 0.485 0.527 0.320 BiLSTM (Kadlec et al., 2015) 0   The number of feature maps is 8. In layer two, we set the dimensions of matching vectors and the hidden states of GRU as 50. We optimized the objective function using back-propagation and the parameters were updated by stochastic gradient descent with Adam algorithm (Kingma and Ba, 2014) on a single Tesla K80 GPU. The initial learning rate is 0.001, and the parameters of Adam, β 1 and β 2 , that control exponential decay are 0.9 and 0.999 respectively. We employed early-stopping (Lawrence and Giles, 2000) as a regularization strategy. Models were trained in mini-batches with a batch size 200, and we padded zeros if the length of an utterance exceeds 50. Table 3 shows the evaluation results on the two data sets. Our models outperform baselines greatly in terms of all metrics on both data sets, and the improvements are statistically significant (t-test with p-value ≤ 0.01). Even the state-ofthe-art single-turn matching models perform much worse than our models. The results demonstrate that one cannot neglect utterance relationships and simply perform multi-turn response selection by transforming it to a single-turn problem. Our models achieve significant improvements over Multi-View, which justified our "matching first" strategy. DL2R is also worse than our models, indicating that utterance reformulation with heuristic rules is not a good method to utilize context information. Numbers on the Ubuntu data are much higher than those on the Chinese data (R 10 @1 and P@1 are equivalent). The results showed the merit of our new data and supported our claim that the Ubuntu data oversimplified the problem of multi-turn response selection. There is no significant difference among our three models. The reason might be GRU has already selected useful signals from the matching sequence and accumulated them in the final state with its gate mechanism, especially when the sequence is not long, and there is no need to equip another attention mechanism on top of it.

Further Analysis
Visualization: we visualize the similarity matrices and the gates of GRU in layer two using an example from the Ubuntu Corpus to further clarify how our model identifies important information in context and how it selects important matching vectors with the gate mechanism of GRU as described in Section 3.3 and Section 3.4. The example is {u 1 : how can unzip many rar ( number for example ) files at once; u 2 : sure you can do that in bash; u 3 : okay how? u 4 : are the files all in the same directory? u 5 : yes they all are; r: then the command glebihan should extract them all from/to that directory}. It is from the test set and our model successfully ranked the correct response to the top position. Due to space limitation, we only visualized M 1 of u 1 and r in Figure 2(a), M 1 of u 3 and r in Figure 2(b), and the update gate (i.e. z) in Figure 2(c). They are already enough to support our analysis. In all pictures, darker areas mean larger values. We can see that in u 1 important words including "unzip", "rar", "files" are recognized and carried to matching by "command", "extract", and "directory" in r, while u 3 is almost useless and thus little information is extracted from it. u 1 is crucial to response selection and nearly all information from u 1 and r flows to the hidden state of GRU, while other utterances are less informative and the corresponding gates are almost "closed" to keep the information from u 1 and r until the final state.  Table 4 reports the results. First, replacing the multi-channel "2D" matching with a neural tensor network (NTN) (Socher et al., 2013) (denoted as Replace M ) makes the performance drop dramatically. This is because NTN only matches a pair by an utterance vector and a response vector and misses important information in the pair. Together with the visualization, we can conclude that "2D" matching plays a key role in the "matching first" strategy as it can capture the important matching information in each pair with minimal loss. Second, the performance slightly drops when replacing the GRU for matching accumulation with a multi-layer perceptron (denoted as Replace S ). This indicates that utterance relationships are also useful. Finally, we left only one channel in matching and found that M 2 is a little more powerful than M 1 and we can achieve the best results with both of them.
Session length: we finally study how our model (SMN last ) performs with respect to the length of sessions. Figure 3 shows the comparison on MAP in different length intervals on the Chinese data. We can see that our model consistently performs better than the baselines, and when sessions become longer, the gap becomes larger. The results demonstrate that our model can well capture the dependencies, especially long dependencies, among utterances in sessions.