Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots

Multi-turn retrieval-based conversation is an important task for building intelligent dialogue systems. Existing works mainly focus on matching candidate responses with every context utterance on multiple levels of granularity, which ignore the side effect of using excessive context information. Context utterances provide abundant information for extracting more matching features, but it also brings noise signals and unnecessary information. In this paper, we will analyze the side effect of using too many context utterances and propose a multi-hop selector network (MSN) to alleviate the problem. Specifically, MSN firstly utilizes a multi-hop selector to select the relevant utterances as context. Then, the model matches the filtered context with the candidate response and obtains a matching score. Experimental results show that MSN outperforms some state-of-the-art methods on three public multi-turn dialogue datasets.


Introduction
Building a dialogue system that can naturally and consistently converse with humans has drawn increasing research interests in past years. Existing works on building dialogue systems include generation-based and retrieval-based methods. Compared with generation-based methods, retrieval-based methods have advantages in providing fluent and informative responses. Many industrial products have applied retrieval-based dialogue system, e.g., the E-commerce assistant AliMe Assist from Alibaba Group  and the XiaoIce (Shum et al., 2018) from Microsoft.
focus on response selection for single-turn conversation. Recently, researchers have begun to pay attention to the multi-turn conversation, aiming at selecting the most related response from a set of candidates given the context utterances of a conversation. Some effective models, such as Sequential Matching Network (SMN) , Deep Attention Matching network (DAM) (Zhou et al., 2018c), Multi-Representation Fusion Network (MFRN) (Tao et al., 2019), have been proposed to capture the matching features on multiple levels of granularity (words, phrases, sentences, etc.) and short-term and long-term dependencies among words. Previous works have shown that utilizing multiturn utterances can further improve the matching performance than only using single-turn utterance (i.e., last utterance). But context utterance is a "double-edged sword", it also provides a lot of noise while providing abundant information, which would influence the performance due to the sensitivity of these matching-based methods.  , DAM (Zhou et al., 2018c) from E-commerce Corpus. The scores in the table are matching scores predicted by the models.

Turns
Dialogue Text SMN DAM Turn-1 A: Are there any discounts activities recently? Turn-2 B: No. Our product have been cheaper than before. Turn-3 A: Oh. Turn-4 B: Hum! Turn-5 A: I'll buy these nuts. Can you sell me cheaper? Turn-6 B: You can get some coupons on the homepage. Turn-7 A: Will you give me some nut clips? Turn-8 B: Of course we will. Turn-9 A: How many clips will you give? Resp-1 One clip for every package. (True) 0.832 0.854 Resp-2 OK, we will give you a coupons worth $1. (False) 0.925 0.947 To illustrate the problem, we show an error case of SMN  and DAM (Zhou et al., 2018c) from E-commerce Corpus in Table 1. We can see that although "Resp-1" is the right answer for utterance "Turn-9", the SMN and DAM mod-els still choose "Resp-2". Because it has more words overlap with context utterances, thus accumulating a larger similarity score. We can easily observe that "Resp-2" is relevant to former utterances (Turn-1 to Turn-6), but the topic has changed after "Turn-6". Besides, we can see that "Turn-3" and "Turn-4" do not provide any useful information for selecting candidate response. From this example, irrelevant context utterances may cause the models making simple mistakes that humans would not make. Furthermore, we conduct several adversarial experiments and the results show that these matching-based models are very sensitive to the adversarial samples.
In this paper, we propose a multi-hop selector network to tackle the above problem. Intuitively, the closer the utterance to the response is, the more it reflects the intention of the last dialogue session. Thus, we firstly use the last utterance as key to select context utterances that are relevant to it on the word and sentence level. However, we find that there are many samples whose last utterance is very short and contains very limited information (such as "good", "ok"), which will cause the selectors to lose too much useful context information. Therefore, we propose multi-hop selectors to select more relevant context utterances, yielding k different context. Then, we fuse these selected context utterances and match it with candidate response. During the matching stage, the convolution neural network (CNN) is applied to extract matching features and the gated recurrent unit (GRU) is applied to learn the temporal relationship of utterances.
The contributions of this paper are summarized as follows: • We find the noises in context utterances could influence the matching performance and design adversarial experiments to verify it.
• We propose a unified network MSN to select relevant context utterances from word and utterance level and fuse the selected context to generate a better context representation.
• Experimental results on three public datasets achieve significant improvement, which shows the effectiveness of MSN.
The outline of the paper is as follows. Section 2 introduces related works. Section 3 describes adversarial experiment to check how sensitivity of previous models to the context utterances. Section 4 describes every component of MSN model. Section 5 discusses the experiments and corresponding results. Section 6 discusses some experiments to explore the influence of hyper-parameters on performance. We conclude our work in Section 7.

Related Work
With the development of natural language processing, building intelligent chatbots with data-driven approaches has drawn increasing attention in recent years. Existing works can be generally categorized into retrieval-based methods (Wan et al., 2016;Zhang et al., 2018;Tao et al., 2019) and generation-based methods (Shang et al., 2015;Serban et al., 2016;Wu et al., 2018;Zhou et al., 2018a,b). In this work, we focus on retrieval-based method and study context-based response selection.
Early retrieval-based chatbots are devoted to response selection for single-turn conversation (Wang et al., 2013;Tan et al., 2015;. Recently, researchers have begun to turn to the multi-turn conversation. Lowe et al. (2015) use RNN to read context and response, use the last hidden states to represent context and response as two semantic vectors to measure their relevance. Zhou et al. (2016) perform context-response matching with a multi-view model on both word and utterance levels. Considering concatenating utterances in context may lose relationships among utterances or important contextual information,  separately match the response with each utterance based on a convolutional neural network. This paradigm is applied in many subsequent works. Zhou et al. (2018c) consider the dependency relation among utterances based on the attention mechanism. Tao et al. (2019) fuse words, n-grams, and sub-sequences of utterances representations and capture both short-term and long-term dependencies among words.
Different from previous works, (i) we study the influence of using excessive context utterances, (ii) we explore how to filter out irrelevant context to improve the robustness of matching-based methods.

Adversarial experiments
To study how sensitive of the previous models Zhang et al., 2018;Zhou et al., 2018c;Tao et al., 2019) to the context utterances, we conduct several adversarial experiments inspired  by (Jia and Liang, 2017). We keep the training set unchanged and add some noises to the original test set. To be specific, we randomly sample 1∼3 words from context utterances and append them on every candidate response. In this way, we can obtain 3 different adversarial test sets: adversarial set1, adversarial set2, adversarial set3.
Then, we evaluate the models again to see how much will the performance change. To ensure the fairness of the experiments, we use the results from their papers for the original test set. Moreover, we use their open source code for adversarial experiments. We employ recall at position k in n candidates (R n @k) as the evaluation metric, which is the same as previous works.  , DUA (Zhang et al., 2018), DAM (Zhou et al., 2018c), MFRN (Tao et al., 2019) on original test set are cited from their papers. Models original test set adversarial set1 adversarial set2 adversarial set3 R 10 @1 R 10 @2 R 10 @1 R 10 @2 R 10 @1 R 10 @2 R 10 @1 R 10 @2 The experimental results are shown in Table 2. From the table, we can observe that the one-word noise will bring about 7% ∼ 13% absolute de-crease on R 10 @1 and the three-word noise brings about 20% R 10 @1 decrease. Thus, we can see that matching-based models Zhang et al., 2018;Zhou et al., 2018c;Tao et al., 2019) are very sensitive to small noises of the dataset. Moreover, using too many context utterances will greatly increase the probability of introducing noise. The results of MSN also show that filtering irrelevant utterances can effectively alleviate this problem and improve the robustness of matching-based models.

Problem Formalization
. . , u iL } represents a conversation context with L utterances and every utterance u ij contains T words. r i is a response candidate and y i ∈ {0, 1} denotes a label. y i = 1 means r i is a proper response for U i , otherwise y i = 0. Our goal is to learn a matching model g(·, ·) with D. For any context-response pair (U i , r i ), g(U i , r i ) measures the matching degree between U i and r i .
To this end, we need to address two problems: (1) how to select proper context utterances from U i ; and (2) how to fuse these selected utterances together for a better representation.

Model Overview
We propose a multi-hop selector network (MSN) to model g(·, ·). Figure 1 gives the architecture, which generally follows the representationmatching-aggregation framework Zhang et al., 2018;Zhou et al., 2018c;Tao et al., 2019) to match response with multi-turn context. Different from previous works, we add a selection process before the above framework. MSN first constructs semantic representations at word level by an Attentive Module. Then, each utterance are packed as context or key and sent to the "Hopk Selector" to calculate relevance scores. The scores of k different selectors are fused together by a Context Fusion module. Finally, the fused scores are performed over original context utterances to filter out irrelevant context. The rest context utterances are applied for response matching.

Attentive Module
We use the Attentive Module to learn the context information for word representation. Attentive Module is proposed in DAM (Zhou et al., 2018c) and it is a variant of Multi-head Attention (Vaswani et al., 2017). Figure 2 shows its structure. The AttentiveModule(Q, K, V ) has three input sentences: the query sentence, the key sentence and the value sentence, namely Q ∈ R nq×d , K ∈ R n k ×d , and V ∈ R nv×d respectively, where n q , n k , and n v denote the number of words in each sentence, and d is the dimension of the embedding.
The Attentive Module first takes each word in the query sentence to attend to words in the key sentence via Scaled Dot-Product Attention (Vaswani et al., 2017), and then applies those attention weights upon the value sentence: Then, a feed-forward network (FFN) with RELU (LeCun et al., 2015) activation is applied upon the normalization result, to further process the fused embeddings: where x is a 2D matrix in the same shape of query sentence Q and W 1 , b 1 ,W 2 , b 2 are learnt parameters. The result FFN(x) is a 2D matrix that has the same shape as x, FFN(x) is then residually added to x, and the fusion result is then normalized as the final outputs.

Context Selector
Given U i = [u i1 , . . . , u ij , . . . , u iL ], the wordlevel embedding representations for utterance u ij ∈ R T ×d , where d is the dimension of word vector, we use the Attentive Module to reconstruct the word representations of each utterance to encode the context and dependency information into word, which is formulated as: We first discuss how to construct "Hop1 Selector", which consists of word and utterance selector. To capture matching features at multiple levels of granularity, we leverage word and utterance level matching features to select relevant context.

Word Selector
At word level, we utilize cross attention to obtain a matching feature map for each context utterance u ij and key K 1 = u iL , which is formulated as: where W ∈ R d×d×h , b ∈ R h and v ∈ R h×1 . And we get a word alignment matrix A ∈ R L×T ×T . Then, we extract the most prominent matching features from A by max pooling over row and column. Then, they are concatenated together: where m 1 (K 1 , U i ) ∈ R L×2T , which reflects which words have identical or similar meaning between utterances u ij and key u iL . The matching features are transformed to the relevance score by a linear layer: where c ∈ R 2T ×1 and b ∈ R L×1 . The word selector can only capture word-level relevance between key and utterances. It can not reflect whether key and context are compatible on the overall semantic level. Thus, we continue to evaluate the relevance on the utterance level.

Utterance Selector
Firstly, the word-level representations U i are transformed to utterance-level representations by mean pooling over word dimension: where U i ∈ R L×d . We use cosine similarity to measure the relevance between key K 2 = U iL and context utterances U i , which is formulated as: where s 2 ∈ R L×1 is the relevance score at utterance level. Both the scores of word selector and utterance selector are important to measure the relevance of last utterance and context. In order to make full use of word and utterance selectors, we design a combined strategy to fuse two scores. Specifically, we use the weighted sum of two scores for selection: where α is a hyper-parameter and s (1) is the final score that hop1 selector produces. The default value of α is set to 0.5.

Hopk Selector
Although "Hop1 Selector" can choose proper context utterances that are related to the last dialogue session, we find that there are many samples whose last utterance contains very little information (such as "good", "ok"), which will cause the selector lose too much useful context information. Thus, we combine it with u i,L−1 , u i,L−2 , ..., u i,L−k by mean pooling. Then, we treat them as key to conduct the same process as "Hop1 Selector" for context selection. In this way, we can get k different selectors, yielding k different scores S = [s (1) , s (2) , . . . , s (k) ] ∈ R L×k .

Context Fusion
Then we fuse the similarity scores from different selectors and apply it to select relevant context utterances for matching. Firstly, we combine the similarity scores S ∈ R L×k to form the final scores for each context utterances and filter out irrelevant context, which is formulated as: where W ∈ R 1×k is a dynamic weight vector and will be tuned by the gradient. γ is the threshold and will be tuned according to the dataset. The default value of γ can be set to 0.5. The utterances whose scores are below γ will be allocated lower weights or filtered out. Then, we multiply the mask weight s and context utterances to filter irrelevant context: and generateÛ i ∈ R L×T ×d , where U i ∈ R L×T ×d is the original utterances tensor.

Utterance-Response Matching
Similar to DAM (Zhou et al., 2018c), we utilize the self and cross matching paradigm to construct better matching feature maps.

Origin Matching
Given the filtered utterancesÛ i = [û i1 , . . . ,û ij , . . . ,û iL ] and candidate response r i ∈ R T ×d , they are then used to construct a word-word similarity matrix M 1 ∈ R L×2×T ×T by dot product and cosine similarity. Both of them are stacked together as the channel dimension. The process can be formulated as: where A 1 ∈ R d×d is a linear transformation matrix.

Self Matching
Then, we use the Attentive Module over word dimension to construct multi-grained representations, which is formulated as: By this means, the words in each utterance or candidate response are connected together repeatedly to combine more and more overall characterizations. Different from DAM (Zhou et al., 2018c), we do not stack many Attentive Module layers because it will drastically increase the computational expense. Then, we use them to construct M 2 ∈ R L×2×T ×T , whose element is where A 2 ∈ R d×d is a linear transformation matrix.

Cross Matching
Similarly, we build the semantic association between every utterance and response by the attentive module: In this way, we can make the inter-dependent segment pairs close to each other, and aliment scores between those latently inter-dependent pairs could get increased, which will better encode the dependency relation into representation.
Finally, we useÛ cross i and r cross i to construct M 3 ∈ R L×2×T ×T , whose element is where A 3 ∈ R d×d is a linear transformation.

Aggregation
MSN aggregates all the matching matrices together M = [M 1 ; M 2 ; M 3 ] ∈ R L×6×T ×T and applies 2D CNN and max pooling for matching feature extraction and use GRU to model the temporal relationship of utterances in the context, which is the same as SMN .
Then we compute matching score g(U i , r i ) based on the matching features. Specifically, we use the final state of GRU output h L as features and apply a single-layer perceptron to obtain score: where W and b are learnt parameters, σ(·) is sigmoid activation function. Finally, the negative log-likelihood is used as a loss function to optimize the training process.

Dataset
We test MSN on three widely used multi-turn response selection datasets, the Ubuntu Corpus (Lowe et al., 2015), the Douban Corpus  and the E-commerce Corpus (Zhang et al., 2018). Data statistics are in Table 3. Ubuntu Corpus consists of English multi-turn conversations about technical support collected from chat logs of the Ubuntu forum.
Douban Corpus contains dyadic dialogs (conversation between two persons) longer than 2 turns from the Douban group 1 which is a popular social networking service in China.
E-commerce Corpus is collected from realworld conversations between customers and customer service staff from Taobao 2 , the largest ecommerce platform in China. The dataset contains diverse types of conversations (e.g. commodity consultation, logistics express, recommendation, and chitchat) concerning various commodities.

Evaluation Metric
Following the previous works Zhang et al., 2018;Chaudhuri et al., 2018;Tao et al., 2019), we employ recall at position k in n candidates (R n @k) as evaluation metrics. Apart from R n @k, we use MAP (Mean Average Precision), MRR (Mean Reciprocal Rank), and Precision-atone P@1 especially for Douban corpus, which is the same as previous works Tao et al., 2019). For some dialogues in Douban corpus have more than one true candidate response.

Model Training
Our model was implemented by PyTorch (Paszke et al., 2017). Word embeddings were initialized by the results of word2vec (Mikolov et al., 2013) which ran on the dataset, and the dimensionality of word vectors is 200. The hyper-parameter k of selectors is set to 3. We use three convolution layers to extract matching features.  2014) and the parameters of Adam, β 1 and β 2 are 0.9 and 0.999 respectively. The learning rate is initialized as 1e-3 and gradually decreased during training. Same as previous works Zhang et al., 2018), the maximum utterance length is 50 and the maximum context length (i.e., number of utterances) as 10. Table 4 shows the results of MSN and all baseline models on the datasets. All the experimental results are cited from previous works (Zhang et al., 2018;Chaudhuri et al., 2018;Tao et al., 2019).

Experiment Result
Referring to the table, MSN significantly outperforms all other models in terms of most of the metrics on the three datasets, including MRFN, which is the state-of-the-art model until this submission. MSN extends from SMN  and DAM (Zhou et al., 2018c), and it achieves more than 3% absolute improvement on R 10 @1 compared with SMN and DAM. The improvement also shows the importance of filtering irrelevant context before matching.
6 Further Analysis

Ablation Study
We perform a series of ablation experiments over the different parts of the model to investigate their relative importance. Firstly, we use the complete MSN as the baseline. Then, we gradually remove its modules as follows: • w/o Word Selector: A model that is trained using the utterance selector but without the word selector.
• w/o Utterance Selector: A model which is trained without the utterance selector.
• w/o Selector: Removing all selector modules and only use the attention module for matching.  Table 5, we can observe that: (1) Compared with MSN base , removing selectors leads to performance degradation, which shows that the multi-hop selectors are indeed help to improve the selection performance.
(2) The performances decay a large margin when the word selector and utterance selector are removed, which proves that both word selector and utterance selector play an important role in selecting relevant context utterances.
(3) For E-commerce dataset, the context selected by Hop1 selector is more important than other selectors. We think the main reason is that the dialogs in E-commerce corpus happen between buyers and sellers on the Taobao platform. The intent of the dialogue is very clear and the dialogue is mainly in the form of one question and one answer. So the last dialogue session has little dependency on the very far context. However, the fusion of these hop selectors' results still brings more performance improvement.

Parameter Sensitivity
The choices of k for selectors and threshold γ in formula (10) may influence the performance. Thus, we conduct a series of sensitivity analysis experiments on the development dataset to study how different choices of parameters influence the performance of the model.  The k decides how many selectors that MSN uses to select relevant context utterances. Referring to Figure 3 (a), only using hop1 selector is not better than using multiple selectors. However, the performance does not increase when k > 3. It is easy to see that when k is too large, the key will contain too many noises and cannot reflect the intention of the last dialogue session. Figure 3 (b) shows the performance with different threshold γ. Intuitively, when γ is too large, the selectors will filter out too much context, which may hurt performance. However, when γ is too small, the selectors do not work very well. We can observe that MSN achieves the best performance when γ = 0.3 or 0.5.

Conclusion and Future Work
In this paper, we analyze the side effect of using unnecessary context utterances and verify matchingbased models are very sensitive to the context. We propose a multi-hop selector network to alleviate this problem. Empirical results on three large-scale datasets demonstrate the effectiveness of the model in multi-turn response selection and yield new stateof-the-art results at the same time.
In the future, we will study how to solve the logical consistency problem between utterances and candidate responses to improve selection performance.