One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues

Currently, researchers have paid great attention to retrieval-based dialogues in open-domain. In particular, people study the problem by investigating context-response matching for multi-turn response selection based on publicly recognized benchmark data sets. State-of-the-art methods require a response to interact with each utterance in a context from the beginning, but the interaction is performed in a shallow way. In this work, we let utterance-response interaction go deep by proposing an interaction-over-interaction network (IoI). The model performs matching by stacking multiple interaction blocks in which residual information from one time of interaction initiates the interaction process again. Thus, matching information within an utterance-response pair is extracted from the interaction of the pair in an iterative fashion, and the information flows along the chain of the blocks via representations. Evaluation results on three benchmark data sets indicate that IoI can significantly outperform state-of-the-art methods in terms of various matching metrics. Through further analysis, we also unveil how the depth of interaction affects the performance of IoI.


Introduction
Building a chitchat style dialogue systems in opendomain for human-machine conversations has attracted increasing attention in the conversational artificial intelligence (AI) community. Generally speaking, there are two approaches to implementing such a conversational system. The first approach leverages techniques of information retrieval (Lowe et al., 2015;, and selects a proper response from an index; while the second approach directly synthesizes a response with a natural lan- * Corresponding author: Rui Yan (ruiyan@pku.edu.cn). guage generation model estimated from a largescale conversation corpus (Serban et al., 2016;Li et al., 2017b). In this work, we study the problem of multi-turn response selection for retrievalbased dialogue systems where the input is a conversation context consisting of a sequence of utterances. Compared with generation-based methods, retrieval-based methods are superior in terms of response fluency and diversity, and thus have been widely applied in commercial chatbots such as the social bot XiaoIce (Shum et al., 2018) from Microsoft, and the e-commerce assistant AliMe Assist from Alibaba Group (Li et al., 2017a).
A key step in multi-turn response selection is to measure the matching degree between a conversation context and a response candidate. Stateof-the-art methods  perform matching within a representationinteraction-aggregation framework  where matching signals in each utteranceresponse pair are distilled from their interaction based on their representations, and then are aggregated as a matching score. Although utteranceresponse interaction has proven to be crucial to the performance of the matching models , it is executed in a rather shallow manner where matching between an utterance and a response candidate is determined only by one step of interaction on each type or each layer of representations. In this paper, we attempt to move from shallow interaction to deep interaction, and consider context-response matching with multiple steps of interaction where residual information from one time of interaction, which is generally ignored by existing methods, is leveraged for additional interactions. The underlying motivation is that if a model extracts some matching information from utterance-response pairs in one step of interaction, then by stacking multiple such steps, the model can gradually accumulate useful signals for matching and finally capture the semantic relationship between a context and a response candidate in a more comprehensive way.
We propose an interaction-over-interaction network (IoI) for context-response matching, through which we aim to investigate: (1) how to make interaction go deep in a matching model; and (2) if the depth of interaction really matters in terms of matching performance. A key component in IoI is an interaction block. Taking a pair of utteranceresponse as input, the block first lets the utterance and the response attend to themselves, and then measures interaction of the pair by an attentionbased interaction function. The results of the interaction are concatenated with the self-attention representations and then compressed to new representations of the utterance-response pair as the output of the block. Built on top of the interaction block, IoI initializes each utterance-response pair via pre-trained word embeddings, and then passes the initial representations through a chain of interaction blocks which conduct several rounds of representation-interaction-representation operations and let the utterance and the response interact with each other in an iterative way. Different blocks could distill different levels of matching information in an utterance-response pair. To sufficiently leverage the information, a matching score is first calculated in each block through aggregating matching vectors of all utterance-response pairs, and then the block-wise matching scores are combined as the final matching degree of the context and the response candidate.
We conduct experiments on three benchmark data sets: the Ubuntu Dialogue Corpus (Lowe et al., 2015), the Douban Conversation Corpus , and the E-commerce Dialogue Corpus (Zhang et al., 2018b). Evaluation results indicate that IoI can significantly outperform stateof-the-art methods with 7 interaction blocks over all metrics on all the three benchmarks. Compared with deep attention matching network (DAM), the best performing baseline on all the three data sets, IoI achieves 2.9% absolute improvement on R 10 @1 on the Ubuntu data, 2.3% absolute improvement on MAP on the Douban data, and 3.7% absolute improvement on R 10 @1 on the Ecommerce data. Through more quantitative analysis, we also show that depth indeed brings improvement to the performance of IoI, as IoI with 1 interaction block performs worse than DAM on the Douban data and the E-commerce data, and on the Ubuntu data, the gap on R 10 @1 between IoI and DAM is only 1.1%. Moreover, the improvement brought by depth mainly comes from short contexts.
Our contributions in this paper are three-folds: (1) proposal of a novel interaction-over-interaction network which enables deep-level matching with carefully designed interaction block chains; (2) empirical verification of the effectiveness of the model on three benchmarks; and (3) empirical study on the relationship between interaction depth and model performance.

Related Work
Existing methods for building an open-domain dialogue system can be categorized into two groups. The first group learns response generation models under an encoder-decoder framework. On top of the basic sequence-to-sequence with attention architecture (Vinyals and Le, 2015;Shang et al., 2015;Tao et al., 2018), various extensions have been made to tackle the "safe response" problem Mou et al., 2016;Zhao et al., 2017;Song et al., 2018); to generate responses with specific personas or emotions (Li et al., 2016a;; and to pursue better optimization strategies (Li et al., 2017b(Li et al., , 2016b. The second group learns a matching model of a human input and a response candidate for response selection. Along this line, the focus of research starts from single-turn response selection by setting the human input as a single message (Wang et al., 2013;Hu et al., 2014;Wang et al., 2015), and moves to context-response matching for multi-turn response selection recently. Representative methods include the dual LSTM model (Lowe et al., 2015), the deep learning to respond architecture , the multi-view matching model (Zhou et al., 2016), the sequential matching network , and the deep attention matching network . Besides model design, some attention is also paid to the learning problem of matching models (Wu et al., 2018a). Our work belongs to the second group. The proposed interaction-over-interaction network is unique in that it performs matching by stacking multiple interaction blocks, and thus extends the shallow interaction in state-of-the-art methods to a deep GRU ...

Interaction
form. As far as we know, this is the first architecture that realizes deep interaction for multi-turn response selection. Encouraged by the big success of deep neural architectures such as Resnet (He et al., 2016) and inception (Szegedy et al., 2015) in computer vision, researchers have studied if they can achieve similar results with deep neural networks on NLP tasks. Although deep models have not yet brought breakthroughs to NLP as they do to computer vision, they have proven effective in a few tasks such as text classification (Conneau et al., 2017), natural language inference (Kim et al., 2018;Tay et al., 2018), and question answering (Tay et al., 2018;Kim et al., 2018), etc. In this work, we attempt to improve the accuracy of multi-turn response selection in retrieval-based dialogue systems by increasing the depth of context-response interaction in matching. Through extensive studies on benchmarks, we show that depth can bring significant improvement to model performance on the task.

Problem Formalization
Suppose that there is a conversation data set . . , u i,l i } represents a conversation context with u i,k the k-th turn, r i is a response candidate, and y i ∈ {0, 1} denotes a label with y i = 1 indicating r i a proper response for c i , otherwise y i = 0. The task is to learn a matching model g(·, ·) from D, and thus for a new context-response pair (c, r), g(c, r) measures the matching degree between c and r.
In the following sections, we will elaborate how to define g(·, ·) to achieve deep interaction between c and r, and how to learn such a deep model from D.

Interaction-over-Interaction Network
We define g(·, ·) as an interaction-over-interaction network (IoI). Figure 1 illustrates the architecture of IoI. The model pairs each utterance in a context with a response candidate, and then aggregates matching information from all the pairs as a matching score of the context and the response candidate. For each pair, IoI starts from initial representations of the utterance and the response, and then feeds the pair to stacked interaction blocks. Each block represents the utterance and the response by letting them interact with each other based on the interactions before. Matching signals are first accumulated along the sequence of the utterances in each block, and then combined along the chain of blocks as the final matching score. Below we will describe details of components of IoI and how to learn the model with D.

Initial Representations
Given an utterance u in a context c and a response candidate r, u and r are initialized as E u = [e u,1 , · · · , e u,m ] and E r = [e r,1 , · · · , e r,n ] respectively. ∀i ∈ {1, . . . , m} and ∀j ∈ {1, . . . , n}, e u,i and e r,j are representations of the i-th word of u and the j-th word of r respectively which are obtained by pre-training Word2vec (Mikolov et al., 2013) on D. E u and E r are then processed by stacked interaction blocks that model different levels of interaction between u and r and generate matching signals.

Interaction Block
The stacked interaction blocks share the same internal structure. In a nutshell, each block is composed of a self-attention module that captures long-term dependencies within an utterance and a response, an interaction module that models the interaction between the utterance and the response, and a compression module that condenses the results of the first two modules into representations of the utterance and the response as output of the block. The output is then utilized as the input of the next block.
Before diving to details of the block, we first generally describe an attention mechanism that lays a foundation for the self-attention module and the interaction module. Let Q ∈ R nq×d and K ∈ R n k ×d be a query and a key respectively, where n q and n k denote numbers of words and d is the embedding size, then attention from Q to K is defined asQ where S(·, ·) is a function for attention weight calculation. Here, we exploit the symmetric function in (Huang et al., 2017b) as S(·, ·) which is given by: In Equation (2), f is a ReLU activation function, D is a diagonal matrix, and both D ∈ R d×d and W ∈ R d×d are parameters to estimate from training data. Intuitively, in Equation (1), each entry of K is weighted by an importance score defined by the similarity of an entry of Q and an entry of K.
The entries of K are then linearly combined with the weights to form a new representation of Q. A residual connection (He et al., 2016) and a layer normalization (Ba et al., 2016) are then applied toQ asQ. After that,Q is fed to a feed forward network which is formulated as where W {1,2} ∈ R d×d and b {1,2} are parameters. The output of the attention mechanism is defined with the result of Equation (3) after another round of residual connection and layer normalization. For ease of presentation, we denote the entire attention mechanism as f AT T (Q, K). Let U k−1 and R k−1 be the input of the k-th block where U 0 = E u and R 0 = E r , then the self-attention module is defined aŝ The interaction module first lets U k−1 and R k−1 attend to each other by Then U k−1 and R k−1 further interact with U k and R k respectively, which can be formulated as where denotes element-wise multiplication. Finally, the compression module updates U k−1 and R k−1 to U k and R k as the output of the block.
Suppose that e k u,i and e k r,i are the i-th entries of U k and R k respectively, then e k u,i and e k r,i are calculated by where w p ∈ R 4d×d and b p are learnable projection weights and biases,ê k {u,r},i , e k {u,r},i ,ẽ k {u,r},i , and e k−1 {u,r},i are the i-th entries of {Û,R} k , {U, R} k , {Ũ,R} k , and {U, R} k−1 , respectively. Inspired by , we also introduce direct connections from initial representations to all their corresponding subsequent blocks.

Matching Aggregation
Suppose that c = (u 1 , . . . , u l ) is a conversation context with u i the i-th utterance, then in the kth interaction block, we construct three similarity matrices by where U k−1 i and R k−1 are the input of the k-th block,Û k i andR k are defined by Equations (4-5), and U k i and R k are calculated by Equations (6-7).
The three matrices are then concatenated into a 3-D matching tensor T k i ∈ R m i ×n×3 which can be written as where ⊕ denotes a concatenation operation, and m i and n refer to numbers of words in u i and r respectively. We exploit a convolutional neural network (Krizhevsky et al., 2012) to extract matching features from T k i . The output of the final feature maps are flattened and mapped to a d-dimensional matching vector v k i with a linear transformation.
is then fed to a GRU (Chung et al., 2014) to capture temporal relationship among (u 1 , . . . , u l ). ∀i ∈ {1, . . . , l}, the i-th hidden state of the GRU model is given by where h k 0 is randomly initialized. A matching score for context c and response candidate r in the k-th block is defined as where w o and b o are parameters, and σ(·) is a sigmoid function. Finally, g(c, r) is defined by where L is the number of interaction blocks in IoI. Note that we define g(c, r) with all blocks rather than only with the last block. This is motivated by (1) only using the last block will make training of IoI difficult due to the gradient vanishing/exploding problem; and (2) different blocks may capture different levels of matching information in (c, r), and thus leveraging all of them could enhance matching accuracy.

Learning Methods
We consider two strategies to learn an IoI model from the training data D. The first strategy estimates the parameters of IoI (denoted as Θ) by minimizing a global loss function that is formulated as (17) In the second strategy, we construct a local loss function for each block and minimize the summation of the local loss functions. By this means, each block can be directly supervised by the labels in D during learning. The learning objective is then defined as We compare the two learning strategies through empirical studies, as will be reported in the next section. In both strategies, Θ are optimized using back-propagation with Adam algorithm (Kingma and Ba, 2015).

Experiments
We test the proposed IoI on three benchmark data sets for multi-turn response selection.

Experimental Setup
The first data we use is the Ubuntu Dialogue Corpus (Lowe et al., 2015) which is a multi-turn English conversation data set constructed from chat logs of the Ubuntu forum. We use the version provided by Xu et al. (2017). The data contains 1 million context-response pairs for training, and 0.5 million pairs for validation and test. In all the three sets, positive responses are human responses, while negative ones are randomly sampled. The ratio of the positive and the negative is 1:1 in the training set, and 1:9 in both the validation set and the test set. Following Lowe et al. (2015), we employ recall at position k in n candidates (R n @k) as evaluation metrics.
The second data set is the Douban Conversation Corpus ) that consists of multiturn Chinese conversations collected from Douban group 1 . There are 1 million context-response pairs for training, 50 thousand pairs for validation, and 6, 670 pairs for testing. In the training set and the validation set, the last turn of each conversation is taken as a positive response and a negative response is randomly sampled. For each context in the test set, 10 response candidates are retrieved from an index and their appropriateness regarding to the context is annotated by human labelers. Following , we employ R n @ks, mean average precision (MAP), mean reciprocal rank (MRR) and precision at position 1 (P@1) as evaluation metrics.
Finally, we choose the E-commerce Dialogue Corpus (Zhang et al., 2018b) as an experimental data set. The data consists of multi-turn realworld conversations between customers and customer service staff in Taobao 2 , which is the largest e-commerce platform in China. It contains 1 million context-response pairs for training, and 10 thousand pairs for validation and test. Positive responses in this data are real human responses, and negative candidates are automatically constructed by ranking the response corpus based on conversation history augmented messages using Apache Lucene 3 . The ratio of the positive and the negative is 1:1 in training and validation, and 1:9 in test. Following (Zhang et al., 2018b), we employ R 10 @1, R 10 @2, and R 10 @5 as evaluation metrics.

Baselines
We compare IoI with the following models: Single-turn Matching Models: these models, including RNN (Lowe et al., 2015), CNN (Lowe et al., 2015), LSTM (Lowe et al., 2015), BiL-STM (Kadlec et al., 2015), MV-LSTM (Wan et al., 2016) and Match-LSTM (Wang and Jiang, 2016), perform context-response matching by concatenating all utterances in a context into a single long document and calculating a matching score between the document and a response candidate.
Multi-View (Zhou et al., 2016): the model calculates matching degree between a context and a response candidate from both a word sequence view and an utterance sequence view.
DL2R : the model first reformulates the last utterance with previous turns in a context with different approaches. A response candidate and the reformulated message are then represented by a composition of RNN and CNN.
Finally, a matching score is computed with the concatenation of the representations.
SMN : the model lets each utterance in a context interact with a response candidate at the beginning, and then transforms interaction matrices into a matching vector with CNN. The matching vectors are finally accumulated with an RNN as a matching score.
DUA (Zhang et al., 2018b): the model considers the relationship among utterances within a context by exploiting deep utterance aggregation to form a fine-grained context representation. Each refined utterance then matches with a response candidate, and their matching degree is finally calculated through an aggregation on turns.
DAM : the model lets each utterance in a context interact with a response candidate at different levels of representations obtained by a stacked self-attention module and a cross-attention module.
For the Ubuntu data and the Douban data, since results of all baselines under fine-tuning are available in , we directly copy the numbers from the paper. For the E-commerce data, Zhang et al. (2018b) report performance of all baselines except DAM. Thus, we copy all available numbers from the paper and implement DAM with the published code 4 . In order to conduct statistical tests, we also run the code of DAM on the Ubuntu data and the Douban data.

Implementation Details
In IoI, we set the size of word embedding as 200. For the CNN in matching aggregation, we set the window size of convolution and pooling kernels as (3, 3), and the strides as (1, 1) and (3, 3) respectively. The number of convolution kernels is 32 in the first layer and 16 in the second layer. The dimension of the hidden states of GRU is set as 200. Following , we limit the length of a context to 10 turns and the length of an utterance (either from a context or from a response candidate) to 50 words. Truncation or zero-padding is applied to a context or a response candidate when necessary. We gradually increase the number of interaction blocks (i.e., L) in IoI, and finally set L = 7 in comparison with the baseline models. In optimization, we choose 0.2 as a dropout rate, and 50 as the size of mini-batches. The learning rate is initialized as 0.0005, and exponentially decayed Models

Discussions
In this section, we make some further analysis with IoI-local to understand (1) how depth of in-  teraction affects the performance of IoI; (2) how context length affects the performance of IoI; and (3) importance of different components of IoI with respect to matching accuracy.
Impact of interaction depth. Figure 2 illustrates how the performance of IoI changes with respect to the number of interaction blocks on test sets of the three data. From the chart, we observe a consistent trend over the three data sets: there is significant improvement during the first few blocks, and then the performance of the model becomes stable. The results indicate that depth of interaction indeed matters in terms of matching accuracy. With shallow interaction (L = 1), IoI performs worse than DAM on the Douban data and the E-commerce data. Only after the interaction goes deep (L ≥ 5), improvement from IoI Models Metrics Ubuntu data Douban data E-commerce data R 2 @1 R 10 @1 R 10 @2 MAP MRR P@1 R 10 @1 R 10 @2 R 10 @5   to DAM on the two data becomes significant. On the Ubuntu data, improvement to DAM from the deep model (L = 7) is more than twice as much as that from the shallow model (L = 1). The performance of IoI becomes stable earlier on the Ubuntu data than it does on the other two data. This may stem from the different nature of test sets of the three data. The test set of the Ubuntu data is in large size and built by random sampling, while the test sets of the other two data are smaller and constructed through response retrieval.
Impact of context length. Context length is measured by (1) number of turns in a context and (2) average length of utterances in a context. Figure 3 shows how the performance of IoI varies across contexts with different lengths, where we bin test examples of the Ubuntu data into buckets and compare IoI (L = 7) with its shallow version (L = 1) and DAM. We find that (1) IoI, either in a deep form or in a shallow form, is good at dealing with contexts with long utterances, as the model achieves better performance on longer utterances; (2) overall, IoI performs well on contexts with more turns, although too many turns (e.g., ≥ 8) is still challenging; (3) a deep form of our model is always better than its shallow form, no matter how we measure context length, and the gap between the two forms is bigger on short contexts than it is on long contexts, indicating that depth mainly improves matching accuracy on short contexts; and (4) trends of DAM in both charts are consistent with those reported in , and on both short contexts and long contexts, IoI is superior to DAM.
Ablation study. Finally, we examine how different components of IoI affects its performance. First, we remove e k−1 u,i (e k−1 r,i ),ê k u,i (ê k r,i ), e k u,i (e k r,i ), andẽ k u,i (ẽ k r,i ) one by one from Equation (10) and Equation (11), and denote the models as IoI-E, IoI-Ê, IoI-E, and IoI-Ẽ respectively. Then, we keep all representations in Equation (10) and Equation (11), and remove M k i,1 , M k i,2 , and M k i,3 one by one from Equation (13). The models are named IoI-M 1 , IoI-M 2 , and IoI-M 3 respectively. Table 3 reports the ablation results 5 . We conclude that (1) all representations are useful in representing the information flow along the chain of interaction blocks and capturing the matching information between an utterance-response pair within the blocks, as removing any component gener-ally causes performance drop on all the three data sets; and (2) in terms of component importance, E > E > E >Ẽ and M 2 > M 1 ≈ M 3 , meaning that self-attention (i.e.,Ê) and cross-attention (i.e., E) are more important than others in information flow representation, and self-attention (i.e., those used for calculating M 2 ) convey more matching signals. Note that these results are obtained with IoI (L = 7). We also check the ablation results of IoI (L = 1) and do not see much difference on overall trends and relative gaps among different ablated models.

Conclusions and Future Work
We present an interaction-over-interaction network (IoI) that lets utterance-response interaction in context-response matching go deep. Depth of the model comes from stacking multiple interaction blocks that execute representationinteraction-representation in an iterative manner. Evaluation results on three benchmarks indicate that IoI can significantly outperform baseline methods with moderate depth. In the future, we plan to integrate our IoI model with models like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) to study if the performance of IoI can be further improved.