Intra-/Inter-Interaction Network with Latent Interaction Modeling for Multi-turn Response Selection

Multi-turn response selection has been extensively studied and applied to many real-world applications in recent years. However, current methods typically model the interactions between multi-turn utterances and candidate responses with iterative approaches, which is not practical as the turns of conversations vary. Besides, some latent features, such as user intent and conversation topic, are under-discovered in existing works. In this work, we propose Intra-/Inter-Interaction Network (I^3) with latent interaction modeling to comprehensively model multi-level interactions between the utterance context and the response. In specific, we first encode the intra- and inter-utterance interaction with the given response from both individual utterance and the overall utterance context. Then we develop a latent multi-view subspace clustering module to model the latent interaction between the utterance and response. Experimental results show that the proposed method substantially and consistently outperforms existing state-of-the-art methods on three multi-turn response selection benchmark datasets.


Introduction
Recent years have witnessed many successful real-world applications on chatbots and AI assistants, such as the XiaoIce (Shum et al., 2018) from Microsoft and the E-commerce assistant AliMe  from Alibaba Group, which owe to the extensive researches on dialogue systems. Existing works on building conversational models mainly study generation-based (Wen et al., 2017; or retrieval-based methods (Lowe et al., 2015;. In this work, we focus on the problem of multi-turn response selection for retrieval-based dialogue systems, which aims at selecting appropriate responses from a set of candidates as the reply for the given multi-turn utterances. Measuring the matching degree between the utterance context and the candidate response is the core of multi-turn response selection task. Recent works develop a variety of interaction model to enhance the utterance-response interaction from a broader (Zhou et al., 2018b;Tao et al., 2019a) or deeper perspective (Tao et al., 2019b;Yuan et al., 2019). Empirical evidences show that iterative architectures achieve state-of-the-art performance on multi-turn response selection, such as interactionover-interaction (Tao et al., 2019b), iterated attentive matching , and multi-hop selector (Yuan et al., 2019).
Despite the effectiveness of these methods, multi-turn response selection task still remains some challenges when modeling the interaction between the utterance context and response: (i) In order to capture the interaction information between a candidate response and multi-turn utterances, most of existing iterative architectures may require deeper or more complex network structure along with the growth of the turns of conversations, which fall short to efficiently learn the multi-turn utterance representations. (ii) Existing methods mainly focus on measuring the semantic relevancy between the response and the given utterance context. Nevertheless, researchers observe that some latent features in the conversations, such as user intent (Wen et al., 2017;Perkins and Yang, 2019; or conversation topic Yoon et al., 2018;Yoon et al., 2019), also attach great importance in dialogue systems, which have received little attention in recent multi-turn response selection studies.
In this work, we propose Intra-/Inter-Interaction Network (I 3 ) with latent interaction modeling to tackle the aforementioned issues. In specific, we adopt hierarchical structure instead of iterative structure to model the multi-level interactions in the multi-turn conversation, including the intra-utterance interaction between the response and each individual utterance, and the inter-utterance interaction among the response and the overall utterance context. Such comprehensive sentence representational learning enables each utterance to be encoded with rich information for mining the latent features. Besides, subspace clustering (Ji et al., 2017;Zhou et al., 2018a;, which aims to cluster the data into multiple subspaces and find a low-dimensional subspace for each class of data in an unsupervised manner, can be an effective approach to learn the latent feature representations without human-annotated labels. As for dialogue systems, the utterance context and the response can be regarded as two independent views of data (Perkins and Yang, 2019), and it is required to learn the latent representation from both views in a common space to model the coherency of their latent features. Inspired by latest mulit-view subspace clustering studies Zhang et al., 2020a), we propose two kinds of latent multiview subspace clustering module, namely linear and generalized Latent Multi-view Subspace Clustering (lLMSC and gLMSC), to capture the latent features, which first encode the utterance and the response into view-specific latent representation respectively, and then project them to the same subspaces for multi-view clustering. Finally, we aggregate the three-level interaction information, including the intra-/inter-utterance interaction and latent feature matching information, to comprehensively measure the matching degree between the utterance context and candidate response.
To summarize, the main contributions of this work are as follows: (1) We propose a novel multi-turn response selection model, Intra-/Inter-Interaction Network (I 3 ), to capture the multi-level matching information by modeling the multi-turn conversations as a hierarchical structure; (2) We develop two kinds of latent multi-view subspace clustering module to model the latent feature coherency between the utterance and response; (3) Experimental results show that the proposed method substantially and consistently outperforms existing state-of-the-art methods on three multi-turn dialogue benchmark datasets.

Related Works
Existing methods for building intelligent dialogue systems can be categorized into retrieval-based methods (Lowe et al., 2015;, generation-based methods Wen et al., 2017) and hybrid methods (Song et al., 2018;. Besides, current studies on conversational systems have evolved from single-turn (Lowe et al., 2015;Kadlec et al., 2015) into multi-turn scenarios . In this work, we focus on retrieval-based methods for multi-turn response selection.
The key to matching the response and the given utterance context is modeling the interaction between them, which is mainly addressed by deep learning models in current studies, like CNN (Kadlec et al., 2015), RNN (Lowe et al., 2015), and hybrid models . Based on these deep neural networks, some recent works further develop diverse and effective approaches to measure the relevance between the response and the utterances, such as integrating multi-view matching information (Zhou et al., 2016), modeling sequential utterance information , and refinement and aggregation scheme . Inspired by recent progresses of transformer model (Vaswani et al., 2017), latest studies on multi-turn response selection step up to a new stage with carefully designed selfattention-based interaction networks, including deep attention matching network (Zhou et al., 2018b), multi-representation fusion network (Tao et al., 2019a), interaction-over-interaction network (Tao et al., 2019b), and multi-hop selector network (Yuan et al., 2019). In this work, we facilitate the interaction modeling by considering both intra-/inter-utterance interaction with a hierarchical encoder.
Apart from measuring the semantic and contextual relevancy, several efforts have been made on discovering some latent features in the conversations for modeling the intent or topic coherency between the utterance and the response. Yoon et al. (2018) and Yoon et al. (2019) incorporate latent clustering into context-based response/answer selection models to fetch latent topic information. Yang et al. (2018) Utterance-1 Utterance-n

Self-attention Layer
Dual-attention Layer

Self-attention Layer
Dual-attention Layer

Intra-utterance Encoder
Origin Interaction Intra-utterance Interaction

Dual-attention Layer
Inter-utterance Encoder  and  leverage human-annotated conversational intent labels to model the user intent in information-seeking conversations to help response selection. In this paper, we study latent multi-view subspace clustering to measure the utterance-response coherency consistently in the latent subspace.

Problem Definition
Suppose that there is a conversation data set represents a conversation context with u i t as the i-th turn utterance in the t-th sample. r t and y t are the response candidate and the corresponding label, i.e., whether r t is an appropriate response given U t . The goal is to learn a model g(·) with D to measure the matching degree between U t and r t . For simplicity, we omit t in the following notations.
We propose an Intra-/Inter-Interaction Network (I 3 ) with latent interaction modeling to model g(·). The overview of the proposed model is depicted in Figure 1.

Attention Module
Following the former success on multi-turn response selection (Zhou et al., 2018b;Yuan et al., 2019), we employ the Attentive Module proposed by Zhou et al. (2018b) as the basic component of the proposed hierarchical transformer encoder, which is a variant of original transformer block (Vaswani et al., 2017).
The Attention Module is denoted as Attention(Q, K, V ), with three input vectors: the query vectors Q ∈ R lq×d , the key vectors K ∈ R l k ×d , and the value vectors V ∈ R lv×d , where l q , l k , and l v denote the length of each input and d is the dimension of the embedding. The Attention Module first conducts Scale Dot-Product Attention to apply attention weights upon the value vectors: Then, V att and Q are added up together and passed through a layer normalization operation. A feedforward network (FFN) with ReLU activation is applied upon the normalization result x, and the output of FFN will be residually added to x. Finally, another layer normalization will be applied to obtain the final output: where W 1 , b 1 , W 2 , b 2 are parameters to be learned.

Intra-utterance Encoder
The intra-utterance encoder is used to encode the individual utterance and response information. The intra-utterance encoder layer in I 3 consists of two kinds of attention module, Self-attention Layer and Dual-attention Layer: Self-attention Layer is exploited to attend the important word-level information from each individual utterance and response sentence: where L denotes the length of a sentence, and E u and E r are the embeddings of input sequences. Dual-attention Layer is used to capture the relevant information between each utterance and the response sentence:

Inter-utterance Encoder
The inter-utterance encoder layer is used to learn the overall contextual information across multiple utterance. A mean pooling layer is applied over the local sentence representation for each sentence for obtaining the context sequence: The same self-attention and dual-attention layers are applied upon the context sequenceĤ u andĤ r to capture inter-interaction among utterances and between the utterance context and the response: where O u = {o u 1 , o u 2 , ..., o un } and O r are the self-attentive sentence representations. O ur = {o u 1 r , o u 2 r , ..., o unr } and O ru = {o ru 1 , o ru 2 , ..., o run } are the dual-attentive sentence representations.

Intra-/Inter-utterance Interaction Matching
We derive the matching feature by combining dot product and cosine similarity between the utterance and response representations as Zhou et al. (2018b) and Yuan et al. (2019). The first matching feature matrix M 1 is derived from the original word embeddings of the input utterance U and response r: where A 1 ∈ R d×d is a similarity parameter matrix to be learned. Then, we match the intra-utterance information and inter-utterance information with the candidate response by using the local sentence representations from Section 3.2.2 and the global sentence representations from Section 3.2.3, respectively: where A 2 , A 3 , A 4 , A 5 ∈ R d×d are also similarity parameter matrices to be learned.

Latent Interaction Modeling
In addition to the intra-and inter-utterance interactions, we develop a latent multi-view subspace clustering approaches for the representational learning of latent features in the dialog content to capture the latent interaction between the utterance context and the candidate response, in which the utterance context and the response are regarded as two different views of dialog content.

Multi-view Latent Representation Learning
Let X u , X r denote the inputs of two different views, where X u = {o u i }, X r = {o r i } ∈ R n×dx , n and d x are the number of samples and the dimension of the embedding. As shown in the Figure 1, we first encode the inputs of each view into the latent representation C v , where C v is a common notation of different views, i.e., C u and C r , by using a view-specific linear encoder, namely Linear Multi-view Latent Clustering. Then the latent representation is self-represented by a self-attentive weighted sum of a common clustering memory matrix across different views: where W (1) v and b (1) v are linear projection parameters to be learned. Z ∈ R nc×dc is a common selfrepresentation matrix for all views, and n c , d c are the pre-defined number of clusters and the dimension of self-representation matrix, which connects the latent representations C u and C r . And C * v , i.e., C * u and C * r , are the clustering representations in the subspace, which are used for measuring the latent feature coherency between the utterance and response. After self-representation operation, the clustering representations are reconstructed by the view-specific decoders: where W (2) v and b (2) v are linear projection parameters to be learned. The above approach assumes a linear relationship between the latent representation and the features from each view, which also leads to a linear relationship among the features from different views. As one may expect, the relationship among the features from different views is likely to be non-linear, thus, we also study the non-linear situation, namely Generalized Multi-view Latent Clustering. The only difference between linear and generalized multi-view latent representation learning is that the generalized form adopts non-linear encoder-decoder in the projection and reconstruction of the latent representation. In this work, we adopt basic Multi-Layer Perceptron (MLP) as the non-linear encoder-decoder:

Latent Interaction
After the multi-view latent representation learning, we obtain the latent clustering representations, i.e., C * u and C * r , and the reconstructed sentence representations, i.e., O * u and O * r , which are exploited to match the coherency of latent features between the utterance and response, with the same matching formula as Section 3.2.4: where A 6 , A 7 ∈ R d×d are coherence parameter matrices to be learned.

Loss Function of Multi-view Latent Clustering
The loss function of the multi-view latent representation learning module consists of two parts, information preservation loss and self-representation loss: where α v and λ are the hyper-parameters that balance the weight of different views and losses. The information preservation loss ensures that the information from the contextual representation is encoded into the latent representation for each view, while the self-representation loss aims to minimize the differences between the common clustering representation and the view-specific latent representation and alleviate the bias among different views.

Aggregation and Training
We concatenate the word-level matching matrices together, i.e., M = [M 1 : M 2 : M 3 ] ∈ R N ×6×L×L , and extract the corresponding utterance-level features F w ∈ R N ×d f via a convolutional layer, where d f is the dimension of the feature size. Then all the utterance-level matching features F = [F w : M 4 : M 5 : +6N ) are aggregate by the GRU layer. Finally, the output of GRU is passed through a single-layer perceptron to obtain the matching score g(U t , r t ).
The overall model is trained to minimize the cross-entropy loss function and the latent multi-view subspace clustering loss function: 4 Experiment

Datasets & Evaluation Metrics
We evaluate the proposed method on three multi-turn response selection benchmark datasets, including (1) Ubuntu Dialogue Corpus (Lowe et al., 2015) contains multi-turn conversations about technical support issues from the Ubuntu Forum 1 , (2) Douban Conversation Corpus (Wu et al., 2017) collects conversation content from the Douban group 2 which is a social networking website, and (3) E-commerce Dialogue Corpus ) is a conversation dataset in E-commerce scenario, which is collected from Taobao 3 . The statistics of these datasets are shown in Table 1. Following previous works Yuan et al., 2019), we adopt recall at position k in n candidates, i.e, R n @k, as the evaluation metrics. As for Douban dataset, we also adopt MAP (Mean Average Precision), MRR (Mean Reciprocal Rank), and P@1 (Precision@1) for evaluation, since there are more than one ground-truth responses in the Douban Corpus.

Baseline Models
Single-turn Matching Models: Lowe et al. (2015) and Kadlec et al. (2015) employ RNN, CNN, LSTM, and BiLSTM for response selection tasks by regarding the given context as a whole for matching the candidate responses. Multi-turn Matching Models: We further separate existing multi-turn matching models into two groups, Pre-transformer Models and Post-transformer Models. Pre-transformer Models combine or hybrid RNN and CNN models with carefully designed matching strategies, including DL2R , Multi-View (Zhou et al., 2016), SMN , and DUA . Post-transformer Models leverage improved and adaptive self-attention mechanism to enhance the  interaction between the utterance and response during the representational learning process, including DAM (Zhou et al., 2018b), MRFN (Tao et al., 2019a), IACMN , IoI (Tao et al., 2019b), and MSN (Yuan et al., 2019).

Implementation Details
For a fair comparison, we follow previous works Zhou et al., 2018b;Yuan et al., 2019) to adopt Word2Vec (Mikolov et al., 2013) word embedings with the dimension of 200, which is pretrained on the training data without extra materials for pre-training. For the hyper-parameters settings of I 3 , the number of all attention layers is set to be 1. In the aggregation, three 2-D convolutional layers are used to extract matching features with 16 [3,3], 32 [3,3], and 64 [3,3] filters, respectively. The dimension of the hidden states in GRU is set to be 300. In LMSC module, we observe similar performances when varying the number of clusters and the weights of different view of clustering. Thus, the number of clusters is fixed to be 10. α v and λ are also set to 1. Specifically for gLMSC, the encoder-decoder MLPs are two-layer and the hidden size of them is set to be 300. The maximum length of sentence and the maximum number of utterance turns are set to be 50 and 10. The learning rate and the dropout rate are set to be 0.001 and 0.2, and all datasets are trained on a mini-batch of 200.

Results
Table 2 presents the evaluation results over different methods on three datasets. Obviously, multi-turn methods outperform single-turn methods to a large margin, and it is needless to emphasize the necessity of multi-turn response selection studies. Compared with pre-transformer methods, post-transformer methods have a better performance on multi-turn response selection, which demonstrates the effectiveness of self-attention mechanism on capture the interaction between texts. As for the proposed models, we observe that the basic I 3 model achieves state-of-the-art performance on 10 out of 12 metrics. More importantly, different with latest iterative interaction based models, i.e., IACMN, IoI, and MSN, the depth of network for I 3 will be fixed and not be affected by the growth of conversation turns. As is reported in their works, IoI achieves the best performance on these datasets with 7 times of iterative interaction blocks, and MSN with 3-hops selector. In another word, I 3 can decently achieve competitive or even better performance with a single layer of interaction, regardless of various number of conversation turns. In addition, by adding the latent multi-view subspace clustering modules, I 3 -lMVLC and I 3 -gMVLC further improve the performance with a noticeable margin. For instance, there is an additional improvement of about 1% on E-commerce Corpus by adding the lLMSC module. By comparing lLMSC and gLMSC, we observe that these two kinds of LMSC modules perform  Table 3: Ablation study and comparisons of clustering strategies differently on different datasets. This situation is common in clustering methods (Zhang et al., 2020a), as it is difficult to determine whether the relationship among different samples is linear or non-linear.

Ablation Study
In order to validate the effectiveness of different modules in the proposed I 3 network, we conduct several ablation studies on Douban Corpus and E-commerce Corpus in terms of discarding different components. Apart from the original metrics, we also report the Average Scores, which is the mean of all the scores in two datasets, to integrally observe the difference. As is presented in the first part of Table 3, there are several notable observations: (i) As for the hierarchical transformer encoder, both local and global transformer contribute to the final performance to a large extent, which validates the effectiveness of encoding multi-level utterance information. (ii) By leaving only self-attention or dual-attention as the functional module in hierarchical transformer, we observe that these two kinds of attention modules guarantee the superiority of the performance. (iii) Under the matching-aggregate framework, we also evaluate the contribution of each matching feature. Note that we omit the "w/o inter-interaction" result, since it will be the same model as "w/o global transformer". From the results, we observe that origininteraction contributes far less than the other two matching features.

Comparison on Latent Clustering Strategy
We compare the proposed lLMSC and gLMSC module with other two latent clustering modules proposed for response/answer selection, including LTC (Yoon et al., 2018) and LC (Yoon et al., 2019). LTC (Yoon et al., 2018) is a latent topic clustering module to extract semantic information from target samples, which only clusters the information from the view of response. LC (Yoon et al., 2019) further applies the latent topic clustering module for both question and answer separately. Different from these two strategies, LMSC not only projects both utterances and the response into the same subspace for coherence measurement, but also applies specific loss functions to control the information preservation during the clustering process. The results are presented in the second part of Table 3. Despite the improvement on some of the metrics, there is not much difference on the overall performance for these two clustering strategy. However, lLMSC and gLMSC effectively improve the overall performance.

Case Study of Latent Subspace Clustering
The Latent Multi-view Subspace Clustering module is proposed to extract latent features for measuring the coherency between the utterance and response. To facilitate further investigation of the latent subspace clustering, we derive the probability of words in each cluster, and rank by their frequency. After filtering the stop words, the results of clusters and words for E-commerce Corpus are presented in Table 4. Note that the category of cluster is conjectured from the cluster results, since there is no ground-truth label for the latent cluster. From the clustering result, we observe an obvious inclination for each cluster. For example, in the E-commerce Corpus, the conversation topics are clustered into different groups, such as free shipping, refund, payment, discount, etc. One one hand, latent multi-view subspace clustering can assist the measurement of the latent representation coherency, leading to a better utterance-response matching result. On the other hand, such clustering approach provides an unsupervised way to discover the latent features of the dialogue.

Error Analysis
To better understand the failure modes of the proposed methods, we analyze 100 failure cases, and find the error cases could be classified into the following categories for later further improvement. Information Imbalance (≈ 45%): Some conversation samples suffer a great imbalance on the provided information from the utterance context and the response, leading to the difficulties in matching the utterance and response. Among them, about 70% of them give a short and simple response, such as "Sure.", "I see.", etc. While the rest only provide little information in the utterance context, for which even human cannot determine the true response. One possible way to address this kind of failures is to introduce background information to balance the information from both the utterance and the response.
Mislabeling or Misspelling (≈ 25%): We attribute these failures to the data issues. For instance, the ground-truth response is "Yes, we will.", while there are some negative candidates that contains both the true response but also some extra information, like "Yes, we will address it as soon as possible.". However, this negative candidate is also supposed to be a good or better response to the given utterances. Besides, some ground-truth responses are misspelled.
Inconsistency of Fact (≈ 20%): There are some conversations concerning factoid issues, such as the date, the place, the size, etc. However, the proposed method lacks of the ability to verify whether the information provide in the response is fact of not. To address the problem of the inconsistency of fact in the response, it would be better to incorporate some supporting knowledge (Deng et al., 2018) and consider the interrelationship (Zhang et al., 2020b) among all the candidate responses.
Multiple Intents/Topics (≈ 10%): Compared with the error analysis provided in , the issues related to user intent or conversation topic have been alleviated to a great extent. However, there are still some cases involved multiple intents/topics remaining to be tackled by further studying the latent clustering representational learning.

Conclusion
In this work, we propose Intra-/Inter-Interaction Network (I 3 ) with latent interaction modeling for multiturn response selection. We propose a hierarchical transformer encoder to capture the intra-and interutterance interaction with the candidate response from both individual utterance and the overall utterance context. Besides, we develop a latent multi-view subspace clustering module to model the latent feature coherency between the utterance and response. Experimental results show that the proposed method substantially and consistently outperforms existing SOTA methods on three benchmark datasets.