Dually Interactive Matching Network for Personalized Response Selection in Retrieval-Based Chatbots

This paper proposes a dually interactive matching network (DIM) for presenting the personalities of dialogue agents in retrieval-based chatbots. This model develops from the interactive matching network (IMN) which models the matching degree between a context composed of multiple utterances and a response candidate. Compared with previous persona fusion approach which enhances the representation of a context by calculating its similarity with a given persona, the DIM model adopts a dual matching architecture, which performs interactive matching between responses and contexts and between responses and personas respectively for ranking response candidates. Experimental results on PERSONA-CHAT dataset show that the DIM model outperforms its baseline model, i.e., IMN with persona fusion, by a margin of 14.5% and outperforms the present state-of-the-art model by a margin of 27.7% in terms of top-1 accuracy hits@1.


Introduction
Building a conversation system with intelligence is challenging.Response selection, which aims to select a potential response from a set of candidates given the context of a conversation, is an important technique to build retrieval-based chatbots (Zhou et al., 2018).Many previous studies on singleturn (Wang et al., 2013) or multi-turn response selection (Lowe et al., 2015;Zhou et al., 2018;Gu et al., 2019) rank response candidates according to their semantic relevance with the given context.
With the emergence and popular use of personal assistants such as Apple Siri, Google Now and Microsoft Cortana, the techniques of making personalized dialogues has attracted much research attention in recent years (Li et al., 2016;Zhang et al., 2018;Mazaré et al., 2018).Zhang et al. (2018) constructed a PERSONA-CHAT dataset for building personalized dialogue agents, where each persona was represented as multiple sentences of profile description.An example dialogue conditioned on given profiles from this dataset is given in Table 1 for illustration.
A persona fusion method for personalized response selection was also proposed by Zhang et al. (2018).In this method, given a context and a persona composed of several profile sentences, the similarities between the context representation and all profile sentences are computed first using attention to get the persona representation.Then, the persona representation is applied to enhance the context representation by a simple concatenation or addition operation.Finally, the enhanced context representation is used to rank response candidates.This method has two main deficiencies.First, the context is treated as a whole for calculating its attention towards profile sentences.However, each context is composed of multiple utterances and these utterances may play different roles when matching different profile sentences.Second, the interactions between the persona and each response candidate are ignored when deriving the persona representation.
In this paper, the interactive matching network (IMN) (Gu et al., 2019) is adopted as the fundamental architecture to build our baseline and improved models for personalized response selection.The baseline model follows the persona fusion method proposed by Zhang et al. (2018) and two improved models are then proposed.First, an IMN-based persona fusion model with finegrained context-persona interaction is designed.In this model, each utterance in a context, instead of the whole context, is used to calculate its similarity with each profile sentence in a persona.Second, a dually interactive matching network (DIM) is proposed by formulating the task of personalized response selection as a dual matching problem,  i.e., finding a response that can properly match the given context and persona simultaneously.The DIM model calculates the interactions between the context and the response, and between the persona and the response in parallel, in order to derive the final matching feature for response selection.
We test our proposed methods on the PERSONA-CHAT dataset (Zhang et al., 2018).Results show that the IMN-based utterance-level persona fusion model and the DIM model can obtain a top-1 accuracy hits@1 improvement of 2.4% and 14.5%, respectively, over the baseline model, i.e., the IMN-based context-level persona fusion model.Finally, our proposed DIM model outperforms the current state-of-the-art model by a margin of 27.7% in terms of top-1 accuracy hits@1 on the PERSONA-CHAT dataset.
In summary, the contributions of this paper are three-fold.(1) An IMN-based fine-grained persona fusion model is designed in order to consider the utterance-level interactions between contexts and personas.(2) A dually interactive matching network (DIM) is proposed by formulating the task of personalized response selection as a dual matching problem, aiming to find a response that can properly match the given context and persona simultaneously.(3) Experimental results on the PERSONA-CHAT dataset demonstrate that our proposed models outperform the baseline and state-of-the-art models by large margins on the accuracy of response selection.

Response Selection
Response selection is an important problem in building retrieval-based chatbots.Existing work on response selection can be categorized into single-turn (Wang et al., 2013) and multi-turn dialogues (Lowe et al., 2015;Zhou et al., 2018;Gu et al., 2019).Early studies have been more on single-turn dialogues, considering only the last utterance of a context for response matching.More recently, the research focus has been shifted to multi-turn conversations, a more practical setup for real applications.Wu et al. (2017) proposed the sequential matching network (SMN) which first matched the response with each context utterance and then accumulated the matching information by a recurrent neural network (RNN).Zhou et al. (2018) proposed the deep attention matching network (DAM) to construct representations at different granularities with stacked self-attention.Gu et al. (2019) proposed the interactive matching network (IMN) to enhance the representations of the context and response at both the word-level and sentencelevel, and to perform the bidirectional and global interactions between the context and response in order to derive the matching feature vector.

Persona for Chatbots
Chit-chat models suffer from a lack of a consistent personality as they are typically trained over many dialogues, each with different speakers, and a lack of explicit long-term memory as they are typically trained to produce an utterance given only a very recent dialogue history.Li et al. (2016) proposed a persona-based neural conversation model to capture individual characteristics such as background information and speaking style.Miller et al. (2016) proposed the key-value memory network, where the keys were dialogue histories, i.e., contexts, and the values were next dialogue utterances.Zhang et al. (2018) proposed the profile memory network by considering the dialogue history as input and then performing attention over the persona to be combined with the dialogue history.Mazaré et al. (2018) proposed the fine-tuned persona-chat (FT-PC) model which first pretrained a model using a large-scale corpus with external knowledge and then fine-tuned it on the PERSONA-CHAT dataset.
In general, all these methods adopted a contextlevel persona fusion strategy, which first obtained the embedding vector of a context and then computed the similarities between the whole context and each profile sentence to acquire the persona representation.However, such persona fusion is relatively too coarse.The utterance-level representations of contexts are not leveraged.The interactions between the persona and each response candidate are also ignored when deriving the persona representation.

Task Definition
Given a dialogue dataset D with personas, an example of the dataset can be represented as (c, p, r, y).Specifically, c = {u 1 , u 2 , ..., u nc } represents a context with {u m } nc m=1 as its utterances and n c as the utterance number.p = {p 1 , p 2 , ..., p np } represents a persona with {p n } np n=1 as its profile sentences and n p as the profile number.r represents a response candidate.y ∈ {0, 1} denotes a label.y = 1 indicates that r is a proper response for (c, p); otherwise, y = 0.  Our goal is to learn a matching model g(c, p, r) from D. For any context-persona-response triple (c, p, r), g(c, p, r) measures the matching degree between (c, p) and r.A softmax output layer over all response candidates is adopted in this model.The model parameters are trained by minimizing a multi-class cross-entropy loss function on D.

IMN-Based Persona Fusion
The model architecture used by previous methods with persona fusion (Zhang et al., 2018;Mazaré et al., 2018) is shown in Figure 1(a).It first obtains the context representation and then computes the similarities between the whole context and each profile sentence in a persona.Attention weights are calculated for all profile sentences to obtain the persona representation.Finally, the persona representation is combined with the context representation through concatenation or addition operations.
Formally, the representations of the whole context which is the concatenation of utterances, the context utterances, and the profile sentences are denoted as c, {u m } nc m=1 and {p n } np n=1 respectively, where c, u m and p n ∈ R d .In previous context-level persona fusion methods, the enhanced context representation c + fused with persona information is calculated as (1) Then, the similarity between c + and the response representation are computed to get the matching degree of (c, p, r).
In this paper, we build our baseline model based on IMN (Gu et al., 2019).After the context and response embeddings are obtained in the IMN model, the context-level persona fusion architecture shown in Figure 1(a) is applied to integrate persona information.All model parameters are estimated in an end-to-end manner.This baseline model is denoted as IMN ctx in this paper.
Considering each context is composed of multiple utterances and these utterances may play different roles when matching different profile sentences, we propose to improve the baseline model by fusing the persona information at a finegrained utterance-level as shown in Figure 1(b).This model is denoted as IMN utr in this paper.First, the similarities between each context utterance and each profile sentence are computed and the enhanced representation u + m of each context utterance is calculated as (2) Then, these enhanced utterance representations are aggregated into the enhanced context representation as where either RNN or attention-based aggregation (Gu et al., 2019) can be employed.
5 Dually Interactive Matching Network

Model Overview
Previous studies on personalized response selection treat personas as supplementary information to enhance context representations by attentionbased interaction.In this paper, we formulate the task of personalized response selection as a dual matching problem.The selected response is expected to properly match the given context and persona respectively.Here, personas are considered as equally important counterparts to contexts for ranking response candidates.The interactive matching between the context and response, and that between the persona and response constitute the dually interactive matching network (DIM).The DIM model is composed of five layers.Figure 2 shows an overview of the architecture.Details about each layer are provided in the following subsections.

Word Representation Layer
We follow the setting used in IMN (Gu et al., 2019), which constructs word representations by combining general pre-trained word embeddings, those estimated on the task-specific training set, as well as character-level embeddings, in order to deal with the out-of-vocabulary issue.
Formally, embeddings of the m-th utterance in a context, the n-th profile sentence in a persona and a response candidate are denoted as respectively, where l um , l pn and l r are the numbers of words in U m , P n and R respectively.Each u m,i , p n,j or r k is an embedding vector of d-dimensions.

Matching Layer
The interactions between the context and the response and those between the persona and the response can provide useful matching information for deciding the matching degree between them.
Here, the DIM model adopts the same strategy as in the IMN model (Gu et al., 2019) which considers the global and bidirectional interactions between two sequences.
Take the context-response matching as an example.
First, the context representation C = { ci } lc i=1 with l c = nc m=1 l um is formed by

Aggregation Layer
The aggregation layer converts the matching matrices of context utterances, profile sentences and response into a final matching feature vector.
First, each matching matrix U m , R, P n and R * are processed by BiLSTMs as putr n,j = BiLSTM( P n , j), j ∈ {1, ..., l pn }, (14 where the four BiLSTMs share the same parameters in our implementation.Then, the aggregated embeddings are calculated by max pooling and last-hidden-state pooling operations as Next, the sequences of ûagr m and pagr n are further aggregated to get the embedding vectors for the context and the persona respectively.
Context aggregation As the utterances in a context are chronologically ordered, the utterance embeddings U agr = {û agr m } nc m=1 are sent into another BiLSTM following the chronological order of utterances in the context.Combined max pooling and last-hidden-state pooling operations are then performed to obtain the context embeddings as Persona aggregation As the profile sentences in a persona are independent to each other, an attention-based aggregation is designed to derive the persona embeddings as follows, where the first two features describe the contextresponse matching, and the last two describe the persona-response matching.

Prediction Layer
The final matching feature vector is then sent into a multi-layer perceptron (MLP) classifier with softmax output.Here, the MLP is designed to predict whether a (c, p, r) triple match appropriately based on the derived matching feature vector.Finally, the MLP returns a probability to denote the matching degree.

Dataset
We tested our proposed methods on the PERSONA-CHAT dataset (Zhang et al., 2018)  To make this task more challenging, a version of revised persona descriptions are also provided by rephrasing, generalizing, or specializing the original ones.Since the personas of both speakers in a dialogue are available, the response selection task can be conditioned on the speaker's persona ("self persona") or the dialogue partner's persona ("their persona") respectively.

Evaluation Metrics
We used the same evaluation metrics as in the previous work (Zhang et al., 2018).Each model aimed to select the best-matched response from available candidates for the given context c and persona p.We calculated the recall of the true positive replies, denoted as hits@1.In addition, the mean reciprocal rank (MRR) (Voorhees, 1999) metric was also adopted to take the rank of the correct response over all candidates into consideration.

Training Details
For building the IMN, IMN ctx , IMN utr and DIM models, the Adam method (Kingma and Ba, 2015) was employed for optimization with a batch size of 16.The initial learning rate was 0.001 and was exponentially decayed by 0.96 every 5000 steps.Dropout (Srivastava et al., 2014) with a rate of 0.2 was applied to the word embeddings and all hidden layers.A word representation is a concatenation of a 300-dimensional GloVe embedding (Pennington et al., 2014), a 100-dimensional embedding estimated on the training set using the Word2Vec algorithm (Mikolov et al., 2013) 20, respectively.We padded with zeros if the number of utterances in a context was less than 15; otherwise, we kept the last 15 utterances.For the IMN ctx , IMN utr and the DIM models, the maximum number of words in a profile sentence and that of profile sentences in a persona were set to be 15 and 5, respectively.Similarly, we padded with zeros if the number of profile sentences in a persona was less than 5.The development set was used to select the best model for testing.
All code was implemented in the TensorFlow framework (Abadi et al., 2016) and is published to help replicate our results1 .

Experimental Results
Table 2 presents the evaluation results of our reproduced IMN model (Gu et al., 2019) and previous methods on PERSONA-CHAT dataset without using personas.It can be seen that the IMN model outperformed other models on this dataset by a margin larger than 28.9% in terms of hits@1.As introduced above, our proposed models for personalized response selection were all built on IMN.
Table 3 presents the evaluation results of our proposed and previous methods on PERSONA-CHAT under various persona configurations.The t-test shows that the differences between our proposed models, i.e., IMN utt and DIM, and the baseline model, i.e.IMN ctx , were both statistically significant with p-value < 0.01.We can see that the fine-grained persona fusion at the utterance level rendered a hits@1 improvement of 2.4% and an MRR improvement of 1.9% by comparing IMN ctx and IMN utr conditioned on original self personas.The DIM model outperformed its baseline IMN ctx by a margin of 14.5% in terms of hits@1 and 10.5% in terms of MRR.Compared with the FT-PC model (Mazaré et al., 2018) which was first pretrained using a large-scale corpus and then fine-tuned on the PERSONA-CHAT dataset, the DIM model outperformed it by a margin of 10.0% in terms of hits@1 conditioned on revised self personas.Another advantage of DIM is that it was trained in an end-to-end mode without pretraining and using any external knowledge.Lastly, the DIM model outperforms previous models by margins larger than 27.7% in terms of hits@1 conditioned on original self personas.

Improvement of Using Personas
Examining the numbers which indicate the gains or losses after adding persona conditions in Table 3, we can see that the context-level persona fusion improves the performance of previous models significantly when original self personas are used.However, the gain achieved by the IMN ctx model is limited.One possible reason is that the IMN model performs attention-based interactions between the context and the response in order to get their matching feature for response selection.Thus, the context embeddings shown in Fig. 1(a) contained the information from both the context and the response, which may be inappropriate for the following context-level persona fusion shown in Eq. ( 1).The improvement achieved by the DIM model is much higher because it adopts a dual matching framework to address this issue.
Original vs. Revised Comparing with using original personas, it is more difficult for the models conditioned on the revised personas to extract useful persona information, as shown by the limited improvement achieved by the previous models shown in Table 3.One possible reason is that there are fewer shared words between the response and the persona revised by rephrasing, generalizing, or specializing, which increases the difficulty of understanding the persona and its relationships with the response.For example, it is easier for models to judge the matching degree between the original profile "Autumn is my favorite season."and the response "This is my favorite time of the year season wise."than between the revised profile "I love watching the leaves change colors."and the response.On the contrary, our proposed DIM model still obtains a hits@1 improvement of 6.9% and an MRR improvement of 5.4% when conditioned on the revised self personas, which can be attributed to the direct and interactive  persona-response matching used in this model.To demonstrate the importance of the dual matching framework followed by our proposed DIM model, ablation tests were performed using the original self personas and the results are shown in Table 4.We can see that both the personaresponse matching and the context-response matching contribute to the performance of the DIM model.It is reasonable that the contextpersona matching is more important because contexts provide the fundamental semantic descriptions for response selection.On the other hand, the single persona-response matching can also achieve a hits@1 of 48.8% and an MRR of 60.9%, which shows the usefulness of utilizing persona information to select the best-matched response.

Interactive Matching in DIM
In order to investigate the effectiveness of the interaction matching between the context and the response and that between the persona and the response in the DIM model, a case study was conducted by visualizing the response-to-context and response-to-persona attention weights used in Eq. ( 9).The results are shown in Fig. 3.We can see that some important words such as "dogs" in the response selected its relevant words such as "animals" in the context to derive the contextresponse matching features.Some important profile texts such as "I love animals and have two dogs."also obtained large attention weights for getting the persona-response matching features.This experimental result confirms our formulation of the task of personalized response selection as a dual matching problem.trained on the original personas and tested on the revised personas, which shows that the revised personas can provide a better generalization ability to the DIM model than the original ones.

Conclusions
In this paper, we formulate the task of personalized response selection as a dual matching problem to search for a response that can properly match the given context and persona simultaneously.A new model named Dually Interactive Matching Network (DIM) is proposed, which performs the interaction matching between the context and response as well as between the persona and response in parallel, in order to derive the final matching features for personalized response selection.Experimental results show that DIM improves over the IMN models with contextlevel or utterance-level persona fusion, outperforming previous methods and achieving a new state-of-the-art performance on the PERSONA-CHAT dataset.In the future, we will explore models to make better use of dialogue partners' persona for response selection.

Figure 1 :
Figure 1: Comparison of the model architectures for (a) context-level persona fusion and (b) utterance-level persona fusion.

Figure 2 :
Figure 2: An overview of our proposed DIM model.
w and b are parameters need to be estimated during training.Last, the final matching feature vector is the concatenation of context, persona and response embeddings as m = [ĉ agr ; ragr ; pagr ; ragr * ],

Figure 3 :
Figure 3: Visualizations of (a) response-to-context or (b) response-to-persona attention weights at the matching layer for a test sample.The darker units correspond to larger values.

Table 1 :
An example dialogue from the PERSONA-CHAT dataset.

Table 2 :
Zhang et al. (2018)of the IMN model and previous methods on PERSONA-CHAT dataset without using personas.All the results except ours are copied fromZhang et al. (2018).

Table 3 :
Mazaré et al. (2018)roposed and previous methods on the PERSONA-CHAT under various persona configurations.The meanings of "Self Persona", "Their Persona", "Original", and "revised" can be found in Section 6.1.All results except ours are copied fromZhang et al. (2018);Mazaré et al. (2018).Numbers in parentheses indicate the gains or losses after adding the persona conditions.

Table 4 :
Ablation tests of removing either personaresponse matching or context-response matching in the DIM model conditioned on original self personas.

Table 5 :
hits@1 results of transfer tests on the DIM model.Transfer tests were conducted by training and evaluating the DIM model using mismatched types of personas.The results are reported in Table5.It shows that the DIM model achieved a better performance when testing on the same type of personas as training.Meanwhile, the model trained on the revised personas and tested on the original personas achieved less loss than the ones response (b) Response-to-persona attention weights