Incorporating Interlocutor-Aware Context into Response Generation on Multi-Party Chatbots

Conventional chatbots focus on two-party response generation, which simplifies the real dialogue scene. In this paper, we strive toward a novel task of Response Generation on Multi-Party Chatbot (RGMPC), where the generated responses heavily rely on the interlocutors’ roles (e.g., speaker and addressee) and their utterances. Unfortunately, complex interactions among the interlocutors’ roles make it challenging to precisely capture conversational contexts and interlocutors’ information. Facing this challenge, we present a response generation model which incorporates Interlocutor-aware Contexts into Recurrent Encoder-Decoder frameworks (ICRED) for RGMPC. Specifically, we employ interactive representations to capture dialogue contexts for different interlocutors. Moreover, we leverage an addressee memory to enhance contextual interlocutor information for the target addressee. Finally, we construct a corpus for RGMPC based on an existing open-access dataset. Automatic and manual evaluations demonstrate that the ICRED remarkably outperforms strong baselines.


Introduction
Human computer conversation has been an important and challenging task in NLP and AI since the Turing Test was proposed in 1950 (Turing, 1950).Recently, with the rapid growth of social conversation data available on the Internet, data-driven chatbots are able to learn to generate responses directly and have attracted much more attention than before (Li et al., 2016a;Tian et al., 2017).
Researches in this area mostly focus on the dialog with two interlocutors (Maíra Gatti de Bayser et al., 2017).However, the real-life interaction involves a substantial part of Multi-Party Chatbots (MPC, such as internet forum and chat group), which is a form of conversation with multiple in-  At each turn, a speaker said one utterance to an addressee.There are many interlocutors (e.g., Alan, Bert and so on) in a conversation, where a i represents interlocutor's ID.
terlocutors (Ouchi and Tsuboi, 2016).For example, there are more than three interlocutors (a 1 , a 2 , a 3 ...a m ) involved in the conversation in Figure 1, and their roles (e.g., speaker and addressee) may change across different dialog turns.
As shown in Figure 1, at each turn, the core issue of MPC is to capture who (speaker) talks to whom (addressee) about what (utterance).In order to obtain responses in MPC, in our best knowledge, previous approaches usually employ a response selection paradigm, which simply selects one response from a set of existing utterances as the final response according to the contexts.Obviously, this paradigm, which could not generate new responses, is not so flexible.In this study, to build a more broadly applicable system, we concentrate on producing new responses word by word, named as Response Generation on Multi-Party Chatbots (RGMPC).
RGMPC is a very challenging task.The primary challenge is that the generated response has strong relevance to the interlocutor's roles, such as the speaker and the addressee.For example, in the same context of Figure 1, what a 1 says to a 2 is different from what a 1 says to a 3 because different addressees (a 2 and a 3 ) have different information demands.Similarly, as for the same ad-dressee, utterances from different speakers may be different because each speaker has personal background knowledge and style of speaking.Moreover, the roles of the same interlocutor may vary across different dialog turns.For instance, in Figure 1, a 3 plays different roles in different dialog turns: speaker in the turn 2 and n-1, addressee in the turn n and n+1.
Therefore, it is very important for RGMPC to capture interlocutor information.Currently, most response generation methods consider only the contextual utterance information (Serban et al., 2016(Serban et al., , 2017) ) but neglect the interlocutor information.Although some researches have exploited the interlocutor information for response generation, they are still suffering from certain critical limitations.Li et al. (2016b) learn a fixed vector for each person from all conversational texts in the training corpus.However, as a global representation, the fixed person vector needs to be trained from largescale dialogue turns for each interlocutor, and it may have a sparsity issue since some interlocutors have very few dialogue turns.
To address the aforementioned problems of RGMPC, this paper incorporates Interlocutoraware Contexts into a Recurrent Encoder-Decoder model (ICRED) for RGMPC, which is also an end-to-end framework.Specifically, in order to capture interlocutor information, we exploit interactive interlocutor representations learned from current dialog context rather than the fixed person vectors (Li et al., 2016b) obtained from all dialogs in the training corpus.We expect that the learned contextual interlocutor representation could be a good alternative to the fixed person vectors (Li et al., 2016b) due to its ability of alleviating the sparsity issue.Furthermore, from the view of conversation analysis, responses are usually used for answering the addressee's question or expanding the addressee's utterances.Therefore, we originally introduce an addressee memory mechanism to enhance contextual information for the target addressee especially.Finally, both of the interactive interlocutor representation and addressee memory are utilized for decoding response utterances.In particular, the addressee memory is leveraged to capture the addressee information for each generated word dynamically.
In order to prove the effectiveness of the proposed model, we construct a dataset for RGMPC based on an open dataset1 .Experimental results show that the proposed model is fairly competitive on both automatic and manual evaluations compared with state-of-the-arts.
In brief, the main contributions of the paper are as follows: (1) We propose an end-to-end response generation model called ICRED which incorporates Interlocutor-aware Contexts into Recurrent Encoder-Decoder framework for RGMPC.
(2) We leverage an addressee memory mechanism to enhance contextual interlocutor information for the addressee.
(3) We construct an open-access dataset for RGMPC.Both automatic and manual evaluations demonstrate that our model is remarkably better than strong baselines in this dataset.

Methodology
The overview of the proposed ICRED for RGMPC is shown in Figure 2 along with its caption.The details are as follows.

Utterance Encoder Layer
The utterance encoder layer transforms input utterance into distributional representations.We leverage the bi-directional Gated Recurrent Units (GRU) (Cho et al., 2014) to capture the longterm dependency.
For an utterance u t = (w t 1 , w t 2 ...w t Lu ) at time step t, the concatenated representation for hidden states in bi-directions is denoted as h , where h t i is considered as the contextual word representation of the input word w t i .The state (h t Lu ) of the last word is treated as the representation of the utterance at time step t, which is denoted as h t , and it could be sent to the speaker interaction layer for updating contextual representation.

Speaker Interaction Layer
The speaker interaction layer is leveraged to obtain the interlocutor information in the context.Similar to the Speaker Interaction RNNs (Zhang et al., 2018), we utilize the interactive speaker encoder for RGMPC.
As shown in Figure 2, an interlocutor embedding matrix A is used to record all interlocutors' representation, and A is initiated with a zero matrix.Each column of A corresponds to an interlocutor's embedding: where A i is the embedding for the interlocutor a i .The speaker interaction layer updates the entire interlocutors' embeddings at each time step based on their roles (speaker, addressee or observer).Embeddings for the speaker, addressee and observer are updated by following role-differentiated GRUs: GRU S , GRU A and GRU O , respectively.
where A t spk (A t adr / A t obv ) is the embedding for the speaker (addressee / observer) at time step t, and h t is the utterance representation obtained from the utterance encoder layer.Take the first time step "(a 1 , a 2 , u 1 )" in Figure 2 as an example, when a 1 says u 1 to a 2 , the speaker's (a 1 's) embedding A 1 is updated by the speaker GRU-GRU S , and the addressee's (a 2 's) embedding A 2 is updated by the addressee GRU-GRU A , while other interlocutors' embeddings are updated by the observer GRU-GRU O .Note that the addressee may be missing (such as "(a 3 , −, u 2 )" at time step 2 in Figure 2), where embeddings for all interlocutors except for the speaker are updated by the observer GRU.The interlocutor embedding matrix (A) is updated up to the maximum time step n.The final interlocutor embedding matrix is used in decoding.

Addressee Memory Layer
The interlocutor embedding matrix is updated by utterance representations and interlocutor's roles, so it captures interlocutor's context on the utterance level.In fact, contextual word representation is important for response generation, too.A context contains consecutive utterances, and each utterance is a word sequence.Therefore, memorizing all contextual word representations in the entire context is complex, and it is difficult to work on large-scale utterances in one context.
Intuitively, from the view of conversational analysis, responses are usually used for answering the addressee's question or expanding the addressee's utterances.Therefore, we design an addressee memory layer, which only memorizes the contextual word representations (noted as M tgt ) in the last utterance said by the target addressee, and the contextual representation for each word is obtained from the utterance encoder layer.Take "(a m , a 3 , ?)" at time step n+1 in Figure 2 as an example, u n−1 is the last utterance said by the target addressee a 3 because of "(a 3 , a m , u n−1 )" at time step n-1, so the addressee memory layer merely memorizes contextual word representation Lu ] from the utterance u n−1 , where h n−1 i is obtained from Section 3.1.

Decoder Layer
The decoder is responsible for generating target sequences.Different from a single contextual representation in previous work (Serban et al., 2017), the speaker interaction layer is able to capture different interlocutor information from contexts (e.g., personal background knowledge and style of speaking for the responding speaker, special information demands for the target addressee).Moreover, the addressee memory layer records contextual word representation for the target addressee.Therefore, we extract contextual speaker vector A res for the responding speaker a res from the final interlocutor embedding matrix A (e.g., the responding speaker's embedding obtained by A m = A[ * , a m ] for the responding speaker a m in Figure 2).Similarly, contextual addressee vector A tgt for the target addressee is also extracted from A. However, A res and A tgt keep same for each generated word.In order to capture dynamic information for different generated words, we leverage an attention mechanism to selectively reads different contextual word representations from the addressee memory.For each target word, the decoder attentively reads the contextual word repre-sentation as follows: (4) where c j is the attentional addressee vector, M tgt [ * , k] is the contextual word representation for the k-th word in the addressee memory, and s j represents the hidden state in decoding GRU.
A function ρ is leveraged to compute the attentive strength, which is calculated by a projected matrix to connect s T j−1 and M tgt [ * , k].Finally, the attentional addressee vector c j , contextual speaker vector A res and contextual addressee vector A tgt are concatenated to estimate the probability for predicted words: where s j is the hidden state of the decoding GRU-GRU dec .x j is the word vector of the predicted target word r j , and r j is typically performed by a sof tmax classifier over a settled vocabulary based on word embedding similarity.

Learning
The proposed ICRED for RGMPC is totally differentiable, and it can be optimized in an end-to-end manner using back-propagation.Given the context C, responding speaker a res , target addressee a tgt and target word sequence {r j } Lr j=1 , the objective function is to minimize the loss function: It contains a negative log-likelihood for generated responses and L2 regularization (L 2 ), where λ is a hyperparameter for L 2 .

Dataset
Our dataset is constructed based on the Ubuntu multi-party chatbot corpus 2 , which has been widely used as the evaluation dataset for the response selection task (Ouchi and Tsuboi, 2016;   2   Zhang et al., 2018).The original data comes from the Ubuntu IRC chat log, where each line consists of (Time, Speaker, Utterance).If the addressee is explicitly mentioned in the utterance, it is extracted as the addressee.Otherwise, all interlocutors except the speaker are observers.Considering that generating new responses in this paper is more complicated than retrieving responses, the generative task requires higher-quality data.We suppose that the responding speaker and target addressee have appeared in the context, where the contextual window is set to 5.Moreover, the words are tokenized by NLTK, and some general responses are removed by human rules3 .Finally, we randomly split the dataset into Train/Dev/Test (8:1:1), and it is publicly available 1 .The detailed statistics of the dataset are shown in Table 2.

Implement Details
In order to keep our model comparable to other typical existing methods, we keep the same parameters and experimental environments for ICRED and the comparative models.We take a maximum of 20 words for the utterance.The word vector dimension is 300 and it is initialized with the public released fasttext4 pre-trained on Wikipedia.The utterance and interlocutor are encoded by 512dimensional and 1024-dimensional vectors, respectively.The joint loss function with 0.0001 L2 weight is minimized by an Adam optimizer.We implemented all the models with Tensorflow on an NVIDIA TITAN X GPU.

Automatic Evaluation Metrics
Automatic evaluations (AEs) for Natural Language Generation (NLG) is a challenging and under-researched problem (Novikova et al., 2017).
Following (Liu et al., 2018), we leverage two referenced measurements (BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) 5 ) for automatic evaluations.Considering that current data-driven approaches tend to generate short and generic (meaningless) responses, two unreferenced ("intrinsic") metrics are also leveraged to the evaluation.The first one is the average length of responses, which is an objective and surfaced metric reflected the substance of responses (Mou et al., 2016;He et al., 2017a).The other one is the number of nouns6 per response (Liu et al., 2018), which shows the richness of responses since nouns are usually content words.Note that the unreferenced metrics could enrich the evaluations, though they are weak metrics.The detailed results and analyses are shown as follows.Comparison Methods.We compared ICRED with the following methods:

The
(1) Seq2Seq (Sutskever et al., 2014): Seq2Seq is one of the mainstream methods for text generation.In order to capture as much information as possible, the input sequence is all utterances concatenated in order in a context.
(2) Persona Model (Li et al., 2016b): The persona-based model modified a Seq2Seq to encode a global vector for each interlocutor that appears in the training data, and it could alleviate the issue of speaker consistency for response generation.
Comparative Results.(1) ICRED obtains the highest performance on all metrics (marked as bold), and it indicates that incorporating interlocutor-aware context into RGMPC contributes to generating better responses.
(2) Although the persona-based model utilizes interlocutor information, it performs poorly.The average dialogue turn for the interlocutor is more than 5000 in (Li et al., 2016a), while there is less than 100 dialogue turns per interlocutor in our dataset.Therefore, it is hard to learn a global vector for each interlocutor from the sparse corpus.In contrast, our ICRED performs well on such a sparse corpus (details in Section 4.5).
(3) VHRED brings slight improvements over the Seq2Seq and persona-base model.Even that VHRED enhances the contextual information by a high-dimensional latent variable, VHRED is still remarkably worse than ICRED because VHRED neglects the interlocutor information.

The Effect of Sparse Data on ICRED
Comparison Settings.Persona model (Li et al., 2016b) may have a sparsity issue since some interlocutors have very few dialogue turns.To investigate whether ICRED has the sparsity issue or not, we divide the test data into four intervals according to the number of training dialogue turns said by the target addressee (called interlocutor dialogue turns), where small turns represent sparse learning data (e.g., "[0, 100]") and large turns mean plentiful learning data (e.g., "(5000, +∞)").
Comparative Results.Table 4 reports the performances of persona model and ICRED on different interlocutor's dialogue turns for learning.We can clearly see that the persona model has a sparsity issue: it performs very poorly on sparse learning data (e.g., BLEU score = 8.47 on "[0, 100]") while it achieves good performances on plentiful learning data (e.g., BLEU score = 9.51 on "(5000, +∞)"), which demonstrates that the fixed person vectors in the persona model need to be learned from large-scale training data for each interlocutor.In contrast, ICRED exploits interactive interlocutor representation learned from current dialog context rather than the fixed person vectors obtained from all training dialog utterances.Therefore, ICRED has no sparsity issues and it performs closely on sparse and plentiful learning data.Comparison Settings.In order to validate the effectiveness of model components, we have tried to remove some main components in decoding as follows.

Ablation Study for Model Components
(1) w/o Adr Mem: without the addressee memory, such as removing c j in Equation 6-7; (2) w/o Ctx Spk Vec: without the contextual speaker vector, such as removing A res in Equation 6-7; (3) w/o Ctx Adr Vec: without the contextual addressee vector, such as removing A tgt in Equation 6-7.
Comparative Results.Results of the ablation study are shown in Table 5.We can see that removing any component causes obvious performance degradation.In particular, "w/o Ctx Adr Vec" performs the worst on almost all of the metrics, which demonstrates the importance of contextual information for the target addressee.Comparison Settings.In order to demonstrate the effectiveness of the addressee memory, we umount -f does n't even work @Pricey i have right now 12.04 ubuntu i do not want to update any of the other packages other than btrfs @Hanumaan the quantal kernel is out of life @k1l i have right now this kernel `` 3. (1) addressee memory: memorizing contextual word representations in the last utterance said by the target addressee (e.g., u n−1 in Figure 2); (2) all utterance memory: memorizing contextual word representations in all utterances of the context (e.g., u 1 to u n in Figure 2); (3) latest memory: memorizing contextual word representations of the latest utterance in the context (e.g., the latest utterance u n in Figure 2); (4) speaker memory: memorizing contextual word representations in the last utterance said by the responding speaker; (5) w/o memory: without any memory.

The Effectiveness of Addressee Memory
Comparative Results.We report the results of different memory types as shown in Table 6.It can see that our method, the addressee memory, achieves the best or near-best performances on all metrics.Although memorizing all utterances is competitive, the complexity of all utterance memory is n times compared with the one in the addressee memory, where n is the number of utterances in a context.The speaker memory performs closely to without memory, which indicates that not all memories can improve the performance.

Manual Evaluations
Besides automatic evaluations, we employ manual evaluations (MEs), which is important for response generation.Similar to (He et al., 2017b;Zhou et al., 2018), and we select three metrics for MEs, which measure the following aspects.( 1 We conduct a pair-wise comparison between the response generated by ICRED and the one for the same input by three typical baselines.We sample 100 responses from each compared methods.Two curators judge (win, tie and lose) between these two methods.The Cohen Kappa of interannotator statistics is 0.750, 0.658 and 0.580 for the fluency, consistency and informativeness, respectively.As shown in Table 7, the score is the percentage that ICRED wins baselines after removing the "tie" pairs, and we can obtain that ICRED is significantly (sign test, p-value < 0.005) superior to all baselines on any metric.It demonstrates our model is able to deliver more fluent, consistent and informative responses.

Case Study
Figure 3 shows an example of responses on different models for the same dialogue context.It is clearly observed that our model (ICRED) generates more fluent, consistent and knowledgeable (marked as underline) responses compared to baselines.In particular, the response given by ICRED "if you want a new kernel , you can install the kernel from the kernel repo", not only explains the reason for kernel installation but also suggests a source of the installation.It fully captures the context and then produces a fluent, consistent and knowledgeable response, which is semantically similar to the gold one.

Discussion
Interlocutor Prediction and RGMPC.The above methods assume that the responding speaker and target addressee are given for RGMPC.Though the speaker and the addressee could be obtained in some situations (e.g., extracted from chat logs), it is still a researchable task to interlocutor prediction.There have been some researches to predict either the responding speaker or the target addressee based on the given textual contexts or multimodal information (Akhtiamov et al., 2017a;Meng et al., 2017;Akhtiamov et al., 2017b).Nevertheless, in order to obtain the interaction between interlocutor prediction and RGMPC, we further design a joint model for RGMPC and interlocutor prediction.Note that both the speaker and the addressee are predicted based on textual contexts, simultaneously.Firstly, the responding speaker is predicted from contexts: where h C is a summary contextual vector, which is max-pooled by the final interlocutor embedding matrix (A), and h n Lu is the hidden state of the last utterance.W is a projected matrix.a res and A res are the ID and the embedding of the responding speaker, respectively.The responding speaker is predicted by a softmax classifier based on the embedding similarity, and the target addressee is obtained in the same way.Secondly, the predicted interlocutors replace the gold ones for the addressee memory and extracting interlocutor's embeddings from A. Finally, the interlocutor prediction loss is added to the response generation loss for training.Table 8 shows the response generation performance on the situation that responding interlocutors are given and predicted.We can observe that: (1) The overall performance on predicted interlocutors ("* / *" in Table 8) is slightly worse than the one with gold interlocutors (the first line in Table 8).Nevertheless, "* / *" still outperforms the strongest baseline (VHRED in Table 3).
(2) The correctness of interlocutor prediction has a significant impact on response generation performance.It performs the best when the responding speaker and the target addressee are predicted correctly."False / False" (both are mispredicted) obtains the worst performance on the referenced metrics.These results demonstrate that both responding speaker and target addressee contribute to generating better responses.Table 8: Performance on learning interlocutor prediction and RGMPC."True" and "False" means right and wrong interlocutor, respectively."*" represents both "True" and "False".The correctness of the responding speaker and target addressee is segmented by "/".For example, "True / *" means that the responding speaker is right, and the target addressee is right or wrong.
(3) Surprisingly, the unreferenced metrics perform well on "False / False".One possible reason is that the wrong interlocutors also capture rich contexts, and it generates long and meaningful responses but with a weak correlation to the gold interlocutors.Therefore, it achieves very poor performance on the referenced metrics.

Related Work
Our work is inspired by a large number of applications utilizing recurrent encoder-decoder frameworks (Cho et al., 2014) on NLP tasks such as machine translation (Bahdanau et al., 2015) and text summarization (Chopra et al., 2016).Recently, many researches extend the encoder-decoder framework on response generation.HRED (Serban et al., 2016) utilizes hierarchical encoder to capture the context.VHRED (Serban et al., 2017) extends HRED by adding a high-dimensional latent variable for utterances.These researches demonstrate the importance of contexts on response generation.
Our work is also inspired by researches on multi-party chatbots.Dielmann and Renals (2008) automatically recognize dialogue acts in multiparty speech conversations.Recently, some studies focus on the three elements (speaker, addressee, response) on multi-party chatbots.Meng et al. (2017) introduce speaker classification as a surrogate task.Addressee selection is researched by (Akhtiamov et al., 2017b).Some researches strive to the response selection (Ouchi and Tsuboi, 2016;Zhang et al., 2018).However, the response selection heavily relies on the candidates, and it can not generate new responses in new dialogue contexts.Response generation could solve this problem.Li et al. (2016b) learn fixed person vector for response generation.Unfortunately, it needs to be obtained from large-scale dialogue turns, which has a sparsity issue: some interlocutors have very little dialog data.Differently, our model has no such restrictions.

Conclusion
In this study, we formalize a novel task of Response Generation for Multi-Party Chatbots (RGMPC) and propose an end-to-end model which incorporates Interlocutor-aware Contexts into Recurrent Encoder-Decoder frameworks (ICRED) for RGMPC.Specifically, we employ interactive speaker models to capture contextual interlocutor information.Moreover, we leverage an addressee memory mechanism to enrich contextual information.Furthermore, we propose to predict both the speaker and the addressee when generating responses.Finally, we construct a corpus for RGMPC.Experimental results demonstrate the ICRED remarkably outperforms strong baselines on automatic and manual evaluation metrics.

Figure 1 :
Figure 1: An example of Multi-Party Chatbots (MPC).At each turn, a speaker said one utterance to an addressee.There are many interlocutors (e.g., Alan, Bert and so on) in a conversation, where a i represents interlocutor's ID.

@
Figure 3: An example of different model responses for the same dialogue context.The input dialogue context is on the left.The gold (referenced) response and model responses are on the top right and bottom right, respectively.The rounded rectangle is the message box, where the italic behind "@" is the addressee, and the solid-line box near to the message box represents the speaker or model.

Table 1 , given the context C of previ- ous n dialog turns, the responding speaker a res and the target addressee a tgt at time step n+1, the task of RGMPC aims to automatically gener
ate the next utterance u n+1 as the final response.
Overall structure of the proposed ICRED for RGMPC.At each time step t, (a i , a j , u t ) means that a speaker a i said an utterance u t to an addressee a j , where the time step t is denoted on the bottom, and the superscript t may be omitted for brevity. https://github.com/hiroki13/response-ranking Effectiveness of ICRED for RGMPC
Table 3 demonstrates overall comparisons of ICRED.We can clearly obtain the following observations:

Table 5 :
Ablation Experiments by removing the main components.

Table 6 :
Performances over different memory types.