Addressee and Response Selection for Multi-Party Conversation

To create conversational systems working in actual situations, it is crucial to assume that they interact with multiple agents. In this work, we tackle addressee and response selection for multi-party conversation, in which systems are expected to select whom they address as well as what they say. The key challenge of this task is to jointly model who is talking about what in a previous context. For the joint modeling, we propose two modeling frameworks: 1) static modeling and 2) dynamic modeling. To show benchmark results of our frameworks, we created a multi-party conversation corpus. Our experiments on the dataset show that the recurrent neural network based models of our frameworks robustly predict addressees and responses in conversations with a large number of agents.


Introduction
Short text conversation (STC) has been gaining popularity: given an input message, predict an appropriate response in a single-round, two-party conversation (Wang et al., 2013;Shang et al., 2015). Modeling STC is simpler than modeling a complete conversation, but instantly helps applications such as chat-bots and automatic short-message replies (Ji et al., 2014).
Beyond two-party conversations, there is also a need for modeling multi-party conversation, a form of conversation with several interlocutors conversing with each other (Traum, 2003;Dignum and Vreeswijk, 2003;Uthus and Aha, 2013). For example, in the Ubuntu Internet Relay Chat (IRC), sev- eral users cooperate to find a solution for a technical issue contributed by another user. Each agent might have one part of the solution, and these pieces have to be combined through conversation in order to come up with the whole solution.
A unique issue of such multi-party conversations is addressing, a behavior whereby interlocutors indicate to whom they are speaking (Jovanović and Akker, 2004;Akker and Traum, 2009). In faceto-face communication, the basic clue for specifying addressees is turning one's face toward the addressee. In contrast, in voice-only or textbased communication, the explicit declaration of addressee's names is more common.
In this work, we tackle addressee and response selection for multi-party conversation: given a context, predict an addressee and response. As Figure 1 shows, a system is required to select an addressee from the agents appearing in the previous context and a response from a fixed set of candidate responses (Section 3).
The key challenge for predicting appropriate addressees and responses is to jointly capture who is talking about what at each time step in a context. For jointly modeling the speaker-utterance information, we present two modeling frameworks: 1) static modeling and 2) dynamic modeling (Section 5). While speakers are represented as fixed vectors in the static modeling, they are represented as hidden state vectors that dynamically change with time steps in the dynamic modeling. In practice, our models trained for the task can be applied to retrieval-based conversation systems, which retrieves candidate responses from a large-scale repository with the matching model and returns the highest scoring one with the ranking model (Wang et al., 2013;Ji et al., 2014;Wang et al., 2015). Our trained models work as the ranking model and allow the conversation system to produce addressees as well as responses.
To evaluate the trained models, we provide a corpus and dataset. By exploiting Ubuntu IRC Logs 1 , we build a large-scale multi-party conversation corpus, and create a dataset from it (Section 6). Our experiments on the dataset show the models instantiated by the static and dynamic modeling outperform a strong baseline. In particular, the model based on the dynamic modeling robustly predicts appropriate addressees and responses even if the number of interlocutors in a conversation increases. 2 We make three contributions in this work: 1. We formalize the task of addressee and response selection for multi-party conversation.
2. We present modeling frameworks and the performance benchmarks for the task.
3. We build a large-scale multi-party conversation corpus and dataset for the task.

Related Work
This work follows in the footsteps of Ritter et al. (2011), who tackled the response generation problem: given a context, generate an appropriate response. While previous response generation ap-proaches utilize statistical models on top of heuristic rules or templates (Levin et al., 2000;Young et al., 2010;Walker et al., 2003), they apply statistical machine translation based techniques without such heuristics, which leads to recent work utilizing the SMT-based techniques with neural networks (Shang et al., 2015;Vinyals and Le, 2015;Sordoni et al., 2015;Serban et al., 2016). As another popular approach, retrieval-based techniques are used to retrieve candidate responses from a repository and return the highest scoring one with the ranking model (Ji et al., 2014;Wang et al., 2015;Hu et al., 2014;Wang et al., 2013;. Stemming from this approach, the next utterance classification (NUC) task has been proposed, in which a system is required to select an appropriate response from a fixed set of candidates (Lowe et al., 2015;Kadlec et al., 2015). The NUC is regarded as focusing on the ranking problem of retrieval-based system, since it omits the candidate retrieving step. The merit of NUC is that it allows us to easily evaluate the model performance on the basis of accuracy.
Our proposed addressee and response selection task is an extension of the NUC. We generalize the task by integrating the addressee detection, which has been regarded as a problematic issue in multiparty conversation (Traum, 2003;Jovanović and Akker, 2004;Uthus and Aha, 2013). Basically, the addressee detection has been tackled in the spoken/multimodal dialog system research, and the models largely rely on acoustic signal or gaze information (Jovanović et al., 2006;Akker and Traum, 2009;Ravuri and Stolcke, 2014). This current work is different from such previous work in that our models predict addressees with only textual information.
For predicting addressees or responses, how the context is encoded is crucial. In single-round conversation, a system is expected to encode only one utterance as a context (Ritter et al., 2011;Wang et al., 2013). In contrast, in multi-turn conversation, a system is expected to encode multiple utterances (Shang et al., 2015;Lowe et al., 2015). Very recently, individual personalities have been encoded as distributed embeddings used for response generation in two-party conversation (Li et al., 2016). Our work is different from that work in that our proposed personality-independent representation allows us to handle new agents unseen in the training data.

Addressee and Response Selection
We propose and formalize the task of addressee and response selection (ARS) for multi-party conversation. The ARS task assumes the situation where a responding agent gives a response to an addressee following a context. 3 Notation Table 1 shows the notations for the formalization. We denote vectors with bold lower-case (e.g. x t , h), matrices with bold upper-case (e.g. W, H a ), scalars with italic lower-case or upper-case (e.g. a m , Q), and sets with bold italic lower-case or cursive uppercase (e.g. x, C) letters.

Formalization
Given an input conversational situation x, an addressee a and a response r are predicted: GIVEN : x = (a res , C, R) PREDICT : a, r where a res is a responding agent, C is a context and R is a set of candidate responses. The context C is a sequence of previous utterances up to the current time step T : where u at,t is an utterance given by an agent a t at a time step t. Each utterance u at,t is a sequence of N t tokens: u at,t = (w at,t,1 , · · · , w at,t,Nt ) where w at,t,n is a token index in the vocabulary V.
To predict an addressee a as a target output, we select an agent from a set of the agents appearing in a context A(C). Note that a ground-truth addressee is always included in A(C). To predict an appropriate response r, we select a response from a set of candidate responses R, which consists of Q candidates: where r q is a candidate response, which consists of N q tokens, and w q,n is an token index in the vocabulary V.

Dual Encoder Models
Our proposed models are extensions of the dual encoder (DE) model in (Lowe et al., 2015). The DE model consists of two recurrent neural networks (RNN) that respectively compute the vector representation of an input context and candidate response.
A generic RNN, with input x t ∈ R dw and recurrent state h t ∈ R d h , is defined as: where π is a non-linear function, W x ∈ R d h ×dw is a parameter matrix for x t , W h ∈ R d h ×d h is a parameter matrix for h t−1 , and the recurrence is seeded with the 0 vector, i.e. h 0 = 0. The recurrent state h t acts as a compact summary of the inputs seen up to time step t.
In the DE model, each word vector of the context C and the response r q is consumed by each RNN, and is then summarized into the context vector h c ∈ R d h and the response vector h q ∈ R d h . Using these vectors, the model calculates the probability that the given candidate response is the groundtruth response given the context as follows: where y is a binary function mapping from r q to {0, 1}, in which 1 represents the ground-truth sample and 0 represents the false one, σ is the logistic sigmoid function, and W ∈ R d h ×d h is a parameter matrix. As extensions of this model, we propose our multi-party encoder models.

Multi-Party Encoder Models
For capturing multi-party conversational streams, we jointly encode who is speaking what at each time step. Each agent and its utterance are integrated into the hidden states of an RNN. We present two multi-party modeling frameworks: (i) static modeling and (ii) dynamic modeling, both of which jointly utilize agent and utterance representation for encoding multiple-party conversation. What distinguishes the models is that while the agent representation in the static modeling framework is fixed, the one in the dynamic modeling framework changes along with each time step t in a conversation.

Modeling Frameworks
As an instance of the static modeling, we propose a static model to capture the speaking-orders of agents in conversation. As an instance of the dynamic modeling, we propose a dynamic model using an RNN to track agent states. Note that the agent representations are independent of each personality (unique user). The personality-independent representation allows us to handle new agents unseen in the training data.
Formally, similar to Eq. 2, both of the models calculate the probability that the addressee a p or response r q is the ground-truth given the input x: where y is a binary function mapping from a p or r q to {0, 1}, in which 1 represents the ground-truth sample and 0 represents the false one. The function σ is the logistic sigmoid function. a res ∈ R da is a responding agent vector, a p ∈ R da is a candidate addressee vector, h c ∈ R d h is a context vector, h q ∈ R d h is a candidate response vector. These vectors are respectively defined in each model. W a ∈ R (da+d h )×d h is a parameter matrix for the addressee selection probability, and W r ∈ R (da+d h )×d h is a parameter matrix for the response selection probability. These model parameters are learned during training.
On the basis of Eqs. 3 and 4, a resulting addressee and response are selected as follows: whereâ is the highest probability addressee of a set of agents in the context A(C), andr is the highest probability response of a set of candidate responses R.

A Static Model
In the static model, agent matrix A is defined for the agent vectors in Eqs. 3 and 4. This agent matrix can be defined arbitrarily. We define the agent matrix A on the basis of agents' speaking orders. Intuitively, the agents that spoke in recent time steps are more likely to be an addressee. Our static model captures such property. The static model is shown in Figure 2. First, agents in the context A(C) and a responding agent a res are sorted in descending order based on each latest speaking time. Then the order is assigned as an agent index a m ∈ (1, · · · , |A(C)|) to each agent. In the table shown in Figure 2, the responding agent (represented as SYSTEM) has the agent index 1 because he spoke at the most recent time step t = 6. Similarly, User 1 has the index 2 because he spoke at the second most recent time step t = 5, and User 2 has the index 3 because he spoke at the third t = 3.
Each speaking-order index a m is associated with the a m -th column of the agent matrix A: Consuming the agent vectors, an RNN updates its hidden state. For example, at the time step t = 1 in Figure 2, the agent vector a 1 of User 1 is extracted from A on the basis of agent index 2 and then consumed by the RNN. Then, the RNN consumes each word vector w of User 1's utterance. By consuming the agent vector before word vectors, the model can capture which agent speaks the utterance. The last state of the RNN is regarded as h c . As the transition function f of RNN (Eq. 1), we use the Gated Recurrent Unit (GRU) Chung et al., 2014).
For the candidate response vector h q , each word vector (w q,1 , · · · , w q,Nq ) in the response r q is summarized with the RNN. Using these vectors a res , a p , h c , and h q , we predict a next addressee and response with the Eqs. 3 and 4.

A Dynamic Model
In the static model, agent representation A is a fixed matrix that does not change in a conversational stream. In contrast, in the dynamic model, agent representation A t tracks each agent's hidden state which dynamically changes with time steps t. Figure 3 shows the overview of the dynamic model. Initially, we set a zero matrix as initial agent state A 0 , and each column vector of the agent matrix corresponds to an agent hidden state vector. Then, each agent state is updated by consuming the utter-ance vector at each time step. Note that the states of the agents that are not speaking at the time are updated by zero vectors.
Formally, each column of A t corresponds to an agent state vector: where an agent state vector a m,t of an agent a m at a time step t is the a m -th column of the agent matrix A t .
Each vector of the matrix is updated at each time step, as shown in Figure 3. An agent state vector a m,t ∈ R da for each agent a m at each time step t is recurrently computed: a m,t = g(a m,t−1 , u m,t ), a m,0 = 0 where u m,t ∈ R dw is a summary vector of an utterance of an agent a m and computed with an RNN. As the transition function g, we use the GRU. For example, at a time step t = 2 in Figure 3, the agent state vector a 1,2 is influenced by its utterance vector u 1,2 and updated from the previous state a 1,1 .
The agent matrix updated up to the time step T is denoted as A T , which is max-pooled and used as a summarized context vector: The agent matrix A T is also used for a responding agent vector a res and a candidate addressee vector a p , i.e. a res = A T [ * , a res ] and a p = A T [ * , a p ]. r q is summarized into a response vector h q in the same way as the static model.

Learning
We train the model parameters by minimizing the joint loss function: where L a is the loss function for the addressee selection, L r is the loss function for the response selection, α is the hyper-parameter for the interpolation, and λ is the hyper-parameter for the L2 weight decay. For addressee and response selection, we use the cross-entropy loss functions: where x is the input set for the task, i.e. x = (a res , C, R), a + is a ground-truth addressee, a − is a false addressee, r + is a ground-truth response, and r − is a false response. As a false addressee a − , we pick up and use the addressee with the highest probability from the set of candidate addressees except the ground-truth one (A(C) \ a + ). As a false response, we randomly pick up and use a response from the set of candidate responses except the ground-truth one (R \ r + ).

Corpus and Dataset
Our goal is to provide a multi-party conversation corpus/dataset that can be used over a wide range of conversation research, such as turn-taking modeling (Raux and Eskenazi, 2009) and disentanglement modeling (Elsner and Charniak, 2010), as well as for the ARS task. Figure 4 shows the flow of the corpus and dataset creation process. We firstly crawl Ubuntu IRC Logs and preprocess the obtained logs.  Table 2: Statistics of the corpus and dataset. "Docs" is documents, "Utters" is utterances, "W. / U." is the number of words per utterance, "A. / D." is the number of agents per document.
Then, from the logs, we extract and add addressee information to the corpus. In the final step, we set candidate responses and labels as the dataset. Table  2 shows the statistics of the corpus and dataset.

Ubuntu IRC Logs
The Ubuntu IRC Logs is a collection of logs from Ubuntu-related chat rooms. In each chat room, a number of users chat on and discuss various topics, mainly related to technical support with Ubuntu issues. The logs are put together into one file per day for each room. Each file corresponds to a document D. In a document, one line corresponds to one log given by a user. Each log consists of three items (Time, UserID, Utterance). Using such information, we create a multi-party conversation corpus.

The Multi-Party Conversation Corpus
To pick up only the documents written in English, we use a language detection library (Nakatani, 2010). Then, we remove the system logs from each document and leave only user logs. For segmenting the words in each utterance, we use a word tokenizer (TreebankWordTokenizer) of the Natural Language Toolkit 4 . Using the preprocessed documents, we create a corpus, whose row consists of the three items (UserID, Addressee, Utterance).
First, the IDs of the users in a document are collected into the user ID list by referring to the UserID in each log. Then, as the addressee user ID, we extract the first word of each utterance. In the Ubuntu IRC Logs, users follow the name mention convention (Uthus and Aha, 2013), in which they express their addressee by mentioning the addressee's user ID at the beginning of the utterance. By exploiting the name mentions, if the first word of each utterance is identical to a user ID in the user ID list, we extract the addressee ID and then create a table consisting of (UsetID, Addressee, Utterance). In the case that addressee IDs are not explicitly mentioned at the beginning of the utterance, we do not extract anything.

The ARS Dataset
By exploiting the corpus, we create a dataset for the ARS task. If the line of the corpus includes an addressee ID, we regard it as a sample for the task. As the ground truth addressees and responses, we straightforwardly use the obtained addressee IDs and the preprocessed utterances.
As false responses, we sample utterances elsewhere within a document. This document-within sampling method makes the response selection task more difficult than the random sampling method 5 . One reason for this is that common or similar topics in a document are often discussed and the used words tend to be similar, which makes the wordbased features for the task less effective. We partitioned the dataset randomly into a training set (90%), a development set (5%) and a test set (5%).

Experiments
We provide performance benchmarks of our learning architectures on the addressee and response selection (ARS) task for multi-party conversation.

Experimental Setup Datasets
We use the created dataset for the experiments. The number of candidate responses RES-CAND (|R|) is set to 2 or 10.

Evaluation Metrics
We evaluate performance by accuracies on three aspects: addressee-response pair selection (ADR-RES), addressee selection (ADR), and response selection (RES). In the addressee-response pair selection, we regard the answer as correct if both the addressee and the response are correctly 5 Lowe et al. (2015) adopted the random sampling method. selected. In the addressee/response selection, we regard the answer as correct if the addressee/response is correctly selected.

Optimization
The models are trained by backpropagation through time (Werbos, 1990;Graves and Schmidhuber, 2005). For the backpropagation, we use stochastic gradient descent (SGD) with a mini-batch training method. The mini-batch size is set to 128. The hyper-parameter α for the interpolation between the two loss functions (Section 5.3) is set to 0.5. For the L2 weight decay, the hyper-parameter λ is selected from {0.001, 0.0005, 0.0001}.
Parameters of the models are randomly initialized over a uniform distribution with support [−0.01, 0.01]. To update parameters, we use Adam (Kingma and Ba, 2014) with the default setting suggested by the authors. As the word embeddings, we used the 300 dimension vectors pre-trained by GloVe 6 (Pennington et al., 2014). To avoid overfitting, the word vectors are fixed across all experiments. The hidden dimensions of parameters are set to d w = 300 and d h = 50 in the both models, and d a is set to 300 in the static model and 50 in the dynamic model.
To identify the best training epoch and model configuration, we use the early stopping method (Yao et al., 2007). In this method, if the best accuracy of ADR-RES on the development set has not been updated for consecutive 5 epochs, training is stopped and the best performing model is picked up. The max epochs is set to 30, which is sufficient for convergence.

Implementation Details
For computational efficiency, we limit the length of a context C as C T −Nc+1:T = (u T −Nc+1 , · · · , u T ), where N c , called context window, is the number of utterances prior to a time step t. We set N c to {5, 10, 15}. In addition, we truncate the utterances and responses at a maximum of 20 words. For batch processing, we zero-pad them so that the number of words is constant. Out-of-vocabulary words are replaced with <unk>, whose vector is the averaged vector over all word vectors.

Baseline Model
We set a baseline using the term frequency-inverse document frequency (TF-IDF) retrieval model for the response selection (Lowe et al., 2015). We firstly compute two TF-IDF vectors, one for a context window and one for a candidate response. Then, we compute a cosine similarity for these vectors, and select the highest scoring candidate response as a result. For the addressee selection, we adopt a rulebased method: to determine the agent that gives an utterance most recently except a responding agent, which captures the tendency that agents often respond to the other that spoke immediately before.

Results
Overall Performance Table 3 shows the empirical benchmark results. The dynamic model achieves the best results in all the metrics. The static model outperforms the baseline, but is inferior to the dynamic model. In addressee selection (ADR), the baseline model achieves around 55% in accuracy. This means that if you select the agents that spoke most recently as an addressee, the half of them are correct. Compared with the baseline, our proposed models achieve better results, which suggests that the models can select the correct addressees that spoke at more previous time steps. In particular, the dynamic model achieves 68% in accuracy, which is 7 point higher than the accuracy of static model.
In response selection (RES), our models outperform the baseline. Compared with the static model,

Effects of the Context Window
In response selection, a performance boost of our proposed models is observed for the context window N c = 10 over N c = 5. Comparing the results of the models with the context window N c = 10 and N c = 15, the performance is improved but relatively small, which suggests that the performance almost reaches the convergence. In addressee selection, the performance improvements of the static model with the broader context window is limited. In contrast, in the dynamic model, a steady performance boost is observed, yielding an increase of over 5 points between N c = 15 and N c = 5,  Effects of the Sample Size Figure 5 shows the accuracy curves of addresseeresponse selection (ADR-RES) for different training sample sizes. We use 1/2, 1/4, and 1/8 of the whole training samples for training. The results show that as the amount of the data increases, the performance of our models are improved and gradually approaches the convergence. Remarkably, the performance of the dynamic models using the 1/8 samples is comparable to that of the static model using the whole samples.

Effects of the Number of Participants
To shed light on the relationship between the model performance and the number of agents in multi-party conversation, we investigate the effect of the number of agents participating in each context. Table 4 compares the performance of the models for different numbers of agents in a context. In addressee selection, the performance of all models gradually gets worse as the number of agents in the context increases. However, compared with the baseline, our proposed models suppress the performance degradation. In particular, the dynamic model predicts correct addressees most robustly.
In response selection, unexpectedly, the performance of all the models gets better as the number of agents increases. Detailed investigation on the interaction between the number of agents and the response selection complexity is an interesting line of future work.

Conclusion
We proposed addressee and response selection for multi-party conversation. Firstly, we provided the formal definition of the task, and then created a corpus and dataset. To present benchmark results, we proposed two modeling frameworks, which jointly model speakers and their utterances in a context. Experimental results showed that our models of the frameworks outperform a baseline.
Our future objective to tackle the task of predicting whether to respond to a particular utterance. In this work, we assume that the situations where there is a specific addressee that needs an appropriate response and a system is required to respond. In actual multi-party conversation, however, a system sometimes has to wait and listen to the conversation that other participants are engaging in without needless interruption. Hence, the prediction of whether to respond in a multi-party conversation would be an important next challenge.