Who Is Speaking to Whom? Learning to Identify Utterance Addressee in Multi-Party Conversations

Previous research on dialogue systems generally focuses on the conversation between two participants, yet multi-party conversations which involve more than two participants within one session bring up a more complicated but realistic scenario. In real multi- party conversations, we can observe who is speaking, but the addressee information is not always explicit. In this paper, we aim to tackle the challenge of identifying all the miss- ing addressees in a conversation session. To this end, we introduce a novel who-to-whom (W2W) model which models users and utterances in the session jointly in an interactive way. We conduct experiments on the benchmark Ubuntu Multi-Party Conversation Corpus and the experimental results demonstrate that our model outperforms baselines with consistent improvements.


Introduction
As an essential aspect of artificial intelligence, dialogue systems have attracted extensive attention in recent studies (Vinyals and Le, 2015;Serban et al., 2016). Researchers have paid great efforts to understand conversations between two participants, either single-turn (Li et al., 2016a;Shang et al., 2015;Vinyals and Le, 2015) or multi-turn (Zhou et al., 2016;Tao et al., 2019a,b), and achieved encouraging results. A more general and challenging scenario is that a conversation may involve more than two interlocutors conversing among each other (Uthus and Aha, 2013;Hu et al., 2019), which is known as multi-party conversation. Ubuntu Internet Relay Chat channel (IRC) is a multi-party conversation scenario as shown in Table 1. Generally, each utterance is associated with a speaker and one or more addressees in the conversation. Such a characteristic * Equal contribution. † Corresponding author. "Good point, tmux is the thing I miss." -User 1 "Cool thanks for ur help." @User 4 User 4 User 2 "Ahha, you r using something like cpanel." -User 3 "Yeah 1.4.0 exactly." @User 2 User 2 User 4 "my pleasure :)" leads to complex speaker-addressee interactions. As a result, the speaker and addressee roles associated with utterances are constantly changing among multiple users across different turns. Such speaker and addressee information could be essential in many multi-party conversation scenarios including group meeting, debating and forum discussion. Therefore, compared to two-party conversations, a unique issue of multi-party conversations is to understand who is speaking to whom. In real scenarios of multi-party conversations, an interesting phenomenon is that the speakers do not usually designate an addressee explicitly. This phenomenon also accords with our statistic analysis on the IRC dataset. We found that around 66% utterances missing explicit addressee information. That means when modeling such multi-party conversations, one may have to guess who is speaking to whom in order to understand the utterance correspondence as well as the stream structure of multi-party conversations.
Given a multi-party conversation where part of the addressees are unknown, previous work mainly focuses on predicting the addressee of only the last utterance. Ouchi and Tsuboi (2016) proposed to scan the conversation session and track the speaker's state based on the utterance content at each step. On this basis, Zhang et al. (2017) introduced a speaker interaction model that tracks all users' states according to their roles in the session. They both fused the representations of the last speaker and utterance as a query, and a match-ing network is utilized to calculate the matching degree between the query and each listener. The listener with the highest matching score is selected as the predicted addressee.
However, in practice, it is more helpful to predict all the missing addressees rather than only the last one in understanding the whole conversation. And it also benefits for both building a group-based chatbot and clustering users based on what they have said. Therefore, we propose a new task of identifying the addressees of all the missing utterances given a multi-party conversation session where part of the addressees are unspecified. To this end, we propose a novel Who-to-Whom (W2W) model which jointly models users and utterances in the multi-party conversation and predicts all the missing addressees in a uniform framework. 1 Our contributions are as follows: • We introduce a new task of understanding who speaks to whom given an entire conversation session as well as a benchmark system.
• To capture the correlation within users and utterances in multi-party conversations, we propose an interactive representation learning approach to jointly learn the representations of users and utterances and enhance them mutually.
• The proposed approach (W2W) considers both previous and subsequent information in the session while incorporating the correlation with users and utterances. For conversations with complex structures, W2W models them in a uniform way and could handle any kind of occasion even when all the addressee information is missing.

Related Work
In this section, we briefly review recent works and progresses on multi-party conversations.
Multi-party conversations, as a general case of multi-turn conversations (Li et al., 2017(Li et al., , 2016cSerban et al., 2016) involve more than two participants. In addition to the representation of learning for utterances, another key issue is to model multiple participants in the conversations. It is intuitive to introduce multiple user embeddings for multi-party conversations, either as persona-dependent embeddings (Li et al., 2016b), or as persona-independent embeddings (Ouchi and Tsuboi, 2016;Zhang et al., 2017;Meng et al., 2017). Recently, some researchers utilized users' information based on different roles in conversations, such as senders and recipients Luan et al., 2016).
In multi-party conversations, identifying the relationship among users is also an important task. It can be categorized into two topics, 1) predicting who will be the next speaker (Meng et al., 2017) and 2) who is the addressee (Ouchi and Tsuboi, 2016;Zhang et al., 2017). For the first topic, Meng et al. (2017) investigated a temporal-based and a content-based method to jointly model the users and context. For the second topic, which is closely related to ours, Ouchi and Tsuboi (2016) proposed to predict the addressee and utterance given a context with all available information. Later, Zhang et al. (2017) proposed a speaker-interactive model, which takes users' role information into consideration and implements a role-sensitively state tracking process.
In our task, the addressee identification problem is quite different from (Ouchi and Tsuboi, 2016) and (Zhang et al., 2017). Both of their studies aimed to make predictions on whom the last speaker addresses to. While in this paper, we focus on the whole session and aim to identify all the missing addressees. By contrast, our task is a more challenging scenario since it relies on the correlation within all users and utterances to identify the speaker-addressee structure of the entire session.

Problem Formulation
Given an entire multi-party conversation S with length T , the sequence of utterances in it is defined as {u t } T t=1 . Each utterance is associated with a speaker a SP R t and an addressee a ADR t . a SP R t is observable across the entire session while a ADR t is mostly unspecified as shown in Table 1. Our task is to identify the addressees for all utterances within the conversation session. The predicted addressee is denoted asâ ADR t . Formally, we have following formulations: Let A(S) denote the user set in the session S, thus A(S)\{a SP R t } denotes the listeners at the t-th turn (a LSR j t denotes the j-th listener). The listeners are also referred as candidate addressees for each turn and the identified addresseeâ ADR t should be one  Concretely, the representation learning module is designed to jointly learn the representation of users and utterances in an interactive way after initializing them separately. The representations of users (also denoted as user states) and utterance embeddings are mutually enhanced. With the representations of users and utterances, a network is utilized to fuse them up into a query representation. In this way, we jointly capture who is speaking what at each step.
After the representations of users and utterances are learned, we feed them into a matching module. In this module, a matching network is learned to score the matching degrees between the query and each candidate. According to the matching scores, the model ranks all addressee candidates in A(S)\{a SP R t } and selects the one with the highest matching score as the identified addressee. For each utterance in the multi-party conversation, we repeat the above steps until the addressees of all utterances are identified.

Our W2W Model
In this section, we first describe each part of the W2W model in details: (1) Initialization of utterance and user representations; (2) Interactive representation learning of users and utterances; (3) Matching procedure for identifying the addressee. We finally describe the training procedure of the W2W model.

Initialization
W2W models utterance and user embeddings separately before interactive representation learning and gets the representation of each utterance and user as initialization.

Utterance Initialization Encoder
Suppose that in a conversation session S with T utterances denoted as {u 1 , u 2 , . . . , u T }. An utterance u t that contains n tokens is denoted as {w 1 , w 2 , . . . , w n }, where {w i } is word embeddings 3 of the i-th token. We first utilize a word level bi-directional RNN with Gated Recurrent Units (GRUs) (Cho et al., 2014) to encode each utterance and take the concatenation of the hidden states of the last step from both sides as the sentence embedding. Then, a sentence level bidirectional GRU is applied with each sentence embedding as input to obtain the global context of the session. The utterance representation u t is represented by the concatenation of hidden states from both sides at t-th time step. 4

Position-Based User Initialization
In multi-party conversation, position information of different participants in the session is crucial in the addressee identification task. For example, a speaker is more likely to address his direct preceding or subsequent speaker. On this basis, we define the initialization user matrix A (0) based on the speaking order of users in the session (Ouchi and Tsuboi, 2016). Concretely, all users in a session are sorted in a descending order according to the first time when they speak, and the i-th user is assigned with the i-th row of A (0) as a i (0) . The user matrix A (0) is trained as parameters along with other weight matrices in the neural network.
Users of the same order in different sessions share the same initialization user embedding. Note that the user representations are independent of each personality (unique user). Such strategy guarantees the initialization user embeddings to carry position information as well as handle new users unseen in training data during addressee identification.

Interactive Representation Learning
To better capture who speaks what at each turn through the whole session, we propose to interactively learn the representation of utterances and users. Different from prior studies (Ouchi and Tsuboi, 2016;Zhang et al., 2017) which only track the users' states but neglecting the users' impact on utterances. We propose the W2W model which learns user and utterance representations interactively by tracking users' states with utterance embeddings as well as fusing users' states into the utterance embeddings.

Users Representation Learning
Role-sensitive User State Tracking. Suggested by (Zhang et al., 2017), an utterance could have different degrees of impact on the states of the corresponding speaker and listeners. In order to capture the users' role information, we utilize two kinds of GRU-based cells represented as Speaker-GRU (SGRU) and Listener-GRU (LGRU) to track the states of the speaker and listeners respectively at each turn of the session. 5 At the t-th transition step, the SGRU tracks the speaker representation a SPR (t) , from the former state of him a SPR (t−1) , the utterance representation u t , as well as a pseudo addressee representation a PADR (t−1) calculated via PAM (Person Attention Mechanism) which is 5 We denote a user embedding tracked until t th time step as a (t) , with a SPR (t) as the representation of the speaker at t th turn and a LSR j (t) as the representation of the j th listener at t th turn. a weighted sum of all the listeners' representations. Details on PAM will be elaborated in the next part. The state tracking procedure for the ith step is formulated as Eq (2). The main idea of SGRU is to incorporate two reset gates, each of which controls the information fusion from the listeners and speaker respectively, denoted as r i and p i . W , U and V are learnable parameters.
Symmetrically, LGRU incorporates the embeddings of a certain listener as well as a pseudo speaker and a pseudo utterance representation (also calculated via PAM) as inputs and tracks the state of each listener. SGRU and LGRU have symmetric updating functions as Eq (2) except for the difference on pseudo representation incorporated in the cell. 6 The parameters of SGRU and LGRU are not shared, which guarantees W2W to learn role-dependent features in users' state tracking procedure. The whole structure of SGRU and LGRU are illustrated in Figure 3. Person Attention Mechanism. We propose a person attention mechanism (PAM) (Eq (3)  the state tracking process exactly. Each element β j i measures how likely the model estimates the j-th listener to be the addressee for the i-th turn based on the user representations tracked until turn i. W p is the parameter.
Then for each turn i, a pseudo addressee a PADR (i−1) is generated as the weighted sum of all listener representations tracked until step i as Eq (4). Intuitively, a listener with a higher matching score is more likely to be the addressee at the current step. The pseudo addressee a PADR (i−1) is incorporated into the state tracking of the speaker as Eq (2).
Symmetrically, the pseudo speaker a PSPR j (i−1) and pseudo utterance u P j i are generated through Eq (5) and Eq (6) for each listener j at the i-th turn of the conversation.

Utterances Representation Learning
We design a UGRU (Utterance-GRU) cell 7 , which has the same structure as SGRU/LGRU to fuse the utterance embedding, current speaker embedding and the user-summary vector into an enhanced utterance embedding. The user matrix initialized with A (0) (as described in 4.2.1) and tracked until step t − 1 is denoted as A (t−1) . The 7 Note that although UGRU has the same structure as SGRU/LGRU, it acts on each utterance for only once instead of recurrently tracking. Fuse users' information into utterance representation using Eq (7) ; 7 end 8 return 1) User matrix of the last turn Algorithm 1: Interactive Representation Learning Algorithm (Forward Pass).
user-summary vector is calculated through maxpooling over users on A (t−1) as a summary of all users' current states.

Forward-and-Backward Scanning.
Considering that the addressee of an utterance can be the speaker of the preceding utterances or the subsequent ones, it is important to capture the dependency from both sides for users and utterances. We propose a forward-and-backward scanning schema to enhance the interactive representation learning. For forward pass, W2W model outputs the forward user matrix of the last time step, denoted as − → A (T ) as well as all the forwardenhanced utterance embeddings { − → u t } T t=1 as illustrated in Algorithm 1. The backward pass initializes users and utterances in the same way as the forward pass and scans the conversation session in the reversed order. Representations from both sides are concatenated correspondingly as the final representation as Eq (8).

Matching
Matching Network. We first fuse the speaker embedding and the utterance embedding into a query representation as q, then measures the embedding similarity s between the query and each listener: where W s , W u , W m denote weight matrices. For simplicity, we use a short-handed M atch(.) to denote the Equation (9) when there is no ambiguity. Addressee Identification. For each turn in the session, we score the matching degree between the query and each listener and select the best matched a ADR i as the addressee prediction as Eq (10). s j i denotes the matching score between the j-th listener a LSR j and the query of the i-th turn. For the entire conversation, we repeat the above steps until the addressee for each utterance is identified.

Learning
We utilize the cross-entropy loss to train our model (Ouchi and Tsuboi, 2016). The objective is to minimize the loss as follows: Each subscript k denotes a session, and subscript i is taken from the utterances that have ground truths of addressee information. s + i denotes the matching score between the query and the ground truth addressee, s − i denotes the score of the negative matching, where the candidate addressee is negatively sampled. All parameters in our W2W model are jointly trained via Back-Propagation (BP) (Rumelhart et al., 1986).

Experimental Setups
Dataset. We run experiments using the benchmark Ubuntu dataset released by Ouchi and Tsuboi (2016). The corpus consists of a huge amount of records including response utterances, user IDs and their posting time. We organize the dataset as samples of conversation sessions.
We filter out the sessions without a single addressee ground truth, which means no label is available in these sessions. We also filter out session samples with one or more blank utterance. Moreover, we separate the conversation sessions into three categories according to the session length. Len-5 indicates the sessions with 5 turns and it is similar for Len-10 and Len-15. Such a splitting strategy is adopted in related studies as (Ouchi and Tsuboi, 2016;Zhang et al., 2017). The dataset is split into train-dev-test sets and the statistics are shown in Table 2. Comparison Methods.
We utilize several algorithms including heuristic and state-of-the-art methods as baselines. As there is no existing method that can perform the new task, we have to adapt baselines below into our scenario.
• Preceding: The addressee is designated as the preceding speaker of the current speaker.
• Subsequent: The addressee is designated as the next speaker.
• Dynamic RNN (DRNN): The model is originally designed to predict the addressee of the last utterance given the whole context available (Ouchi and Tsuboi, 2016). We adapt it to our scenario which is to identify addressees for all utterances in the conversation session. Concretely, the representations of users and context are learned in the same way as DRNN. While during the matching procedure, the representations of speaker and context are utilized to calculate the matching degree with candidate addressees at each turn.
• Speaker Interactive RNN (SIRNN): SIRNN is an extension model on DRNN, which is more interaction-sensitive (Zhang et al., 2017). Since all addressee information is totally unknown in our scenario, we also adapt the model with only speaker-role and observer-role into this situation. User states are tracked recurrently according to their roles at each turn, i.e. two distinct networks (IGRU S and IGRU O ) are utilized to track the status of the speaker and observers at each turn. Since there is no addressee-role observable through the session, we also make some adaption on the updating cell here. At each turn, IGRU S updates the speaker embedding from the previous speaker embedding and the utterance embedding, IGRU O updates the observer embedding from the previous observer embedding and the utterance embedding. During matching procedure, we make the prediction on each turn instead of only predicting the addressee for the last turn. Implementation and Parameters. For fair comparison, we choose the hyper-parameters spec-   (2016) and Zhang et al. (2017). We represent the words with 300dimensional GloVe vectors, which are fixed during training. The dimension of speaker embeddings and hidden states are set to 50. The joint cross-entropy loss function with L 2 weight decay as 0.001 is minimized by Adam (Kingma and Ba, 2014) with a batch size of 128. Evaluation Metrics. To examine the effectiveness of our model on the addressee identification task, we compare it with baselines in terms of pre-cision@n (i.e. p@n) (Yan et al., 2017). For predicting an addressee of an utterance, our model actually provides a ranking list for all candidate addressees. 8 We also evaluate the performance on the session level: we mark a session as a positive sample if and only if all ground truth labels are correctly identified on the top 1 of rankings, and calculate the ratio as accuracy.
As we discussed before, only a part of utterances in multi-party conversations have explicit addressee, which limits the completeness of automatic evaluation metrics. In order to evaluate the performance on unlabeled utterances, we leverage human inference results and calculate the consistency between the model predictions and human results. Due to the labor cost limit, we randomly sample 100 sessions from the test set of Len-5, Len-10 and Len-15 respectively and recruit three 8 Intuitively, p@1 is the precision at the highest ranked position and should be the most natural way to indicate the performance. We also provide p@2 and p@3 to illustrate the potential of different systems to identify the correct addressee on top of the lists. volunteers to annotate the addressee label for unlabeled utterances by reasoning through content and addressee information. We leave blank on the utterance where three annotators give different inference results. Finally around 81.4% of the unlabeled utterances have two or more annotators given them same annotation. With the human inference results and model predictions, we use the overlapping rate 9 as the consistency metric.

Results and Discussion
We first give an overall comparison between W2W and baselines followed by ablation experiments to demonstrate the effectiveness of each part in W2W model. We then confirm the robustness of W2W with several factors including the numbers of users, the position of the utterance. Furthermore, we evaluate how W2W and baseline models perform on both labeled and unlabeled utterance.
Overall Performance. For automatic evaluation shown in Table 3, end-to-end deep learning approaches outperform heuristic ones, which indicates that simple intuition is far from satisfaction and further confirm the value of this work. Among the deep-learning based approaches, our W2W model outperforms the state-of-the-art models by all evaluation metrics. Direct adaption from approaches on identifying the last addressee of the session may not work fine for our scenario.
As shown in Figure 4, the performance of all methods drops as the context length increases (from Len-5 to Len-15) since the task becomes more difficult with more context information to encode and more candidate addressees to rank. However, the improvement of our W2W model is rather more obvious with longer context length. In particular, for the dataset Len-15, W2W improves 5% on p@1 and 10% on session accuracy over SIRNN as shown in Figure 4, which indicate the robustness of W2W model in complex scenarios.  Figure 4: The comparison between W2W model and two stateof-the-art baselines on p@1.  Table 4 shows the consistency between human inference and model predictions. W2W also outperforms the baselines with a larger margin on longer conversation scenarios, which is consistent with the phenomenon of automatic evaluation. The advantage on unlabeled data of our W2W model demonstrates the superiority for detecting the latent speaker-addressee structure unspecified in the conversation stream, and that it could help find out the relationship between and across users in the session.
Ablation Test. Table 5 shows the results of ablation test. First, we replace the bi-directional scanning schema with the forward scanning one. The result shows that the bi-directional scanning schema captures the information in long conversations more sufficiently. Besides, we investigate the effectiveness of PAM by replacing it with the simple mean-pooling approach as Eq (12): The result shows that it is more beneficial to capture the correlation between the user-utterance pair and each listener and implement the state tracking correspondingly at each turn with our PAM mechanism.
To investigate the effectiveness of interactive representation learning module, we first remove the UGRU cell and fix the utterance representations in the state tracking procedure (referred as w/o Utterance Interaction in Table 5). Symmetrically, we fix the user representations in the session by removing the SGRU and LGRU cell and maintain only the interaction affect from the users to the utterances (referred as w/o User Interaction in Table 5). We also conduct an experiment on taking off the whole interactive representation module by removing UGRU, SGRU and LGRU, where the addressee identification is totally dependent on the initial representations of users and utterances. The result demonstrates that each part of them has an important contribution to our W2W model, especially the interaction affect on users.

Number of Participants.
The task becomes more difficult as the number of participants involved in the conversation increases since more participants correspond to more complicated speaker-addressee relationship. We investigate how the performance correlates with the number of speakers in the dataset Len-15. The results in terms of p@1 are shown in Figure 5. In conversations with few participants, all methods have rather high performance. As the participant number increases, W2W constantly outperforms the baselines. The performance gap becomes larger especially when there are 6 users and more, which indicates the capability of our W2W model in handling complex conversation situations.
Position of Addressee-to-Predict. As mentioned above in 4.1.2, position information of utterances is a crucial factor in identifying addressees for multi-party conversation. We investigate how the system performance correlates with the position of the addressee to be predicted. In Figure 6, we show the p@1 performance of the W2W and baselines when predicting the addressee of u i at the i th turn. Again, W2W shows consistently better performance than the other baselines no matter where the turn to predict addressee is.
We can observe that all the methods perform relatively poor at the beginning or the end. Compared with the middle part of a long conversation, the beginning and the ending contains less context information, which makes the addressee of these part more difficult to predict. The result in Figure 6 shows that the gap between W2W and other methods is even larger where the addresseeto-predict is at the beginning or the end, which indicates that W2W is better at capturing key information and has stronger robustness in difficult scenarios.
Variance of Matching Scores. In real multiparty conversations, the utterances without addressee information can be divided into two cases. Sometimes an utterance has an explicit addressee while the speaker doesn't specify whom he/she is speaking to. We denote these cases as NP which refers to NULL-Positive. Another case is that the utterances don't address to any user in the conversation (denoted as NN which means Null-Negative), such as utterances 'Hi, everyone!' and 'Can anyone here help me?' In Ubuntu IRC dataset, unlabeled of the NN case and NP case are mixed and are difficult to distinguish without manual annotation.
Meanwhile, our W2W model and baseline approaches predict matching scores on each listener for every utterance no matter it has addressee information or not. For each utterance, the variance of the matching scores on all listeners represents how certain the model is on its addressee identification decision. A larger variance corresponds to a more confident prediction. Table 6 demonstrates the variance comparison on labeled and unlabeled cases in the test set Len-15. On utterances with addressee labels, the variance of our W2W model is significantly larger than the state-of-the-art baseline, which indicates that W2W has a higher degree of certainty about its own predictions when the conversation content is referring to someone explicitly. For utterances without addressee labels, the difference of variance between W2W and SIRNN is significantly reduced. Considering that unlabeled sets consist of NN ones as well as NP ones on which W2W has much larger variance than SIRNN just as the labled utterance scenario, we can infer that W2W has much lower variance than SIRNN on the NN case. Such phenomenon reflects that our W2W model won't make a prediction recklessly on occasions where there is no clear addressee. Therefore, the variance on matching scores of all listeners in our W2W model, to some extent, provides a signal of whether the utterance has a explicit addressee even if we do not provide any supervision information on this aspect during training.

Conclusion
In this paper, we aim at learning to identify the utterance addressees in multi-party conversations by predicting who is speaking to whom, which is a new task.. To perform the new task, we propose the W2W model which learns the user and utterance representations interactively. To handle the uncertainty in the conversation session, we design PAM which captures matching degree between current query and each candidate addressees. The experimental results show prominent and consistent improvement over heuristic and state-of-theart baselines.
In the future, we will further investigate better utterance models associated with additional information such as topics or knowledge. With the help of learned addressee structure, we can build a general structural dialogue system for complex multiparty conversations.