Online Conversation Disentanglement with Pointer Networks

Huge amounts of textual conversations occur online every day, where multiple conversations take place concurrently. Interleaved conversations lead to difficulties in not only following the ongoing discussions but also extracting relevant information from simultaneous messages. Conversation disentanglement aims to separate intermingled messages into detached conversations. However, existing disentanglement methods rely mostly on handcrafted features that are dataset specific, which hinders generalization and adaptability. In this work, we propose an end-to-end online framework for conversation disentanglement that avoids time-consuming domain-specific feature engineering. We design a novel way to embed the whole utterance that comprises timestamp, speaker, and message text, and proposes a custom attention mechanism that models disentanglement as a pointing problem while effectively capturing inter-utterance interactions in an end-to-end fashion. We also introduce a joint-learning objective to better capture contextual information. Our experiments on the Ubuntu IRC dataset show that our method achieves state-of-the-art performance in both link and conversation prediction tasks.


Introduction
With the fast growth of Internet and mobile devices, people now commonly communicate in the virtual world to discuss events, issues, tasks, and personal experiences.Among the various methods of communication, text-based conversational media, such as Internet Relay Chat (IRC), Facebook Messenger, Whatsapp, and Slack, has been and remains one of the most popular choices.Multiple ongoing conversations seem to occur naturally in such social and organizational interactions, especially when the conversation involves more than two participants (Elsner and Charniak, 2010).For example,  consider the excerpt of a multi-party conversation in Figure 1 taken from the Ubuntu IRC corpus (Lowe et al., 2015).Even in this small excerpt, there are 4 concurrent conversations (distinguished by different colors) among 4 participants.
Identifying or disentangling individual conversations is often considered as a prerequisite for downstream dialog tasks such as utterance ranking and generation (Lowe et al., 2017;Kim et al., 2019).It can also help building other applications such as search, summarization, and question answering over conversations, and support users by providing online help (Joty et al., 2019).
However, often there are no explicit structures or metadata to separate out the individual conversations.Naive heuristics to disentanglement often lead to sub optimal results as Kummerfeld et al. (2019) found that only 10.8% of the conversations arXiv:2010.11080v1[cs.CL] 21 Oct 2020 in the widely used Ubuntu IRC dialog corpus were extracted correctly by the heuristics employed by Lowe et al. (2015Lowe et al. ( , 2017)).
Previous studies have therefore investigated traditional machine learning methods with statistical and linguistic features for conversation disentanglement, e.g., (Shen et al., 2006), (Wang and Oard, 2009;Wang et al., 2011b,a), (Elsner andCharniak, 2010, 2011), to name a few.The task is generally solved by first finding links between utterances, and then grouping them into a set of distinct conversations.Recent work by Jiang et al. (2018) and Kummerfeld et al. (2019) adopt deep learning approaches to learn abstract linguistic features and compute message pair similarity.However, these methods heavily rely on hand-engineered features that are often too specific to the particular datasets (or domains) on which the model is trained and evaluated.For example, many of the features used in (Kummerfeld et al., 2019) are only applicable to the Ubuntu IRC dataset.This hinders the model's generalization and adaptability to other domains.
In this work, we propose a more general framework for conversation disentanglement while avoiding time-consuming and domain-specific feature engineering.In particular, we cast link prediction as a pointing problem, where the model learns to point to the parent of a given utterance (Figure 2).Each pointing operation is modeled as a multinomial distribution over the set of previous utterances.A neural encoder is used to encode each utterance text along with its speaker and timestamp.The pointing function implements a custom attention mechanism that models different interactions between two utterances.This results in an end-toend neural framework that can be optimized with a simple cross entropy loss.During training, we jointly model the reply-to relationship and pairwise relationship (whether two utterances in the same conversation) under the same framework, so that more contextual and structural information can be learned by our model to further improve the disentangling performance.
Furthermore, the framework supports online decoding, which is naturally provided by the pointer network framework, disentangles a conversation as it unfolds, and can provide real-time help to participants in contributing to the right conversations.
We performed extensive experiments on the recently released Ubuntu IRC dataset (Kummerfeld et al., 2019) and demonstrate that our approach out-performs previous methods for both link prediction and conversation prediction tasks.1 Ablation studies reveal the importance of different components of the model and special handling of self-links.Our framework is generic and can be applied to chat conversations from other domains.

Background
In this section, we give a brief overview of previous work on conversation disentanglement and the generic pointer network model.

Conversation disentanglement
Most existing approaches treat disentanglement as a two-stage problem.The first stage involves link prediction that models "reply-to" relation between two utterances.The second stage is a clustering step, which utilizes the results from link prediction to construct the individual conversation threads.
For link prediction, earlier methods used discourse cues and content features within statistical classifiers.Elsner andCharniak (2008, 2010) combine conversation cues like speaker, mention, and time with content features like the number of shared words to train a linear classifier.
Recent methods use neural models to represent utterances with compositional features.Mehri and Carenini (2017) pre-train an LSTM network to predict reply probability of an utterance, which is then used in a link prediction classifier along with other handcrafted features.Jiang et al. (2018) model high and low-level linguistic information using a siamese hierarchical convolutional network that models similarity between pairs of utterances in the same conversation.The interactions between two utterances is captured by taking element-wise absolute difference of the encoded sentence features along with other handcrafted features.Kummerfeld et al. (2019) uses feed-forward networks with averaged pre-trained word embedding and many hand-engineered features.Tan et al. (2019) used an utterance-level LSTM network, while Zhu et al. (2019) used a masked transformer to get a contextaware utterance representation considering utterances in the same conversation.
Finding a globally optimal clustering solution for conversation disentanglement has been shown to be NP-hard (McCallum and Wellner, 2005).Previous methods focus mostly on approximating the global optimal by either using greedy decoding (Wang and Oard, 2009;Elsner and Charniak, 2008, 2010, 2011;Jiang et al., 2018;Aumayr et al., 2011) or training multiple link classifiers to do voting (Kummerfeld et al., 2019).Mehri and Carenini (2017) trained additional classifiers to decide whether an utterance belongs to a conversation or not.Wang et al. (2020) use a multi-task topic tracking framework for conversation disentanglement, topic prediction and next utterance ranking.
Our work is fundamentally different from previous studies in that we treat link prediction as a pointing problem modeled by a multinomial distribution over the previous utterances (as opposed to pairwise binary classification).This formulation allows us to model the global conversation flow.Our method does not rely on any handcrafted features.Each utterance in our method is represented by the utterance text, its speaker and timestamp, which are generic to any conversation.The interactions between the utterances are effectively modeled within the pointer module.Moreover, our framework can work in an end-to-end online setup.

Pointer Networks
Pointer networks (Vinyals et al., 2015) are a class of encoder-decoder models that can tackle problems where the output vocabulary depends on the input sequence.They use attentions as pointers to the input elements.An encoder network first transforms the input sequence X = (x 1 , . . ., x n ) into a sequence of hidden states H = (h 1 , . . ., h m ).At each time step t, the decoder takes the input from the previous step, generates a decoder state d t , and uses it to attend over the input elements.The attention gives a softmax (multinomial) distribution over the input elements as follows.
where σ(., .) is a scoring function for attention, which can be a neural network or simply a dot product operation.The model uses a t to infer the output: ŷt = arg max(a t ).
Similar to the standard pointer network, each pointing mechanism in our approach is modeled as a multinomial distribution over the indices of the input sequence.However, unlike the original pointer network where a decoder state points to an encoder state, in our approach, the current encoder state points to the previous states.
Pointer networks have recently yielded stateof-the-art results in constituency parsing (Nguyen et al., 2020), dependency parsing (Ma et al., 2018), anaphora resolution (Lee et al., 2017), and discourse segmentation and parsing (Lin et al., 2019).To the best of our knowledge, this is the very first work that utilizes a pointer network for conversation disentanglement.It is also a natural fit for online conversation disentanglement.

Our Disentanglement Model
Given a sequence of streaming utterances U = {U 1 , U 2 , . . ., U i , . ..}, our task in link prediction ( §3.1) is to find the parent utterances U p i ⊂ U ≤i that the current utterance U i replies to.Here, U ≤i refers to all the previous utterances until i, that is, An utterance can reply to itself (i.e., i = p i ), for example, the initial message in a conversation or a system message.Besides, one utterance may have multiple parents, and one parent can be replied to by multiple (children) messages.For example, in Figure 1, both pnunn and hannasanarion reply to TuxThePonguin's message.The case of one message replying to multiple parents is very rare in our corpus (see Table 1).
After link prediction, we employ a decoding algorithm ( §3.3) to construct the individual threads.

Link Prediction by Pointing
We propose a joint learning framework for conversation disentanglement based on pointing operations where Figure 2 shows the network architecture.It has three main components: (a) an utterance encoder and (b) a pointer module (c) a pairwise classification model.The job of the utterance encoder is to encode each utterance U i as it comes, while the the pointer module implements a custom pointing mechanism to find the ancestor message U p ∈ U ≤i that U i replies to.The pairwise classification model is to determine whether two utterances U i and U j are in the same conversation or not.

Utterance Encoder
As shown in Figure 1, each utterance U i has three components <t i , s i , m i >: the timestamp t i when the utterance was posted, the speaker s i who posted it, and the message content m i .We encode these three components separately.
Encoding timestamp.Encoding speaker.In a multi-party conversation, participants mention each other's names to make disentanglement easier, compensating for the lack of visual cues normally present in a face-to-face conversation (O' Neill and Martin, 2003;Elsner and Charniak, 2010).Our goal is to capture the mention relation between utterances in the way we encode the speaker information.For this, each speaker is placed in the same vocabulary as the words, and encoded with a unique identifier (a discrete value).
Encoding message text.We utilize a Bidirectional LSTM or Bi-LSTM (Hochreiter and Schmidhuber, 1997) to encode the raw text message m i into deep contextual representations of the words.We concatenate the hidden states from both forward and backward LSTM cells.Formally, for a message containing n words m i = (w 0 , w 1 , . . ., w n ), the Bi-LSTM gives H = ( where ⊕ denotes vector concatenation.

Pointer Module
Given the encoded representation of the current utterance U i , our pointer module computes a probability distribution over the previous utterances U ≤i which represents the probability that U i replies to an utterance U j ∈ U ≤i .The module implements an association function between U i and U j by incorporating different kinds of interactions.(a) Time difference: The time difference between U i and U j is computed as: (b) Mention: To determine whether U i mentions s j (speaker of U j ) in its message m i , we compute: where 1 is the Indicator function that returns 1 if the index of w k in m i matches the speaker id s j .
(c) Mention history: The pointer module also keeps track of mention histories between two speakers.It not only computes whether s i mentions s j in m i , but it also keeps track of whether s i , s j mentioned each other in their previous messages and how often.It maintains an external memory M (a matrix) to record this.At each step, we compute both mention i,j and mention j,i , and update the memory M to be used in the next step.
The memory M grows incrementally as the model sees new speakers during training and inference.
(d) Topic coherence: To model textual similarity between m i and m j , we use a similar method as Chen et al. (2016).Let H i = (h i,0 , . . ., h i,p ) and H j = (h j,0 , . . ., h j,q ) be the Bi-LSTM representations for m i and m j , respectively from the utterance encoder layer.We compute soft alignment between m i and m i as follows.
In Eq. 5, we use the vectors in H i as the query vectors to compute attentions over the key/value vectors in H j , and compute a set of attended vectors H i = (h i,0 , . . ., h i,p ), one for each h i ∈ H i .
Eq. 6 does the same thing but uses H j as the query vectors and H i as the key/value vectors to compute the attended vectors, H j = (h j,0 , . . ., h j,q ).
Then we enhance the interactions by applying difference and element-wise product between the original representation H and the attended representations H as follows.
The final representation h i,j is computed as where ⊕ denotes concatenation.
(e) Pointing: After computing the above four types of interactions between each pair of utterances, we concatenate them and feed them into a feed-forward network to compute the pointing distribution over all the previous utterances.
for j ∈ (0, ..., i) where w is a shared linear layer parameter.We use cross entropy (CE) loss for the pointer module.
where y i,j = 1 if U i replies to U j , otherwise 0, and θ are the model parameters.

Pairwise Classification Model
In the above pointer module, we only consider first-order interaction between two utterances in an online fashion (i.e., looking at only previous utterances).However, higher-order information derived from the entire conversation may provide more contextual information for the model to learn useful disentanglement features.We propose a joint learning framework to consider both first-and higher-order information simultaneously in a unified framework; see Figure 2(b)-(c).
For the higher-order information, we train a binary pairwise classifier that decides whether two utterances should be in the same conversation.For any two arbitrary utterances we use the same feature function from Eq. 10 (i.e., the parameters are shared with the pointer module) and feed them into a binary logistic classifier.The probability of two utterances belongs to the same conversation is: where w is the classifier parameter.We use a binary cross entropy loss for this model.
where ŷi,j = p pair (U i , U j ), y i,j = 1 if U i and U j are in the same conversation, otherwise 0, and θ are the model parameters.Since the pairwise classifier is trained on all possible pairs of utterances in a conversation, it models higher-order information about the conversation clusters.

Training
The final training loss of our model is: where λ is the hyper-parameter for tuning the importance of the pairwise classification loss.We use Glove 128-dimensional word embedding (Pennington et al., 2014), pre-trained by Lowe et al. (2015) on the #Ubuntu corpus.The hidden layers in the Bi-LSTM are of 256 dimensions.We optimize our model using Adam (Kingma and Ba, 2014) optimizer with a learning rate of 1 × 10 −5 .For regularization, we set the dropout at 0.2 and L 2 penalize weight with 1 × 10 −7 .

Decoding
Our framework naturally allows us to disentangle the threads in an online fashion.As a new utterance U i arrives, the utterance encoder encodes it into a vector.The pointer module then computes a multinomial distribution over the previous utterances U ≤i (Eq. 11) by modelling pair-wise interactions, and then finds the parent message U p i as follows.
To the best of our knowledge, this is the first work using an end-to-end framework for online conversation disentanglement.In our analysis of several conversations, we found that the self-links (an utterance pointing to itself) play a crucial role in clustering performance.Mistakes in correctly identifying a self-link will result in two misclusterings.
To address this, we did some simple adjustment to our decoding method.We raise the threshold for self-link prediction to make a more conservative prediction of self-links.In particular, during decoding we first find the parent with the highest probability, but if it turns out to be a self-link, we see if the probability passes the preset threshold, otherwise the utterance with the second highest probability will be predicted as the parent.The tuning of the threshold parameter for self-link is done on the development set.

Experiment
In this section, we present our experiments -the dataset used, the evaluation metrics, experimental setup, and the results with analysis.

Dataset
The dataset used for training and evaluation is from (Kim et al., 2019), which is the largest dataset available for conversation disentanglement (Kummerfeld et al., 2019).It consists of multi-party conversations extracted from the #Ubuntu IRC channel.A typical conversation starts with a question that was asked by one participant, and then other participants respond with either an answer or follow-up questions.This leads to a back-and-forth conversation between multiple participants.An example of the Ubuntu-IRC data is shown in Figure 1.We follow the same train, dev, test split as the Dialog System Technology Challenges 8 (DSTC8) (Kim et al., 2019

Metrics
We consider two kinds of metrics to evaluate our disentanglement model: link level and conversation level.For link-level, we use precision, recall and F-1 scores.For cluster level evaluation, we use the same clustering metrics from DSTC8 and (Kummerfeld et al., 2019).This includes: (a) Variation of Information (VI).This is a measure of information gain or loss when going from one clustering to another (Meilȃ, 2007).It is the sum of conditional entropies H(Y |X) + H(X|Y ), where X and Y are clusterings of the same set of items.We used the bound for n items that V I(X; Y ) ≤ log(n), and present 1 − V I, so that the larger the value the better.
(b) Ajusted Random Index (ARI).A measure (also referred to as 1V1) (Hubert and Arabie, 1985) between two clusterings by considering all links of samples and counting links that are assigned in the same or different clusters in the predicted and true clusterings.ARI is defined as: where n ij are number of overlapping links between predicted cluster i and ground truth cluster j, whereas a i and b j indicate row and column level summation over n ij .
(c) Exact Match F-1.Calculated using the number of perfectly matching conversations, excluding conversations with only one message.

Models Compared
Feed Forward Model.We use the feed-forward model from (Kummerfeld et al., 2019) as the baseline model, which outperforms previously proposed disentanglement models.For the DSTC8 challenge, the author (one of the task organizers) provided a trained model 3 , which has two feedforward layers.The input is 77 hand-engineered features combined with 128 dimension word average embeddings from pre-trained Glove.We will denote this model as FF model below.
Pointer Network.This is our model.For computational simplicity, we did not compute the attention over all the previous utterances, rather we set a fixed window size of 50.This means for the current utterance, we will calculate the attention with itself and 50 previous utterances during training and decoding.In our training data, about 97% of the utterances' parents are located in this window.
In the #Ubuntu Data, according to the statistic of Table 1, one utterance only have 1.03 parent utterances on average.So, given an utterance we only predict its most likely parent.

Results
We present our main results in Table 2.For analysis purposes, in the table we also show the results for two variants of the models: (i) when the models consider only the utterance texts, as denoted by (T) suffix; (ii) when the models exclude the utterance text, as denoted by (−T) suffix.In addition, we present how the models perform specifically on self-links predictions, as correctly identifying self-links turns out to be quite crucial for identifying the conversations, as we will explain later.We 3 https://github.com/dstc8-track2/NOESIS-II/tree/master/subtask4 also report the performance of the Siamese hierarchical convolutional neural network (SHCNN) from (Jiang et al., 2018) on #Ubuntu dataset.However, SHCNN mainly focuses on modeling message content representations and only incorporates four context features: speaker identicality, absolute time difference, and the number of duplicate words.So the performance is not as good as the feed-forward model with many hand-engineered features.
Link Prediction.We can see that our Pointer Network has better link prediction accuracy compared to the baseline, when it uses the message texts.The reason that the baseline performs slightly better in the absence of message texts is because it uses several meta features from the whole thread that capture more structural information.On the other hand, our model has access to only time and speaker information in the absence of message texts.Thus, we can say that our model can capture textual similarity or topical coherence better compared to the baseline.
Cluster (or Conversation) Prediction.Now if we compare the performance at the cluster-level, we see that our model performs much better when it uses only textual information compared to the baseline.In the absence of textual information, it performs on par with the baseline.However, when we compare the full model, we notice that its results are lower in some cluster-level measures (see 'Ptr-Net' results in the last block), which we did not expect given that it has higher accuracy on link prediction.Therefore, we performed a case  study, and the study reveals that one particular kind of links, which we call "self-link", are very crucial to the cluster-level results.
Self-links.Kummerfeld et al. (2019) mention that most of the self-links are system messages like "===zelot just join the channel".However, according to our statistics (shown in Table 3), only 41% self-links are system messages, and we have identified two other types of utterances that reply to themselves: • Start of a conversation: These messages do not reply to any previous message but will be replied afterward.
• Isolated Messages: These are non-system messages, but reply to no previous message and never been replied afterwards.
Handling Self-links.To see how much our models get affected by inaccurate self-link predictions, we performed an experiment using ground truth self-links, where we replace those predictions with ground truth self-link.From  (Kummerfeld et al., 2019).can predict the system messages with almost 100% accuracy, but for other kinds of self-links, the performance is not that high.Experimental results in Table 2 show that our proposed Pointer Network has worse results on self-links compared to the FF model (compare "FF" and " Ptr-Net" in the forth and fifth block), which explains why our model does not perform well on clustering metrics.
When we do the simple adjustment to our decoding method by raising the threshold for self-link prediction, we see significant improvements in clustering.Note that this adjustment also improves the performance of the baseline (FF+Self-Link).
Joint Learning.The results in Table 2 show that joint learning further improves the clustering result.Combined with joint training, our online decoding algorithm achieves state-of-the-art results.
Ablation Studies.To show the importance of encoding textual and non-texture features like speaker and time, we trained the models with only textual features and without textual features; see the (T) (−T) variants in Table 2.
Intuitively, it is hard for the models to come up with a good prediction with only textual features, since the utterances in the same conversation usually talk about similar topics.This makes it difficult for the models to identify the right parent.Therefore, the results in Table 2 show a huge drop in performance when no speaker and time information are used and only textual features are used (see the first block of results with (T) suffix).This indicates that time and speaker mention information play a crucial role in disentanglement.
Similarly, the performance also goes down for the models when they do not consider textual information; see (−T) variants in Table 2.Although compared to the text only, the drop is less.

Conclusion
In this paper, we have proposed a novel online framework for disentangling multi-party conversa-tions.In contrast to previous work, our method reduces the effort of complicated feature engineering by proposing an utterance encoder and a pointer module that models inter-utterance interactions.Moreover, we propose a joint-training framework that enables the pointer network to learn more contextual information.Link prediction in our framework is modeled as a pointing function with a multinomial distribution over previous utterances.We also show that our framework supports online decoding.Extensive experiments have been conducted on the #Ubuntu dataset, which show that our method achieves state-of-the-art performance on both link and conversation prediction tasks without using any handcrafted features.
There are some possible future directions from our work.We have shown in our experiments that self-link predictions have a significant impact on clustering results.This reminds us that neither our and most of the existing methods took good advantage of graph information in disentangling conversations.This can be done in two ways, encoding, and decoding.From the encoding side, it would be ideal to encode an utterance within its context.One challenge for this problem is that conversations are tangled, so sequential encoding methods like the one of (Sordoni et al., 2015) would not be appropriate.From the decoding side, a promising direction would be to make global inference in a more efficient way.

Figure 1 :
Figure 1: An excerpt of a conversation from the Ubuntu IRC corpus (best viewed in color).Same color reflects same conversation.Mentions of names are highlighted.

Figure 2 :
Figure 2: (a) Overview of our Pointer Network joint learning framework for online conversation disentanglement.Each utterance U i consists of three parts: time (t i ), speaker (s i ) and message text (m i ), which are encoded by the utterance encoder.(b) The Pointer module designs an effective attention mechanism that captures inter-utterance interactions through several features and models link prediction as a multinomial distribution over the previous utterances.(c) The pairwise classification model aims to capture higher-order contextual information.The whole model is trained end-to-end and can decode/disentangle the conversation in an online fashion.
Figure 2(b) gives a schematic diagram of the module, which has five different components:

Table 1 :
). 2 Table 1 reports the dataset statistics.Statistics of train, dev and test datasets.

Table 2 :
Experimental results on the Ubuntu test set."T" suffix means the model uses only utterance text."−T" indicates the model excludes utterance text."Joint Train" indicates the model is trained with the joint learning objective (Eq.15), "Self Link" indicate the model is decoded with self-link threshold re-adjustment.

Table 3 :
Self-link statistics on Ubuntu Dataset

Table 4 :
"+G" indicates replacing self-link prediction with ground truth labels.

Table 5 :
Self-link prediction results for the baseline model