Context-Aware Conversation Thread Detection in Multi-Party Chat

In multi-party chat, it is common for multiple conversations to occur concurrently, leading to intermingled conversation threads in chat logs. In this work, we propose a novel Context-Aware Thread Detection (CATD) model that automatically disentangles these conversation threads. We evaluate our model on four real-world datasets and demonstrate an overall im-provement in thread detection accuracy over state-of-the-art benchmarks.


Introduction
In multi-party chat conversations, such as in Slack 1 , multiple topics are often discussed at the same time (Wang et al., 2019;Zhang et al., 2017). For example, in Table 1, Alice and Bob talk about work, while Bob and Chuck chat about lunch, and this results in intermingled messages. Automatic conversational thread detection could be used to disentangle and group the messages into their respective topic threads. The resulting thread information could in turn be used to improve response relevancy for conversational agents (Shamekhi et al., 2018) or improve chat summarization quality (Zhang and Cranshaw, 2018).
Unlike most of today's Email and Forum systems that use threaded structure by default. However, the Instant Messaging systems (e.g., Slack) often require users to manually organize messages in threads. A recent study (Wang et al., 2019) found that users most likely do not manually create threads. On average, only 15.3 threads were created per Slack channel with 355 messages, when they discuss group projects.
Prior work on conversation thread disentanglement is often based on pairwise message compar-  ison. Some solutions use unsupervised clustering methods with hand-engineered features (Wang and Oard, 2009;Shen et al., 2006), while others use supervised approaches with statistical (Du et al., 2017) or linguistic features (Wang et al., 2008;Wang and Rosé, 2010;Elsner and Charniak, 2008, 2011Mayfield et al., 2012). Recent work by (Jiang et al., 2018;Mayfield et al., 2012) adopt deep learning approaches to compute message pair similarity, using a combination of message content and simple contextual features (e.g., authorship and timestamps). However, linguistic theories (Biber and Conrad, 2019) differentiate the following three concepts: register, genre and style, to describe the text varieties. Register refers to the linguistic features such as the choice of words in content. Genre and Style refer to the conversational structure such as the sentence sequence and distribution. All aforementioned thread disentanglement methods fail to take into account the contextual information of the thread, or the conversational flow and genre.
A thread's contextual information is a useful feature for thread-detection because considering the relationship between a single new input message and an existing message alone may not be enough to accurately determine thread membership. Hence, using the full thread context history during comparison can improve pre-diction. Additionally, the conversational flow and genre may also be useful because (Butler et al., 2002) suggests this represents a conversation's signature. For example, we observe that users act distinctively in public Q&A (StackOverflow) and enterprise Q&A (IBM Social Q&A) online community (Wang et al., 2016), even when they are answering a similar question. Based on these hypotheses, we propose two contextaware thread detection (CATD) models. The first model (CATD-MATCH) captures contexts of existing threads and computes the distance between the context and the input message; the second model (CATD-FLOW) captures the conversational flow, and computes the language genre consistency while attaching the input message to a thread. We also combine them with a dynamic gate for further performance improvement, followed by an efficient beam search mechanism in the inference step. The evaluation proves our approach improves over the existing methods.
The contribution of this work is two-fold: 1) We propose context-aware deep learning models for thread detection and it advances the state-ofthe-art; 2) Based on the dataset in (Jiang et al., 2018), we develop and release a more realistic multi-party multi-thread conversation dataset for future research.

Methodology
We model thread-detection as a topic detection and tracking task by deciding whether an incoming message starts a new thread or belongs to an existing thread (Allan, 2002). The goal is to assign each message m i in the sequence, M = {m i } N i=1 , a thread label t i , such that the complete thread label sequence T = {t i } N i=1 contains multiple threads (T 1 , T 2 , · · ·), where each thread T l contains all messages with the same label. We denote is always before m i , but may not be the last message m i 1 .
The training and inference steps are as follows: we train an LSTM-based thread classification model to obtain the membership of the message m i to an existing-or a new thread, given the existing message sequence's thread tags T i 1 = {t j } i 1 j=1 , which form L threads {T l i 1 } L l=1 (Subsection 2.1). During inference, we use this model to sequentially perform message thread labelling (Subsection 2.2).

Context-Aware Thread Detection
We first adopt the Universal Sentence Encoder 2 with deep averaging network (USE) (Cer et al., 2018) to get a static feature representation for each message in the form of sentence embeddings. We encode each message m j as enc(m j ), by concatenating the USE output with two 20dimensional embeddings: (1) User-identity difference between m j and m i . (2) Time difference by mapping the time difference between m j and m i into 11 ranges (from 1 minutes to 72 hours, details in Appendix A). These two features are also used in (Jiang et al., 2018), and another baseline model GTM uses only these features (Elsner and Charniak, 2008).
Given a message sequence M i 1 , which has been detected with L threads i 1 indicates that m i starts a new thread. As shown in Fig.1, we adopt a message-level single directional LSTM to encode each thread T l i 1 , whose inputs are enc(·) of maximum K last messages (set to 20) in the thread, denoted as {m (l,k) } K k=1 . The messages outside that window are viewed as irrelevant to the prediction of m i . In Fig.1, we propose two CATD models, CATD-FLOW and CATD-MATCH, each one capturing the semantic relationship between the new message and the existing thread contexts. Fig.1): This model considers each thread as a conversation flow for a particular topic with its own genre, and the current message should belong to the thread with which it is more likely to form a fluent conversation. Therefore, we concatenate the enc(m i ) to each LSTM sequence of T l i 1 , and get the last output e l flow to dot-product a trainable vector w and compute the probability of m i being labeled with T l i 1 ,

CATD-FLOW (left in
where is a scaling hyper-parameter. We differentiate T L+1 i 1 from other existing threads by introducing a parameter u as the only input for LSTM. Fig.1): An alternative view of determining the thread tag of m i is to find a thread T l i 1 semantically closest to m i . Thus we independently encode each thread and m i with the parameter-shared LSTM (Only one LSTM step for m i ). In order to dynamically point to more related messages in the thread history, we use an one-way attention, which has been successfully adopted in many NLP tasks (Tan et al., 2016;Hermann et al., 2015). Specifically, given the sequence outputs of each thread's LSTM, {h l (l,k) } K k=1 , we perform a weighted mean pooling to get a context embedding e l cxt , attended by the one-step LSTMencoded m i , denoted asĥ i .

CATD-MATCH (right in
Next, we compute a matching vector e l match below, which again computes dot-product w for classification. The function N (x) normalizes x by l 2 norm. ⌦ is element-wise multiplication. where w 0 is a parameter vector. g l is determined by the distance between N (e l cxt ) and N (ĥ i ). We use this dynamic gate g to linearly combine the two models. g is computed based on the difference of the MATCH vector of context and the input message. Intuitively, if they are close, both FLOW and MATCH will be considered equally for prediction. Otherwise, the model dynamically computes the weights of MATCH and FLOW.
Training Procedure: Following (Jiang et al., 2018), apart from a new thread, we consider the candidate threads (Active Threads) in Eq. 1 only from those appearing in one hour time-frame before m i . During training, we treat the messages of a channel as a single sequence, and optimize Eq. 1 with training examples, containing m i and its active threads. Though messages are sorted by time, the training examples are shuffled during training.

Thread Inference
During inference, we want to find the optimal thread labeling by maximizing: where t i are selected from active threads and the new thread. However, searching the entire space of T is unfeasible. Hence, we resort to Beam Search, a generalized version of greedy search. It predicts sequentially from m 1 to m N , while keeping B states in the beam. For each m i , each state in the beam is a candidate T i 1 . Each new state T i is ranked after labeling t i for m i : where T l i 1 is selected from the active threads in the previous state and a new thread tag. The new states with scores lower than top-B candidates are discarded. Similar to training, the active threads are also pruned by the "one-hour" constraint. However, they are not extracted from the groundtruth, but from previously-detected threads.

Experiments
Datasets: We conduct extensive experiments on three publicly available datasets from Reddit datasets. We strictly follow (Jiang et al., 2018) to construct our data. Comments under a post can be treated as messages in a single conversational thread, and we merge all comments in a   Table 2: CATD models are compared with baselines wrt. metrics of NMI, ARI and F1 for the three datasets sub-reddit to construct a synthetic dataset of interleaved conversations. We take three sub-reddits to build three datasets, Gadgets, IPhones and Politics. 3 The data statistics and examples are shown in Appendix B.
Reddit Dataset Improvement: We use the same pre-processing method in (Jiang et al., 2018): we discard the messages which have less than 10 words or more than 100 words. Conversations less than 10 messages are also discarded. We guarantee that no more than 10 conversations happen at the same time. In their work, different message pairs of the same thread might be included in both train and test sets. Instead, we split the datasets on the thread level because in realistic settings, test threads should be completely unseen in train set. 4 Experimental Setup: We use Adam (Kingma and Ba, 2015) to optimize the training objective (Eq. 1). During training, we fix in Eq. 1 as 10. In inference, this value may influence the search quality. We set it as 20.0 by the validation accuracy on Politics. We set LSTM output dimensions to 400, the batch size to 10 and the beam size to 5 by default. We train 50 epochs and select the model with the best validation-set performance.
Baseline: (1) CISIR-SHCNN (Jiang et al., 2018): A recently proposed model based on CNN and ranking message pairs. (2) CISIR-USE: We replace CNN encoder in CISIR with a USE to test the effect of different sentence encoders. (3) GTM (Elsner and Charniak, 2008): A graph-theoretical model with chat and content specific features.  Evaluation Metrics: Normalized mutual information (NMI), Adjusted rand index (ARI) and F1 score, following (Jiang et al., 2018). F1 is computed based on all message pairs in a test set. Also, following their work, we assume the candidate threads of each message for our models and baselines are obtained from the ones which have messages in the previous hour. For examples, the CISIR-SHCNN models will take pairs only within the one-hour frame.
Main Results: Table 2 compares the CATD models and baselines on NMI, ARI and F1. CISIR models are generally better than non-deeplearning GTM. There is a clear gap between CISIR-USE and our proposed models, which proves our models' improvement is not due to the usage of USE but the new model structures. CATD models are significantly superior to all baselines and CATD-COMBINE generally performs best. Specifically, all baselines failed on Politics, probably because there are more threads in Politics than the other two datasets (see Appendix B), making disentanglement more difficult. But CATD models achieve better results because they encode ac-tive threads in parallel, while considering longer history in each thread.
Analysis : In Table 3, we analyze our models on Politics, the largest dataset. First, we examine the effect of K. For all CATD models, with K from 5 to 20 (D-F, G-I, K, L and N), and all metrics improve, showing the importance of the longer history in LSTM. Second, we adopt bidirectional LSTMs (J) for CATD-MATCH, without an obvious improvement, probably because most messages in the datasets can be fully comprehended only with previous history. This assumption is consistent with a mild improvement when we increase beam size from 1 (M) to 5 (N). We see a lower ARI with beam size as 10 (O), because of the incorrect candidates at lower ranking positions. Finally, the models are generally good when beam=1, enabling an "online" detection without knowing the future messages, which can not be directly fulfilled by most pairwise prior work.   Table 3, we also shared the LSTM parameters for MATCH and FLOW models (A), with 4% drop on ARI. This is because we need two independent LSTMs to capture different linguistic features. Next, we combine FLOW and MATCH (B) by concatenating e l flow and e l match , resulting in 1.5% drop on ARI, which proves the benefit of the gate in CATD-COMBINE. Also, we break the links between LSTM nodes and perform one-step LSTM on all the history messages (C), leading to over 4% drop on ARI. This reflects the necessity of a RNN encoding inter-messages information.
In Table 4 and 5, we show the analysis for for Gadgets and Iphones datasets similar to Poli-tics dataset in Table 3. As compared to Politics, we observe that for Gadgets and Iphones, CATD-FLOW models have some fluctuations in performance when we increase K from 5 to 20, which may be due to the limited capability of LSTMs for memorizing long-term history. This issue is more prevalent when the training data size is small.

Conclusion
We propose context-aware thread detection models to perform thread detection for multi-party chat conversations which take into account threads' contextual information. These are integrated into an efficient beam search for inference. Our proposed method advances the state-of-the-art.