Microblog Conversation Recommendation via Joint Modeling of Topics and Discourse

Millions of conversations are generated every day on social media platforms. With limited attention, it is challenging for users to select which discussions they would like to participate in. Here we propose a new method for microblog conversation recommendation. While much prior work has focused on post-level recommendation, we exploit both the conversational context, and user content and behavior preferences. We propose a statistical model that jointly captures: (1) topics for representing user interests and conversation content, and (2) discourse modes for describing user replying behavior and conversation dynamics. Experimental results on two Twitter datasets demonstrate that our system outperforms methods that only model content without considering discourse.


Introduction
Online platforms have revolutionized the way individuals collect and share information (O'Connor et al., 2010;Lee and Ma, 2012;Bakshy et al., 2015), but the vast bulk of online content is irrelevant or unpalatable to any given individual. A user interested in political discussion, for instance, might prefer content concerning a specific candidate or issue, and only then if discussed in a positive light without controversy (Adamic and Glance, 2005;Bakshy et al., 2015).
How do individuals facing such large quantities of superfluous material select which conversations to engage in, and how might we better algorithmically recommend conversations suited to individual users? We approach this problem from a microblog conversation recommendation framework. Where prior work has focused on the content of individual posts for recommendation (Chen Conversation 1 ... [U1]: The sheer cognitive dissonance required for a "liberal" to say Clinton is as bad as Trump is just staggering. [U2]: Hillarists, Troll; they insult Liberals trying to distract from Hillary's Conseratism.  [U i ]: The message is posted by user U i . "-" is the dividing line between training history and test part. U 1 did not reengage in Conversation 1 but reengaged in Conversation 2. Yan et al., 2012;Vosecky et al., 2014;He and Tan, 2015), we examine the entire history and context of a conversation, including both topical content and discourse modes such as agreement, question-asking, argument and other dialogue acts (Ritter et al., 2010). 1 And where Backstrom et al. (2013) leveraged conversation reply structure (such as previous user engagement), their model is unable to predict first entry into new conversations, while ours is able to predict both new and repeated entry into conversations based on a combination of topical and discourse features.
To illustrate the interplay between topics and discourse, Figure 1 displays two snippets of conversations on Twitter collected during the 2016 United States presidential election. User U 1 participates in both conversations. The first conversation is centered around Clinton, and U 1 , who is more typically involved with conversations about candidate Sanders, does not return. In the second conversation, however, U 1 is involved in a heated back-and-forth debate, and thus is drawn back to a conversation that they may otherwise have abandoned but for their enjoyment of adversarial discourse.
Effective conversation prediction and recommendation requires an understanding of both user interests and discourse behaviors, such as agreement, disagreement, inquiry, backchanneling, and emotional reactions. However, acquiring manual labels for both is a time-consuming process and hard to scale for new datasets. We instead propose a unified statistical learning framework for conversation recommendation, which jointly learns (1) hidden factors that reflect user interests based on conversation history, and (2) topics and discourse modes in ongoing conversations, as discovered by a novel probabilistic latent variable model. Our model is built on the success of collaborative filtering (CF) in recommendation systems, where latent dimensions of product ratings or movie reviews are extracted to better capture user preferences (Linden et al., 2003;Salakhutdinov and Mnih, 2008;Wang and Blei, 2011;McAuley and Leskovec, 2013). To the best of our knowledge, we are the first to model both topics and discourse modes as part of a CF framework and apply it to microblog conversation recommendation. 2 Experimental results on two Twitter conversation datasets show that our proposed model yields significantly better performance than state-of-theart post-level recommendation systems. For example, by leveraging both topical content and discourse structure, our model achieves a mean average precision (MAP) of 0.76 on conversations about the U.S. presidential election, compared with 0.70 by McAuley and Leskovec (2013), which only considers topics. We further con-ducted detailed analysis on the latent topics and discourse modes and find that our model can discover reasonable topic and discourse representations, which play an important role in characterizing reply behaviors. Finally, we also provide a pilot study on recommendation for first time replies, which shows that our model outperforms comparable recommendation systems.
The rest of this paper is structured as follows. The related work is discussed in Section 2. We then present our microblog conversation recommendation model in Section 3. The experimental setup and results are described in Sections 4 and 5. Finally, we conclude in Section 6.

Related Work
Social media has attracted increasing attention in digital communication research (Agichtein et al., 2008;Kwak et al., 2010;Wu et al., 2011). The problem studied here is closely related to work on recommendation and response prediction in microblogs (Artzi et al., 2012;Hong et al., 2013), where the goal is to predict whether a user will share or reply to a given post. Existing methods focus on measuring features that reflect personalized user interests, including topics (Hong et al., 2013) and network structures (Pan et al., 2013;He and Tan, 2015). These features have been investigated under a learning to rank framework (Duan et al., 2010;Artzi et al., 2012), graph ranking models (Yan et al., 2012;Feng and Wang, 2013;Alawad et al., 2016), and neural network-based representation learning methods (Yu et al., 2016).
Distinguishing from prior work that focuses on post-level recommendation, we tackle the challenges of predicting user reply behaviors at the conversation-level. In addition, our model not only captures latent factors such as the topical interests of users, but also leverages the automatically learned discourse structure. Much of the previous work on discourse structure and dialogue acts has relied on labeled data (Jurafsky et al., 1997;Stolcke et al., 2000), while unsupervised approaches have not been applied to the problem of conversation recommendation (Woszczyna and Waibel, 1994;Crook et al., 2009;Ritter et al., 2010;Joty et al., 2011).
Our work is also in line with conversation modeling for social media discussions (Ritter et al., 2010;Budak and Agrawal, 2013;Louis and Cohen, 2015;Cheng et al., 2017). Topic modeling has been employed to identify conversation content on Twitter (Ritter et al., 2010). In this work, we propose a probabilistic model to capture both topics and discourse modes as latent variables. A further line of work studies the reposting and reply structure of conversations (Gómez et al., 2011;Laniado et al., 2011;Backstrom et al., 2013;Budak and Agrawal, 2013). But none of this work distinguishes the rich discourse functions of replies, which is modeled and exploited in our work.

The Joint Model of Topic and Discourse for Recommendation
Our proposed microblog conversation recommendation framework is based on collaborative filtering and a novel probabilistic graphical model. Concretely, our objective function takes the form: This function encodes two types of information. First, L models user reply preference in a similar fashion to collaborative filtering (CF) (Hu et al., 2008;Pan et al., 2008). It captures topics of interests and discourse structures users are commonly involved (e.g., argumentation), and takes the form of mean square error (MSE) based on user reply history. This part is detailed in Section 3.1. The second term, N LL(C | Θ), denotes the negative log-likelihood of a set of conversations C, with Θ containing all parameters. A probabilistic model is described in Section 3.2 that shows how the topical content and discourse structures of conversations are captured by these latent variables.
The hyperparameter µ controls the trade-off between the two effects. 2 regularization is also added for parameters to avoid model overfitting.
For the rest of this section, we first present the construction of L and N LL(C | Θ) in Sections 3.1 and 3.2. We then discuss how these two components can be mutually informed by each other in Section 3.3. Finally, the generative process and parameter learning are described in Section 3.4.

Reply Preference (L)
Our user reply preference modeling is built on the success of collaborative filtering (CF) for product ratings. However, classic CF problems, such as product recommendation, generally rely on explicit user feedback. Unlike user ratings on products, our input lacks explicit feedback from users about negative preferences and nonresponse. Therefore, we follow one-class Collaborative Filtering (Hu et al., 2008;Pan et al., 2008), which weights positive instances higher during training and is thus suited to our data. Formally, for user u and conversation c, we measure reply preference based on the MSE between predicted preference score p u,c and reply history r u,c . r u,c equals 1 if u is in the conversation history; otherwise, it is 0. The first term of objective (Eq. 1) takes the following form: where U consists of users {u} and C is a set of conversations {c} in a dataset. f u,c is the corresponding weight for a conversation c and a target user u. Intuitively, it has a large value if positive feedback (user replied) is observed. Therefore, we adapt the formulation from Pan et al. (2008): where s > 1, an integer hyperparameter to be tuned. Inspired by prior models (Koren et al., 2009;McAuley and Leskovec, 2013), we propose the following latent factor model to describe p u,c : γ U u and γ C c are K-dimensional latent vectors that encode topic-specific information (where K is the number of latent topics) for users and conversations. Specifically, γ U u reflects the topical interests of u, with higher value γ U u,k indicating greater interest by u in topic k. γ C c captures the extents that topics are discussed in conversation c.
Similarly, D-dimensional vectors δ U u and δ C c capture discourse structures in shaping reply behaviors (where D is the number of discourse clusters). δ U u reflects the discourse behaviors u prefers, such as u 1 often enjoys arguments as in the second conversation of Figure 1, while δ C c captures the discourse modes used throughout conversation c. By multiplying user and conversation factors, we can measure the corresponding similarity. The predicted score p u,c thereby reflects the tendency for a user u to be involved in conversation c.
As pointed out by McAuley and Leskovec (2013), these latent vectors often encode hidden factors that are hard to interpret under a CF framework. Therefore, in Section 3.2, we present a novel probabilistic model which can extract interpretable topics and discourse modes as word 377 distributions. We then describe how they can be aligned with the latent vectors of γ C and δ U .
Parameter a is an offset parameter, b u and b c are user and conversation biases, and λ ∈ [0, 1] serves as the weight for trading offs of topic and discourse factors in reply preference modeling.

Corpus Likelihood N LL(C | Θ)
Here we present a novel probabilistic model that learns coherent word distributions for latent topics and discourse modes of conversations. Formally, we assume that each conversation c ∈ C contains M c messages, and each message m has N c,m words. We distinguish three latent components -discourse, topic, and background -underlying conversations, each with their own type of word distribution. At the corpus level, there are K topics represented by word distribution φ T represents the D discourse modes embedded in corpus. In addition, we add a background word distribution φ B to capture general information (e.g., common words), which do not indicate either discourse or topic information. φ D d , φ T k , and φ B are all multinomial word distributions over vocabulary size V . Below describes more details.
Message-level Modeling. Our model assigns two types of message-level multinomial variables to each message: z c,m reflects its latent topic and d c,m represents its discourse mode.
Topic assignments. Due to the short nature of microblog posts, we assume each message m in conversation c contains only one topic, indexed as z c,m . This strategy has been proven useful to alleviate data sparsity for topic inference (Quan et al., 2015). We further assume messages in the same conversation would focus on similar topics. We thus draw topic z c,m ∼ θ c , where θ c denotes the fractions of topics discussed in conversation c.
Discourse assignments. To capture discourse behaviors of u, distribution π u is used to represent the discourse modes in messages posted by u. The discourse mode d c,m for message m is then generated from π uc,m , where u c,m is the author of m in c.
Word-level Modeling. We aim to separate discourse, topic, and background information for conversations. Therefore, for each word w c,m,n of message m, a ternary switcher x c,m,n ∈ {DISC, TOPIC, BACK} controls word w c,m,n to fall into one of the three types: discourse, topic, and background.
Discourse words (DISC) are indicative of the discourse modes of messages. When x c,m,n = DISC (i.e., w c,m,n is assigned as a discourse word), word w c,m,n is generated from the discourse word distribution φ D dc,m where d c,m is discourse assignment to message m.
Topic words (TOPIC) describe the topical focus of a conversation. When x c,m,n = TOPIC, w c,m,n is assigned as a topic word and generated from φ T zc,m -word distribution given topic of m. Background words (BACK) capture the general information that is not related to discourse or topic. When word w c,m,n is assigned as a background word (x c,m,n = BACK), it is drawn from background distribution φ B .
Switching among Topic, Discourse, and Background. We further assume the word type switcher x c,m,n is sampled from a multinomial distribution which depends on the current discourse mode d c,m . The intuition is that messages of different discourse modes may show different distributions of the three word types. For instance, a statement message may contain more content words than a rhetorical question. Specifically, x c,m,n ∼ M ulti(τ dc,m ), where τ d is a 3-dimension stochastic vector that expresses the appearing probabilities of three kinds of words (DISC, TOPIC, BACK), when the discourse assignment is d. Stop words and punctuations are forced to be labeled as discourse or background. By explicitly distinguishing different types of words with switcher x c,m,n , we can thus separate word distributions that reflect discourse, topic, and background information.
Likelihood. Based on the message-level and the word-level generation process, the probability of observing words in the given corpus is: And we use negative log likelihood to model corpus likelihood effect in Eq. 1, i.e., N LL(C | Θ) = − log(P r(C | Θ), where parameters set Θ = {θ, π, φ, τ , z, d, x}.

Mutually Informed User Preference and
Latent Variables As mentioned above, the hidden factors discovered in Section 3.1 lack interpretability, which can be boosted by the learned latent topics and discourse modes in Section 3.2. However, it is nontrivial to link the topic-related parameters of γ C c to the conversation topic distributions of θ c , since the former takes real values from −∞ to +∞ while the latter is a stochastic vector. Therefore, we follow the strategy from McAuley and Leskovec (2013) to apply a softmax function over γ C c : We further assume that the discourse mode preference by users, δ U u , can also be informed by the discourse mode distribution captured by π u , i.e., a user who enjoys arguments may be willing to participate another. So similarly, we define: where κ T and κ D are learnable parameters that control the "peakiness" of the transformation. For example, a larger κ T indicates a more focused conversation, while a smaller κ T means users discuss diverse topics. Finally, softmax transformation is also applied to φ T k , φ D d , φ B , and τ d , as done in McAuley and Leskovec (2013), with additional parameters ψ T k , ψ D d , ψ B , and χ d (as shown in Figure 2). This is to ensure that the distributions φ * * and τ d are stochastic vectors. In doing so, these distributions can be learned via optimizing ψ * * and χ d , which take any value and thus ensure that the cost function in Eq. 1 is optimized without considering any parameter constraints.

Generative Process and Model Learning
Our word generation process is displayed in Figure 2 and described as follows: • Compute topic distribution θc by Eq. 6 Parameter Learning. For learning, we randomly initialize all learnable parameters and then alternate between the following two steps: Step 1. Fix topic and discourse assignments z and d, and word type switcher x, then optimize the remaining parameters in Eq. 1 by L-BFGS (Nocedal, 1980): Step 2. Sample topic and discourse assignments z and d at the message level and word type switcher x at the word level, using the distributions, computed according to parameters optimized in step 1: Step 2 is analogous to Gibbs Sampling (Griffiths, 2002) in probabilistic graphical models, such as LDA (Blei et al., 2003). However, distinguishing from previous models, the multinomial distributions in our models are not drawn from a Dirichlet prior. Instead, they are computed based on the parameters learned in Step 1.
Our learning process stops when the change of parameters is small (i.e., below a pre-specified

Experimental Setup
Datasets. We collected two microblog conversation datasets from Twitter for experiments 3 : one contains discussions about the U.S. presidential election (henceforth US Election), the other gathers conversations of diverse topics based on the tweets released by TREC 2011 microblog track (henceforth TREC) 4 . US Election was collected from January to June of 2016 using Twitter's Streaming API 5 with a small set of political keywords. 6 To recover conversations, Tweet Search API 7 was used to retrieve messages with the "inreply-to" relations to collect tweets in a recursive way until full conversations were recovered. Statistics of the datasets are shown in Table 1. Figure 3 displays the number of conversations individual users participated in. As can be seen, most users are involved in only a few conversations. Simply leveraging personal chat history will not produce good performance for conversation recommendation.
In our experiments, we predict whether a user will engage in a conversation given the previous messages in that conversation and past conversations the user is involved. For model training and testing, we divide conversations into three ordered segments, corresponding to training, development, and test sets at 75%, 12.5%, and 12.5%. 8 Preprocessing and Hyperparameter Tuning. For preprocessing, links, mentions (i.e., @username), and hashtags in tweets were replaced with generic tags of "URL", "MENTION", and "HASHTAG". We then utilized the Twitter NLP tool 9 (Gimpel et al., 2011;Owoputi et al., 2013) for tokenization and non-alphabetic token removal. We removed stop words and punctuations for all comparisons to ensure comparable performance. We maintain a vocabulary with the 5,000 most frequent words.
Our model parameters are tuned on the development set based on grid search, i.e. the parameters that give the lowest value for our objective are selected. Specifically, the number of discourse modes (D) and topics (K) are tuned to be 10. The trade-off parameter µ between user preference and corpus negative log-likelihood takes value of 0.1, and λ, the parameter for balancing topic and discourse, is set to 0.5. Finally, the confidence parameter s takes a value of 200 to give higher weight for positive instances, i.e., a user replied to a conversation.
Evaluation Metrics. Following prior work on social media post recommendation (Chen et al., 2012;Yan et al., 2012), we treat our task on conversation recommendation as a ranking problem. Therefore, popular information retrieval evaluation metrics, including precision at K (P@K), mean average precision (MAP) (Manning et al., 2008), and normalized Discounted Cumulative Gain at K (nDCG@K) (Järvelin and Kekäläinen, 2002) are reported. The metrics are computed per user in the dataset and then averaged over all users. The values range from 0.0 to 1.0, with higher values indicating better performance.
We further compare results with three established recommendation models: • OCCF: one-class Collaborative Filtering (Pan et al., 2008), which only considers users' reply history without modeling content in conversations.
• RSVM: ranking SVM (Joachims, 2002), which ranks conversations for each user with the content and Twitter features as in Duan et al. (2010).
• CTR: messages in one conversation are aggregated into one post and a state-of-the art Collaborative Filtering-based post recommendation model is applied (Chen et al., 2012).
Finally, we also adapt the "hidden factors as topics" (HFT) model proposed in McAuley and Leskovec (2013) (henceforth ADAPTED HFT). Because the original model leverages the ratings for all product reviews and does not handle implicit user feedback well, we replace their user preference objective function with ours (Eq. 2).

Experimental Results
In this section, we first discuss our main evaluation in Section 5.1. A case study and corresponding discussion are provided in Section 5.2 to provide further insights, which is followed by an analysis of the topics and discourse modes discovered by our model (Section 5.3). We also examine our performance on first time replies (Section 5.4).

Conversation Recommendation Results
Experimental results are displayed in Table 2, where our model yields statistically significantly better results than baselines and comparisons (paired t-tests, p < 0.01). For P@K, we only report P@1, because a significant amount of users participate only in 1 or 2 conversations. For nDCG@K, different K values are experimented, which results in similar trend, so only nDCG@5 is reported. We find that the baselines that rank conversations with simple features (e.g., length or popularity) perform poorly. This implies that generic algorithms that do not consider conversation content or user preference cannot produce reasonable recommendations.
Although some non-baseline systems capture content in one way or another, only ADAPTED HFT and our model exploit latent topic models to better represent content in tweets, and outperform other methods.
Compared to ADAPTED HFT, which only considers latent topics under a collaborative filtering framework, our model extracts both topics and discourse modes as latent variables, and shows superior performance on both datasets. Our discourse variables go beyond topical content to capture social behaviors that affect user engagement, such as arguments, question-asking, agreement, and other discourse modes.

Training with Varying Conversation History.
To test the model performance based different levels of user engagement history, we further experiment with varying the length of conversations for training. Specifically, in addition to using 75% of conversation history, we also extract the first 25% and 50% of history as training. The rest of a conversation is separated equally for development and test. Figure 4 shows the MAP scores for US Election and TREC datasets. The increasing MAP for all methods as the training history increases indicates that generally, conversation history is essential for recommendation. Our model performs consistently better over different lengths of conversation histories.
Results for Varying Degree of Data Sparsity. From Table 1 and Figure 3, we observe that most users in our datasets are involved in only a few conversations. In order to study the effects of data sparsity on recommendation models, we examine in Figure 5 the MAP scores for users engaged in a varying number of conversations, as measured on the TREC dataset. The results on the US Election dataset have similar distributions. As we see, the prediction results become worse for users involved in fewer conversations. This indicates that data sparsity serves as a challenge for all recommendation models. We also observe that our model performs consistently better than other models over different degrees of sparsity. This implies that effectively capturing discourse structure in conversation context is useful to mitigating the effects of  Table 3: Predicted recommendation scores by different models of U 1 for conversations c 1 and c 2 in Figure 1. U 1 later replies to c 2 but not c 1 , where our model predicts scores of 0.961 for c 2 (higher than 0.924 for c 1 ).

Case Study and Discussion
Here we present a case study based on the sample conversations in Figure 1. Recall that user U 1 is interested in conversations about Sanders, and also prefers more argumentative discourse, and thus returns in conversation c 2 but not c 1 . Table 3 shows the predicted scores for the two conversations from OCCF, ADAPTED HFT, and our model (as in Eq. 2). Both ADAPTED HFT and our model more accurately recommend c 2 over c 1 , with our model producing a slightly higher recommendation score for c 2 . Table 4 shows the latent dimension values for the learned topics and discourse modes for this user and these two conversations. Based on human inspection, topic 1 appears to contain words about Sanders, which is the main topic in conversation c 2 . Topic 2 is about Clinton, which is a dominating topic in conversation c 1 . Our model also picks up user interest in topic 1 (Sanders), and thus assigns γ U u 1 ,1 a high value. For discourse modes, our model also generates a high score for "argument" discourse (labeled via human inspection) for both the user and c 2 .

Further Analysis of Topic and Discourse
Ablation Study. We have shown that joint modeling of topical content and discourse modes produces the superior performance for our model.
Here we provide an ablation study to examine the relative contributions of those two aspects by setting the trade-off parameter λ to 1.0 (topic only) or 0.0 (discourse only). Table 5 shows that topics or discourse individually improve slightly upon the comparison ADAPTED HFT, but only jointly do they improve significantly upon it.  Topic Coherence. To examine the quality of topics found by our model, we use the C V topic coherence score measured via the open-source toolkit Palmetto 10 , which has been shown to produce evaluation performance comparable to human judgment (Röder et al., 2015). Our model achieves topic coherence scores of 0.343 and 0.376 on TREC and US Election datasets, compared to 0.338 and 0.371 for the topics from ADAPTED HFT.  Sample Discourse Modes. While our topic word distributions are relatively unsurprising, of greater interest are the discourse mode word distributions. Table 6 shows a sample of discourse modes as labeled by human. Although this is merely a qualitative human judgment at this point, there does appear to be a notable overlap in discourse modes between the two datasets even though they were learned separately.

First Time Reply Results
From a recommendation perspective, users may be interested in joining new conversations. We thus compare each recommendation system for first time replies. For each user, we only evaluate for conversations where they are newcomers. Table 7 shows that, unsurprisingly, all systems perform poorly on this task, though our model performs slightly better. This suggests that other features, e.g., network structures or other discussion thread features, could usefully be included in future studies that target new conversations.

Conclusion
This paper has presented a framework for microblog conversation recommendation via jointly modeling topics and discourse modes. Experimental results show that our method can outperform competitive approaches that omit user discourse behaviors. Qualitative analysis shows that our joint model yields meaningful topics and discourse representations.