Topic Extraction from Microblog Posts Using Conversation Structures

.


Introduction
The increasing popularity of microblog platforms results in a huge volume of user-generated short posts. Automatically modeling topics out of such massive microblog posts can uncover the hidden semantic structures of the underlying collection and can be useful to downstream applications such as microblog summarization (Harabagiu and Hickl, 2011), user profiling (Weng et al., 2010), event tracking (Lin et al., 2010) and so on.
Popular topic models, like Probabilistic Latent Semantic Analysis (pLSA) (Hofmann, 1999) * * Part of this work was conducted when the first author was visiting Aston University. and Latent Dirichlet Allocation (LDA) (Blei et al., 2003b), model the semantic relationships between words based on their co-occurrences in documents. They have demonstrated their success in conventional documents such as news reports and scientific articles, but perform poorly when directly applied to short and colloquial microblog content due to severe sparsity in microblog messages (Wang and McCallum, 2006;Hong and Davison, 2010).
A common way to deal with short text sparsity is to aggregate short messages into long pseudodocuments. Most of the studies heuristically aggregate messages based on authorship (Zhao et al., 2011;Hong and Davison, 2010), shared words (Weng et al., 2010), or hashtags (Ramage et al., 2010;Mehrotra et al., 2013). Some works directly take into account the word relations to alleviate document-level word sparseness (Yan et al., 2013;Sridhar, 2015). More recently, a self-aggregation-based topic model called SATM (Quan et al., 2015) was proposed to aggregate texts jointly with topic inference.
However, we argue that the existing aggregation strategies are suboptimal for modeling topics in short texts. Microblogs allow users to share and comment on messages with friends through reposting or replying, similar to our everyday conversations. Intuitively, the conversation structures can not only enrich context, but also provide useful clues for identifying relevant topics. This is nonetheless ignored in previous approaches. Moreover, the occurrence of non-topic words such as emotional, sentimental, functional and even meaningless words are very common in microblog posts, which may distract the models from recognizing topic-related key words and thus fail to produce coherent and meaningful topics.
We propose a novel topic model by utilizing the structures of conversations in microblogs. We link microblog posts using reposting and replying rela-tions to build conversation trees. Particularly, the root of a conversation tree refers to the original post and its edges represent the reposting/replying relations.
[O] Just an hour ago, a series of coordinated terrorist attacks occurred in Paris !!!
[R2] Gunmen and suicide bombers hit a concert hall. More than 100 are killed already.
[R1] OMG! I can't believe it's real. Paris?! I've just been there last month.
[R3] Oh no! @BonjourMarc r u OK? please reply me for god's sake!!! . These messages are named as leaders, which contain salient content in topic description, e.g., the italic and underlined words in Figure 1. The remaining messages, named as followers, do not raise new issues but simply respond to their reposted or replied messages following what has been raised by the leaders and often contain non-topic words, e.g., OMG, OK, agree, etc.
Conversation tree structures from microblogs have been previously shown helpful to microblog summarization (Li et al., 2015), but have never been explored for topic modeling. We follows Li et al. (2015) to detect leaders and followers across paths of conversation trees using Conditional Random Fields (CRF) trained on annotated data. The detected leader/follower information is then incorporated as prior knowledge into our proposed topic model.
Our experimental results show that our model, which captures parent-child topic correlations in conversation trees and generates topics by considering messages being leaders or followers separately, is able to induce high-quality topics and outperforms a number of competitive baselines. In summary, our contributions are three-fold: • We propose a novel topic model, which ex-plicitly exploits the topic dependencies contained in conversation structures to enhance topic assignments.
• Our model differentiates the generative process of topical and non-topic words, according to the message where a word is drawn from being a leader or a follower. This helps the model distinguish the topic-specific information from background noise.
• Our model outperforms state-of-the-art topic models when evaluated on a large real-world microblog dataset containing over 60K conversation trees, which is publicly available 1 .

Related Works
Topic models aim to discover the latent semantic information, i.e., topics, from texts and have been extensively studied. One of the most popular and well-known topic models is LDA (Blei et al., 2003b). It utilizes Dirichlet priors to generate document-topic and topic-word distributions, and has been shown effective in extracting topics from conventional documents.
Nevertheless, prior research has demonstrated that standard topic models, essentially focusing on document-level word co-occurrences, are not suitable for short and informal microblog messages due to severe data sparsity exhibited in short texts (Wang and McCallum, 2006;Hong and Davison, 2010). Therefore, how to enrich and exploit context information becomes a main concern. Weng et al. (2010), Hong et al. (2010) andZhao et al. (2011) first heuristically aggregated messages posted by the same user or sharing the same words before applying classic topic models to extract topics. However, such a simple strategy poses some problems. For example, it is common that a user has various interests and posts messages covering a wide range of topics. Ramage et al. (2010) and Mehrotra et al. (2013) used hashtags as labels to train supervised topic models. But these models depend on large-scale hashtag-labeled data for model training, and their performance is inevitably compromised when facing unseen topics irrelevant to any hashtag in training data due to the rapid change and wide variety of topics in social media.
SATM (Quan et al., 2015) combined short texts aggregation and topic induction into a unified model. But in their work, no prior knowledge was given to ensure the quality of text aggregation, which therefore can affect the performance of topic inference. In this work, we organize microblog messages as conversation trees based on reposting/reply relations, which is a more advantageous message aggregation strategy.
Another line of research tackled the word sparseness by modeling word relations instead of word occurrences in documents. For example, Gaussian Mixture Topic Model (GMTM) (Sridhar, 2015) utilized word embeddings to model the distributional similarities of words and then inferred clusters of words represented by word distributions using Gaussian Mixture Model (GMM) that capture the notion of latent topics. However, GMTM heavily relies on meaningful word embeddings that require a large volume of high-quality external resources for training.
Biterm Topic Model (BTM) (Yan et al., 2013) directly explores unordered word-pair cooccurrence patterns in each individual message. Our model learns topics from aggregated messages based on conversation trees, which naturally provide richer context since word co-occurrence patterns can be captured from multiple relevant messages.

LeadLDA Topic Model
In this section, we describe how to extract topics from a microblog collection utilizing conversation tree structures, where the trees are organized based on reposting and replying relations among the messages 2 .
To identify key topic-related content from colloquial texts, we differentiate the messages as leaders and followers. Following Li et al. (2015), we extract all root-to-leaf paths on conversation trees and utilize the state-of-the-art sequence learning model CRF (Lafferty et al., 2001) to detect the leaders 3 . As a result, the posterior probability of each node being a leader or follower is obtained by averaging the different marginal probabilities of the same node over all the tree paths that contain the node. Then, the obtained probability distribution is considered as the observed prior variable input into our model. 2 Reposting/replying relations are straightforward to obtain by using microblog APIs from Twitter and Sina Weibo. 3 The CRF model for leader detection was trained on a public corpus with all the messages annotated on the tree paths. Details are described in Section 4.

Topics and Conversation Trees
Previous works (Zhao et al., 2011;Yan et al., 2013;Quan et al., 2015) have proven that assuming each short post contains a single topic is useful to alleviate the data sparsity problem. Thus, given a corpus of microblog posts organized as conversation trees and the estimated leader probabilities of tree nodes, we assume that each message only contains a single topic and a tree covers a mixture of multiple topics. Since leader messages subsume the content of their followers, the topic of a leader can be generated from the topic distribution of the entire tree. Consequently, the topic mixture of a conversation tree is determined by the topic assignments to the leader messages on it. The topics of followers, however, exhibit strong and explicit dependencies on the topics of their ancestors. So, their topics need to be generated in consideration of local constraints. Here, we mainly address how to model the topic dependencies of followers.
Enlighten by the general Structural Topic Model (strTM) (Wang et al., 2011), which incorporates document structures into topic model by explicitly modeling topic dependencies between adjacent sentences, we exploit the topical transitions between parents and children in the trees for guiding topic assignments.
Intuitively, the emergence of a leader results in potential topic shift. It tends to weaken the topic similarities between the emerging leaders and their predecessors. For example, [R7] in Figure 1 transfers the topic to a new focus, thus weakens the tie with its parent. We can simplify our case by assuming that followers are topically responsive just up to (hence not further than) their nearest ancestor leaders. Thus, we can dismantle each conversation tree into forest by removing the links between leaders and their parents hence producing a set of subgraphs like [R2]-[R6] and [R7]-[R9] in Figure 1. Then, we model the internal topic dependencies within each subgraph by inferring the parent-child topic transition probabilities satisfying the first-order Markov properties in a similar way as estimating the transition distribution of adjacent sentences in strTM (Wang et al., 2011). At topic assignment stage, the topic of a follower will be assigned by referring to its parent's topic and the transition distribution that captures topic similarities of followers to their parents (see Section 3.2).
In addition, every word in the corpus is either a topical or non-topic (i.e., background) word, which highly depends on whether it occurs in a leader or a follower message. Figure 2 illustrates the graphical model of our generative process, which is named as LeadLDA.

Topic Modeling
Formally, we assume that the microblog posts are organized as T conversation trees. Each tree t contains M t message nodes and each message m contains N t,m words in the vocabulary. The vocabulary size is V and there are K topics embedded in the corpus represented by word distribution φ k ∼ Dir(β) (k = 1, 2, ..., K). Also, a background word distribution φ B ∼ Dir(β) is included to capture the general information, which is not topic specific. φ k and φ B are multinomial distributions over the vocabulary. A tree t is modeled as a mixture of topics θ t ∼ Dir(α) and any message m on t is assumed to contain a single topic z t,m ∈ {1, 2, ..., K}.
(1) Topic assignments: The topic assignments of LeadLDA is inspired by  that combines syntactic and semantic dependencies between words. LeadLDA integrates the outcomes of leader detection with a binomial switcher y t,m ∈ {0, 1} indicating whether m is a leader (y t,m = 1) or a follower (y t,m = 0), given each message m on the tree t. y t,m is parameterized by its leader probability l t,m , which is the posterior probability output from the leader detection model and serves as an observed prior variable.
According to the notion of leaders, they initiate key aspects of previously discussed topics or signal a new topic shifting the focus of its descendant followers. So, the topics of leaders on tree t are directly sampled from the topic mixture θ t .
To model the internal topic correlations within the subgraph of conversation tree consisting of a leader and all its followers, we capture parentchild topic transitions π k ∼ Dir(γ), which is a distribution over K topics, and use π k,j to denote the probability of a follower assigned topic j when the topic of its parent is k. Specifically, if message m is sampled as a follower and the topic assignment to its parent message is z t,p(m) , where p(m) indexes the parent of m, then z t,m (i.e., the topic of m) is generated from topic transition distribution π z t,p(m) . In particular, since the root of a conversation tree has no parent and can only be a leader, we make the leader probability l t,root = 1 to force its topic only to be generated from the topic distribution of tree t.
(2) Topical and non-topic words: We separately model the distributions of leader and follower messages emitting topical or non-topic words with τ 0 and τ 1 , respectively, both of which are drawn from a symmetric Beta prior parametererized by δ. Specifically, for each word n in message m on tree t, we add a binomial background switcher x t,m,n controlled by whether m is a leader or a follower, i.e., x t,m,n ∼ Bi(τ yt,m ), which indicates n is a topical word if x t,m,n = 0 or a background word if x t,m,n = 1, and x t,m,n controls n to be generated from the topic-word distribution φ zt,m , where z t,m is the topic of m, or from background word distribution φ B modeling non-topic information.
(3) Generation process: To sum up, conditioned on the hyper-parameters Θ = (α, β, γ, δ), the generation process of a conversation tree t can be described as follows: # of words with background switchers assigned as r and occurring in messages with leader switchers s.
# of words occurring in messages whose leader switchers are s, i.e., r∈{0,1} C LB s,(r) . N B (r) # of words occurring in message (t, m) and with background switchers assigned as r.
# of words indexing v in vocabulary, sampled as topic (nonbackground) words, and occurring in messages assigned topic k.
# of words assigned as topic (non-background) word and occurring in messages assigned topics k, i.e., # of words indexing v in vocabulary that occur in message (t, m) and are assigned as topic (non-background) word.
# of words assigned as topic (non-background) words and occurring in message (t, m), i.e., N W # of messages sampled as followers and assigned topic j, whose parents are assigned topic i.
# of messages sampled as followers whose parents are as- An indicator function, whose value is 1 when its argument inside () is true, and 0 otherwise.
# of messages that are children of message (t, m), sampled as followers and assigned topic j.
# of message (t, m)'s children sampled as followers, i.e., # of messages on conversation tree t sampled as leaders and assigned topic k.

C T T t,(·)
# of messages on conversation tree t sampled as leaders, i.e., # of words indexing v in vocabulary and assigned as background (non-topic) words # of words assigned as background (non-topic) words, i.e., C BW  (1) and (2). (t, m): message m on conversation tree t.

Inference for Parameters
We use collapsed Gibbs Sampling (Griffiths, 2002) to carry out posterior inference for parameter learning. The hidden multinomial variables, i.e., message-level variables (y and z) and wordlevel variables (x) are sampled in turn, conditioned on a complete assignment of all other hidden variables. Due to the space limitation, we leave out the details of derivation but give the core formulas in the sampling steps.
We first define the notations of all variables needed by the formulation of Gibbs sampling, which are described in Table 1. In particular, the various C variables refer to counts excluding the message m on conversation tree t.
For each message m on a tree t, we sample the leader switcher y t,m and topic assignment z t,m according to the following conditional probability distribution: p(yt,m = s, zt,m = k|y ¬(t,m) , z ¬(t,m) , w, x, l, Θ) where g(s, k, t, m) takes different forms depending on the value of s: For each word n in m on t, the sampling formula of its background switcher is given as the following: p(xt,m,n = r|x ¬(t,m,n) , y, z, w, l, Θ) ∝ C LB y t,m ,(r) + δ C LB y t,m ,(·) + 2δ · h(r, t, m, n) where h(r, t, m, n) =

Data Collection and Experiment Setup
To evaluate our LeadLDA model, we conducted experiments on real-world microblog dataset collected from Sina Weibo that has the same 140character limitation and shares the similar market penetration as Twitter (Rapoza, 2011). For the hyper-parameters of LeadLDA, we fixed α = 50/K, β = 0.1, following the common practice in previous works Quan et al., 2015). Since there is no analogue of γ and δ in prior works, where γ controls topic dependencies of follower messages to their ancestors and δ controls the different tendencies of  leaders and followers covering topical and nontopic words. We tuned γ and δ by grid search on a large development set containing around 120K posts and obtained γ = 50/K, δ = 0.5. Because the content of posts are often incomplete and informal, it is difficult to manually annotate topics in a large scale. Therefore, we follow Yan et al. (2013) to utilize hashtags led by '#', which are manual topic labels provided by users, as ground-truth categories of microblog messages. We collected the real-time trending hashtags on Sina Weibo and utilized the hashtag-search API 4 to crawl the posts matching the given hashtag queries. In the end, we built a corpus containing 596,318 posts during May 1 -July 31, 2014.
To examine the performance of models on various topic distributions, we split the corpus into 3 datasets, each containing messages of one month. Similar to Yan et al. (2013), for each dataset, we manually selected 50 frequent hashtags as topics, e.g. #mh17, #worldcup, etc. The experiments were conducted on the subsets of posts with the selected hashtags. Table 2 shows the statistics of the three subsets used in our experiments.
We preprocessed the datasets before topic extraction in the following steps: 1) Use FudanNLP toolkit (Qiu et al., 2013) for word segmentation, stop words removal and POS tagging for Chinese Weibo messages; 2) Generate a vocabulary for each dataset and remove words occurring less than 5 times; 3) Remove all hashtags in texts before input them to models, since the models are expected to extract topics without knowing the hashtags, which are ground-truth topics; 4) For LeadLDA, we use the CRF-based leader detection model (Li et al., 2015) to classify messages as leaders and followers. The leader detection model was implemented by using CRF++ 5 , which was trained on the public dataset composed of 1,300 conversation paths and achieved state-of-the-art 73.7% F1score of classification accuracy (Li et al., 2015

Experimental Results
We evaluated topic models with two sets of K, i.e., the number of topics. One is K = 50, to match the count of hashtags following Yan et al. (2013), and the other is K = 100, much larger than the "real" number of topics. We compared LeadLDA with the following 5 state-of-the-art basedlines.
TreeLDA: Analogous to Zhao et al. (2011), where they aggregated messages posted by the same author, TreeLDA aggregates messages from one conversation tree as a pseudo-document. Additionally, it includes a background word distribution to capture non-topic words controlled by a general Beta prior without differentiating leaders and followers. TreeLDA can be considered as a degeneration of LeadLDA, where topics assigned to all messages are generated from the topic distributions of the conversation trees they are on.
StructLDA: It is another variant of LeadLDA, where topics assigned to all messages are generated based on topic transitions from their parents. The strTM (Wang et al., 2011) utilized a similar model to capture the topic dependencies of adjacent sentences in a document. Following strTM, we add a dummy topic T start emitting no word to the "pseudo parents" of root messages. Also, we add the same background word distribution to capture non-topic words as TreeLDA does. The hyper-parameters of BTM, SATM and GMTM were set according to the best hyperparameters reported in their original papers. For TreeLDA and StructLDA, the parameter settings were kept the same as LeadLDA since they are its variants. And the background switchers were parameterized by symmetric Beta prior on 0.5, following Chemudugunta et al. (2006). We ran Gibbs samplings (in LeadLDA, TreeLDA, StructLDA, BTM and SATM) and EM algorithm (in GMTM) with 1,000 iterations to ensure convergence.
Topic model evaluation is inherently difficult. In previous works, perplexity is a popular metric to evaluate the predictive abilities of topic models given held-out dataset with unseen words (Blei et al., 2003b). However, Chang et al. (2009) have demonstrated that models with high perplexity do not necessarily generate semantically coherent topics in human perception. Therefore, we conducted objective and subjective analysis on the coherence of produced topics.

Objective Analysis
The quality of topics is commonly measured by coherence scores (Mimno et al., 2011), assuming that words representing a coherent topic are likely to co-occur within the same document. However, due to the severe sparsity of short text posts, we modify the calculation of commonly-used topic coherence measure based on word co-occurrences in messages tagged with the same hashtag, named as hashtag-document, assuming that those messages discuss related topics 8 .
Specifically, we calculate the coherence score of a topic given the top N words ranked by likelihood as below: where w k i represents the i-th word in topic k ranked by p(w|k), D(w k i , w k j ) refers to the count of hashtag-documents where word w k i and w k j cooccur, and D(w k i ) denotes the number of hashtagdocuments that contain word w k i . Table 3 shows the absolute values of C scores for topics produced on three evaluation datasets (May, June and July), and the top 10, 15, 20 words of topics were selected for evaluation. Lower scores indicate better coherence in the induced topic.
We have the following observations: • GMTM gave the worst coherence scores, which may be ascribed to its heavy reliance on relevant large-scale high-quality external data, with- 8 We sampled posts and their corresponding hashtags in our evaluation set and found only 1% mismatch.  out which the trained word embedding model failed to capture meaningful semantic features for words, and hence could not yield coherent topics.
• TreeLDA and StructLDA produced competitive results compared to the state-of-the-art baseline models, which indicates the effectiveness of using conversation structures to enrich context and thus generate topics of reasonably good quality.
• The coherence of topics generated by LeadLDA outperformed all the baselines on the three datasets, most of time by large margins and was only outperformed by BTM on the May dataset when K = 50 and N = 10. The generally higher performance of LeadLDA is due to three reasons: 1) It effectively identifies topics using the conversation tree structures, which provide richer context information; 2) It jointly models the topics of leaders and the topic dependencies of other messages on a tree. TreeLDA and StructLDA, each only considering one of the factors, performed worse than LeadLDA; 3) LeadLDA separately models the probabilities of leaders and followers containing topical or nontopic words while the baselines only model the general background information regardless of the different types of messages. This implies that leaders and followers do have different capacities in covering key topical words or background noise, which is useful to identify key words for topic representation.

Subjective Analysis
To evaluate the coherence of induced topics from human perspective, we invited two annotators to subjectively rate the quality of every topic (by displaying the top 20 words) generated by different models on a 1-5 Likert scale. A higher rating indicates better quality of topics. The Fless's Kappa of annotators' ratings measured for various models on different datasets given K = 50 and 100 range from 0.62 to 0.70, indicating substantial agreements (Landis and Koch, 1977). Table 4 shows the overall subjective ratings. We noticed that humans preferred topics produced given K = 100 to K = 50, but coherence scores gave generally better grades to models for K = 50, which matched the number of topics in ground truth. This is because models more or less mixed more common words when K is larger. Coherence score calculation (Equation (3)) penalizes common words that occur in many documents, whereas humans could somehow "guess" the meaning of topics based on the rest of words thus gave relatively good ratings. Nevertheless, annotators gave remarkably higher ratings to LeadLDA than baselines on all datasets regardless of K being 50 or 100, which confirmed that LeadLDA effectively yielded goodquality topics.
For a detailed analysis, Figure 3 lists the top 20 words about "MH17 crash" induced by different models 9 when K = 50. We have the following 9 As shown in Table 3 and 4, the topic coherence scores of GMTM were the worst. Hence, the topic generated by  Table 4: Subjective ratings of topics. The meanings of K50, K100, TREE, STR and LEAD are the same as in Table 3. observations: • BTM, based on word-pair co-occurrences, mistakenly grouped "Fok's family" (a tycoon family in Hong Kong), which co-occurred frequently with "Hong Kong" in other topics, into the topic of "MH17 crash". "Hong Kong" is relevant here as a Hong Kong passenger died in the MH17 crash.
• The topical words generated by SATM were mixed with words relevant to the bus explosion in Guangzhou, since it aggregated messages according to topic affinities based on the topics learned in the previous step. Thus the posts about bus explosion and MH17 crash, both pertaining to disasters, were aggregated together mistakenly, which generated spurious topic results.
• Both TreeLDA and StructLDA generated topics containing non-topic words like "microblog" and "dear". This means that without distinguishing leaders and followers, it is difficult to filter out non-topic words. The topic quality of StructLDA nevertheless seems better than GMTM is not shown due to space limitation. TreeLDA, which implies the usefulness of exploiting topic dependencies of posts in conversation structures.
• LeadLDA not only produced more semantically coherent words describing the topic, but also revealed some important details, e.g., MH17 was shot down by a missile.

Conclusion and Future Works
This paper has proposed a novel topic model by considering the conversation tree structures of microblog posts. By rigorously comparing our proposed model with a number of competitive baselines on real-world microblog datasets, we have demonstrated the effectiveness of using conversation structures to help model topics embedded in short and colloquial microblog messages.
This work has proven that detecting leaders and followers, which are coarse-grained discourse derived from conversation structures, is useful to model microblogging topics. In the next step, we plan to exploit fine-grained discourse structures, e.g., dialogue acts (Ritter et al., 2010), and propose a unified model that jointly inferring discourse roles and topics of posts in context of conversation tree structures. Another extension is to extract topic hierarchies by integrating the conversation structures into hierarchical topic models like HLDA (Blei et al., 2003a) to extract fine-grained topics from microblog posts.