User Based Aggregation for Biterm Topic Model

Biterm Topic Model (BTM) is designed to model the generative process of the word co-occurrence patterns in short texts such as tweets. However, two aspects of BTM may restrict its performance: 1) user individualities are ignored to obtain the corpus level words co-occurrence patterns; and 2) the strong assumptions that two co-occurring words will be assigned the same topic label could not distinguish background words from topical words. In this paper, we propose Twitter-BTM model to address those issues by considering user level personalization in BTM. Firstly, we use user based biterms aggregation to learn user speciﬁc topic distribution. Secondly, each user’s preference between background words and topical words is estimated by incorporating a background topic. Experiments on a large-scale real-world Twitter dataset show that Twitter-BTM outperforms several state-of-the-art baselines.


Introduction
In recent years, short texts are increasingly prevalent due to the explosive growth of online social media. For example, about 500 million tweets are published per day on Twitter 1 , one of the most popular online social networking services. Probabilistic topic models (Blei et al., 2003) are broadly used to uncover the hidden topics of tweets, since the low-dimensional semantic representation is crucial for many applications, such as product recommendation (Zhao et al., 2014), hashtag recommendation (Ma et al., 2014), user interest tracking (Sasaki et al., 2014), sentiment analysis 1 See https://about.Twitter.com/company (Si et al., 2013). However, the scarcity of context and the noisy words restrict LDA and its variations in topic modeling over short texts.
Previous works model topic distribution at three different levels for tweets: 1) document, the standard LDA assumes each document is associated with a topic distribution (Godin et al., 2013;Huang, 2012). LDA and its variations suffer from context sparsity in each tweet. 2) user, user based aggregation is utilized to alleviate the sparsity problem in short texts (Weng et al., 2010;Hong and Davison, 2010). In these models, all the tweets of the same user are aggregated together as a pseudo document based on the observation that the tweets written by the same user are more similar. 3) corpus, BTM (Yan et al., 2013) assumes that all the biterms (co-occurring word pairs) are generated by a corpus level topic distribution to benefit from the global rich word co-occurrence patterns.
As far as we know, how to incorporate user factor into BTM has not been studied yet. User based aggregation has proven effective for LDA. But unfortunately, our preliminary experiments indicate that simple user-based aggregation for BTM will generate incoherent topics. To distinguish between commonly used words (e.g., good, people, etc) and topical words (e.g., food, travel, etc), a background topic is often incorporated into the topic models. Zhao et al. (2011) use a background topic in Twitter-LDA to distill discriminative words in tweets. Sasaki et al. (2014) reduce the perplexity of Twitter-LDA by estimating the ratio between choosing background words and topical words for each user. They both make a very strong assumption that one tweet only covers one topic. Yan et al. (2015) use a background topic to distinguish between common biterms and bursty biterms, which need external data to evaluate the burstiness of each biterm as prior knowledge. Unlike those above, we incorporate a background topic to absorb non-discriminative common words in each biterm. And we also estimate the user's preference between common words and topical words. Our new model is named as Twitter-BTM, which combines user based aggregation and the background topic in BTM. Finally, experiments on a Twitter dataset show that Twitter-BTM not only can discover more coherent topics but also can give more accurate topic representation of tweets compared with several state-of-the-art baselines.
We organize the rest of the paper as follows. Section 2 gives a brief review for BTM. Section 3 introduces our Twitter-BTM model and its implementation. Section 4 describes experimental results on a large-scale Twitter dataset. Finally, Section 5 contains a conclusion and future work.

BTM
There are two major differences between BTM and LDA (Yan et al., 2013). For one thing, considering a topic is a mixture of highly correlated words, which implies that they often occur together in the same document, BTM models the generative process of the word co-occurrence patterns directly. Thus a document made up of n words will be converted to C 2 n biterms. For another, LDA and its variants suffer from the severe data sparsity in short documents. BTM uses global co-occurrence patterns to model the topic distribution over corpus level instead of document level.
The graphical representation of BTM (Yan et al., 2013) is shown in Figure 1(a). It assumes that the whole corpus is associated with a distributions θ over K topics drawn from a Dirichlet prior Dir(α). And each topic t is associated with a multinomial distribution φ t over a vocabulary of V unique words drawn from a Dirichlet prior Dir(β). The generative process for a corpus which consists of , is as follows: In the above process, z b is the topic assignment latent variable of biterm b. To infer the parameters φ and θ, collapsed Gibbs sampling Compared with the strong assumption that a short document only covers a single topic (Diao et al., 2012;Ding et al., 2013), BTM makes a looser assumption that two words will be assigned the same topic label if they have co-occurred. Thus a short document could cover more than one topic, which is more close to the reality. But this assumption causes another issue, those commonly used words and those topical words are treated equally. Obviously it is inappropriate to assign same topic label to those words.

Twitter-BTM
In this Section, we introduce our Twitter-BTM model. Figure 1(b) shows the graphical representation of Twitter-BTM. The generative process of Twitter-BTM is as follows: In the above process, user u's topic interest θ u is a multinomial distribution over K topics drawn from a Dirichlet prior Dir(α). The background topic B is associated with a multinomial distribution φ B drawn from a Dirichlet prior Dir(β). The assumption that each user has a different preference between topical words and background words is shown to be effective in (Sasaki et al., 2014). We adopt this assumption in Twitter-BTM. User u's preference is represented as a Bernoulli distribution with parameter π u drawn from a beta prior Beta(γ). N u is the number of biterms of user u, z u,b is the topic assignment latent variable of user u's biterm b. For user u and his/her biterm b, n=1 or 2, we use a latent variable y u,b,n to indicate the word type of the word w b,n . When y u,b,n = 1, w b,n is generated from topic z u,b . When y u,b,n = 0, w b,n is generated from the background topic B.
We adopt collapsed Gibbs Sampling to estimate the parameters. Because of the limitations of space, we leave out the details about the sampling algorithm. Since we can't get a document's distribution over topics from the parameters estimated by Twitter-BTM directly, we utilize the following formula (Yan et al., 2013) to infer the topic distribution of document d. Given a document d whose author is user u: Now the problem is converted to how to estimate P (b i |d) and P (z = t|b i ). P (b i |d) is estimated by empirical distribution in d: where N b i is the number of biterm b i occurred in d, N b is the total number of biterms in d. We can apply Bayes' rule to compute P (z = t|b i ) via following expression:

Experiments
In this Section, we describe our experiments carried on a Twitter dataset collected form 10th Jun, 2009 to 31st Dec, 2009. Stop words and words occur less than 5 times are removed. We also filter tweets which only have one or two words. All letters are converted into lower case. The dataset is divided into two parts. The first part whose statistics is shown in Table 1 is used for training. The second part which consists of 22,496,107 tweets is used as the external dataset in topic coherence evaluation task in Section 4.1.
We compare the performance of Twitter-BTM with five baselines: • LDA-U, user based aggregation is applied before training LDA.
• Twitter-LDA (Zhao et al., 2011), which makes a strong assumption that a tweet only covers one topic.
• TwitterUB-LDA (Sasaki et al., 2014), an improved version of Twitter-LDA, which models the user level preference between topical words and background words.
• BTM-U, a simplified version of Twitter-BTM without background topic.
For all the above models, we use symmetric Dirichlet priors. The hyperparameters are set as follows: for all the models, we set α = 50/K, β = 0.01; for Twitter-LDA, TwitterUB-LDA and Twitter-BTM, we set γ = 0.5. We run Gibbs sampling for 400 iterations. Perplexity metric is not used in our experiments since it is not a suitable evaluation metric for BTM (Cheng et al., 2014). The first reason is that BTM and LDA optimize different likelihood. The second reason is that topic models which have better perplexity may infer less semantically topics (Chang et al., 2009).

Topic Coherence
We use PMI-Score (Newman et al., 2010) to quantitatively evaluate the quality of topic component.
is an extremely small constant (Stevens et al., 2012), which is equal to 10 −12 in this paper. The word probabilities and the co-occurrence probabilities are computed on the large-scale external dataset empirically. Here we use the second part Twitter dataset as the external dataset. Then for a topic t and its top T words ranked by topic-word probability φ t w , the PMI-Score of topic t is defined as follow: The model's PMI-Score is defined as the mean of all the topics' PMI-Score. Table 2 shows the average results over 10 runs of different models. When K = 50, Twitter-BTM outperforms all other models significantly. When K = 100, The PMI-Score of BTM and Twitter-BTM are very close. BTM-U is worse than BTM, the reason may be that each user's biterm sets provide extremely limited words co-occurring information. Table 3 shows top 10 words of topic "food" learned by BTM, BTM-U and Twitter-BTM when K = 50. We use italic fonts to indicate background words labeled by human judgement. Compared with BTM and BTM-U, Twitter-BTM can rank those background words at lower level. It demonstrates that representative words learned by Twitter-BTM are more coherent and meaningful.

Document Representation
Topic models are powerful dimension reduction methods for texts. Given a tweet d, we can infer its probability distribution over K topics with We use document classification task (Cheng et al., 2014) and document clustering task (Duan et al., 2012) to measure the quality of the documents' topic proportions. Tweets in Twitter have no explicit label information. But some tweets are labeled by one or more hashtags (a type of label whose form is "#keyword") manually by its author to indicate the topic the tweets involve. We follow previous works (Cheng et al., 2014;Wang et al., 2014) and use hashtags as the tweets' labels. Table 4 lists 38 frequent (at least appears in 100 tweets ) hashtags relating to certain topic or event manually selected in our dataset. We choose those tweets which contain only one of these hashtags appear in Table 4 from our original data in the following experiments. When we infer a tweet's topic distribution, the hashtag is ignored. Because it doesn't make sense to use the label information to construct the feature vector directly.
We classify these selected tweets by Random Forest classifier (Breiman, 2001) implemented in aaliyah afghanistan beatcancer birding blogtalkradio digguser dmv dontyouhate fact giladshalit gno gov green haiku healthcare honduras india iranelection jazz jesus krp lgbt mindsetshift nfl nn oink rhoa slaughterhouse socialmedia tech travel trueblood vegan vegas voss weeklyfitnesschallenge wordpress yyj  Figure 2: Performance of classification sklearn 2 python module with 10-fold cross validation. Using accuracy as the evaluation metric, we report the classification performance of different topic models in Figure 2. With the increase of the topic number K, all the models' accuracies are tending to increase. BTM is worse than all other models, which confirms the effectiveness of user based aggregation. Twitter-BTM and BTM-U always outperform LDA-U, Twitter-LDA and TwitterUB-LDA. Twitter-BTM's accuracy is a little higher than BTM-U, which demonstrates that the background topic is helpful to capture more accurate topic representation of documents. We adopt k-means algorithm implemented in sklearn python module as our clustering method. The number of cluster is set to 38. Considering we have the knowledge of ground truth class assignments of each tweet, and Adjusted Rand Index (ARI) and Normalized Mutual Information are used as cluster validation indices in our experiments. As shown in Figure 3 and Figure 4, The higher ARI and NMI value indicate that Twitter-BTM outperform other models. And BTM performs worse than all other models.

Conclusion
In this paper, we investigate the problem of topic modeling over short texts with user factor. Us-2 See http://scikit-learn.org/stable/ er individualities are sacrificed to obtain the corpus level words co-occurrence patterns in BTM. However, unlike LDA, simple user based aggregation will reduce the topic coherence for BTM. To address this problem, we propose Twitter-BTM which loosens the inappropriate assumption that two co-occurring words must have same topic label made in BTM by leveraging user based aggregation and incorporating a background topic in BTM. The experimental results show that Twitter-BTM substantially outperforms BTM.
In the future, we plan to study the influence of other factors such as temporal information to BTM and its variants.