Hashtag Recommendation Using Dirichlet Process Mixture Models Incorporating Types of Hashtags

In recent years, the task of recommending hashtags for microblogs has been given increasing attention. Various methods have been proposed to study the problem from different aspects. However, most of the recent studies have not considered the differences in the types or uses of hashtags. In this paper, we introduce a novel nonparametric Bayesian method for this task. Based on the Dirichlet Process Mixture Models (DPMM), we incorporate the type of hashtag as a hidden variable. The results of experiments on the data collected from a real world microblogging service demonstrate that the proposed method outperforms state-of-the-art methods that do not consider these aspects. By taking these aspects into consideration, the relative improvement of the proposed method over the state-of-the-art methods is around 12.2% in F1-score.


Introduction
Hashtags are used to mark keywords or topics in a microblog. Over the past few years, social media services have become some of the most important communication channels for people. According to the statistic reported by the Pew Research Centers Internet & American Life Project in Aug 5, 2013, about 72% of adult internet users are also members of at least one social networking site. Hence, microblogs have also been widely used as data sources for public opinion analyses (Bermingham and Smeaton, 2010;Jiang et al., 2011), prediction (Asur and Huberman, 2010;Bollen et al., 2011), reputation management (Pang and Lee, 2008;Otsuka et al., 2012), and many other applications (Sakaki et al., 2010;Becker et al., 2010;Guy et al., 2010;Guy et al., 2013).
In addition to the limited number of characters in the content, microblogs also contain a form of metadata tag (hashtag), which is a string of characters preceded by the symbol (#). Hashtags are used to mark the keywords or topics of a microblog. They can occur anywhere in a microblog, at the beginning, middle, or end. Hashtags have been proven to be useful for many applications, including microblog retrieval (Efron, 2010), query expansion (A. Bandyopadhyay et al., 2011), and sentiment analysis (Davidov et al., 2010;Wang et al., 2011). However, only a small percentages of microblogs contain hashtags provided by their authors. Hence, the task of recommending hashtags for microblogs has become an important research topic and has received considerable attention in recent years. Existing works have studied discriminative models (Ohkura et al., 2006;Heymann et al., 2008) and generative models (Blei and Jordan, 2003;Krestel et al., 2009;Ding et al., 2013;Godin et al., 2013) based on the textual information of a single microblog.
Since microblog users are free to develop and use their own hashtags, they may select hashtags for different purposes. Based on an analysis of the hashtags crawled from a real online service, we observe that hashtags are used for events, conferences, conversation, disasters, memes, recall, quotes, and so on. To illustrate it let us take the following examples: Example 1:#Apple iOS 9 includes music feature, new security and support for older iPhones.
Example 2:#BREAKING: Missing cyclist Natalie Donoghue has been found alive after she went missing in the Hunter Valley.
We can see that the hashtag #Apple iOS 9 used in the example summarize the main topics of the corresponding microblog. While, the aim of hashtag #BREAKING in the example 2 is used as a label of the microblog. The different uses greatly impact the strategy of hashtag recommendation. However, there has been relatively few studies which take this issue into consideration.
In this paper, we propose a novel nonparametric Bayesian method to perform this problem. Inspired by the methods proposed by Liu et al. (2012), we assume that the hashtags and textual content in the corresponding microblog are parallel descriptions of the same thing in different languages. We adapt a translation model with topic distribution to achieve this task. Because of the ability of Dirichlet Process Mixture Models (DPMM) (Antoniak and others, 1974;Ferguson, 1983) to handle an unbounded number of topics, the proposed method is extended from them. Based on the different uses of hashtags, we incorporate the type of hashtag into the DPMM as a hidden variable.
The main contributions of this work can be summarized as follows: • Through analyzing the microblogs, we propose the problem of influences of types of hashtags.
• We adopt a nonparametric Bayesian method to perform the hash tag recommendation task, which also takes the types of hashtags into consideration.
• Experimental results on the dataset we construct from a real microblogging service show that the proposed method can achieve significantly better performance than the state-of-the-arts methods.

The Proposed Method
In this section, we first give some brief descriptions about the Dirichlet process (DP) and Dirichlet Process Mixture Models (DPMM). Then, we detail the proposed hashtag recommendation method.

Dirichlet Process
The Dirichlet process (DP) is a distribution over distributions. A DP, denoted by G ∼ DP (α, H), is parameterized by a base measure H, and a concentration parameter α. After a discussion of basic definitions, we present two different perspectives on the Dirichlet process.
A perspective on the Dirichlet process is stickbreaking construction. The stick-breaking construction considers a probability mass function {β k } ∞ k=1 on a countably infinite set, where the discrete probabilities are defined as follows: (1) The k th weight is a random proportion v k of the remaining stick after the previous(k − 1) weights have been defined. This stick-breaking construction is generally denoted by β ∼ GEM (α) (GEM stands for Griffiths, Engen and McCloskey). A random draw G ∼ DP (α, H) can be expressed as: where δ θ is a probability measure concentrated at θ. A second perspective on the Dirichlet process is provided by the Pólya urn scheme (Blackwell and MacQueen, 1973). It refers to draws from G. Let θ 1 , θ 2 , ... represent a sequence of independent and identically distributed (i.i.d.) random variables distributed according to G. Blackwell and MacQueen (1973) showed that the conditional distributions of θ i given θ 1 , ..., θ i−1 have the following form: Eq.
(3) shows that θ i has positive probability of being equal to one of the previous draws. We use φ 1 , ..., φ K to represent the distinct values taken on by θ 1 , ..., θ i−1 , and Eq.(3) can be re-expressed as: where m k is the number of values θ i = φ k for 1 ≤ i < i.

Dirichlet Process Mixture Models
In nonparametric Bayesian statistics, DPs are commonly used as prior distributions for mixture models with an unknown number of components. Let F (θ i ) denotes the distribution of the observation x i given θ i . We can get the observation x i as follows: Given G ∼ DP (α, H), each observation x i from an exchangeable data set x is generated by first choosing a parameter θ i ∼ G, and then sampling This model is referred to as a Dirichlet process mixture model. This process is often described by a set z of independently sampled variables z i ∼ M ult(β) indicating the component of the mixture G(θ) associated with each data point x i ∼ F (θ z i ). Then we can get:

The Generation Process
Let D represent the number of microblogs in the given corpus. A microblog contains a bag of words denoted by w where N d is the total number of terms in the microblog. A word is defined as an item from a vocabulary with W distinct words indexed by w = {w 1 , w 2 , ..., w W }. Each microblog may have a number of hashtags denoted by Given an unlabeled data set, the task of hashtag recommendation is to discover a list of hashtags for each microblog.
In standard LDA, each document is viewed as a mixture of topics, and each topic has probabilities to generate words. A LDA is a generalization of a finite mixture model. Since DP is the extension of finite mixture models to the nonparametric setting, the appropriate tool for nonparametric topic models is HDP. However, both LDA and HDP are normally suitable for long documents. For microblogs, which have limited number of words, a single microblog is most likely to talk about a single topic. Hence, in this work, we regard that each microblog associates with only one topic. The set of documents are viewed as a mixture of infinite topics. And we use DPs as prior distributions for the mixture of infinite topics.
The main assumptions of our model are as follows. When user u publishes a microblog, he will first generate the content and then generate the hashtags. When constructing the content, he will select a topic based on the topic distribution. Then he will choose a bag of words one by one from the word distribution of the topic or from the background words that captures white noise. Hashtags will be chosen according to the following two situations. In the first situation, hashtags summarize the corresponding microblogs. Hashtags of a microblog can be generated from the content through the topicspecific alignment probability between words and hashtags. In the second situation, hashtag is used as a label of the microblog. We recommend the hashtags using the words in the microblog, which is based on the frequency of words regarded as this type of hashtag.
Let π be the probability of choosing a topic word or a background word, and we use y d = {y dn } N d n=1 to indicate a word to be a topic word or background word. θ denotes the topic distribution, and φ k represents the word distribution for topic k. φ B represents the word distribution for background words. We use x dm to represent the type of hashtag h dm , and use z d to represent the topic of document d. Then each hashtag h dm is annotated according to the translation possibility where ϕ x dm is the probability alignment table between words and hashtags. The generation process is as Algorithm 1. is the graphical model which does not take the types of hashtags into consideration, where ϕ * ∈ {ϕ 1 , ϕ 2 }. If ϕ * = ϕ 1 , the model is just considering the first situation. when ϕ * = ϕ 2 , only the second type of hashtag will be considered.

Learning
We use collapsed Gibbs sampling (Griffiths and Steyvers, 2004) to obtain samples of hidden variables assignment and to estimate the model parameters from these samples.
The sampling probability of being a topic/background word for the nth word in the microblog d can be calculated by the following Figure 1: The graphical representation of the proposed model. Shaded circles are observations or constants. Unshaded ones are hidden variables. CNHR represents the proposed hashtag recommendation method. NHR* represents the model which does not take the types of hashtags into consideration.

Algorithm 1 The generation process of CNHR
where l = B when p = 0 and l = z d when p = 1, N ¬n,p is a count of words that are assigned to background words and any topic respectively, N w dn ¬n,B is the number of word w dn assigned to background words, N w dn ¬n,z d is the number of word w dn that are assigned to topic z d . All counters are calculated with the current word w dn excluded.
We sample z d for the microblog d using the following equation: (6) We can also represent p(z d |z ¬d , α) with CRP as described in the previous section. Since z 1 , z 2 , ... is a sequence of i.i.d random variables, they are exchangeable. Let us consider the dth variable z d is the last observation, we can get the following expression: where k is an exist topic andk is a new topic, N k ¬d is the number of microblogs assigned with topic k, N ) ¬d is the total number of microblogs, α is concentration parameter. All counters are calculated with the current microblog d excluded.
If z d equals an exist topic z d = k, then we can calculate p(w d |z, w ¬d , y, β w ) by: is the density of word w dn given topic k. w d are the words in microblog d. h(φ k ) is the density of base measure H.
If z d is a new topic z d =k, then we can calculate p(w d |z d =k, w ¬d , y, β w ) by: where p(w d |φk) = 1≤n≤N d ,y dn =1 p(w dn |φk).
We can calculate the probabilities of generating hashtags from two situations as follows: is the total number of occurrences that word w dn is under topic k, M k,¬d w dn ,2 is the number of word w dn recommended as the second type of hashtag given topic k. All counters with ¬d are calculated with the current microblog w d excluded.
We sample the index variable x dm for mth hashtag in the microblog d as follows: where N ¬dm x dm is the number of hashtags that is generated by the type x dm , N ¬dm is total number of hashtags, the counters with ¬d m are calculated with the current hashtag excluded.
After enough sampling iterations to burn in the Markov chain, ϕ 1 and ϕ 2 are estimated as follows: The potential size of the probability alignment ϕ 1 between hashtag and word is W · V · K. The data sparsity may pose a more serious problem in estimating ϕ 1 than the topic-free word alignment case. We use interpolation smoothing technique for ϕ 1 . In this paper, we employ smoothing as follows: where ϕ 1 * h,k,w is the smoothed topical alignment probabilities, ϕ 1 h,k,w is the original topical alignment probabilities, P (h|w) is topic-free word alignment probability. In this work, we obtain P (h|w) by exploring IBM model-1 (Brown et al., 1993). γ is trade-off of two probabilities ranging from 0 to 1. When γ = 0, ϕ 1 * h,k,w reduces to topicfree word alignment probability, and when γ = 1, there will be no smoothing in ϕ 1 * h,k,w .

Hashtag Recommendation
Suppose given an unlabeled dataset, we firstly discover the topic and determine topic/background words for each microblog. The collapsed Gibbs sampling is also applied for inference. The process is almost same as previous section described the model learning. The different is that there are no hashtags in the unlabeled dataset. Hence, when sampling z d for the microblog d, we use the following equation: Since there are no differences between the word alignments with each hashtags for a new topic in the unlabeled dataset, after the hidden variables of topic/background words and the topic of each microblog become stable, we only need to estimate the distribution of topics exist in the training dataset. Then we can estimate the distribution of topics for the microblog d in the unlabeled data by: where p(w dn |k) = is a count of words w dn that are assigned to topic k in the corpus. And p(k) = N k N (.) +α is regarded as a prior for topic distribution, where Z is the normalized factor. With topic distribution χ and topic-specific word alignment table ϕ * , we can rank hashtags for the microblog d in the unlabeled data through the following equation: where C is the number of hashtag types. p(w dn |w d ) is the weight of the word w dn in the microblog content w d , which can be estimated by the IDF score of the word, p(x dm ) is the probability of hashtag belong to the type x dm , we can estimate it with Eq.(11). Based on the ranking scores, we can suggest the top-ranked hashtags for each microblog.

Data Collection
We use a dataset collected from Sina Weibo 1 , which provides the Twitter-like service and is one of the most popular one in China, to evaluate the proposed approach and alternative methods. The original data set contains 282.2 million microblogs posted by around 1.1 million users. These microblogs were obtained by starting from a set of seed users and their follower/followee relations. We extract the microblogs posted with hashtags between Jan. 2012 and July 2013. Finally, 1,118,792 microblogs posted are selected for this work. The unique number of hashtags in the corpus is 305,227. We randomly select 100K as training data, 10K as development data, and 10K as test set. The hashtags marked in the original microblogs are considered as the golden standards.

Experiment Configurations
We use precision (P ), recall (R), and F1-score (F 1 ) to evaluate the performance. Precision is calculated based on the percentage of "hashtags truly assigned" among "hashtags assigned by system". Recall is calculated based on the "hashtags truly assigned" among "hashtags manually assigned". F 1 is the harmonic mean of precision and recall. We do 500 iterations of Gibbs sampling to train the model. For optimizing the hyperparmeters of the proposed method and alternative methods, we use development data set to do it. In this work, the scale parameter α is set to Gamma(5, 0.5). The other settings of hyperparameters are as follows: β w = 0.1, β h = 0.1, η = 0.01, and σ = 0.01. The smoothing factor γ in Eq.(13) is set to 0.8. For estimating the translation probability without topical information, we use GIZA++ 1.07 (Och and Ney, 2003) to do it.
Since hashtag recommendation task can also be modeled as a classification problem, we compare the proposed model with the following alternative methods: • Naive Bayes (NB): We formulate hashtag recommendation as a binary classification task and apply NB to model the posterior probability of each hashtag given a microblog.
• Support Vector Machine (SVM): Similar to Naive Bayes, each hashtag can be regarded as one label and we use SVM to classify these microblogs.
• Translation model (IBM-1): IBM model 1 is directly applied to obtain the alignment probability between the word and the hashtag .
• Topical translation model (TTM): Ding et al. (2013) proposed the TTM for hashtag extraction. We implemented and extended their method for evaluating on the corpus constructed in this work. The number of topics in TTM is set to 20, and α is set to 50/K. The hyperparameters used in TTM are also selected based on the development data set. Table 1 shows the comparisons of the proposed method with the state-of-the-art methods on the constructed evaluation dataset. "CNHR" denotes the method proposed in this paper. "NHR1" is a degenerate variation of CNHR, in which we consider all the hashtags are generated from distribution ϕ 1 . "NHR2" is a model in which we consider all the hashtags are generated from  distribution ϕ 2 . From the results, we can observe that discriminative methods achieve worse results than generative methods. We think that the large number of hashtags is one of the main reasons of the low performances. From the results shown in Table 1, we also observe that the proposed method can achieve significantly better performance than existing methods. The relative improvement of proposed CNHR over TTM is around 12.7% in F 1 . And we can see that the performances of TTM are similar as the results of NHR1. Since TTM and NHR1 are similar with each other except that TTM is based on LDA and NHR1 is adapted from DPMM. The results demonstrate the advantage of using DPMM over LDA. It does not need prior knowledge about number of topics. Comparing the results of the method CNHR with the methods NHR1 and NHR2 which do not take the types of hashtags into consideration, we can see that the proposed method benefits a lot from incorporating the types of hashtags. Figure 2 shows the Precision, Recall, and F 1 curves of NB, IBM1, SVM, TTM, NHR1, NHR2 and CNHR on the test data. Each point of a curve represents the extraction of a different number of hashtags ranging from 1 to 5 respectively. In curves, the curve that is the highest of the graph indicates the best performance. Based on the results, we can observe that the performance of CNHR is the highest in all the curves. This indicates that the proposed method was significantly better than the other methods.

Experimental Results
In TTM, the number of topics K is also crucial factor. Table 2 shows the impact of the number of topics. From the table, we can observe that TTM obtains the best performance when K is set to 20. And performance decreases with more number of topics. We think that data sparsity may be one of the main reasons. With much more topic number, the data sparsity problem will be more serious when estimating topic-specific translation probability. We compare our method with the best performance of TTM.
From the description of the proposed model, we can know that there is a smooth parameter γ in the proposed method CNHR. To evaluate the impact of it, Figure 3 shows the influence of the translation probability smoothing parameter γ. When γ is set to 0.0, it means that the topical information is omitted. Comparing the results of γ = 0.0 and other values, we can observe that the topical information can benefit this task. When γ is set to 1.0, it represents the method without smoothing. The results indicate that it is necessary to address the sparsity problem through smoothing.

Related Works
Due to the usefulness of tag recommendation, many methods have been proposed from different perspectives (Heymann et al., 2008;Krestel et al., 2009;Rendle et al., 2009;Liu et al., 2012;Ding et al., 2013). Heymann et al. (Heymann et al., 2008) investigated the tag recommendation problem using the data collected from social bookmarking system. They introduced an entropy-based metric to capture the generality of a particular tag. In (Song et al., 2008), a Poisson Mixture Model based method is introduced to achieve the tag recommendation task. Krestel et al. (Krestel et al., 2009) introduced a Latent Dirichlet Allocation to elicit a shared topical structure from the collaborative tagging effort of multiple users for recommending tags. Ding et al. (2013) proposed to use translation process to model this task. Based on the the observation that similar web pages tend to have the same tags, Lu et al. (2009) proposed a method taking both tag information and page content into account to achieve the task. They extended the translation based method and introduced a topic-specific translation model to process the various meanings of words in different topics. In (Tariq et al., 2013), discriminativeterm-weights were used to establish topic-term relationships, of which users' perception were learned to suggest suitable hashtags for users. To handle the vocabulary problem in keyphrase extraction task, Liu et al. proposed a topical word trigger model, which treated the keyphrase extraction problem as a translation process with latent topics (Liu et al., 2012).
Most of the works mentioned above are based on textual information. Besides these methods, personalized methods for different recommendation tasks have also been paid lots of attentions (Liang et al., 2007;Shepitsen et al., 2008;Garg and Weber, 2008;Li et al., 2010;Liang et al., 2010;Rendle and Schmidt-Thieme, 2010;Huang et al., 2012). Shepitsen et al. (2008) proposed to use hierarchical agglomerative clustering to take into account personalized navigation context in cluster selection. In (Garg and Weber, 2008), the problem of personalized, interactive tag recommendation was also studied based on the statistics of the tags co-occurrence. Liang et al. (2010) proposed to the multiple relationships among users, items and tags to find the semantic meaning of each tag for each user individually and used this information for personalized item recommendation.
From the brief descriptions given above, we can observe that most of the previous works on hashtag suggestion did not take the types of hashtags into consideration. In this work, we propose to incorporate it into the generative methods.

Conclusions
In this paper, we study the problem of hashtag recommendation for microblogs. Since existing translation model based methods for this task regard all the hashtags generated from the same distribution, we propose a novel method which incorporates different type of hashtags have different distribution into the topical translation model for hashtag recommendation task. To evaluate the proposed method, we also construct a dataset from real world microblogging services. The results of experiments on the constructed dataset demonstrate that the proposed method outperforms state-of-the-art methods that do not consider these aspects. tially funded by National Natural Science Foundation of China (No. 61473092 and 61472088), the National High Technology Research and Development Program of China (No. 2015AA011802), and Shanghai Science and Technology Development Funds (13dz226020013511504300).