Leveraging Meta Information in Short Text Aggregation

Short texts such as tweets often contain insufficient word co-occurrence information for training conventional topic models. To deal with the insufficiency, we propose a generative model that aggregates short texts into clusters by leveraging the associated meta information. Our model can generate more interpretable topics as well as document clusters. We develop an effective Gibbs sampling algorithm favoured by the fully local conjugacy in the model. Extensive experiments demonstrate that our model achieves better performance in terms of document clustering and topic coherence.


Introduction
Texts generated on the internet (e.g., tweets, news headlines and product reviews) are usually short, which means that each individual document contains insufficient word co-occurrence information. Many existing topic models like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its variants infer topics purely based on the word occurrence information, which often results in degraded performance and makes those models incapable of learning from short texts.
Recently, many research efforts have been devoted to analysing short texts. A common strategy is to aggregate short texts into clusters and then apply topic models to those clusters. The clusters are expected to aggregate the word co-occurrence information of the assigned documents. One widelyused option is known as self-aggregation, where we can aggregate short texts according to the contextual information. For example, the contextual information of a document can be encoded by its topics so that the topic assignments can be used for * Corresponding author aggregation. This line of research includes models such as SATM (Quan et al., 2015), LTM (Li et al., 2018a), and PTM (Zuo et al., 2016a). On the other hand, many short texts, likes tweets, often come with meta information (meta-info for short, also known as meta-data or side information), such as authors, categories, hashtags, timestamps, etc. Therefore, another popular option is to aggregate short texts according to their meta-info. For example, we can assume that tweets published by the same users (Hong and Davison, 2010;Zhao et al., 2011) or with the same hashtags (Mehrotra et al., 2013) are likely to discuss similar topics. Those tweets can be aggregated into the same clusters.
Although the above two aggregation schemes have yielded prominent results on short text analysis, there is still space for improvement. For example, in tweet analysis: if we ignore the associated meta-info as in the self-aggregation scheme, we may lose important information; on the other hand, it may not be a perfect idea to simply aggregate the tweets according to one kind of its metainfo such as hashtags, because the amounts of the tweets in different hashtags may differ largely and the diversity of the tweets labelled by one hashtag can be dramatic. In this paper, we are interested in developing a principle way of incorporating the meta-info directly into the generative process of a self-aggregation model, so that we can take advantage of both aggregation schemes in one integrated model. Here we present the Meta-Info Guided Aggregation (MIGA) model, a new selfaggregation model whose aggregation process is guided by the meta-info associated with each individual short text. Specifically, MIGA aggregates short texts according to two factors: whether those texts have similar content and whether they share similar meta-info. The proposed model assumes that the more short texts share the meta-info and discuss similar topics, the more likely they are as-signed to the same cluster. Moreover, MIGA automatically balances the two factors in a principled way. The flexibility in the framework of MIGA also allows us to leverage hierarchical meta-info and/or the pre-trained word embeddings to further improve the model performance.

Related Work
In addition to the aforementioned aggregation or pooling based models, another popular research direction for short text topic modelling is using word correlations or embedding to enhance topic models. For example, Biterm Topic Model (BTM) (Yan et al., 2013) and Relational BTM (Li et al., 2018b) (Fu et al., 2016) combines the idea of LFLDA and Topical Word Embedding (TWE) model (Liu et al., 2015); Gaussian LDA (GLDA) (Das et al., 2015) directly generates word embeddings from Gaussian distributions; Xun et al. (2016) uses an alternative background model to complement Gaussian topics in GLDA. GPUDMM , GPUPDMM (Li et al., 2017) and SeaNMF (Shi et al., 2018) utilise word semantic relations computed from pre-trained word embeddings. MetaLDA (Zhao et al., 2017c(Zhao et al., , 2018a, WEIFTM (Zhao et al., 2017b), and WEDTM (Zhao et al., 2018c) leverage binary and real-valued word embeddings in the topic-word distributions, respectively. Without using external word embeddings, DirBN (Zhao et al., 2018b) can be viewed as a self-aggregation model which aggregates the word co-occurrence information with a multi-layer structure.
The proposed model, MIGA, falls into the category of aggregation based models. Compared with others in this line, the major novelty of MIGA is that it considers both meta-info and content of short texts in the aggregation process, while existing models only take one factor into account. For short text models with word embeddings, they may face problems when the contextual information of the external word embeddings is not consistent with the contextual information in the target corpus. For example, the word embeddings trained on large general corpora may not be suitable for a specialised target corpus. Compared with those models, MIGA does not rely on external information of words. Moreover, if applicable, MIGA can also be flexibly extended with hierarchical meta-info and word embeddings.

The Proposed Model 1
Given a set of D short documents, the existing self-aggregation methods such as PTM (Zuo et al., 2016a) assume that each document d ∈ {1, · · · , D} belongs to one of M latent clusters. Each cluster accumulates the word counts from the assigned documents and contains more sufficient word co-occurrence information than an individual document. To generate the i th (i ∈ {1, · · · , N d }) word w d,i in document d with N d words, we first sample d's cluster assignment c d = m ∈ {1, · · · , M } according to its doc-cluster distribution, i.e., ψ d ∈ R M + ; and then we sample a topic z d,i ∈ {1, · · · , K} for word w d,i from the where V is the size of the vocabulary in the target corpus.
Different from the existing self-aggregation methods, which impose an uninformative prior on ψ d , our model draws it from a document-specific prior π d ∈ R M + , constructed from d's meta-info. Assume that there are L unique labels 2 in a corpus and the labels of document d are encoded in a binary vector f d ∈ {0, 1} L , where f d,l = 1 indicates d has label l. This encoding method allows a document to have multiple labels. Figure 1 shows the full generative process of MIGA, which is also described as follows: 1. For each latent cluster m: where Ga(·, ·) is the gamma distribution with the shape and rate parameters; The main idea of MIGA is the meta-info guided aggregation, where instead of putting an uninformative prior on ψ d , MIGA constructs an informative document-specific Dirichlet prior with parameter π d computed from the document's labels. Specifically, in Step 3a above, λ l,m captures the correlations between label l and cluster m. If document d has label l, i.e., f d,l = 1, λ l,m contributes to π d,m , which is the prior of ψ d,m . This shows how the meta-info influences the probability of assigning a document to a cluster. Moreover, in our model, meta-info only contributes to the prior and the actual value of ψ d,m is eventually determined by both the prior and the evidence (i.e., the content of d), according to Bayes' theorem. The incorporation of meta-info in the Dirichlet prior of our model is related to the ones in Zhao et al. (2017aZhao et al. ( , 2018d), but theirs work in different domains.
Leveraging hierarchical meta-info: MIGA can be extended to accommodate hierarchical metainfo (e.g., an academic paper labelled with tags "computer science→machine learning→deep learning"). Let us consider a two-layer hierarchy, where the L document labels (i.e., the first-layer labels) are further categorised into a set of L super classes (i.e., the second-layer labels). Note that one document label is allowed to belong to multiple super classes and f l,l ∈ {0, 1} is used to denote whether or not a first-layer label l belongs Dataset D V avg.N d L Tweets (Mehrotra et al., 2013) 87,638 24,884 11 6 Patents 4 13,588 3,745 9 L = 3 L = 10 Web Snippets  12,237 10,052 15 8 Stackoverflow (Xu et al., 2015) 18,287 2,458 5 20 20Newsgroups 5 10,020 2000 28 L = 6 L = 20 to a second-layer label l . The general idea here is that instead of drawing λ l,m from an uninformative gamma prior as in our original model, we draw it from a prior distribution informed by the second-layer labels, as follows: where λ l ,m captures the correlation between labels at the two layers. Thus, the information of the second-layer labels will be propagated down to the assignment process of the documents.
Leveraging word embeddings: MIGA can be extended to incorporate word embeddings to guide the generation of latent topics. Following the approach introduced in Zhao et al. (2017c), we draw φ k ∼ Dir V (β k ), where where β k ∈ R V + is computed with a log-linear model of word embeddings, similar to Step 3a in the generative process.

Experiments
We evaluate the performance of MIGA on document clustering and topic coherence, with several advances in short text topic modelling 3 . We also provide a set of qualitative analysis to demonstrate the interpretability of our model.
The details of the datasets used in the experiments are shown in Table 1 Table 2 shows the purity and NMI scores 7 . 6 Original MetaLDA is able to use both document metainfo and word embeddings. Here we used its variant only with document meta-info. 7 The scores of KMeans + TFIDF on Tweets are not reported because it exceeds the memory of our machine.
MIGA outperforms the other models on Tweets, Patents, and Stackoverflow, which are relatively shorter than the other datasets. This demonstrates our model's effectiveness on clustering short texts.
Topic Coherence: Topic coherence measures the semantic coherence in the most significant (top) words in a topic, which is another commonly-used metric for topic models. Here we used the Normalized Pointwise Mutual Information (NPMI) (Aletras and Stevenson, 2013;Lau et al., 2014) to calculate a topic coherence score of the top 10 words of each topic 8 . Following Yang et al. (2015b), to eliminate rare topics, we report the scores over the top 50% topics with the largest number of words (i.e., for a topic k, we can count the number of words that are assigned to it: D d=1 N d i=1 1 z d,i =k ). It is known that word embeddings are able to significantly improve topic coherence. Therefore, in this experiment, following Zhao et al. (2017c), we used the word embeddings binarised from the pre-trained 50dimensional GloVe word embeddings (Pennington et al., 2014) in MIGA and MetaLDA, denoted as MIGA-eb and MetaLDA-eb, respectively. For all the models, we set K = 50. For PTM, MIGA, and MIGA-eb, we report the best scores with M varying from 100 to 3000. Table 4 shows the NPMI scores, where in general, among the models without word embeddings, MIGA outperforms the others on most datasets. Moreover, word embeddings comp 1. use problem using running program 2. fax university internet mail phone 3. thanks help edu advance appreciated comp.graphics 1. graphics color screen bit mode 2. card drivers driver video windows 3. software support looking products product comp.sys.ibm.pc.hardware 1. drive disk hard drives floppy 2. mb card controller ide scsi 3. mhz board speed port problem comp.sys.mac.hardware 1. monitor pc mouse systems box 2. card modem software internal meg 3. like buy looking price new good power want low need comp.windows.x 1. windows motif application server widget 2. ftp file program package format 3. windows printer font version text Cluster Num Documents Topic 1 purchase store cat outstanding shipping crucial memory upgrades ram sdram ddr service pricing com computer memory virtual cache apple manufacture flash graphics crucial memory upgrades cards usb storage ram media computer selection online memory upgrades ram ddr upgrade specialist 2 reviews box deal processors shipping computers pricegrabber tax customers prices retail com digital camera electronics reviews apple home store shop manufacturer accessories downloads product online apple ipods reviews forums discussion photography camera articles news digital review 3 country experience altavista languages search comprehensive web programming java code language source browser documentation hall resources lists faqs apl jhu programming compiler tutorials edu sites java books download introduction programming math textbook downloading on-line edu java Table 3: Left: Topics related to the hierarchical labels in 20Newsgroups. We started with a second-layer label "comp" and found its most related topics by first selecting the most related clusters (by ranking λ l ,m ) then selecting the most related topics (by ranking θ m,k ). Next, we looked at the first-layer labels (marked in italic) associated with l , i.e. f l,l = 1 and then found the most related topics in a similar way. Right: Clusters, documents, and topics for the label of "Computers" on Web Snippets, discovered by MIGA with K = 100, M = 500. We first selected the most related clusters to "Computers" by ranking λ l,m , and then selected the most related documents and the most related topic in each cluster by ranking π d,m and θ m,k , respectively.  further help improve topic coherence of MIGA-eb. It is noteworthy that MIGA and MIGA-eb do not improve NPMI over PTM on Tweets. There are two possible factors: the labels of the tweets are not informative enough for MIGA to learn better clusters and the vocabulary of this dataset consists many slangs and abbreviations, which are not included in the corpus used for calculating NPMI.
Qualitative Analysis: The left sub -table of Table 3 shows the relations between the hierarchical labels and topics discovered by MIGA in 20Newsgroups. One can see that the associated topics of the second-layer document label, "comp", are more general ones, describing several general aspects of computers, while the topics associated with the first-layer labels are relatively more specific. For example, the associated topics of "comp.sys.ibm.pc.hardware" are specific ones describing different aspects of computer hardware.
The right sub-table shows the relations between clusters, documents, and topics discover by MIGA in Web Snippets. It can be observed that the documents in Web Snippets labelled with "Computers" are quite diverse, which can be further clustered into the ones related to "hardware", "digital products", "programming language", and so on. Therefore, simply aggregating those documents into one cluster as in previous meta-info aggregation models may not be appropriate. MIGA can discover fine-grained latent clusters, each of which focuses on different aspects of "Computers" and can intuitively be interpreted by its top topic.

Conclusion
We have presented a new aggregation framework, MIGA, for short text topic analysis. MIGA is able to aggregate short text documents into latent clusters by leveraging meta-info. MIGA takes advantages of previous models which perform aggregation according to either content or meta-info in short texts. The proposed framework can be easily extended with hierarchical meta-info and word embeddings. The experimental results have shown that MIGA achieves improved performance on document clustering, topic coherence, as well as appealing interpretability. For future study, we would like to investigate how to automatically learn the number of latent clusters with nonparametric Bayesian methods.
A.1 Sampling cluster assignment c d : Now we extract the related terms to the cluster assignments C in Eq.
(3) to have: Given the probability density functions of Dirichlet and categorical distributions, Eq. (4) can be written as: Given Eq. (6), the conditional probability for Gibbs sampling of c d can be derived by: where n doc d,k = N d i I (z d,i =k) and I (·) is the indicator function; n cluster A.2 Sampling topic assignment z d,i : The sampling of z d,i is similar to the LDA model: A.3 Sampling λ l,k : As λ l,k is used to construct π d , which is the prior of ψ d , according to Eq. (3), we have: Pr(ψ d | π d ) ∝ Γ(π d,· ) Γ(π d,· + 1) M m π d,m . (9) According to Zhao et al. (2017c), if we introduce q d ∼ Beta(π d,· , 1), Eq. (9) can be augmented as: Recall that π d,m = L l=1 (λ l,m ) f d,l , we can actually extract the terms related to λ l,m to get: Pr ( where g l,m = D d I f d,l =1&c d =m . Given the above equation, we can sample λ l,m from its gamma posterior: λ l,m ∼ Ga(µ, ν), A.4 Sampling λ l ,k for MIGA with hierarchical meta-info: To incorporate the second-layer labels, λ l ,m can be sampled similarly to λ l,k as follows: where y l,m =