A Latent Concept Topic Model for Robust Topic Inference Using Word Embeddings

Uncovering thematic structures of SNS and blog posts is a crucial yet challenging task, because of the severe data sparsity induced by the short length of texts and diverse use of vocabulary. This hinders effective topic inference of traditional LDA because it infers topics based on document-level co-occurrence of words. To robustly infer topics in such contexts, we propose a latent concept topic model (LCTM). Unlike LDA, LCTM reveals topics via co-occurrence of latent concepts , which we introduce as latent variables to capture conceptual similarity of words. More speciﬁcally, LCTM models each topic as a distribution over the latent concepts , where each latent concept is a localized Gaussian distribution over the word embedding space. Since the number of unique concepts in a corpus is often much smaller than the number of unique words, LCTM is less susceptible to the data sparsity. Experiments on the 20Newsgroups show the effectiveness of LCTM in dealing with short texts as well as the capability of the model in handling held-out documents with a high degree of OOV words.


Introduction
Probabilistic topic models such as Latent Dirichlet allocation (LDA) (Blei et al., 2003), are widely used to uncover hidden topics within a text corpus. LDA models each document as a mixture of topics where each topic is a distribution over words. In essence, LDA reveals latent topics in a corpus by implicitly capturing document-level word cooccurrence patterns (Wang and McCallum, 2006).
In recent years, Social Networking Services and blogs have become increasingly prevalent due to the explosive growth of the Internet. Uncovering the themantic structures of these posts is crucial for tasks like market review, trend estimation (Asur and Huberman, 2010) and so on. However, compared to more conventional documents, such as news articles and academic papers, analyzing the thematic content of blog posts can be challenging, because of their typically short length and the use of diverse vocabulary by various authors. These factors can substantially decrease the chance of topically related words co-occurring in the same post, which in turn hinders effective topic inference in conventional topic models. Additionally, sometimes small corpus size can further exacerbate topic inference, since word co-occurrence statistics becomes more sparse as the number of documents decreases.
Recently, word embedding models, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) have gained much attention with their ability to form clusters of conceptually similar words in the embedding space. Inspired by this, we propose a latent concept topic model (LCTM) that infers topics based on documentlevel co-occurrence of references to the same concept. More specifically, we introduce a new latent variable, termed a latent concept to capture conceptual similarity of words, and redefine each topic as a distribution over the latent concepts. Each latent concept is then modeled as a localized Gaussian distribution over the embedding space. This is illustrated in Figure 1, where we denote the centers of the Gaussian distributions as concept vectors. We see that each concept vector captures a representative concept of surrounding words, and the Gaussian distributions model the small variation between the latent concepts and the actual use of words. Since the number of unique concepts that are referenced in a corpus is often much smaller than the number of unique words, we expect topically-related latent concepts to co-occur many times, even in short texts with diverse usage of words. This in turn promotes topic inference in LCTM.
LCTM further has the advantage of using continuous word embedding. Traditional LDA assumes a fixed vocabulary of word types. This modeling assumption prevents LDA from handling out of vocabulary (OOV) words in held-out documents. On the other hands, since our topic model operates on the continuous vector space, it can naturally handle OOV words once their vector representation is provided.
The main contributions of our paper are as follows: We propose LCTM that infers topics via document-level co-occurrence patterns of latent concepts, and derive a collapsed Gibbs sampler for approximate inference. We show that LCTM can accurately represent short texts by outperforming conventional topic models in a clustering task. By means of a classification task, we furthermore demonstrate that LCTM achieves superior performance to other state-of-the-art topic models in handling documents with a high degree of OOV words.
The remainder of the paper is organized as follows: related work is summarized in Section 2, while LCTM and its inference algorithm are presented in Section 3. Experiments on the 20Newsgroups are presented in Section 4, and a conclusion is presented in Section 5.

Related Work
There have been a number of previous studies on topic models that incorporate word embeddings. The closest model to LCTM is Gaussian LDA (Das et al., 2015), which models each topic as a Gaussian distribution over the word embedding space. However, the assumption that topics are unimodal in the embedding space is not appropriate, since topically related words such as 'neural' and 'networks' can occur distantly from each other in the embedding space. Nguyen et al. (2015) proposed topic models that incorporate information of word vectors in modeling topic-word distributions. Similarly, Petterson et al. (Petterson et al., 2010) exploits external word features to improve the Dirichlet prior of the topic-word distributions. However, both of the models cannot handle OOV words, because they assume fixed word types.
Latent concepts in LCTM are closely related to 'constraints' in interactive topic models (ITM) (Hu et al., 2014). Both latent concepts and constraints are designed to group conceptually similar words using external knowledge in an attempt to aid topic inference. The difference lies in their modeling assumptions: latent concepts in LCTM are modeled as Gaussian distributions over the embedding space, while constraints in ITM are sets of conceptually similar words that are interactively identified by humans for each topic. Each constraint for each topic is then modeled as a multinomial distribution over the constrained set of words that were identified as mutually related by humans. In Section 4, we consider a variant of ITM, whose constraints are instead inferred using external word embeddings.
As regards short texts, a well-known topic model is Biterm Topic Model (BTM) (Yan et al., 2013). BTM directly models the generation of biterms (pairs of words) in the whole corpus. However, the assumption that pairs of cooccurring words should be assigned to the same topic might be too strong (Chen et al., 2015).

Generative Model
The primary difference between LCTM and the conventional topic models is that LCTM describes the generative process of word vectors in documents, rather than words themselves.
Suppose α and β are parameters for the Dirichlet priors and let v d,i denote the word embedding for a word type w d,i . The generative model for LCTM is as follows.
1. For each topic k (a) Draw a topic concept distribution ϕ k ∼ Dirichlet(β).
The graphical models for LDA and LCTM are shown in Figure 2. Compared to LDA, LCTM adds another layer of latent variables to indicate the conceptual similarity of words.

Posterior Inference
In our application, we observe documents consisting of word vectors and wish to infer posterior distributions over all the hidden variables. Since there is no analytical solution to the posterior, we derive a collapsed Gibbs sampler to perform approximate inference. During the inference, we sample a latent concept assignment as well as a topic assignment for each word in each document as follows: where n d,k is the number of words assigned to topic k in document d, and n k,c is the number of words assigned to both topic k and latent concept c. When an index is replaced by '·', the number is obtained by summing over the index. The superscript −d,i indicates that the current assignments of z d,i and c d,i are ignored. N (·|µ, Σ) is a multivariate Gaussian density function with mean µ and covariance matrix Σ. µ c and σ 2 c in Eq.
(2) are parameters associated with the latent concept c and are defined as follows: where (1) is concerned with topic-concept distributions. Eq. (2) of sampling latent concepts has an intuitive interpretation: the first term encourages concept assignments that are consistent with the current topic assignment, while the second term encourages concept assignments that are consistent with the observed word. The Gaussian variance parameter σ 2 acts as a trade-off parameter between the two terms via σ 2 c . In Section 4.2, we study the effect of σ 2 on document representation.

Prediction of Topic Proportions
After the posterior inference, the posterior means of {θ d }, {ϕ k } are straightforward to calculate: Also posterior means for {µ c } are given by Eq. (3). We can then use these values to predict a topic proportion θ dnew of an unseen document d new using collapsed Gibbs sampling as follows: The second term of Eq. (6) is a weighted average of ϕ k,c with respect to latent concepts. We see that more weight is given to the concepts whose corresponding vectors µ c are closer to the word vector v dnew,i . This to be expected because statistics of nearby concepts should give more information about the word. We also see from Eq. (6) that the topic assignment of a word is determined by its embedding, instead of its word type. Therefore, LCTM can naturally handle OOV words once their embeddings are provided.

Reducing the Computational Complexity
From Eqs. (1) and (2), we see that the computational complexity of sampling per word is O(K + SD), where K, S and D are numbers of topics, latent concepts and embedding dimensions, respectively. Since K ≪ S holds in usual settings, the dominant computation involves the sampling of latent concept, which costs O(SD) computation per word. However, since LCTM assumes that Gaussian variance σ 2 is relatively small, the chance of a word being assigned to distant concepts is negligible. Thus, we can reasonably assume that each word is assigned to one of M ≪ S nearest concepts. Hence, the computational complexity is reduced to O(M D). Since concept vectors can move slightly in the embedding space during the inference, we periodically update the nearest concepts for each word type.
To further reduce the computational complexity, we can apply dimensional reduction algorithms such as PCA and t-SNE (Van der Maaten and Hinton, 2008) to word embeddings to make D smaller. We leave this to future work.

Datasets and Models Description
In this section, we study the empirical performance of LCTM on short texts. We used the 20Newsgroups corpus, which consists of discussion posts about various news subjects authored by diverse readers. Each document in the corpus is tagged with one of twenty newsgroups. Only posts with less than 50 words are extracted for training datasets. For external word embeddings, we used 50-dimensional GloVe 1 that were pre-trained on Wikipedia. The datasets are summarized in Table 1. See appendix A for the detail of the dataset preprocessing.
We compare the performance of the LCTM to the following six baselines: • LFLDA (Nguyen et al., 2015), an extension of Latent Dirichlet Allocation that incorporates word embeddings information. • LFDMM (Nguyen et al., 2015), an extension of Dirichlet Multinomial Mixtures that incorporates word embeddings information.
• nI-cLDA, non-interactive constrained Latent Dirichlet Allocatoin, a variant of ITM (Hu et al., 2014), where constraints are inferred by applying k-means to external word embeddings. Each resulting word cluster is then regarded as a constraint. See appendix B for the detail of the model.
In all the models, we set the number of topics to be 20. For LCTM (resp. nI-ITM), we set the number of latent concepts (resp. constraints) to be 1000. See appendix C for the detail of hyperparameter settings.

Document Clustering
To demonstrate that LCTM results in a superior representation of short documents compared to the baselines, we evaluated the performance of each model on a document clustering task. We used a learned topic proportion as a feature for each document and applied k-means to cluster the documents. We then compared the resulting clusters to the actual newsgroup labels. Clustering performance is measured by Adjusted Mutual Information (AMI) (Manning et al., 2008). Higher AMI indicates better clustering performance. Figure 3 illustrates the quality of clustering in terms of Gaussian variance parameter σ 2 . We see that setting σ 2 = 0.5 consistently obtains good clustering performance for all the datasets with varying sizes. We therefore set σ 2 = 0.5 in the later evaluation. Figure 4 compares AMI on four topic models. We see that LCTM outperforms the topic models without word embeddings. Also, we see that LCTM performs comparable to LFLDA and nl-cLDA, both of which incorporate information of word embeddings to aid topic inference. However, as we will see in the next section, LCTM can  better handle OOV words in held-out documents than LFLDA and nl-cLDA do.

Representation of Held-out Documents with OOV words
To show that our model can better predict topic proportions of documents containing OOV words than other topic models, we conducted an experiment on a classification task. In particular, we infer topics from the training dataset and predicted topic proportions of held-out documents using collapsed Gibbs sampler. With the inferred topic proportions on both training dataset and held-out documents, we then trained a multi-class classifier (multi-class logistic regression implemented in sklearn 2 python module) on the training dataset and predicted newsgroup labels of the held-out documents. We compared classification accuracy using LFLDA, nI-cLDA, LDA, GLDA, LCTM and a variant of LCTM (LCTM-UNK) that ignores OOV in the held-out documents. A higher classification accuracy indicates a better representation of unseen documents.

Conclusion
In this paper, we have proposed LCTM that is well suited for application to short texts with diverse vocabulary. LCTM infers topics according to document-level co-occurrence patterns of latent concepts, and thus is robust to diverse vocabulary usage and data sparsity in short texts. We showed experimentally that LCTM can produce a superior representation of short documents, compared to conventional topic models. We additionally demonstrated that LCTM can exploit OOV to improve the representation of unseen documents. Although our paper has focused on improving performance of LDA by introducing the latent concept for each word, the same idea can be readily applied to other topic models that extend LDA.
ii. Draw its constraint l d,i Let V be the set of vocabulary. We note that π k,s is a multinomial distribution over W k,s , which is a subset of V , defined as W k,s ≡ {w ∈ V | r k,w = s}. W k,s represents a constrained set of words that are conceptually related to each other under topic k.
In our application, we observe documents and constraints for each topic, and wish to infer posterior distributions over all the hidden variables. We apply collapsed Gibbs sampling for the approximate inference. For the detail of the inference, see (Hu et al., 2014).

C Hyperparameter Settings
For all the topic models, we used symmetric Dirichlet priors. The hyperparameters were set as follows: for our model (LCTM and LCTM-UNK), nI-cLDA and LDA, we set α = 0.1 and β = 0.01. For nl-cLDA, we set the parameter of Dirichlet prior for constraint-word distribution (γ in appendix B) as 0.1. Also for our model, we set, σ 2 0 = 1.0 and µ to be the average of word vectors. We randomly initialized the topic assignments in all the models. Also, we initialized the latent concept assignments using k-means clustering on the word embeddings. The k-means clustering was implemented using sklearn 5 python module. We set M (number of nearest concepts to sample from) to be 300, and updated the nearest concepts every 5 iterations. For LFLDA, LFDMM, BTM and Gaussian LDA, we used the original implementations available online 6 and retained the default hyperparameters.
We ran all the topic models for 1500 iterations for training, and 500 iterations for predicting heldout documents.