Efficient Methods for Incorporating Knowledge into Topic Models

Latent Dirichlet allocation (LDA) is a popular topic modeling technique for exploring hidden topics in text corpora. Increas-ingly, topic modeling needs to scale to larger topic spaces and use richer forms of prior knowledge, such as word correlations or document labels. However, inference is cumbersome for LDA models with prior knowledge. As a result, LDA models that use prior knowledge only work in small-scale scenarios. In this work, we propose a factor graph framework, Sparse Constrained LDA (SC-LDA), for efﬁciently incorporating prior knowledge into LDA. We evaluate SC-LDA’s ability to incorporate word correlation knowledge and document label knowledge on three benchmark datasets. Compared to several baseline methods, SC-LDA achieves comparable performance but is signiﬁcantly faster.


Challenge: Leveraging Prior
Knowledge in Large-scale Topic Models Topic models, such as Latent Dirichlet Allocation (Blei et al., 2003, LDA), have been successfully used for discovering hidden topics in text collections. LDA is an unsupervised model-it requires no annotation-and discovers, without any supervision, the thematic trends in a text collection. However, LDA's lack of supervision can lead to disappointing results. Often, the hidden topics learned by LDA fail to make sense to end users. Part of the problem is that the objective function of topic models does not always correlate with human judgments of topic quality . Therefore, it's often necessary to incorporate prior knowledge into topic models to improve the model's performance. Recent work has also shown that by interactive human feedback can improve the quality and stability of topics (Hu and Boyd-Graber, 2012;. Information about documents (Ramage et al., 2009) or words (Boyd-Graber et al., 2007) can improve LDA's topics.
In addition to its occasional inscrutability, scalability can also hamper LDA's adoption. Conventional Gibbs sampling-the most widely used inference for LDA-scales linearly with the number of topics. Moreover, accurate training usually takes many sampling passes over the dataset. Therefore, for large datasets with millions or even billions of tokens, conventional Gibbs sampling takes too long to finish. For standard LDA, recently introduced fast sampling methods (Yao et al., 2009;Li et al., 2014;Yuan et al., 2015) enable industrial applications of topic modeling to search engines and online advertising, where capturing the "long tail" of infrequently used topics requires large topic spaces. For example, while typical LDA models in academic papers have up to 10 3 topics, industrial applications with 10 5 -10 6 topics are common (Wang et al., 2014). Moreover, scaling topic models to many topics can also reveal the hierarchical structure of topics (Downey et al., 2015).
Thus, there is a need for topic models that can both benefit from rich prior information and that can scale to large datasets. However, existing methods for improving scalability focus on topic models without prior information. To rectify this, we propose a factor graph model that encodes a potential function over the hidden topic variables, encouraging topics consistent with prior knowledge. The factor model representation admits an efficient sampling algorithm that takes advantage of the model's sparsity. We show that our method achieves comparable performance but runs significantly faster than baseline methods, enabling mod-els to discover models with many topics enriched by prior knowledge.

Efficient Algorithm for Incorporating
Knowledge into LDA In this section, we introduce the factor model for incorporating prior knowledge and show how to efficiently use Gibbs sampling for inference.

Background: LDA and SparseLDA
A statistical topic model represents words in documents in a collection D as mixtures of T topics, which are multinomials over a vocabulary of size V . In LDA, each document d is associated with a multinomial distribution over topics, θ d . The probability of a word type w given topic z is φ w|z . The multinomial distributions θ d and φ z are drawn from Dirichlet distributions: α and β are the hyperparameters for θ and φ. We represent the document collection D as a sequence of words w, and topic assignments as z. We use symmetric priors α and β in the model and experiment, but asymmetric priors are easily encoded in the models (Wallach et al., 2009). Discovering the latent topic assignments z from observed words w requires inferring the the posterior distribution P (z|w). Griffiths and Steyvers (2004) propose using collapsed Gibbs sampling. The probability of a topic assignment z = t in document d given an observed word type w and the other topic assignments z − is where z − are the topic assignments of all other tokens. This conditional probability is based on cumulative counts of topic assignments: n d,t is the number of times topic t is used in document d, n w,t is the number of times word type w is used in topic t, and n t is the marginal count of the number of tokens assigned to topic t. Unfortunately, explicitly computing the conditional probability is quite for models with many topics. The time complexity of drawing a sample by Equation 1 is linear to the number of topics. Yao et al. (2009) propose a clever factorization of Equation 1 so that the complexity is typically sublinear by breaking the conditional probability into three "buckets": The first term s is the "smoothing only" bucket-constant for all documents. The second term r is the "document only" bucket that is shared by a document's tokens. Both s and r have simple constant time updates. The last term q has to be computed specifically for each token, only for the few types with non-zero counts in a topic, due to the sparsity of word-topic count. Since q often has the largest mass and few non-zero terms, we start the sampling from bucket q.

A Factor Model for Incorporating Prior Knowledge
With SparseLDA, inferring LDA models over large topic spaces becomes tractable. However, existing methods for incorporating prior knowledge use conventional Gibbs sampling, which hinders inference. We address this limitation in this section by adding a factor graph to encode prior knowledge. LDA assumes that the hidden topic assignment of a word is independent from other hidden topics, given the document's topic distribution θ. While this assumption facilitates computational efficiency, it loses the rich correlation between words. In many scenarios, users have external knowledge regarding word correlation, document labels, or document relations, which can reshape topic models and improve coherence.
Prior knowledge can constrain what models discover. A correlation between two words v and w indicates that they have a similar topic distribution, i.e., p(z|v) ≈ p(z|w). 1 Therefore, the posterior topic assignments v and w will be correlated. In contrast, if v and w are uncorrelated, nothingother than the Dirichlet's rich get richer effectprevents the topics from diverging. Similarly, if two documents share a label, then it is reasonable to assume that they are more likely than two random documents to share topics.
We denote the set of prior knowledge as M . Each prior knowledge m ∈ M defines a potential function f m (z, w, d) of the hidden topic z of word type w in document d with which m is associated. Therefore, the complete prior knowledge M defines a score on the current topic assignments z: If m is knowledge about word type w, then f m (z, w, d) applies to all hidden topics of word w. If m is knowledge about document d, then f m (z, w, d) applies to all topics that are in document d. The potential function assigns large values to the topics that accord with prior knowledge but penalizes the topic assignments that disagree with the prior knowledge. In an extreme case, if a prior knowledge m says word type w in document d is Topic 3, then the potential function f m (z, w, d) is zero for all topics but Topic 3.
Since the potential function ψ is a function of z, and it is only a real-value score of current topic assignments, the potential can be factored out of the marginalized joint: Given the joint likelihood and observed data, the goal is evaluate the posterior P (z|w). Computing P (z|w) involves evaluating a probability distribution on a large discrete state space: P (z|w) = P (z, w)/ z P (z, w). Griffiths and Steyvers (2004)-mirroring the original inspirations for Gibbs sampling (Geman and Geman, 1990)-draw an analogy to statistical physics, viewing standard LDA as a system that favors configurations z that compromise between having few topics per document and having few words per topic, with the terms of this compromise being set by the hyperparameters α and β. Our factor model representation of prior knowledge adds a further constraint that asks the model to also consider ensembles of topic assignments z that are compatible with a standard LDA model and the given prior knowledge.
The collapsed Gibbs Sampling for inferring topic assignment z of word w in document d is: The first term is identical to standard LDA, and admits efficient computation using SparseLDA. However, if the second term, exp f m (z, w, d), is dense, we still need to compute it explicitly T times (once for each topic) because we need the summation of P (z = t) for sampling. Therefore, the critical part of speeding up the sampler is finding a sparse representation of the second term.
In the following sections, we show that natural, sparse prior knowledge representations are possible. We first present an efficient sparse representation of word correlation prior knowledge and then one for document-label knowledge.

Word Correlation Prior Knowledge
We now illustrate how we can encode word correlation knowledge as a set of sparse constraints f m (z, w, d) in our model. In previous work (Andrzejewski et al., 2009;Hu et al., 2011;Xie et al., 2015), word correlation prior knowledge is represented as word must-link constraints and cannotlink constraints. A must-link relation between two words indicates that the two words tend to be related to the same topics, i.e. their topic probabilities are correlated. In contrast, a cannot-link relation between two words indicates that these two words are not topically similar, and they should not both be prominent within the same topic. For example, "quarterback" and "fumble" are both related to American football, so they can share a must-link relation. But "fumble" and "bank" imply two different topics, so they share a cannotlink. Let us say word w is associated with a set of prior knowledge correlations M w . Each prior knowledge m ∈ M w is a word pair (w, w ), and it has "topic preference" of w given its correlation word w . The must-link set of w is M m w , and the cannot-link set of w is M c w , i.e., M w = M c w M m w . In the example above, M m f umble = {quarterback}, and M c f umble = {bank}, so M f umble = {quarterback, bank}. The topic assignment of word "fumble" has higher conditional probability for the same topics as "quarterback" but lower probability for topics containing "bank".
The potential score of sampling topic t for word type w-if M w is not empty-is where λ is a hyperparameter, which we call the correlation strength. The intuitive explanation of Equation 6 is that the prior knowledge about the word type w will make an impact on the conditional probability of sampling the hidden topic z.
Unlike standard LDA where every word's hidden topic is independent of other words given θ, Equation 6 instead increases the probability that a word w will be drawn from the same topics as those of w's must-link word set, and decreases its probability of being drawn from the same topics as those of w's cannot-link word set. The hyperparameter λ controls the strength of each piece of prior knowledge. The smaller λ is, the stronger this correlation is. For large λ, the constraint is inactive for topics except those with the large counts. As λ decreases, the constraint becomes active for topics with lesser counts. We can adjust the value of λ for each piece of prior knowledge based on our confidence. In our experiments, for simplicity, we use the same value λ for all knowledge and set λ = 1.
From Equation 6 and Equation 5, the conditional probability of a topic z in document d given an observed word type w is: As explained above, λ controls the "strength" of the prior knowledge term. If λ is large, the prior knowledge has little impact on the conditional probability of topic assignments.
Let's return to the question whether Equation 6 is sparse, allowing efficient computation of Equation 7. Fortunately, n u,t and n v,t , which are the Figure 1: Histogram of nonzero topic counts for word types in NYT-News dataset after inference. 81.9% word types have fewer than 50 topics with nonzero counts. This sparsity allows our sparse constraints to speed inference.
topic counts for must-link word u and cannotlink word v, are often sparse. For example, in a 100-topic model trained on the NIPS dataset, 87.2% of word types have fewer than ten topics with nonzero counts. In a 500-topic model trained on a larger dataset like the New York Times News (Sandhaus, 2008), 81.9% of word types have fewer than 50 topics with nonzero counts. Moreover, the model becomes increasingly sparse with additional Gibbs iterations. Figure 1 shows the word frequency histogram of nonzero topic counts of NYT-News dataset.
Therefore, the computational cost of Equation 7 can be reduced. SparseLDA efficiently computes the s, r, q bins as in Equation 3. Then for words that are associated with prior knowledge, we update s, r, q with an additional potential term. We only need to compute the potential term for the topics whose counts are greater than λ. The collapsed Gibbs sampling procedure is summarized in Algorithm 1.
Algorithm 1 Gibbs Sampling for word type w in document d, given w's correlation set M w 1: compute s t , r t , q t with SparseLDA, (see Eq. 3) 2: for t ← 0 to T do 3: update s t , r t , q t . ∀u ∈ M w if n u,t > λ 4: end for 5: p(t) = s t + r t + q t 6: sample new topic assignment for w from p(t)

Other Types of Prior Knowledge
The factor model framework can also handle other types of prior knowledge, such as document labels, sentence labels, and document link relations. We briefly describe document labels here. Ramage et al. (2009) propose Labeled-LDA, which improves LDA with document labels. It assumes that there is a one-to-one mapping between topics and labels, and it restricts each document's topics to be sampled only from those allowed by the documents label set. Therefore, Labeled-LDA can be expressed in our model. We define where m d specifies document d's label set converted to corresponding topic labels. Since f m (z, w, d) is sparse, we can speed up the training as well. Sentence-level prior knowledge (e.g., for sentiment or aspect models (Paul and Girju, 2010)) can be defined in a similar way.
Documents can be associated with other useful metadata. For example, a scientific paper and the prior work it cites might have similar topics (Dietz et al., 2007) or friends in a social network might talk about the same topics . To model link relations, we can use Equation 6 and replace the word-topic counts n v,z with document-topic counts n d,z . By doing so, we encourage related documents to have similar topic structures. Moreover, the document-topic count is also sparse, which fits into the efficient learning framework.
Therefore, for different types of prior knowledge, as long as we can define ψ(z, M ) appropriately so that f (z, w, d) are sparse, we are able to speed up learning.

Experiments
In this section, we demonstrate the effectiveness of our SC-LDA by comparing it with several baseline methods on three benchmark datasets. We first evaluate the convergence rate of each method and then evaluate the learned model parameter φ-the topic-word distribution-in terms of topic coherence. We show that SC-LDA can achieve results comparable to the baseline models but is significantly faster. We set up all experiments on a 8-Core 2.8GHz CPU, 16GB RAM machine. 2 2 Our implementation of SC-LDA is available at https://github.com/yya518/

Dataset
We use the NIPS and NYT-News datasets from the UCI bag of words data collections. 3 These two datasets have no document labels, and we use them for word correlation experiments. We also use the 20Newsgroup (20NG) dataset, 4 which has document labels, for document label experiments. Table 1 shows the characteristics of each dataset. Since NIPS and NYT-News have already been preprocessed, to ensure repeatability, we use the data "as they are" from the sources. For 20NG, we perform tokenization and stopword removal using Mallet (McCallum, 2002) and remove words that appear fewer than 10 times.

Prior Knowledge Generation
Word Correlation Prior Knowledge Previous work proposes two methods to automatically generate prior word correlation knowledge from external sources. Hu and Boyd-Graber (2012) use WordNet 3.0 to obtain synsets for word types, and then if a synset is also in the vocabulary, they add a must-link correlation between the word type and the synset. Xie et al. (2015) use a different method that takes advantage of an existing pretrained word embedding. Each word embedding is a real-valued vector capturing the word's semantic meaning based on distributional similarity. If the similarity between the embeddings of two word types in the vocabulary exceeds a threshold, they generate a must-link between the two words.
In our experiments, we adopt a hybrid method that combines the above two methods. For a noun word type, we first obtain its synsets from Word-Net 3.0. We also obtain the embeddings of each word from word2vec (Mikolov et al., 2013). If the synset is also in the vocabulary, and the similarity between the synset and the word is higher than a threshold, which in our experiment is 0.2, we generate a must-link between thee words. Empirsparse-constrained-lda.

Document Label Prior Knowledge
Since documents in the 20NG dataset are associated with labels, we use the labels directly as prior knowledge.

Baselines
The baseline methods for incorporating word correlation prior knowledge in our experiments are as follows: DF-LDA: incorporates word must-links and cannot-links using a Dirichlet Forest prior in LDA (Andrzejewski et al., 2009). Here we use Hu and Boyd-Graber (2012) (Xie et al., 2015). We also use Mallet's SparseLDA implementation for vanilla LDA in the topic coherence experiment. We use a symmetric Dirichlet prior for all models. We set α = 1.0, β = 0.01. For DF-LDA, η = 100. For Logic-LDA, we use the default parameter setting in the package: a sample rate of 1.0 and step rate of 10.0. For MRF-LDA, we use the default setting with γ = 1.0. (Parameter semantics can be found in the original papers.)

Convergence
The main advantage of our method over other existing methods is efficiency. In this experiment, we show the change of our model's log likelihood over time. In topic models, the log likelihood change is a good indicator of whether a model has converged or not. Figure 2 shows the log likelihood change over time for SC-LDA and three baseline methods on NIPS and NYT-News dataset. SC-LDA converges faster than all the other methods.
We also conduct experiments on SC-LDA with varying numbers of word correlations. Table 2 shows the Gibbs sampling iteration time on the 1st, 50th, 100th and the 200th iteration. We also incorporate different numbers of word correlations Figure 2: Models' log likelihood convergence on NIPS dataset (above) and NYT-News dataset (below). For NIPS, a 100-topic model with 100 must-links is trained. For NYT-News, a 500topic model with 100 must-links is trained. SC-LDA reaches likelihood convergence much more rapidly than the other methods.  in SC-LDA. SC-LDA runs faster as sampling proceeds as the sparsity increases, but additional correlations slow the model.

Topic Coherence
Topic models are often evaluated using perplexity on held-out test data, but this evaluation is of-ten at odds with human evaluations . Following Mimno et al. (2011), we employ Topic Coherence-a metric that is consistent with human judgment-to measure a topic model's quality. Topic t's coherence is defined is the co-document frequency of word type v and v , and ) is a list of the M most probable words in topic t.
In our experiments, we choose the ten words with highest probability in the topic to compute topic coherence, i.e., M = 10. Mimno et al. (2011) use = 1, but Röder et al. (2015 show smaller (such as 10 −12 ) improves coherence stability, so we set = 10 −12 . Larger topic coherence scores imply more coherent topics.
We train a 500-topic model on the NIPS dataset with different methods and compare the average topic coherence score and the average of the top twenty topic coherence scores. Since the topics learned by topic model often contain "bad" topics (Mimno et al., 2011) which do not make sense to end users, evaluating the top twenty topics reflects the model's performance. We let each model train for one hour. Figure 3 shows the topic coherence of each method. SC-LDA has about the same average topic coherence with LDA but has higher coherence score (-36.6) for the top 20 topics than LDA (-39.1). This is because incorporating word correlation knowledge encourages correlated words to have high probability under the same topic, thus improving the coherence score. For the other methods, however, because they cannot converge within an hour, their topic coherence scores are much worse than SC-LDA and LDA. This again demonstrates the efficiency of SC-LDA over other baselines.

Document Label Prior Knowledge
SC-LDA can also handle other types of prior knowledge. We compare it with Labeled-LDA (Ramage et al., 2009). Labeled-LDA also uses Gibbs sampling for inference, allowing direct computation time comparisons. Table 3 shows the average running time per iteration for Labeled-LDA and SC-LDA. Because document labels apply sparsity to the documenttopic counts, the average running time per iteration decreases as the number of labeled document increases. SC-LDA exhibits greater speedup with   more topics; when T = 500, 5 SC-LDA runs more than ten times faster than Labeled-LDA.

Related Work
This works brings together two lines of research: incorporating rich knowledge into probabilistic models and efficient inference of probabilistic models on large datasets. Both are common areas of interest across many machine learning formalisms: probabilistic logic (Bach et al., 2015), graph algorithms (Low et al., 2012), and probabilistic grammars (Cohen et al., 2008). However, our focus in this paper is the intersection of these lines of research with topic models.
Adding knowledge and metadata to topic models makes the models richer, more understandable, and more domain-specific. A common distinction is upstream (conditioning on metadata) vs. downstream models (conditioning on variables already present in a topic model to predict metadata) (Mimno et al., 2008). Downstream models are typically better at prediction tasks such as predicting sentiment (Blei and McAuliffe, 2007), ideology (Nguyen et al., 2014a), or links in a social network ). In contrast, our approach-an upstream model-is often easier to implement and leads to more interpretable topics. Upstream models at the document level have been used to understand the labels in large document collections (Ramage et al., 2009;Nguyen et al., 2014b) and capture relationships in document networks using Markov random fields (Daumé III, 2009). At the word level, Xie et al. (2015) incorporate word correlation to LDA by building a Markov Random Field regularization, similar to Newman et al. (2011), who use regularization to improve topic coherence. However, despite these exciting applications, the experiments in the above work are typically on small datasets.
In contrast, there is a huge interest in improving the scalability of topic models to large numbers of documents, numbers of topics, and vocabularies. Attempts to scale inference for topic models have started from both variational inference and Gibbs sampling-two popular learning inference techniques for topic modeling. Gibbs sampling is a popular technique because of its simplicitly and low latency. However, for large numbers of topics, Gibbs sampling can become unwieldy. Porteous et al. (2008) address this issue by creating an upper bound approximation that produces accurate results, while SparseLDA (Yao et al., 2009) present an effective factorization that speeds inference without sacrificing accuracy. Just as our model builds on SparseLDA's insights, SparseLDA has been incorporated into commercial deployments (Wang et al., 2014) and improved using alias tables (Li et al., 2014). Yuan et al. (2015) also presents an efficient constant time sampling algorithm for building big topic models. Variational inference can easily be parallelized (Nallapati et al., 2007;Zhai et al., 2012), but has high latency, which has been addressed by performing online updates (Hoffman et al., 2010) and taking stochastic gradients estimated by MCMC inference (Mimno et al., 2012). In this paper, we only focus on single-processor learning, but existing parallelization techniques (Newman et al., 2009) are applicable to our model.
At the intersection lies models that improve the scalability of upstream topic model inference. In addition to our SC-LDA, Hu and Boyd-Graber (2012) speed Gibbs sampling in tree-based topic models using SparseLDA's factorization strategy, and Hu et al. (2014) extend this approach by parallelizing global parameter updates using variational inference. Our work is more general (also encompassing document-based constraints) and is faster. In contrast to these upstream models, Zhu et al. (2013) and Nguyen et al. (2015) improve inference of downstream models.

Conclusion
We present a factor graph framework for incorporating prior knowledge into topic models. By expressing the prior knowledge as sparse constraints on the hidden topic variables, we are able to take advantage of the sparsity to speed up training. We demonstrate in experiments that our model runs significantly faster than the other alternative models and achieves comparable performance in terms of topic coherence. Efficient algorithms for incorporating prior knowledge with large topic models will benefit several downstream applications. For example, interactive topic modeling becomes feasible because fast model updates reduce the user's waiting time and thus improve the user experience. Personalized topic modeling is also an interesting future direction in which the model will generate a personalized topic structure based on the user's preferences or interests. For all these applications, an efficient learning algorithm is a crucial prerequisite.