TSDPMM: Incorporating Prior Topic Knowledge into Dirichlet Process Mixture Models for Text Clustering

Dirichlet process mixture model (DPM-M) has great potential for detecting the underlying structure of data. Extensive studies have applied it for text clustering in terms of topics. However, due to the unsupervised nature, the topic clusters are always less satisfactory. Considering that people often have some prior knowledge about which potential topics should exist in given data, we aim to incorporate such knowledge into the DPMM to improve text clustering. We propose a novel model TSDPMM based on a new seeded P´olya urn scheme. Experimental results on document clustering across three datasets demonstrate our proposed TSDPMM signiﬁcantly outperforms state-of-the-art DPMM model and can be applied in a lifelong learning framework.


Introduction
Dirichlet process mixture model (DPMM) (Neal, 2000) has been used in detecting the underlying structure in data. For example, (Vlachos et al., 2008;Vlachos et al., 2009) applied it to lexicalsemantic verb clustering. (Wang et al., 2011;Huang et al., 2013;Yin and Wang, 2014) applied it for text clustering in terms of their topics. While DPMM achieved some promising results, it can still sometimes produce unsatisfactory topic clusters due to its unsupervised nature.
On the other hand, people often have prior knowledge about what potential topics should exist in a given text corpus. Take an earthquake event corpus as an example. The topics, such as "casualties and damages", "rescue" and "government reaction", called prior topics, are expected to occur in the corpus according to our common knowledge (e.g., the topics automatically learned from previous events using topic modeling (Ahmed and Xing, 2008)) or external resources (e.g., table of contents at Wikipedia event pages 1 ). Similarly, in academic fields, "call for papers (CFP)" of conferences 2 lists main topics that conference organizers would like to focus on. Clearly, these prior topics can be represented as sets of words, which are available in many real-world applications. They can serve as weakly supervised information to enhance the unsupervised DPMM for text clustering.
Standard DPMM (Neal, 2000;Ranganathan, 2006) lacks a mechanism for incorporating prior knowledge. Some existing work (Vlachos et al., 2008;Vlachos et al., 2009) added knowledge of observed instance-level constraints (must-links and cannot-links between documents) to DPMM. (Ahmed and Xing, 2008) proposed recurrent Chinese Restaurant Process to incorporate previous documents with known topic clusters. We focus on incorporating topic-level knowledge, which is more challenging, as seed/prior topics could be latent rather than observable.
Particularly, we construct our novel TSDPM-M (Topic Seeded DPMM) based on a principled seeded Pólya urn (sPU) scheme. Our model inherits the nonparametric property of DPMM and has additional technical merits. Importantly, our model is encouraged but not forced to find evidences of seed topics. Therefore, it has freedom to discover new topics beyond prior topics, as well as to detect which prior topics are not covered by current data. It is thus convenient to observe topic variations between prior topics and newly mined topics. Experimental results on document clustering across three corpora demonstrate that our model effectively incorporates prior topics, and significantly outperforms state-of-the-art DPMM model. Particularly, our TSDPMM can be applied in a lifelong learning framework which enables the prior topic knowledge to evolve as more and more data are observed.

Topic Seeded DPMM
In this section, we first introduce the standard DPMM model for document clustering in terms of topics. Then we describe how to incorporate seed/prior topics into the model using a seeded Pólya urn (sPU) scheme, which gives us our novel TSDPMM model (Topic Seeded DPMM). Finally, we present the model inference.

DPMM
The DPMM (Antoniak, 1974) as a non-parametric model assumes the given data is governed by an infinite number of components where only a fraction of these components are activated by the data. Figure 1 illustrates the DPMM graphical model and its generative process of a document x i . First, we sample a topic θ i = {θ ij } j=|V | j=1 (a multinomial distribution over words belonging to the vocabulary V ) for the document x i according to a Dirichlet Process (DP) G ∼ DP (α, G 0 ), where α > 0 is a concentration parameter and the base measure G 0 = Dir( ⃗ β) can be considered as a prior distribution for θ. Consider the document x i as a bag of words, given the topic θ i , the generative distribution F is a given likelihood function parameterized by θ. We define F as p(x i |θ i ) = ∏ |x i | j=1 p(x ij |θ i ), where x ij is the j th word in x i . Note that the DP-MM assumes each document can be assigned to one topic cluster only. The DP process of DPMM, according to which topic θ i for a document x i is drawn, can be explained by the popular metaphor of Pólya urn (PU) scheme (Blackwell and MacQueen, 1973), equivalent to the Chinese Restaurant Process (Ahmed and Xing, 2008). The PU scheme works on balls (documents) and colors (topics). It starts with an empty urn. With probability proportional to α, we draw θ i ∼ G 0 , and add a ball of this color to the urn. With probability proportional to i − 1 (i.e., the current number of balls in the urn), we draw a ball at random from the urn, observe its color θ i and replace the ball with two balls of the same color. In this way, we draw topic θ i for document x i . As shown in the process, the prior probability of assigning a document to a topic is proportional to the number of documents already assigned to the topic. As a result, the DPMM exhibits the "rich get richer" property.

TSDPMM: Incorporating Seed Topics
In this section, we describe our proposed algorithm to incorporate prior seed topics into the DPM-M. A prior/seed topic k is represented by a vector ⃗ N (0) k (word frequencies under the topic). We can obtain the prior topics represented by ⃗ N (0) k from past learning of topic models or external resources such as Wikipedia and "CFP". Assuming we have K (0) prior topics, we use the parame- to control our confidence about how likely each prior topic exists. Let us go back to Pólya urn (PU) scheme, where a prior topic can be taken as a known color. We extend the PU scheme to incorporate prior topics, which gives the sPU (seeded Pólya Urn) scheme. The sPU scheme can be described as follows: • We start with an urn with α (0) k balls of each known color k ∈ {1, ..., K (0) }.
• With a probability proportional to α, we draw θ i ∼ G 0 and add a ball of this color to the urn.
k , we draw a random ball from the urn, and replace the ball with two balls of the same color.
As shown in the above process, instead of starting with an empty urn in DPMM, we assume that the urn already has certain balls of known colors. In this way, we incorporate the prior seed topics. The number of initial balls (documents) α (0) k controls how likely the topic k exists. We can use different values of α (0) k for prior topics with different confidence levels. This sPU scheme gives our novel model TSDPMM (Topic Seeded DPM-M) incorporating prior topics. The TSDPMM has similar graphical representation as DPMM (Figure 1), except the introduction of hyper-parameter ⃗ α (0) . We then present a collapsed gibbs sampling algorithm for model inference as follows.
TSDPMM Inference. The model inference is described in detail in Algorithm 1. It first initializes all documents with random topic clusters. Then it iteratively updates the topic cluster assignments of documents according to the conditional probabilities (Eq.1) until convergence. Eq.1 can be derived as: where z i is the topic assignment of observation x i , ⃗ X is the given document corpus, and ⃗ Z −i are ⃗ X −i are the set of topic assignments and the corpus excluding the i th observation x i , respectively.

Algorithm 1: Collapsed Gibbs Sampling
Initialize the topic assignments ⃗ Z based on prior topics randomly; repeat Select a document x i ∈ ⃗ X randomly Fix the other topic assignments ⃗ Z −i Assign a new value to In Eq.1, the first item p(z i =k| ⃗ Z −i , α, ⃗ α (0) ) denotes a prior probability of z i =k, which is proportional to the number of documents already assigned to it. If k is a prior topic, it is proportional to n k,−i + α (0) k , where n k,−i is the number of documents of topic k excluding the current document x i . If k is an existing (not prior) topic, it is proportional to n k,−i . If k is a new topic, the probability is proportional to α.
where ⃗ N k = {N k,w } V w=1 and N k,w is the number of occurrences of word w in the k th topic. Here, we adopt the function ∆ in (Heinrich, 2009), and . Finally, we can derive: where ⃗ N k,−i is a vector with the word counts for all the documents assigned to topic k excluding x i , ⃗ N .,i and ⃗ N k are vectors with word counts in document x i and in all the documents assigned to k in prior knowledge respectively. According to this equation, documents are likely to go into clusters which are bigger and give higher likelihood of the documents. When the Gibbs sampler converges, we obtain topic cluster assignments of all the documents. Different from DPMM inference process in which topics are removed when no documents is assigned to them, TSDPMM inference can retain prior topics all the time due to the initial number of documents ⃗ α (0) , making it able to track prior topics, as well as to detect new topics.

Experiments
We evaluate our proposed TSDPMM model for document clustering on 3 datasets where each cluster corresponds to a topic. We implement both DPMM and TSDPMM models -their source codes are available at https://github.com/ newsminer/DPMM_and_TSDPMM.

Datasets
We collect machine learning conference NIPS datasets composed of paper titles and abstracts from 2012 to 2014 -each year includes 342, 360 and 411 documents respectively. They are named as NIPS-12, NIPS-13 and NIPS-14.
For all the datasets, we conduct the following preprocessing: (1) Convert letters into lowercase; (2) Remove non-Latin characters and stop words; (3) Remove words with document frequency < 2.

Experimental Setup
We take the standard DPMM as our baseline method and compare it with our proposed TSDP-MM model using different prior knowledge obtained with different manners.
For NIPS datasets, we use two kinds of prior knowledge: one is the topics learned by DPMM from previous year's dataset; the other one is from an external resource "CFP" 4 (10 topics, same for each year). We name them as TSDPMM-P and TSDPMM-E respectively. As the topic descriptions in "CFP" are sparse, we repeat each topic description by ten times and then represent a topic with the words with word frequencies in its description text.
For both 20 Newsgroups and Reuters datasets, we use prior knowledge learned by DPMM from the previous day's dataset. Furthermore, to test if we can improve the results continuously by applying TSDPMM, every time when we model a new dataset, we incorporate prior topics learned by TSDPMM from previous day's dataset, similar to lifelong learning (Chen and Liu, 2014;Thrun, 1998). We call this model as TSDPMM-L.
Evaluation. The widely used NMI (normalized mutual information) measure (Dom, 2002), has been employed to evaluate document clustering results. The higher a value of NMI, the better a clustering result is. However, NMI needs true class labels for documents, and can only be applied to our benchmark news datasets. For NIPS datasets without true labels, we use the measure of perplexity, as defined in (Blei et al., 2003), to test per-word likelihood of the datasets. The lower the perplexity, the better a model fits the data. 4 https://nips.cc/Conferences/2014/CallForPapers Table 1 shows the average perplexity values of five runs of 3 models on NIPS datasets. It shows that both TSDPMM-P and TSDPMM-E, leveraging prior topics from previous learning and "CFP" significantly outperform DPMM. In addition, TSDPMM-E achieves lower performance than TSDPMM-P due to its lower quality of prior topics directly obtained from "CFP", compared to higher quality topics from past learning. We may improve "CFP" knowledge by extending it with related texts from search engines or Wikipedia using keywords in "CFP" in future work.

Results
An insight of our clustering results on NIPS-14 dataset suggests that most prior topics in 2013 are covered again in 2014 (consistent topics), except a few missing topics such as "lasso for Bayesian networks". Additionally, some newly evolved topics in 2014, e.g. "monte carlo particle filtering" and "nash games", are successfully discovered by our proposed model.  Table 2 illustrates the average NMI values of five runs of DPMM, TSDPMM and TSDPMM-L on news datasets. The results show that TSDP-MM using prior topics learnt by DPMM outperforms DPMM (on average +5.8%; p <0.025 with t-test). Additionally, TSDPMM-L, which continuously uses prior topics learnt by TSDPMM from previous dataset, further outperforms TSDPMM (on average +3.2%; p <0.025 with t-test). Note TSDPMM-L uses TSDPMM results of 20N-1 and Reu-1 as prior knowledge for the first time, so there are no TSDPMM-L results for the first days in Table 2 for 20N-1 and Reu-1 respectively.

Discussion
The experimental results across 3 datasets have demonstrated that our proposed models can improve DPMM model by incorporating prior topic knowledge, and the higher-quality knowledge will lead to better results. By applying our TS-DPMM in a lifelong continuous learning framework, namely TSDPMM-L, can further improve

Related Work
Our work is related to papers (Vlachos et al., 2008;Vlachos et al., 2009), which added supervision (instance-level must-links or cannot-links between documents) to the DPMM. (Ahmed and Xing, 2008) proposed recurrent Chinese Restaurant Process to incorporate previous documents with known topic clusters. However, our work is very different as we focus on how to incorporate latent topic-level prior knowledge. We model prior topics as known colors that have a certain probability proportional to α (0) k to be assigned to a document. In addition, our inference mechanism subsequently takes the prior knowledge into consideration for automatically assigning topics to documents.
Some existing studies such as (Ramage et al., 2009;Andrzejewski et al., 2009;Jagarlamudi et al., 2012;Andrzejewski et al., 2011) worked on incorporating prior lexical or domain knowledge into LDA. Different from all these work, we focus on the nonparametric model DPMM and propose to incorporate the prior topic knowledge obtained in multiple ways.

Conclusion
In this paper, we propose a novel problem of incorporating prior topics into DPMM model and address it through a simple yet principled seeded Pólya urn scheme. We show that the topic knowledge can be obtained in multiple ways. Experiments on document clustering across 3 datasets demonstrate our proposed model can effectively incorporate the prior topic knowledge and significantly enhance the standard DPMM for text clustering. In future work, we will study how to discover overlapping clusters, i.e., allowing one document to be grouped into multiple topic clusters. We will also explore how to incorporate prior knowledge about topic relations (such as causation and correlation) into topic modeling.