Extended Topic Model for Word Dependency

Topic Model such as Latent Dirichlet Allocation(LDA) makes assumption that topic assignment of different words are conditionally independent. In this paper, we propose a new model Extended Global Topic Random Field (EGTRF) to model non-linear dependencies between words. Speciﬁcally, we parse sentences into dependency trees and represent them as a graph, and assume the topic assignment of a word is inﬂuenced by its adjacent words and distance-2 words. Word similarity information learned from large corpus is incorporated to enhance word topic assignment. Parameters are estimated efﬁciently by variational inference and experimental results on two datasets show EGTRF achieves lower perplexity and higher log predictive probability.


Introduction
Probabilistic topic model such as Latent Dirichlet Allocation(LDA) (Blei et al, 2003) has been widely used for discovering latent topics from document collections by capturing words' cooccuring relation. However, the "bag of words" assumption is employed in most existing topic models, it assumes the order of words can be ignored and topic assignment of each word is conditionally independent given the topic mixture of a document.
To relax the "bag of words" assumption, many extended topic models have been proposed to address the limitation of conditional independence. Wallach (Wallach, 2006) explores a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables. Gruber (Gruber et al, 2007) models the topics of words in the document as a Markov chain, and assumes all words in the same sentence are more likely to have the same topic. Zhu (Zhu et al, 2010) incorporates Markov dependency between topic assignments of neighboring words, and employs a general structure of the GLM to define a conditional distribution of latent topic assignments over words. Most of the models above are limited to model linear topical dependencies between words, word topical dependencies can also be modeled by a non-linear way. In Syntactic topic models (Boyd-Graber et al, 2009), each word of a sentence is generated by a distribution that combines document-specific topic weights and parsetree-specific syntactic transitions.
In Global Topic Random Field(GTRF) model (Li et al, 2014), sentences of a document are parsed into dependency trees (Marneffe et al, 2008) (Manning et al, 2014) (Marneffe et al, 2006. They show topics of semantically or syntactically dependent words achieve the highest similarity and are able to provide more useful information in topic modeling, which is also the basic assumption of our model. Then they propose GTRF to model non-linear topical dependencies, word topics are sampled based on graph structure instead of "bag of words" representation, the conditional independence of word topic assignment is thus relaxed. However, GTRF assumes topic assignment of a word vertex depends on the topic mixture of the document and its neighboring word vertices, ignoring the fact that word vertex can also be influenced by the distance-2 or further word vertices. In this paper, we extend GTRF model and present a novel model Extended Global Topic Random Field (EGTRF) to exploit topical dependency between words. In EGTRF, the topic assignment of a word is assumed to depend on both distance-1 and distance-2 word vertices. An example of a simple document that has two sentences shows in Figure  1. The two sentences are parsed into dependency trees respectively, and then merged into a graph.  Some hidden dependency relations can also be extracted by merging dependency trees. For example, word "allocation" has a new distance-2 word "topics" after merging. Therefore, EGTRF can exploit more semantically or syntactically word dependencies. Theoretically, we can also model the distance further than 2, however, it leads to more complicated computation and small increase of performance. Another advantage of EGTRF is it incorporates word features. The word vector representations are very interesting because the learned vectors explicitly encode many linguistic regularities and patterns . We use the pretrained model from Google News dataset(about 100 billion words) using word2vec 1 tool to represent each word as a 300-dimensional word vector, and apply normalized word similarity as a confidence score to indicate how possible two word vertices share same topic.
We organized the paper as below: EGTRF is presented in Section 2, variational inference and parameter estimation are derived in Section 3, experiments on two datasets are showed in Section 4, we conclude the paper in Section 5.

Extended Global Topic Random Field
In this section, we first present Extended Global Random Field(EGRF) in section 2.1, then show how to model topical dependencies using EGRF in section 2.2. We incorporate word similarity information into model in section 2.3. 1 https://code.google.com/p/word2vec/

Extended Global Random Field
After representing document to undirected graph on previous section, we extend Global Random Field and give the definition of Extended Global Random Field to model the graph as below: Given an undirected graph G, word vertex set is denoted as W = {w i |i = 1, 2, ..n}, where w i is a word vertex, and n is the number of unique words in a document.
The state(topic assignment) of a word vertex w is generated from Z = {z i |i = 1, 2, ..., k}, k is the number of topics.
In Equation (1), f (z) is the function defined on word vertex, which is a probability measure because of the constraints 1 and 2. f (1) (z, z ) and f (2) (z, z ) are the function defined on edge set E 1 and E 2 . f (1) and f (2) are not necessarily probability measure, however, summing over all possible states of the product of the edge and the linked word pair should equal to 1, which are from constraints 3 and 4. So f (z )f (z )f (1) (z , z ) and f (z )f (z )f (2) (z , z ) are probability measure. g is one sample of word topic assignments from graph G. If Equation (1) satisfies all the four constraints, it is easy to verify P (G) is also a probability measure since summing over all possible samples g equals to 1.
We define the random field as in Equation (1) a Extended Global Random Field (EGRF). And EGRF does not have normalization factor, which is much simplier than models with intractable normalizing factor.

Topic Model Using EGRF
We define Extended Global Topic Random Field based on EGRF. EGTRF is a generative proba-bilistic model, the basic idea is that documents are represented as mixtures of topics, words are generated depending on the topic mixtures and graph structure of current document. The generative process for word sequence of a document is described as below: For each document d in corpus D: Transform document d into graph. Choose θ ∼ Dir(α).
For each of the n words w n in d: Choose topic z n ∼ P egrf (z | θ), Choose word w n ∼ M ulti(β zn,wn ).
Given Dirichlet prior α, word distribution of topics β, topic mixture of document θ, topic assignments z and words w. We obtain the marginal distribution of a document: We can see the marginal distribution is similar to LDA except topic assignment of word is sampled by Extended Global Random Field instead of Multinomial. So the word topic assignment is no longer conditionally independent. According to EGRF described in section 2.1, we define the probability of topic sequence z as below: where f (zw) = M ulti(zw|θ) (4) σ is an indicator function and equals 1 if the topic assignments of two words on an edge are same. In order to model Equation (3) as an EGRF, it must satisfy all the four constraints in Equation (1). Equation (4) defines word vertex as multinomial distribution, and we assign λ 1 , λ 2 , λ 3 and λ 4 nonzero values, then it is clear to verify constraint 1 and 2 are satisfied. To satisfy the constraint 3 and 4, combine with (5), (6), we get the relation between λ 1 and λ 2 , λ 3 and λ 4 .

Word Similarity Information
The coherent edge is the edge that the two linked words have same topic. In distance-i edge set, i= 1, 2. E C i includes all coherent edges, E N C i contains all non-coherent edges. Then equation (3) can be represented as below: From the second line to the third line of Equation (9), we represent λ 1 , λ 3 as the function of λ 2 , λ 4 based on (7) and (8). The expectation of the number of edges in E c i can be computed as: φ is the K dimensional variational multinomial parameters and can be thought as the posterior probability of a word given the topic assignment. S w 1 ,w 2 is the similarity measure between word w 1 and w 2 .
As we discussed in section 1, word similarity information S w 1 ,w 2 works as a confidence score to model how likely two words on an edge have same topic. And we make assumption that two words are more likely to have same topic if they have a higher similarity score. To get the similarity score between words, we use word2vec tool to learn the word representation of each word from pre-trained model. The word representations are computed using neural networks, and the learned representations explicitly encode many linguistic regularities and patterns from the corpus. Normalized similarity between word vectors can be regarded as the confidence score of how possible two words have same topic. In this way, knowledge from large corpus other than current document collections is incorporated to guide topic modeling.

Posterior Inference and Parameter Estimation
We derive Variational Inference for posterior inference. The variational function q is same to the original LDA paper (Blei et al, 2003). All terms except P (z|θ) in likelihood function are also same to LDA, Based on Equation (9), we obtain: We get the approximation in Equation (11) from Taylor series, where ζ 1 and ζ 2 are Taylor approximation. E q (| E C i |) is obtained directly from (10), E q (θ T θ) is from the property of Dirichlet distribution. The updating rule of α and β are same to LDA, γ is updated using Newton method since we can not obtain the direct updating rule for γ. φ can be approximated as: EM algorithm is applied using above updating rules. At E-step, we estimate the best γ and φ given current α and β. At M-step, we update new α and β based on obtained γ and φ. We run such iterations until convergence.

Experiment
In this section we study the empirical performance of EGTRF on two datasets. For each dataset, we remove very short documents, and compute a vocabulary by removing stop words, rare words, frequent words. Eighty percent data are used for training, others for testing.
• 20 News Groups: After processing, it contains 13706 documents with a vocabulary of 5164 terms. We evaluate how well a model fits the data with held-out perplexity (Blei et al, 2003) and predictive distribution (Hoffman et al, 2013). Lower perplexity, higher log predictive probability indicate better generalization performance. We implement GTRF without adding self defined edges from the original paper, and set λ 2 = 0.2 to give higher reward to edges from E 1 that the two word vertices have same topic. We set λ 4 = 1.2 to give lower(even negative) reward to edges from E 2 that the two word vertices have same topic in EGTRF, since the distance-1 words are expected to have greater topical affects than distance-2 words. Word is represented as vector from pretrained Google News dataset, we use the word vector learned from original corpus when the word does not exist in pre-trained Google News dataset.
We choose 10, 20, 30, 50 topics for 20 news dataset, 10, 15, 20, 25 topics for NIPS dataset. Figure 2 shows the experimental results of four models: lda, gtrf, egtrf(EGTRF without word similarity information), and egtrf+s(EGTRF with word similarity information) on two datasets. The results show EGTRF outperforms LDA and GTRF in general, and EGTRF with word similarity information achieves best performance.
We believe modeling distance-2 word vertices can exploit more semantically or syntactically word dependencies from document, and word similarity information obtained from large corpus can make up the lack of sufficient information from the original corpus. Therefore, adding the influence of distance-2 word vertices and word similarity information can improve performance of topic modeling.

Conclusion
In this paper, we extended Global Topic Random Field(GTRF) and proposed a novel topic model Extended Global Topic Random Field(EGTRF) which can model dependency relation between adjacent words and distance-2 words. Word topics are drawed by Extended Global Random Field(EGRF) instead of Multinomial, the conditional independence of word topic assignment is thus relaxed. Word similarity information learned from large corpus is incorporated into the model. Experiments on two datasets show EGTRF achieves better performance than GTRF and LDA, which confirm our assumption that adding topical dependency of distance-2 words and incorporating word similarity information can improve model performance.