Reinforcing the Topic of Embeddings with Theta Pure Dependence for Text Classification

For sentiment classiﬁcation, it is often recognized that embedding based on distributional hypothesis is weak in capturing sentiment contrast–contrasting words may have similar local context. Based on broader context, we propose to incorporate Theta Pure Dependence (TPD) into the Paragraph Vector method to reinforce topical and sentimental information. TPD has a theoretical guarantee that the word dependency is pure, i.e., the dependence pattern has the integral meaning whose underlying distribution can not be conditionally factorized. Our method outperforms the state-of-the-art performance on text clas-siﬁcation tasks.


Introduction
Word embeddings can be learned by training a neural probabilistic language model or a unified neural network architecture for various NLP tasks (Bengio et al., 2003;Collobert and Weston, 2008;Collobert et al., 2011). In global context-aware neural language model (Huang et al., 2012), the global context vector is a weighted average of all word embeddings of a single document/paragraph. After trained with all word embeddings belonging to the current paragraph, a resulting Paragraph Vector can be obtained. Actually, Le and Mikolov's Paragraph Vector (Le and Mikolov, 2014) is trained based on the log-linear neural language model (Mikolov et al., 2013a).
For text classification, using a straightforward extension of language model (e.g. Le and Mikolov's Paragraph Vector) is considered not to be sensible. Embeddings learned for text classification should be very different from that learned for language modeling. For example, language * Corresponding authors: Yuexian Hou and Peng Zhang. models often calculate the probability of a sentence, therefore this is a good movie and this is a bad movie may not be discriminated from each other. In sentiment analysis task, the semantic representation of words needs to tell word good from bad, even if the two words have the same local context. For this reason, the local dependency is insufficient to model topical or sentiment information. Fortunately, if we have the global context of good like interesting or amazing, the sentiment meaning of the embedding will be explicit. However, the training of log-linear neural language model is based on local word dependencies (e.g., the co-occurrence of the words in a local window). Thus, Paragraph Vector can not explicitly model the word dependencies for those words that do not frequently appear in a local window but are actually closely dependent on each other.
In this paper, our aim is to extend the Paragraph Vector with global context which can capture topical or sentiment information effectively. However, if one explicitly considers the dependency patterns that are beyond the local window level, there is a possibility that the noisy dependency patterns can be involved and modeled in the distributed representation methods. Moreover, there should be an unique and explicit topical meaning in the patterns to guarantee no ambiguity in the global context. Therefore, we need a dependency mining method that not only models the long range dependency patterns, but also provides a theoretical guarantee that the dependency patterns are pure. Here, the "pure" dependency pattern is an integral semantic meaning/concept that cannot be factorized into sub dependency patterns.
In the language of statistics, Conditional Pure Dependence (CPD) means that the underlying distribution of the dependency patterns cannot be factorized under certain conditions (e.g., priors, observed words, etc.). It has been proved that CPD is the high-level pure dependence in (Hou et al., 2013). However, judging CPD is NP-hard (Chickering et al., 2004). Fortunately, Theta Pure Dependence (TPD) is the sufficient criteria of CPD and can be identified in O(N) time, where N is the number of words (Hou et al., 2013). This finding motivates us to adopt TPD as the global context. Moreover, compared with other conventional cooccurrence-based methods, such as the Apriori algorithm (Agrawal et al., 1993), TPD based on the Information Geometry (IG) framework has a solid theoretical interpretations in statistics to guarantee the dependence is pure.

Modeling Topic with TPD
Compared with local context, global context can usually capture the text topic more precisely. It is easy to get local context by a sliding window. We define the centered word as the current word and the other words in the window as local context words. Global context words are extracted from all the documents in the corpus and can be divided into two parts: a) the words in the current document but outside of the local context window; b) the words never appeared in the document but in the corpus. The following example shows the words mentioned above, and the topic (the scene of filming) is easily captured by TPD: • TPD: scene camera acting movie Text: there [is great atmosphere in the scene from the location , the] lighting , the fog and such , but the camera should be slowly following the killer. . .
The bracket stands for the local context window, and the size of window is 5, i.e. there are five local context words (in italics) in both sides of the current word (in bold). Global context words are underlined in the example. In order to model the topic explicitly, the dependence pattern should report one and only one topical meaning. TPD has a theoretical guarantee that the dependency has an integral meaning whose underlying distribution can not be conditionally factorized. Formally, given a set of binary random variables X = {X 1 , . . . , X n }, where X i denotes the occurrence (X i = 1) or absence (X i = 0) of the i-th word. Then the n-order TPD over X can be defined as follows.
DEFINITION 1. (TPD): X = {X 1 , . . . , X n } is of n-order Theta Pure Dependence (TPD), iff the n-order θ coordinate θ 12...n is significantly different from zero. (Hou et al., 2013) TPD can be effectively identified by an explicit statistical test procedure: Log Likelihood Ratio Test (LLRT) (Nakahara and Amari, 2002) for θcoordinate of IG. (Hou et al., 2013) Here, we introduce two negative examples to further emphasize the importance of utilizing TPD. Example 1: can, with, of. The joint distribution of this words combination can be unconditionally factorized directly, since the occurrence of any word does not necessarily imply the occurrence of others. Example 2: London, Chelsea, Sherlock Holmes. As we all know, both Chelsea and Sherlock Holmes are closely related to London. Chelsea and Sherlock Holmes are two relatively independent topics, i.e. they are conditional independent given London. Although the three phrases are unconditionally dependent, their joint distribution can be conditionally factorized. Thus the dependency in both two examples can not be pure.
To explain TPD and the characteristic "pure" intuitively, let us look at a typical example of TPD: climate, conference, Copenhagen. The cooccurrence of the three words implies an unseparable high-level semantic entity compared with the two negative examples, introduced above.
In negative examples, the high frequency of words co-occurrence can be explained as some kind of "coincidence", because each of them or their pairwise combinations has a high frequency, independently. However, the co-occurrence of TPD words cannot be fully explained as the random coincidence of, e.g., the co-occurrence of Copenhagen and conference (which can be any other conferences in Copenhagen) and the occurrence of climate.
The word "pure" in Hou et al. (2013) means that the joint probability distribution of these words is significantly different from the product of lowerorder joint distributions or marginal distributions, w.r.t all possible decompositions. More formally, it requires that the joint distribution cannot be factorized unconditionally (UPD) or conditionally (CPD) in the language of graphical model. Let . , x n ] T , be the joint probability distribution over X. Then the definitions of UPD and CPD are as follows: DEFINITION 2. (UPD): X = {X 1 , . . . , X n } is of n-order Unconditional Pure Dependence (UPD), iff it can NOT be unconditionally factorized, i.e., there does NOT exist a k-partition (Hou et al., 2013) DEFINITION 3. (CPD): X = {X 1 , . . . , X n } is of n-order Conditional Pure Dependence (CPD), iff it can NOT be conditionally factorized, i.e., there does NOT exist C 0 ⊂ X and a k-partition is the conditional joint distribution over V given C 0 , and p(c i |c 0 ), i = 1, 2, . . . , k, is the conditional joint distribution over C i given C 0 . In case that C 0 is an empty set, we define p(c 0 ) = 1. (Hou et al., 2013) Actually, CPD is stricter than UPD, and the dependence which just satisfies UPD is not pure enough to model the global context. Therefore, "pure" in our paper refers to the characteristic of CPD. However judging CPD is NP-hard. It is proved that a significant nonzero n-order θ parameter (TPD) entails the n-order CPD/UPD in Hou et al. (2013). The highest-order coordinate parameter in IG is a proper metric for the purity (i.e., the unique semantics) of high-order dependence. A pattern is TPD, iff the n-order θ coordinate θ 12...n is significantly different from zero. Moreover, The Log Likelihood Ratio Test implemented in the mixed coordinates can test whether θ 12...n is significantly different from zero.
Contrasting to TPD, the semantic coupling among the associations in the two negative examples is much weaker. In conclusion, can, with, of cannot give an explicit topic and London, Chelsea, Sherlock Holmes includes at least two topics. the co-occurrence of words in TPD (e.g. climate, conference, Copenhagen) implies an un-separable (pure) high-level semantic entity. A sufficient and unbroken meaning of dependence can not only supply the context but also avoid the ambiguity (or noise) in global context. Therefore, the meaning of pure is important in such a global context modeling method.

Global PV-DBOW and Dependence Vectors
A version of Paragraph Vector in Le and Mikolov (2014) PV-DBOW is extended with TPD to a new model: Global PV-DBOW (Glo-PV-DBOW). TPD has been extracted from the corpus before training. Given a sequence of training words w 1 , w 2 , w 3 , . . . , w T and the global context glo t of w t , the objective of Glo-PV-DBOW is to maximize the average log probability: where c is the local context window size. The indicator of the document that the current word w t belongs to is denoted by doc t . Further, we define p(w t |glo t ) in equation (2): The indicator of the a-th w t 's TPD pattern is denoted as dep a t and can be trained to be a distributed representation of TPD: dependence vector v dep a t . This (N+1)-order TPD consists of N+1 words: w a 1 , w a 2 . . . w a N and w t . The energy function of w t and w i = (w t+j , doc t , dep a t ) is uniform as follows: We define the energy function of TPD words: The resulting predictive distributions are given by p(w t |w a 1 , w a 2 , . . . , w a N ) Hierarchical softmax (Morin and Bengio, 2005) is adopted to reduce the cost of computation. The binary tree is specified with a Huffman tree, and the Huffman code of pseudo words m i in w t 's Huffman path is denoted as x m i . For more about hierarchical softmax we used, please refer to (Mikolov et al., 2013b). Using stochastic gradient descent (SGD), distributed representations of the word, dependence and document have been trained. The update procedure of v w i = (v w t+j , v doct , v dep a t ) is as same as the procedure described in (Mikolov et al., 2013b). Thus, the pseudo code for training TPD words is listed individually: do v w a n + = err

Experiments
Apriori (not a pure dependency method) is contrastively adopted to implement Glo-PV-DBOW. Glo-PV-DBOW-TPD and Glo-PV-DBOW-Apri are all evaluated in two text classification tasks: sentiment analysis and topic discovery. The suffix (e.g., -2, -5) of our global method name denotes the order of dependency (the number of words in a dependence pattern). The order of dependency is changed because we want to show the superiority of the high-order TPD. The high-order TPD provides the more rich and explicit global context than the lower-order one since the high-order TPD cannot be reduced to the random coincidence of lower-order dependencies. We cross-validate the hyperparameters and set the local context window size as 10, the dimension of embeddings as 100. In sentiment analysis task, Apriori's minimum support and TPD's theta 0 is respectively set as 0.004 and 1.4. While in topic discovery task, Apriori's minimum support and TPD's theta 0 is around 0.020 and 2.0 respectively. Since the classification accuracy of the approaches compared is a single result, we do not include any results for test of significance in our method and only report the average accuracy.

Sentiment Analysis on Movie Reviews
The binary sentiment classification on the IMDB dataset proposed by (Maas et al., 2011) is conducted. Results in Fig.1 show that global methods' performance is more stable than PV-DBOW's. Moreover, TPD works much better than Apriori, especially in the high-order dependence. Note that TPD-5 works better than TPD-2, while Apri-5 works worse than Apri-2. It can be explained that the Apriori algorithm is short of an explicit statistical test procedure to guarantee the pure dependence. Therefore, the Apriori algorithm is not suitable for generating the high-order dependence. Instead, the high-order TPD can provide the rich and explicit global context for the model. Meanwhile, it is verified that our method is good at capturing sentiment contrast. Table 1 shows that Glo-PV-DBOW with 5-order TPD achieves the state-of-the-art performance. A promising result is an improvement of more than 2% over result published in Le and Mikolov (2014). Note that the algorithm process of Paragraph Vector (Le and Mikolov, 2014) is much more complex than PV-DBOW's. Paragraph Vector includes an extra inference stage. In addition, Paragraph Vector's document vector is a combination of two vectors: one learned by PV-DBOW and the other learned by Distributed Memory Model of Paragraph Vectors (PV-DM) (Le and Mikolov, 2014). The combined document vector has 800 dimensions, while all vectors in our experiments only have 100 dimensions.

Topic Discovery on News
The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. We follow (Crammer et al., 2012) to create binary problems from the dataset by creating binary decision problems of choosing between two similar groups. Therefore, the dataset is split into two sub-datasets as follows: comp: comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware and sci: sci.electronics vs. sci.med. Similarly, 1800 examples balanced between the two labels were selected for each problem.

Analysis on Word Embeddings
The cosine similarity of each word pair in 20 Newsgroups is computed. We list four center words and their nearest neighbors in PV-DBOW and Glo-PV-DBOW groups respectively. The rankings are labeled in front of neighbor words, and some notable neighbor words are in bold.
From Table 3, we can see that the statistical information of corpus like words co-occurrence can be mined by TPD. Therefore, the Glo-PV-DBOW's embeddings are context-aware and it can help a lot for classification tasks. The top 40 nearest neighbors of ibm are investigated, and we find macintosh and mac appeared in the PV-DBOW group but not in the Glo-PV-DBOW group. In the corpus, the topic of documents is either ibm or mac. If we perform a classification task on "ibm versus mac", it will be hard to classify in the PV-DBOW group. That is because PV-DBOW tends to regard ibm and mac both as computers. However, the two different computer brands are distinguished in Glo-PV-DBOW. Further, ibm and mac co-occur rarely in one document, and the statistical information is noted by TPD.

Conclusion
This paper proposes to incorporate Theta Pure Dependence into Paragraph Vector to capture more topical and sentimental information in the context. The extended model is applied to a sentiment classification task and a topical detection task. Our accuracy outperforms the state-of-theart result on the movie and news datasets. The approach can be improved further to fully leverage the un-factorized sense of high-order Theta Pure Dependence. In future, we will explore the applications of dependence distributed representation.