Efficient Learning for Undirected Topic Models

Replicated Softmax model, a well-known undirected topic model, is powerful in extracting semantic representations of documents. Traditional learning strategies such as Contrastive Divergence are very inefficient. This paper provides a novel estimator to speed up the learning based on Noise Contrastive Estimate, extended for documents of variant lengths and weighted inputs. Experiments on two benchmarks show that the new estimator achieves great learning efficiency and high accuracy on document retrieval and classification.


Introduction
Topic models are powerful probabilistic graphical approaches to analyze document semantics in different applications such as document categorization and information retrieval. They are mainly constructed by directed structure like pLSA (Hofmann, 2000) and LDA (Blei et al., 2003). Accompanied by the vast developments in deep learning, several undirected topic models, such as Srivastava et al., 2013), have recently been reported to achieve great improvements in efficiency and accuracy.
Replicated Softmax model (RSM) ), a kind of typical undirected topic model, is composed of a family of Restricted Boltzmann Machines (RBMs). Commonly, RSM is learned like standard RBMs using approximate methods like Contrastive Divergence (CD). However, CD is not really designed for RSM. Different from RBMs with binary input, RSM adopts softmax units to represent words, resulting in great inefficiency with sampling inside CD, especially for a large vocabulary. Yet, NLP systems usually require vocabulary sizes of tens to hundreds of thousands, thus seriously limiting its application.
Dealing with the large vocabulary size of the inputs is a serious problem in deep-learning-based NLP systems. Bengio et al. (2003) pointed this problem out when normalizing the softmax probability in the neural language model (NNLM), and Morin and Bengio (2005) solved it based on a hierarchical binary tree. A similar architecture was used in word representations like (Mnih and Hinton, 2009;Mikolov et al., 2013a). Directed tree structures cannot be applied to undirected models like RSM, but stochastic approaches can work well. For instance, Dahl et al. (2012) found that several Metropolis Hastings sampling (MH) approaches approximate the softmax distribution in CD well, although MH requires additional complexity in computation. Hyvärinen (2007) proposed Ratio Matching (RM) to train unnormalized models, and Dauphin and Bengio (2013) added stochastic approaches in RM to accommodate high-dimensional inputs. Recently, a new estimator Noise Contrastive Estimate (NCE) (Gutmann and Hyvärinen, 2010) is proposed for unnormalized models, and shows great efficiency in learning word representations such as in (Mnih and Teh, 2012;Mikolov et al., 2013b).
In this paper, we propose an efficient learning strategy for RSM named α-NCE, applying NCE as the basic estimator. Different from most related efforts that use NCE for predicting single word, our method extends NCE to generate noise for documents in variant lengths. It also enables RSM to use weighted inputs to improve the modelling ability. As RSM is usually used as the first layer in many deeper undirected models like Deep Boltzmann Machines (Srivastava et al., 2013), α-NCE can be readily extended to learn them efficiently.

Replicated Softmax Model
RSM is a typical undirected topic model, which is based on bag-of-words (BoW) to represent documents. In general, it consists of a series of RBMs, each of which contains variant softmax visible units but the same binary hidden units.
Suppose K is the vocabulary size. For a document with D words, if the i th word in the document equals the k th word of the dictionary, a vector v i ∈ {0, 1} K is assigned, only with the k th element v ik = 1. An RBM is formed by assigning a hidden state h ∈ {0, 1} H to this document V = {v 1 , ..., v D }, where the energy function is: where θ = {W , b, a} are parameters shared by all the RBMs, andv = D i=1 v i is commonly referred to as the word count vector of a document. The probability for the document V is given by: where F θ (V ) is the "free energy", which can be analytically integrated easily, and Z D is the "partition function" for normalization, only associated with the document length D. As the hidden state and document are conditionally independent, the conditional distributions are derived: where σ(x) = 1 1+e −x . Equation (3) is the softmax units describing the multinomial distribution of the words, and Equation (4) serves as an efficient inference from words to semantic meanings, where we adopt the probabilities of each hidden unit "activated" as the topic features.

Learning Strategies for RSM
RSM is naturally learned by minimizing the negative log-likelihood function (ML) as follows: However, the gradient is intractable for the combinatorial normalization term Z D . Common strategies to overcome this intractability are MCMCbased approaches such as Contrastive Divergence (CD) (Hinton, 2002) and Persistent CD (PCD) (Tieleman, 2008), both of which require repeating Gibbs steps of h (i) ∼ P θ (h|V (i) ) and V (i+1) ∼ P θ (V |h (i) ) to generate model samples to approximate the gradient. Typically, the performance and consistency improve when more steps are adopted. Notwithstanding, even one Gibbs step is time consuming for RSM, since the multinomial sampling normally requires linear time computations. The "alias method" (Kronmal and Peterson Jr, 1979) speeds up multinomial sampling to constant time while linear time is required for processing the distribution. Since P θ (V |h) changes at every iteration in CD, such methods cannot be used.
3 Efficient Learning for RSM Unlike (Dahl et al., 2012) that retains CD, we adopted NCE as the basic learning strategy. Considering RSM is designed for documents, we further modified NCE with two novel heuristics, developing the approach "Partial Noise Uniform Contrastive Estimate" (or α-NCE for short).

Noise Contrastive Estimate
Noise Contrastive Estimate (NCE), similar to CD, is another estimator for training models with intractable partition functions. NCE solves the intractability through treating the partition function Z D as an additional parameter Z c D added to θ, which makes the likelihood computable. Yet, the model cannot be trained through ML as the likelihood tends to be arbitrarily large by setting Z c D to huge numbers. Instead, NCE learns the model in a proxy classification problem with noise samples.
Given a document collection (data) {V d } T d , and another collection (noise) {V n } Tn with T n = kT d , NCE distinguishes these (1+k)T d documents simply based on Bayes' Theorem, where we assumed data samples matched by our model, indicating P θ P data , and noise samples generated from an artificial distribution P n . Parameters are learned by minimizing the cross-entropy function: and the gradient is derived as follows, where σ k (x) = 1 1+ke −x , and the "log-ratio" is: J(θ) can be optimized efficiently with stochastic gradient descent (SGD). Gutmann and Hyvärinen (2010) showed that the NCE gradient ∇ θ J(θ) will reach the ML gradient when k → ∞. In practice, a larger k tends to train the model better.

Partial Noise Sampling
Different from (Mnih and Teh, 2012), which generates noise per word, RSM requires the estimator to sample the noise at the document level. An intuitive approach is to sample from the empirical distributionp for D times, where the log probability is computed: For a fixed k, Gutmann and Hyvärinen (2010) suggested choosing the noise close to the data for a sufficient learning result, indicating full noise might not be satisfactory. We proposed an alternative "Partial Noise Sampling (PNS)" to generate noise by replacing part of the data with sampled words. See Algorithm 1, where we fixed the Algorithm 1 Partial Noise Sampling n , V r ) 10: end for proportion of remaining words at α, named "noise level" of PNS. However, traversing all the conditions to guess the remaining words requires O(D!) computations. To avoid this, we simply bound the remaining words with the data and noise in advance and the noise log P n (V ) is derived readily: where the remaining words V r are still assumed to be described by RSM with a smaller document length. In this way, it also strengthens the robustness of RSM towards incomplete data. Sampling the noise normally requires additional computational load. Fortunately, sincep is fixed, sampling is efficient using the "alias method". It also allows storing the noise for subsequent use, yielding much faster computation than CD.

Uniform Contrastive Estimate
When we initially implemented NCE for RSM, we found the document lengths terribly biased the log-ratio, resulting in bad parameters. Therefore "Uniform Contrastive Estimate (UCE)" was proposed to accommodate variant document lengths by adding the uniform assumption: where UCE adopts the uniform probabilities D √ P θ and D √ P n for classification to average the modelling ability at word-level. Note that D is not necessarily an integer in UCE, and allows choosing a real-valued weights on the document such as idf -weighting (Salton and McGill, 1983). Typically, it is defined as a weighting vector w, where w k = log T d |V ∈{V d }:v ik =1,v i ∈V | is multiplied to the k th word in the dictionary. Thus for a weighted input V w and corresponding length D w , we derive: A specific Z c D w will be assigned to P θ (V w ). Combining PNS and UCE yields a new estimator for RSM, which we simply call α-NCE 1 .

Datasets and Details of Learning
We evaluated the new estimator to train RSMs on two text datasets: 20 Newsgroups and IMDB.
The 20 Newsgroups 2 dataset is a collection of the Usenet posts, which contains 11,345 training and 7,531 testing instances. Both the training and testing sets are labeled into 20 classes. Removing stop words as well as stemming were performed.
The IMDB dataset 3 is a benchmark for sentiment analysis, which consists of 100,000 movie reviews taken from IMDB. The dataset is divided into 75,000 training instances (1/3 labeled and 2/3 unlabeled) and 25,000 testing instances. Two types of labels, positive and negative, are given to show sentiment. Following (Maas et al., 2011), no stop words are removed from this dataset.
For each dataset, we randomly selected 10% of the training set for validation, and the idf -weight vector is computed in advance. In addition, replacing the word countv by log (1 +v) slightly improved the modelling performance for all models.
We implemented α-NCE according to the parameter settings in (Hinton, 2010) using SGD in minibatches of size 128 and an initialized learning rate of 0.1. The number of hidden units was fixed at 128 for all models. Although learning the partition function Z c D separately for every length D is nearly impossible, as in (Mnih and Teh, 2012) we also surprisingly found freezing Z c D as a constant function of D without updating never harmed but actually enhanced the performance. It is probably because the large number of free parameters in RSM are forced to learn better when Z c D is a constant. In practise, we set this constant function as Z c D = 2 H · k e b k D . It can readily extend to learn RSM for real-valued weighted length D w .
We also implemented CD with the same settings. All the experiments were run on a single GPU GTX970 using the library Theano (Bergstra et al., 2010). To make the comparison fair, both α-NCE and CD share the same implementation.

Evaluation of Efficiency
To evaluate the efficiency in learning, we used the most frequent words as dictionaries with sizes ranging from 100 to 20, 000 for both datasets, and test the computation time both for CD of variant Gibbs steps and α-NCE of variant noise sample sizes. The comparison of the mean running  Figure 1, which is averaged on both datasets. Typically, α-NCE achieves 10 to 500 times speed-up compared to CD. Although both CD and α-NCE run slower when the input dimension increases, CD tends to take much more time due to the multinomial sampling at each iteration, especially when more Gibbs steps are used. In contrast, running time stays reasonable in α-NCE even if a larger noise size or a larger dimension is applied.

Evaluation of Performance
One direct measure to evaluate the modelling performance is to assess RSM as a generative model to estimate the log-probability per word as perplexity. However, as α-NCE learns RSM by distinguishing the data and noise from their respective features, parameters are trained more like a feature extractor than a generative model. It is not fair to use perplexity to evaluate the performance. For this reason, we evaluated the modelling performance with some indirect measures. For 20 Newsgroups, we trained RSMs on the training set, and reported the results on document retrieval and document classification. For retrieval, we treated the testing set as queries, and retrieved documents with the same labels in the training set by cosine-similarity. Precision-recall (P-R) curves and mean average precision (MAP) are two metrics we used for evaluation. For classification, we trained a softmax regression on the training set, and checked the accuracy on the testing set. We use this dataset to show the modelling ability of RSM with different estimators.
For IMDB, the whole training set is used for learning RSMs, and an L2-regularized logistic regression is trained on the labeled training set. The error rate of sentiment classification on the testing set is reported, compared with several BoW-based baselines. We use this dataset to show the general modelling ability of RSM compared with others.
We trained both α-NCE and CD, and naturally NCE (without UCE) at a fixed vocabulary size (2000 for 20 Newsgroups, and 5000 for IMDB). Posteriors of the hidden units were used as topic features. For α-NCE , we fixed noise level at 0.5 for 20 Newsgroups and 0.3 for IMDB. In comparison, we trained CD from 1 up to 5 Gibbs steps. Figure 2 and Table 1 show that a larger noise size in α-NCE achieves better modelling perfor- mance, and α-NCE greatly outperforms CD on retrieval tasks especially around large recall values.
The classification results of α-NCE is also comparable or slightly better than CD. Simultaneously, it is gratifying to find that the idf -weighting inputs achieve the best results both in retrieval and classification tasks, as idf -weighting is known to extract information better than word count. In addition, naturally NCE performs poorly compared to others in Figure 2, indicating variant document lengths actually bias the learning greatly.   On the other hand, Table 2 shows the performance of RSM in sentiment classification, where model combinations reported in previous efforts are not considered. It is clear that α-NCE learns RSM better than CD, and outperforms BoW and other BoW-based models 4 such as LDA. The idf -4 Accurately, WRRBM uses "bag of n-grams" assumption. weighting inputs also achieve the best performance. Note that RSM is also based on BoW, indicating α-NCE has arguably reached the limits of learning BoW-based models. In future work, RSM can be extended to more powerful undirected topic models, by considering more syntactic information such as word-order or dependency relationship in representation. α-NCE can be used to learn them efficiently and achieve better performance.

Choice of Noise Level-α
In order to decide the best noise level (α) for PNS, we learned RSMs using α-NCE with different noise levels for both word count and idf -weighting inputs on the two datasets. Figure 3 shows that α-NCE learning with partial noise (α > 0) outperforms full noise (α = 0) in most situations, and achieves better results than CD in retrieval and classification on both datasets. However, learning tends to become extremely difficult if the noise becomes too close to the data, and this explains why the performance drops rapidly when α → 1. Furthermore, curves in Figure 3 also imply the choice of α might be problem-dependent, with larger sets like IMDB requiring relatively smaller α. Nonetheless, a systematic strategy for choosing optimal α will be explored in future work. In practise, a range from 0.3 ∼ 0.5 is recommended.

Conclusions
We propose a novel approach α-NCE for learning undirected topic models such as RSM efficiently, allowing large vocabulary sizes. It is new a estimator based on NCE, and adapted to documents with variant lengths and weighted inputs. We learn RSMs with α-NCE on two classic benchmarks, where it achieves both efficiency in learning and accuracy in retrieval and classification tasks.