Embedding : a Continuous Representation of Documents

Word embedding maps words into a lowdimensional continuous embedding space by exploiting the local word collocation patterns in a small context window. On the other hand, topic modeling maps documents onto a low-dimensional topic space, by utilizing the global word collocation patterns in the same document. These two types of patterns are complementary. In this paper, we propose a generative topic embedding model to combine the two types of patterns. In our model, topics are represented by embedding vectors, and are shared across documents. The probability of each word is influenced by both its local context and its topic. A variational inference method yields the topic embeddings as well as the topic mixing proportions for each document. Jointly they represent the document in a low-dimensional continuous space. In two document classification tasks, our method performs better than eight existing methods, with fewer features. In addition, we illustrate with an example that our method can generate coherent topics even based on only one document.


Introduction
Representing documents as fixed-length feature vectors is important for many document processing algorithms. Traditionally documents are represented as a bag-of-words (BOW) vectors. However, this simple representation suffers from being high-dimensional and highly sparse, and loses semantic relatedness across the vector dimensions.
Word Embedding methods have been demonstrated to be an effective way to represent words as continuous vectors in a low-dimensional embedding space (Bengio et al., 2003;Mikolov et al., 2013;Pennington et al., 2014;Levy et al., 2015). The learned embedding for a word encodes its semantic/syntactic relatedness with other words, by utilizing local word collocation patterns. In each method, one core component is the embedding link function, which predicts a word's distribution given its context words, parameterized by their embeddings.
When it comes to documents, we wish to find a method to encode their overall semantics. Given the embeddings of each word in a document, we can imagine the document as a "bag-of-vectors". Related words in the document point in similar directions, forming semantic clusters. The centroid of a semantic cluster corresponds to the most representative embedding of this cluster of words, referred to as the semantic centroids. We could use these semantic centroids and the number of words around them to represent a document.
In addition, for a set of documents in a particular domain, some semantic clusters may appear in many documents. By learning collocation patterns across the documents, the derived semantic centroids could be more topical and less noisy.
Topic Models, represented by Latent Dirichlet Allocation (LDA) (Blei et al., 2003), are able to group words into topics according to their collocation patterns across documents. When the corpus is large enough, such patterns reflect their semantic relatedness, hence topic models can discover coherent topics. The probability of a word is governed by its latent topic, which is modeled as a categorical distribution in LDA. Typically, only a small number of topics are present in each document, and only a small number of words have high probability in each topic. This intuition motivated Blei et al. (2003) to regularize the topic distributions with Dirichlet priors.
Semantic centroids have the same nature as topics in LDA, except that the former exist in the embedding space. This similarity drives us to seek the common semantic centroids with a model similar to LDA. We extend a generative word embedding model PSDVec (Li et al., 2015), by incorporating topics into it. The new model is named TopicVec. In TopicVec, an embedding link function models the word distribution in a topic, in place of the categorical distribution in LDA. The advantage of the link function is that the semantic relatedness is already encoded as the cosine distance in the embedding space. Similar to LDA, we regularize the topic distributions with Dirichlet priors. A variational inference algorithm is derived. The learning process derives topic embeddings in the same embedding space of words. These topic embeddings aim to approximate the underlying semantic centroids.
To evaluate how well TopicVec represents documents, we performed two document classification tasks against eight existing topic modeling or document representation methods. Two setups of TopicVec outperformed all other methods on two tasks, respectively, with fewer features. In addition, we demonstrate that TopicVec can derive coherent topics based only on one document, which is not possible for topic models.
The source code of our implementation is available at https://github.com/askerlee/topicvec. Li et al. (2015) proposed a generative word embedding method PSDVec, which is the precursor of TopicVec. PSDVec assumes that the conditional distribution of a word given its context words can be factorized approximately into independent log-bilinear terms. In addition, the word embeddings and regression residuals are regularized by Gaussian priors, reducing their chance of overfitting. The model inference is approached by an efficient Eigendecomposition and blockwiseregression method (Li et al., 2016b). TopicVec differs from PSDVec in that in the conditional distribution of a word, it is not only influenced by its context words, but also by a topic, which is an embedding vector indexed by a latent variable drawn from a Dirichlet-Multinomial distribution. Hinton and Salakhutdinov (2009) proposed to model topics as a certain number of binary hidden variables, which interact with all words in the doc-ument through weighted connections. Larochelle and Lauly (2012) assigned each word a unique topic vector, which is a summarization of the context of the current word. Huang et al. (2012) proposed to incorporate global (document-level) semantic information to help the learning of word embeddings. The global embedding is simply a weighted average of the embeddings of words in the document. Le and Mikolov (2014) proposed Paragraph Vector. It assumes each piece of text has a latent paragraph vector, which influences the distributions of all words in this text, in the same way as a latent word. It can be viewed as a special case of TopicVec, with the topic number set to 1. Typically, however, a document consists of multiple semantic centroids, and the limitation of only one topic may lead to underfitting. Nguyen et al. (2015) proposed Latent Feature Topic Modeling (LFTM), which extends LDA to incorporate word embeddings as latent features. The topic is modeled as a mixture of the conventional categorical distribution and an embedding link function. The coupling between these two components makes the inference difficult. They designed a Gibbs sampler for model inference. Their implementation 1 is slow and infeasible when applied to a large corpous. Liu et al. (2015) proposed Topical Word Embedding (TWE), which combines word embedding with LDA in a simple and effective way. They train word embeddings and a topic model separately on the same corpus, and then average the embeddings of words in the same topic to get the embedding of this topic. The topic embedding is concatenated with the word embedding to form the topical word embedding of a word. In the end, the topical word embeddings of all words in a document are averaged to be the embedding of the document. This method performs well on our two classification tasks. Weaknesses of TWE include: 1) the way to combine the results of word embedding and LDA lacks statistical foundations; 2) the LDA module requires a large corpus to derive semantically coherent topics. Das et al. (2015) proposed Gaussian LDA. It uses pre-trained word embeddings. It assumes that words in a topic are random samples from a multivariate Gaussian distribution with the topic embedding as the mean. Hence the probability that a

Notations and Definitions
Throughout this paper, we use uppercase bold letters such as S, V to denote a matrix or set, lowercase bold letters such as v w i to denote a vector, a normal uppercase letter such as N, W to denote a scalar constant, and a normal lowercase letter as s i , w i to denote a scalar variable. Table 1 lists the notations in this paper. In a document, a sequence of words is referred to as a text window, denoted by w i , · · · , w i+l , or w i :w i+l . A text window of chosen size c before a word w i defines the context of w i as w i−c , · · · , w i−1 . Here w i is referred to as the focus word. Each context word w i−j and the focus word w i comprise a bigram w i−j , w i .
We assume each word in a document is semantically similar to a topic embedding. Topic embeddings reside in the same N -dimensional space as word embeddings. When it is clear from context, topic embeddings are often referred to as topics. Each document has K candidate topics, arranged in the matrix form T i = (t i1 · · · t iK ), referred to as the topic matrix. Specifically, we fix t i1 = 0, referring to it as the null topic.
In a document d i , each word w ij is assigned to a topic indexed by z ij ∈ {1, · · · , K}. Geometrically this means the embedding v w ij tends to align 2 Almost all modern word embedding methods adopt the exponentiated cosine similarity as the link function, hence the cosine similarity may be assumed to be a better estimate of the semantic relatedness between embeddings derived from these methods. with the direction of t i,z ij . Each topic t ik has a document-specific prior probability to be assigned to a word, denoted as φ ik = P (k|d i ). The vector φ i = (φ i1 , · · · , φ iK ) is referred to as the mixing proportions of these topics in document d i .

Link Function of Topic Embedding
In this section, we formulate the distribution of a word given its context words and topic, in the form of a link function.
The core of most word embedding methods is a link function that connects the embeddings of a focus word and its context words, to define the distribution of the focus word. Li et al. (2015) proposed the following link function: Here a w l wc is referred as the bigram residual, indicating the non-linear part not captured by v wc v w l . It is essentially the logarithm of the normalizing constant of a softmax term. Some literature, e.g. (Pennington et al., 2014), refers to such a term as a bias term.
(1) is based on the assumption that the conditional distribution P (w c | w 0 : w c−1 ) can be factorized approximately into independent logbilinear terms, each corresponding to a context word. This approximation leads to an efficient and effective word embedding algorithm PSDVec (Li et al., 2015). We follow this assumption, and propose to incorporate the topic of w c in a way like a latent word. In particular, in addition to the context words, the corresponding embedding t ik is included as a new log-bilinear term that influences the distribution of w c . Hence we obtain the following extended link function: It is infeasible to compute the exact value of the topic residual r k . We approximate it by the context size c = 0. Then (2) becomes: It is required that wc∈S P (w c | k) = 1 to make (3) a distribution. It follows that (4) can be expressed in the matrix form: where u is the row vector of unigram probabilities.

Generative Process and Likelihood
The generative process of words in documents can be regarded as a hybrid of LDA and PSDVec. Analogous to PSDVec, the word embedding v s i and residual a s i s j are drawn from respective Gaussians. For the sake of clarity, we ignore their generation steps, and focus on the topic embeddings.
The remaining generative process is as follows: 1. For the k-th topic, draw a topic embedding uniformly from a hyperball of radius γ, i.e. t k ∼ Unif(B γ ); 2. For each document d i : (a) Draw the mixing proportions φ i from the Dirichlet prior Dir(α); (b) For the j-th word: i. Draw topic assignment z ij from the categorical distribution Cat(φ i ); ii. Draw word w ij from S according to The above generative process is presented in plate notation in Figure (1).

Likelihood Function
Given the embeddings V , the bigram residuals A, the topics T i and the hyperparameter α, the complete-data likelihood of a single document d i is: where , respectively. Then the complete-data likelihood of the whole corpus is: where P (v s i ; µ i ) and P (a s i s j ; f (h ij )) are the two Gaussian priors as defined in (Li et al., 2015).
Following the convention in (Li et al., 2015), h ij , H are empirical bigram probabilities, µ are the embedding magnitude penalty coefficients, and Z(H, µ) is the normalizing constant for word embeddings. U γ is the volume of the hyperball of radius γ.
Taking the logarithm of both sides, we obtain where m ik = L i j=1 δ(z ij = k) counts the number of words assigned with the k-th topic in d 6 Variational Inference Algorithm

Learning Objective and Process
Given the hyperparameters α, γ, µ, the learning objective is to find the embeddings V , the topics T , and the word-topic and document-topic distributions p(Z i , φ i |d i , A, V , T ). Here the hyperparameters α, γ, µ are kept constant, and we make them implicit in the distribution notations.
However, the coupling between A, V and T , Z, φ makes it inefficient to optimize them simultaneously. To get around this difficulty, we learn word embeddings and topic embeddings separately. Specifically, the learning process is divided into two stages: 1. In the first stage, considering that the topics have a relatively small impact to word distributions and the impact might be "averaged out" across different documents, we simplify the model by ignoring topics temporarily. Then the model falls back to the original PSDVec. The optimal solution V * , A * is obtained accordingly; 2. In the second stage, we treat V * , A * as constant, plug it into the likelihood function, and find the corresponding optimal T * , p(Z, φ|D, A * , V * , T * ) of the full model.
As in LDA, this posterior is analytically intractable, and we use a simpler variational distribution q(Z, φ) to approximate it.

Mean-Field Approximation and Variational GEM Algorithm
In this stage, we fix V = V * , A = A * , and seek the optimal T * , p(Z, φ|D, A * , V * , T * ). As V * , A * are constant, we also make them implicit in the following expressions.
For an arbitrary variational distribution q(Z, φ), the following equalities hold where p = p(Z, φ|D, T ), H(q) is the entropy of q. This implies In (10), E q [log p(D, Z, φ|T )] + H(q) is usually referred to as the variational free energy L(q, T ), which is a lower bound of log p(D|T ). Directly maximizing log p(D|T ) w.r.t. T is intractable due to the hidden variables Z, φ, so we maximize its lower bound L(q, T ) instead. We adopt a mean-field approximation of the true posterior as the variational distribution, and use a variational algorithm to find q * , T * maximizing L(q, T ).
The following variational distribution is used: We can obtain (Li et al., 2016a) where T i is the topic matrix of the i-th document, and r i is the vector constructed by concatenating all the topic residuals r ik .
We proceed to optimize (12) with a Generalized Expectation-Maximization (GEM) algorithm w.r.t. q and T as follows: 1. Initialize all the topics T i = 0, and correspondingly their residuals r i = 0; 2. Iterate over the following two steps until convergence. In the l-th step: (a) Let the topics and residuals be T = T (l−1) , r = r (l−1) , find q (l) (Z, φ) that maximizes L(q, T (l−1) ). This is the Expectation step (E-step). In this step, log p(D|T ) is constant. Then the q that maximizes L(q, T (l) ) will minimize KL(q||p), i.e. such a q is the closest variational distribution to p measured by KL-divergence; (b) Given the variational distribution q (l) (Z, φ), find T (l) , r (l) that improve L(q (l) , T ), using Gradient descent method. This is the generalized Maximization step (M-step). In this step, π, θ, H(q) are constant.

Update Equations of π, θ in E-Step
In the E-step, T = T (l−1) , r = r (l−1) are constant. Taking the derivative of L(q, T (l−1) ) w.r.t. π k ij and θ ik , respectively, we can obtain the optimal solutions (Li et al., 2016a) at:

Update Equation of T i in M-Step
In the Generalized M-step, π = π (l) , θ = θ (l) are constant. For notational simplicity, we drop their superscripts (l).
To update T i , we first take the derivative of (12) w.r.t. T i , and then take the Gradient Descent method.
The derivative is obtained as (Li et al., 2016a): , the sum of the variational probabilities of each word being assigned to the k-th topic in the i-th document. ∂r ik ∂T i is a gradient matrix, whose j-th column is ∂r ik ∂t ij . Remind that r ik = − log E P (s) [exp{v s t ik }] .
When j = k, it is easy to verify that ∂r ik ∂t ij = 0. When j = k, we have where u • V is to multiply each column of V with u element-by-element. Therefore ∂r ik ∂T i = (0, · · · ∂r ik ∂t ik , · · · , 0). Plugging it into (15), we obtain We proceed to optimize T i with a gradient descent method: where λ(l, L i ) = L 0 λ 0 l·max{L i ,L 0 } is the learning rate function, L 0 is a pre-specified document length threshold, and λ 0 is the initial learning rate. As the magnitude of ∂L(q (l) ,T ) ∂T i is approximately proportional to the document length L i , to avoid the step size becoming too big a on a long document, if L i > L 0 , we normalize it by L i .
To satisfy the constraint that t After we obtain the new T , we update r (m) i using (5).
Sometimes, especially in the initial few iterations, due to the excessively big step size of the gradient descent, L(q, T ) may decrease after the update of T . Nonetheless the general direction of L(q, T ) is increasing.

Sharing of Topics across Documents
In principle we could use one set of topics across the whole corpus, or choose different topics for different subsets of documents. One could choose a way to best utilize cross-document information.
For instance, when the document category information is available, we could make the documents in each category share their respective set of topics, so that M categories correspond to M sets of topics. In the learning algorithm, only the update of π k ij needs to be changed to cater for this situation: when the k-th topic is relevant to the document i, we update π k ij using (13); otherwise π k ij = 0. An identifiability problem may arise when we split topic embeddings according to document subsets. In different topic groups, some highly similar redundant topics may be learned. If we project documents into the topic space, portions of documents in the same topic in different documents may be projected onto different dimensions of the topic space, and similar documents may eventually be projected into very different topic proportion vectors. In this situation, directly using the projected topic proportion vectors could cause problems in unsupervised tasks such as clustering. A simple solution to this problem would be to compute the pairwise similarities between topic embeddings, and consider these similarities when computing the similarity between two projected topic proportion vectors. Two similar documents will then still receive a high similarity score.

Experimental Results
To investigate the quality of document representation of our TopicVec model, we compared its performance against eight topic modeling or document representation methods in two document classification tasks. Moreover, to show the topic coherence of TopicVec on a single document, we present the top words in top topics learned on a news article.
7.1 Document Classification Evaluation 7.1.1 Experimental Setup Compared Methods Two setups of TopicVec were evaluated: • TopicVec: the topic proportions learned by TopicVec; • TV+WV: the topic proportions, concatenated with the mean word embedding of the document (same as the MeanWV below). We compare the performance of our methods against eight methods, including three topic modeling methods, three continuous document representation methods, and the conventional bag-ofwords (BOW) method. The count vector of BOW is unweighted.
The topic modeling methods include: • LDA: the vanilla LDA (Blei et al., 2003) in the gensim library 3 ; • sLDA: Supervised Topic Model 4 (McAuliffe and Blei, 2008), which improves the predictive performance of LDA by modeling class labels; • LFTM: Latent Feature Topic Modeling 5 (Nguyen et al., 2015). The document-topic proportions of topic modeling methods were used as their document representation.
• TWE: Topical Word Embedding 7 (Liu et al., 2015), which represents a document by concatenating average topic embedding and average word embedding, similar to our TV+WV; • GaussianLDA: Gaussian LDA 8 (Das et al., 2015), which assumes that words in a topic are random samples from a multivariate Gaussian distribution with the mean as the topic embedding. Similar to TopicVec, we derived the posterior topic proportions as the features of each document; • MeanWV: The mean word embedding of the document. Datasets We used two standard document classification corpora: the 20 Newsgroups 9 and the ApteMod version of the Reuters-21578 corpus 10 . The two corpora are referred to as the 20News and Reuters in the following.
20News contains about 20,000 newsgroup documents evenly partitioned into 20 different categories. Reuters contains 10,788 documents, where each document is assigned to one or more categories. For the evaluation of document classification, documents appearing in two or more categories were removed. The numbers of documents in the categories of Reuters are highly imbalanced, and we only selected the largest 10 categories, leaving us with 8,025 documents in total.
The same preprocessing steps were applied to all methods: words were lowercased; stop words and words out of the word embedding vocabulary (which means that they are extremely rare) were removed.
Experimental Settings TopicVec used the word embeddings trained using PSDVec on a March 2015 Wikipedia snapshot. It contains the most frequent 180,000 words. The dimensionality of word embeddings and topic embeddings was 500. The hyperparameters were α = (0.1, · · · , 0.1), γ = 5. For 20news and Reuters, we specified 15 and 12 topics in each category on the training set, respectively. The first topic in each category was always set to null. The learned topic embeddings were combined to form the whole topic set, where redundant null topics in different categories were removed, leaving us with 281 topics for 20News and 111 topics for Reuters. The initial learning rate was set to 0.1. After 100 GEM iterations on each dataset, the topic embeddings were obtained. Then the posterior document-topic distributions of the test sets were derived by performing one E-step given the topic embeddings trained on the training set.
LFTM includes two models: LF-LDA and LF-DMM. We chose the better performing LF-LDA to evaluate. TWE includes three models, and we chose the best performing TWE-1 to compare.
LDA, sLDA, LFTM and TWE used the specified 50 topics on Reuters, as this is the optimal topic number according to (Lu et al., 2011). On the larger 20news dataset, they used the specified 100 topics. Other hyperparameters of all compared methods were left at their default values.
GaussianLDA was specified 100 topics on 20news and 70 topics on Reuters. As each sampling iteration took over 2 hours, we only had time for 100 sampling iterations.
For each method, after obtaining the document representations of the training and test sets, we trained an -1 regularized linear SVM one-vs-all classifier on the training set using the scikit-learn library 11 . We then evaluated its predictive performance on the test set.
Evaluation metrics Considering that the largest few categories dominate Reuters, we adopted macro-averaged precision, recall and F1 measures as the evaluation metrics, to avoid the average results being dominated by the performance of the 11 http://scikit-learn.org/stable/modules/svm.html   top categories. Table 2 presents the performance of the different methods on the two classification tasks. The highest scores were highlighted with boldface. It can be seen that TV+WV and TopicVec obtained the best performance on the two tasks, respectively. With only topic proportions as features, TopicVec performed slightly better than BOW, MeanWV and TWE, and significantly outperformed four other methods. The number of features it used was much lower than BOW, MeanWV and TWE (Table 3).

Evaluation Results
GaussianLDA performed considerably inferior to all other methods. After checking the generated topic embeddings manually, we found that the embeddings for different topics are highly similar to each other. Hence the posterior topic proportions were almost uniform and non-discriminative. In addition, on the two datasets, even the fastest Alias sampling in (Das et al., 2015) took over 2 hours for one iteration and 10 days for the whole 100 iterations. In contrast, our method finished the 100 EM iterations in 2 hours.

Qualitative Assessment of Topics Derived from a Single Document
Topic models need a large set of documents to extract coherent topics. Hence, methods depending on topic models, such as TWE, are subject to this limitation. In contrast, TopicVec can extract coherent topics and obtain document representations even when only one document is provided as input.
To illustrate this feature, we ran TopicVec on a New York Times news article about a pharmaceutical company acquisition 12 , and obtained 20 topics. Figure 2 presents the most relevant words in the top-6 topics as a topic cloud. We first calculated the relevance between a word and a topic as the frequency-weighted cosine similarity of their embeddings. Then the most relevant words were selected to represent each topic. The sizes of the topic slices are proportional to the topic proportions, and the font sizes of individual words are proportional to their relevance to the topics. Among these top-6 topics, the largest and smallest topic proportions are 26.7% and 9.9%, respectively.
As shown in Figure 2, words in obtained topics were generally coherent, although the topics were only derived from a single document. The reason is that TopicVec takes advantage of the rich semantic information encoded in word embeddings, which were pretrained on a large corpus.
The topic coherence suggests that the derived topic embeddings were approximately the semantic centroids of the document. This capacity may aid applications such as document retrieval, where a "compressed representation" of the query document is helpful.

Conclusions and Future Work
In this paper, we proposed TopicVec, a generative model combining word embedding and LDA, with the aim of exploiting the word collocation patterns both at the level of the local context and the global document. Experiments show that TopicVec can learn high-quality document representations, even given only one document.
In our classification tasks we only explored the use of topic proportions of a document as its representation. However, jointly representing a document by topic proportions and topic embeddings would be more accurate. Efficient algorithms for this task have been proposed (Kusner et al., 2015).
Our method has potential applications in various scenarios, such as document retrieval, classification, clustering and summarization.