A Generative Word Embedding Model and its Low Rank Positive Semidefinite Solution

Most existing word embedding methods can be categorized into Neural Embedding Models and Matrix Factorization (MF)-based methods. However some models are opaque to probabilistic interpretation, and MF-based methods, typically solved using Singular Value Decomposition (SVD), may incur loss of corpus information. In addition, it is desirable to incorporate global latent factors, such as topics, sentiments or writing styles, into the word embedding model. Since generative models provide a principled way to incorporate latent factors, we propose a generative word embedding model, which is easy to interpret, and can serve as a basis of more sophisticated latent factor models. The model inference reduces to a low rank weighted positive semidefinite approximation problem. Its optimization is approached by eigendecomposition on a submatrix, followed by online blockwise regression, which is scalable and avoids the information loss in SVD. In experiments on 7 common benchmark datasets, our vectors are competitive to word2vec, and better than other MF-based methods.


Introduction
The task of word embedding is to model the distribution of a word and its context words using their corresponding vectors in a Euclidean space. Then by doing regression on the relevant statistics derived from a corpus, a set of vectors are recovered which best fit these statistics. These vectors, commonly referred to as the embeddings, capture semantic/syntactic regularities between the words.
The core of a word embedding method is the link function that connects the input -the embeddings, with the output -certain corpus statistics.
Based on the link function, the objective function is developed. The reasonableness of the link function impacts the quality of the obtained embeddings, and different link functions are amenable to different optimization algorithms, with different scalability. Based on the forms of the link function and the optimization techniques, most methods can be divided into two classes: the traditional neural embedding models, and more recent low rank matrix factorization methods.
The neural embedding models use the softmax link function to model the conditional distribution of a word given its context (or vice versa) as a function of the embeddings. The normalizer in the softmax function brings intricacy to the optimization, which is usually tackled by gradient-based methods. The pioneering work was (Bengio et al., 2003). Later Mnih and Hinton (2007) propose three different link functions. However there are interaction matrices between the embeddings in all these models, which complicate and slow down the training, hindering them from being trained on huge corpora. Mikolov et al. (2013a) and Mikolov et al. (2013b) greatly simplify the conditional distribution, where the two embeddings interact directly. They implemented the well-known "word2vec", which can be trained efficiently on huge corpora. The obtained embeddings show excellent performance on various tasks.
Low-Rank Matrix Factorization (MF in short) methods include various link functions and optimization methods. The link functions are usually not softmax functions. MF methods aim to reconstruct certain corpus statistics matrix by the product of two low rank factor matrices. The objective is usually to minimize the reconstruction error, optionally with other constraints. In this line of research, Levy and Goldberg (2014b) find that "word2vec" is essentially doing stochastic weighted factorization of the word-context pointwise mutual information (PMI) matrix. They then factorize this matrix directly as a new method. Pennington et al. (2014) propose a bilinear regression function of the conditional distribution, from which a weighted MF problem on the bigram logfrequency matrix is formulated. Gradient Descent is used to find the embeddings. Recently, based on the intuition that words can be organized in semantic hierarchies,  add hierarchical sparse regularizers to the matrix reconstruction error. With similar techniques,  reconstruct a set of pretrained embeddings using sparse vectors of greater dimensionality. Dhillon et al. (2015) apply Canonical Correlation Analysis (CCA) to the word matrix and the context matrix, and use the canonical correlation vectors between the two matrices as word embeddings. Stratos et al. (2014) and Stratos et al. (2015) assume a Brown language model, and prove that doing CCA on the bigram occurrences is equivalent to finding a transformed solution of the language model. Arora et al. (2015) assume there is a hidden discourse vector on a random walk, which determines the distribution of the current word. The slowly evolving discourse vector puts a constraint on the embeddings in a small text window. The maximum likelihood estimate of the embeddings within this text window approximately reduces to a squared norm objective.
There are two limitations in current word embedding methods. The first limitation is, all MFbased methods map words and their context words to two different sets of embeddings, and then employ Singular Value Decomposition (SVD) to obtain a low rank approximation of the word-context matrix M . As SVD factorizes M M , some information in M is lost, and the learned embeddings may not capture the most significant regularities in M . Appendix A gives a toy example on which SVD does not work properly.
The second limitation is, a generative model for documents parametered by embeddings is absent in recent development. Although (Stratos et al., 2014;Stratos et al., 2015;Arora et al., 2015) are based on generative processes, the generative processes are only for deriving the local relationship between embeddings within a small text window, leaving the likelihood of a document undefined. In addition, the learning objectives of some models, e.g. (Mikolov et al., 2013b, Eq.1), even have no clear probabilistic interpretation. A generative word embedding model for documents is not only easier to interpret and analyze, but more importantly, provides a basis upon which documentlevel global latent factors, such as document topics (Wallach, 2006), sentiments (Lin and He, 2009), writing styles (Zhao et al., 2011b), can be incorporated in a principled manner, to better model the text distribution and extract relevant information.
Based on the above considerations, we propose to unify the embeddings of words and context words. Our link function factorizes into three parts: the interaction of two embeddings capturing linear correlations of two words, a residual capturing nonlinear or noisy correlations, and the unigram priors. To reduce overfitting, we put Gaussian priors on embeddings and residuals, and apply Jelinek-Mercer Smoothing to bigrams. Furthermore, to model the probability of a sequence of words, we assume that the contributions of more than one context word approximately add up. Thereby a generative model of documents is constructed, parameterized by embeddings and residuals. The learning objective is to maximize the corpus likelihood, which reduces to a weighted low-rank positive semidefinite (PSD) approximation problem of the PMI matrix. A Block Coordinate Descent algorithm is adopted to find an approximate solution. This algorithm is based on Eigendecomposition, which avoids information loss in SVD, but brings challenges to scalability. We then exploit the sparsity of the weight matrix and implement an efficient online blockwise regression algorithm. On seven benchmark datasets covering similarity and analogy tasks, our method achieves competitive and stable performance.
The source code of this method is provided at https://github.com/askerlee/topicvec.

Notations and Definitions
Throughout the paper, we always use a uppercase bold letter as S, V to denote a matrix or set, a lowercase bold letter as v w i to denote a vector, a normal uppercase letter as N, W to denote a scalar constant, and a normal lowercase letter as s i , w i to denote a scalar variable. Suppose a vocabulary S = {s 1 , · · · , s W } consists of all the words, where W is the vocabulary size. We further suppose s 1 , · · · , s W are sorted in decending order of the frequency, i.e. s 1 is most frequent, and s W is least frequent. A document d i is a sequence of words d i = (w i1 , · · · , w iL i ), w ij ∈ S. A corpus is a collec-  In a document, a sequence of words is referred to as a text window, denoted by w i , · · · , w i+l , or w i :w i+l in shorthand. A text window of chosen size c before a word w i defines the context of w i as w i−c , · · · , w i−1 . Here w i is referred to as the focus word. Each context word w i−j and the focus word w i comprise a bigram w i−j , w i .
The Pointwise Mutual Information between two words s i , s j is defined as PMI(s i , s j ) = log P (s i , s j ) P (s i )P (s j ) .

Link Function of Text
In this section, we formulate the probability of a sequence of words as a function of their embeddings. We start from the link function of bigrams, which is the building blocks of a long sequence. Then this link function is extended to a text window with c context words, as a first-order approximation of the actual probability.

Link Function of Bigrams
We generalize the link function of "word2vec" and "GloVe" to the following: The rationale for (1) originates from the idea of the Product of Experts in (Hinton, 2002). Suppose different types of semantic/syntactic regularities between s i and s j are encoded in different dimensions of v s i , v s j . As exp{v s j v s i } = l exp{v s i ,l · v s j ,l }, this means the effects of different regularities on the probability are combined by multiplying together. If s i and s j are independent, their joint probability should be P (s i )P (s j ).
In the presence of correlations, the actual joint probability P (s i , s j ) would be a scaling of it. The scale factor reflects how much s i and s j are positively or negatively correlated. Within the scale factor, v s j v s i captures linear interactions between s i and s j , the residual a s i s j captures nonlinear or noisy interactions. In applications, only v s j v s i is of interest. Hence the bigger magnitude v s j v s i is of relative to a s i s j , the better.
Note that we do not assume a s i s j = a s j s i . This provides the flexibility P (s i , s j ) = P (s j , s i ), agreeing with the asymmetry of bigrams in natural languages. At the same time, v s j v s i imposes a symmetric part between P (s i , s j ) and P (s j , s i ).
(1) is equivalent to (3) of all bigrams is represented in matrix form: where G is the PMI matrix.

Gaussian Priors on Embeddings
When (1) is employed on the regression of empirical bigram probabilities, a practical issue arises: more and more bigrams have zero frequency as the constituting words become less frequent. A zero-frequency bigram does not necessarily imply negative correlation between the two constituting words; it could simply result from missing data. But in this case, even after smoothing, (1) will force v s j v s i + a s i s j to be a big negative number, making v s i overly long. The increased magnitude of embeddings is a sign of overfitting.
To reduce overfitting of embeddings of infrequent words, we assign a Spherical Gaussian prior where the hyperparameter µ i increases as the frequency of s i decreases.

Gaussian Priors on Residuals
We wish v s j v s i in (1) captures as much correlations between s i and s j as possible. Thus the smaller a s i s j is, the better. In addition, the more frequent s i , s j is in the corpus, the less noise there is in their empirical distribution, and thus the residual a s i s j should be more heavily penalized.
To this end, we penalize the residual a s i s j by f (P (s i , s j ))a 2 s i s j , where f (·) is a nonnegative monotonic transformation, referred to as the weighting function. Let h ij denoteP (s i , s j ), then the total penalty of all residuals are the square of the weighted Frobenius norm of A: By referring to "GloVe", we use the following weighting function, and find it performs well: where C cut is chosen to cut the most frequent 0.02% of the bigrams off at 1. When s i = s j , two identical words usually have much smaller probability to collocate. HenceP (s i , s i ) does not reflect the true correlation of a word to itself, and should not put constraints to the embeddings. We eliminate their effects by setting f (h ii ) to 0.
If the domain of A is the whole space R W ×W , then this penalty is equivalent to a Gaussian prior N 0, 1 2f (h ij ) on each a s i s j . The variances of the Gaussians are determined by the bigram empirical probability matrix H.

Jelinek-Mercer Smoothing of Bigrams
As another measure to reduce the impact of missing data, we apply the commonly used Jelinek-Mercer Smoothing (Zhai and Lafferty, 2004) to smooth the empirical conditional probabilitỹ P (s j |s i ) by the unigram probabilityP (s j ) as: Accordingly, the smoothed bigram empirical joint probability is defined as In practice, we find κ = 0.02 yields good results. When κ ≥ 0.04, the obtained embeddings begin to degrade with κ, indicating that smoothing distorts the true bigram distributions.

Link Function of a Text Window
In the previous subsection, a regression link function of bigram probabilities is established. In this section, we adopt a first-order approximation based on Information Theory, and extend the link function to a longer sequence w 0 , · · · , w c−1 , w c .
Decomposing a distribution conditioned on n random variables as the conditional distributions on its subsets roots deeply in Information Theory. This is an intricate problem because there could be both (pointwise) redundant information and (pointwise) synergistic information among the conditioning variables (Williams and Beer, 2010). They are both functions of the PMI. Based on an analysis of the complementing roles of these two types of pointwise information, we assume they are approximately equal and cancel each other when computing the pointwise interaction information. See Appendix B for a detailed discussion.
Following the above assumption, we have PMI(w 2 ; w 0 , w 1 ) ≈ PMI(w 2 ; w 0 )+PMI(w 2 ; w 1 ): Plugging (1) and (3) into the above, we obtain We extend the above assumption to that the pointwise interaction information is still close to 0 within a longer text window. Accordingly the above equation extends to a context of size c > 2: From it derives the conditional distribution of w c , given its context w 0 , · · · , w c−1 :

Generative Process and Likelihood
We proceed to assume the text is generated from a Markov chain of order c, i.e., a word only depends on words within its context of size c. Given the hyperparameter µ = (µ 1 , · · ·, µ W ), the generative process of the whole corpus is: 1. For each word s i , draw the embedding v s i from N (0, 1 2µ i I); 2. For each bigram s i , s j , draw the residual a s i s j from N 0, 1 2f (h ij ) ; 3. For each document d i , for the j-th word, draw word w ij from S with probability P (w ij | w i,j−c : w i,j−1 ) defined by (8). Figure 1: The Graphical Model of PSDVec The above generative process for a document d is presented as a graphical model in Figure 1. Based on this generative process, the probability of a document d i can be derived as follows, given the embeddings and residuals V , A: The complete-data likelihood of the corpus is: where Z(H, µ) is the normalizing constant.
Taking the logarithm of both sides of p(D, A, V ) yields where C 0 = M,L i i,j=1 log P (w ij ) is constant.

Learning Objective
The learning objective is to find the embeddings V that maximize the corpus log-likelihood (9). Let x ij denote the (smoothed) frequency of bigram s i , s j in the corpus. Then (9) is sorted as: As the corpus size increases, will dominate the parameter prior terms. Then we can ignore the prior terms when maximizing (10).
As both {P smoothed (s i , s j )} and {P (s i , s j )} sum to 1, the above sum is maximized when The maximum likelihood estimator is then: Writing (11) in matrix form: where "⊗" is the outer product. Now we fix the values of v s i v s j + a s i s j at the above optimal. The corpus likelihood becomes where

Learning V as Low Rank PSD Approximation
Once G * has been estimated from the corpus using (12), we seek V that maximizes (13). This is to find the maximum a posteriori (MAP) estimates of V , A that satisfy V V + A = G * . Applying this constraint to (13), we obtain Algorithm 1 BCD algorithm for finding a unregularized rank-N weighted PSD approximant.
Let X = V V . Then X is positive semidefinite of rank N . Finding V that minimizes (14) is equivalent to finding a rank-N weighted positive semidefinite approximant X of G * , subject to Tikhonov regularization. This problem does not admit an analytic solution, and can only be solved using local optimization methods.
First we consider a simpler case where all the words in the vocabulary are enough frequent, and thus Tikhonov regularization is unnecessary. In this case, we set ∀µ i = 0, and (14) becomes an unregularized optimization problem. We adopt the Block Coordinate Descent (BCD) algorithm 1 in (Srebro et al., 2003) to approach this problem. The original algorithm is to find a generic rank-N matrix for a weighted approximation problem, and we tailor it by constraining the matrix within the positive semidefinite manifold.
We summarize our learning algorithm in Algorithm 1. Here "•" is the entry-wise product. We suppose the eigenvalues λ returned by Eigen Decomposition(X) are in descending order. Q [1:N ] extracts the 1 to N rows from Q .
One key issue is how to initialize X. Srebro et al. (2003) suggest to set X (0) =G * , and point out that X (0) = 0 is far from a local optimum, thus requires more iterations. However we find G * is also far from a local optimum, and this setting converges slowly too. Setting X (0) = G * /2 usually yields a satisfactory solution in a few iterations.

Online Blockwise Regression of V
In Algorithm 1, the essential subroutine PSD Approximate() does eigendecomposition on G t , which is dense due to the logarithm transformation. Eigendecomposition on a W × W dense matrix requires O(W 2 ) space and O(W 3 ) time, difficult to scale up to a large vocabulary. In addition, the majority of words in the vocabulary are infrequent, and Tikhonov regularization is necessary for them.
It is observed that, as words become less frequent, fewer and fewer words appear around them to form bigrams. Remind that the vocabulary S = {s 1 , · · · , s W } are sorted in decending order of the frequency, hence the lower-right blocks of H and f (H) are very sparse, and cause these blocks in (14) to contribute much less penalty relative to other regions. Therefore these blocks could be ignored when doing regression, without sacrificing too much accuracy. This intuition leads to the following online blockwise regression.
The basic idea is to select a small set (e.g. 30,000) of the most frequent words as the core words, and partition the remaining noncore words into sets of moderate sizes. Bigrams consisting of two core words are referred to as core bigrams, which correspond to the top-left blocks of G and f (H). The embeddings of core words are learned approximately using Algorithm 1, on the top-left blocks of G and f (H). Then we fix the embeddings of core words, and find the embeddings of each set of noncore words in turn. After ignoring the lower-right regions of G and f (H) which correspond to bigrams of two noncore words, the quadratic terms of noncore embeddings are ignored. Consequently, finding these embeddings becomes a weighted ridge regression problem, which can be solved efficiently in closedform. Finally we combine all embeddings to get the embeddings of the whole vocabulary. The details are as follows: 1. Partition S into K consecutive groups S 1 , · · · , S k . Take K = 3 as an example. The first group is core words; 2. Accordingly partition G into K × K blocks, in this example as Partition f (H),A in the same way.
G 11 , f (H) 11 , A 11 correspond to core bi- 3. Solve V 1 V 1 + A 11 = G 11 using Algorithm 1, and obtain core embeddings V * 1 ; 4. Set V 1 = V * 1 , and find V * 2 that minimizes the total penalty of the 12-th and 21-th blocks of residuals (the 22-th block is ignored due to its high sparsity): is the weighted average of G 12 and G 21 , "•" and "/" are elementwise product and division, respectively. The columns in V 2 are independent, thus for each v s i , it is a separate weighted ridge regression problem, whose solution is (Holland, 1973): wheref i andḡ i are columns corresponding to s i inf (H) 12 and G 12 , respectively; 5. For any other set of noncore words S k , find V * k that minimizes the total penalty of the 1kth and k1-th blocks, ignoring all other kj-th and jk-th blocks; 6. Combine all subsets of embeddings to form

Experimental Results
We trained our model along with a few state-ofthe-art competitors on Wikipedia, and evaluated the embeddings on 7 common benchmark sets.
All models were trained on the English Wikipedia snapshot in March 2015. After removing nontextual elements and non-English words, 2.04 billion words were left. We used the default hyperparameters in Hyperwords when training PPMI and SVD. Word2vec, GloVe and Singular were trained with their own default hyperparameters. The embedding sets PSD-Reg-180K and PSD-Unreg-180K were trained using our online blockwise regression. Both sets contain the embeddings of the most frequent 180,000 words, based on 25,000 core words. PSD-Unreg-180K was traind with all µ i = 0, i.e. disabling Tikhonov regularization. PSD-Reg-180K was trained with 130001,180000] , i.e. increased regularization as the sparsity increases. To contrast with the batch learning performance, the performance of PSD-25K is listed, which contains the core embeddings only. PSD-25K took advantages that it contains much less false candidate words, and some test tuples (generally harder ones) were not evaluated due to missing words, thus its scores are not comparable to others.
Sparse was trained with PSD-180K-reg as the input embeddings, with default hyperparameters.
The benchmark sets are almost identical to those in (Levy et al., 2015), except that (Luong et al., 2013)'s Rare Words is not included, as many rare words are cut off at the frequency 100, making more than 1/3 of test pairs invalid.
Word Similarity There are 5 datasets: Word-Sim Similarity (WS Sim) and WordSim Relatedness (WS Rel) (Zesch et al., 2008;Agirre et al., 2009), partitioned from WordSim353 (Finkelstein et al., 2002); Bruni et al. (2012) (Mikolov et al., 2013c), with 8000 questions, and Google's analogy dataset (Mikolov et al., 2013a), with 19544 questions. After filtering questions involving out-of-vocabulary words, i.e. words that appear less than 100 times in the corpus, 7054 instances in MSR and 19364 instances in Google were left. The analogy questions were answered using 3CosAdd as well as 3CosMul proposed by Levy and Goldberg (2014a). Table 2 shows the results on all tasks. Word2vec significantly outperformed other methods on analogy tasks. PPMI and SVD performed much worse on analogy tasks than reported in (Levy et al., 2015), probably due to sub-optimal hyperparameters. This suggests their performance is unstable. The new embeddings yielded by Sparse systematically degraded compared to the old embeddings, contradicting the claim in .

Results
Our method PSD-Reg-180K performed well consistently, and is best in 4 similarity tasks. It performed worse than word2vec on analogy tasks, but still better than other MF-based methods. By comparing to PSD-Unreg-180K, we see Tikhonov regularization brings 1-4% performance boost across tasks. In addition, on similarity tasks, online blockwise regression only degrades slightly compared to batch factorization. Their performance gaps on analogy tasks were wider, but this might be explained by the fact that some hard cases were not counted in PSD-25K's evaluation, due to its limited vocabulary.

Conclusions and Future Work
In this paper, inspired by the link functions in previous works, with the support from Information Theory, we propose a new link function of a text window, parameterized by the embeddings of words and the residuals of bigrams. Based on the link function, we establish a generative model of documents. The learning objective is to find a set of embeddings maximizing their posterior likelihood given the corpus. This objective is reduced to weighted low-rank positive-semidefinite approximation, subject to Tikhonov regularization. Then we adopt a Block Coordinate Descent algorithm, jointly with an online blockwise regression algorithm to find an approximate solution. On seven benchmark sets, the learned embeddings show competitive and stable performance.
In the future work, we will incorporate global latent factors into this generative model, such as topics, sentiments, or writing styles, and develop more elaborate models of documents. Through learning such latent factors, important summary information of documents would be acquired, which are useful in various applications.

Appendix A Possible Trap in SVD
Suppose M is the bigram matrix of interest. SVD embeddings are derived from the low rank approximation of M M , by keeping the largest singular values/vectors. When some of these singular values correspond to negative eigenvalues, undesirable correlations might be captured. The following is an example of approximating a PMI matrix.
In a rank-2 approximation, the largest two singular values/vectors are kept, and M (1) and M (2) yield identical SVD embeddings V = ( 0.45 0.89 0 0 0 1 ) (the rows may be scaled depending on the algorithm, without affecting the validity of the following conclusion). The embeddings of s 1 and s 2 (columns 1 and 2 of V ) point at the same direction, suggesting they are positively correlated. However as M (2) 1,2 = M (2) 2,1 = −1.6 < 0, they are actually negatively correlated in the second corpus. This inconsistency is because the principal eigenvalue of M (2) is negative, and yet the corresponding singular value/vector is kept.

Appendix B Information Theory
Redundant information refers to the reduced uncertainty by knowing the value of any one of the conditioning variables (hence redundant). Synergistic information is the reduced uncertainty ascribed to knowing all the values of conditioning variables, that cannot be reduced by knowing the value of any variable alone (hence synergistic).
The mutual information I(y; x i ) and the redundant information Rdn(y; x 1 , x 2 ) are defined as: Rdn(y; x 1 , x 2 ) = E P (y) min The synergistic information Syn(y; x 1 , x 2 ) is defined as the PI-function in (Williams and Beer, 2010), skipped here. I ( y ; x 1 ) I ( y ; x 2 ) S y n ( y ; x 1 , x 2 ) I ( y ; x 1 , x 2 ) R d n ( y ; x 1 , x 2 ) Figure 2: Different types of information among 3 random variables y, x 1 , x 2 . I(y; x 1 , x 2 ) is the mutual information between y and (x 1 , x 2 ). Rdn(y; x 1 , x 2 ) and Syn(y; x 1 , x 2 ) are the redundant information and synergistic information between x 1 , x 2 , conditioning y, respectively.
PMI is the pointwise counterpart of mutual information I. Similarly, all the above concepts have their pointwise counterparts, obtained by dropping the expectation operator. Specifically, the pointwise interaction information is defined as PInt(x 1 , x 2 , y) = PMI(y; x 1 , x 2 ) − PMI(y; x 1 ) − PMI(y; x 2 ) = log P (x 1 )P (x 2 )P (y)P (x 1 ,x 2 ,y) P (x 1 ,x 2 )P (x 1 ,y)P (x 2 ,y) . If we know PInt(x 1 , x 2 , y), we can recover PMI(y; x 1 , x 2 ) from the mutual information over the variable subsets, and then recover the joint distribution P (x 1 , x 2 , y).
As the pointwise redundant information PRdn(y; x 1 , x 2 ) and the pointwise synergistic information PSyn(y; x 1 , x 2 ) are both higherorder interaction terms, their magnitudes are usually much smaller than the PMI terms. We assume they are approximately equal, and thus cancel each other when computing PInt. Given this, PInt is always 0. In the case of three words w 0 , w 1 , w 2 , PInt(w 0 , w 1 , w 2 ) = 0 leads to PMI(w 2 ; w 0 , w 1 ) = PMI(w 2 ; w 0 )+PMI(w 2 ; w 1 ).