Explaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis

The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA-a form of matrix factorization to generalize the skip-gram model to tensor factorization. In turn, this lets us train embeddings through richer higher-order coocurrences, e.g., triples that include positional information (to incorporate syntax) or morphological information (to share parameters across related words). We experiment on 40 languages and show our model improves upon skip-gram.


Introduction
Over the past years NLP has witnessed a veritable frenzy on the topic of word embeddings: lowdimensional representations of distributional information. The embeddings, trained on extremely large text corpora such as Wikipedia and Common Crawl, are claimed to encode semantic knowledge extracted from large text corpora.
Numerous methods have been proposed-the most popular being skip-gram (Mikolov et al., 2013) and GloVe (Pennington et al., 2014)-for learning these low-dimensional embeddings from a bag of contexts associated with each word type. Natural language text, however, contains richer structure than simple context-word pairs. In this work, we embed n-tuples rather than pairs, allowing us to escape the bag-of-words assumption and encode richer linguistic structures.
As a first step, we offer a novel interpretation of the skip-gram model (Mikolov et al., 2013). We show how skip-gram can be viewed as an application of exponential-family principal components analysis (EPCA) (Collins et al., 2001) to an integer matrix of coocurrence counts. Previous work has related the negative sampling estimator for skipgram model parameters to the factorization of a matrix of (shifted) positive pointwise mutual information (Levy and Goldberg, 2014b). We show the skip-gram objective is just EPCA factorization.
By extending EPCA factorization from matrices to tensors, we can consider higher-order cooccurrence statistics. Here we explore incorporating positional and morphological content in the model by factorizing a positional tensor and morphology tensor. The positional tensor directly incorporates word order into the model, while the morphology tensor adds word-internal information. We validate our models experimentally on 40 languages and show large gains under standard metrics. 1

Matrix Factorization
In this section, we briefly explain how skip-gram is an example of EPCA. We are given data in the form of a matrix X ∈ R n 1 ×n 2 , where X ij is the number of times that word j appears in context i under some user-specified definition of "context." Principal components analysis (Pearson, 1901) approximates X as the product C W of two matrices C ∈ R d×n 1 and W ∈ R d×n 2 , whose columns are d-dimensional vectors that embed the contexts and the words, respectively, for some user-specified d < min(n 1 , n 2 ). Specifically, PCA minimizes 2 where c i , w j , x j denote the i th column of C and the j th columns of W and X, and c i ·w j denotes an inner product of vectors (sometimes called "cosine Figure 1: Comparison of the graphical model for matrix factorization (either PCA or EPCA) and 3-dimensional tensor factorization. Priors are omitted from the drawing. similarity"). Note that rank(C W ) ≤ d, whereas rank(X) ≤ min(n 1 , n 2 ). Globally optimizing equation (1) means finding the best approximation to X with rank ≤ d (Eckart and Young, 1936), and can be done by SVD (Golub and Van Loan, 2012).
By rewriting equation (1) as (2), both Roweis (1997) and Tipping and Bishop (1999) observed that the optimal values of C and W can be regarded as the maximum-likelihood parameter estimates for the Gaussian graphical model drawn in Figure 1a. This model supposes that the observed column vector x j equals C w j plus Gaussian noise, specifically x j ∼ N(C w j , I). Equation (2) is this model's negated log-likelihood (plus a constant). 3 However, recall that in our application, x j is a vector of observed counts of the various contexts in which word j appeared. Its elements are always non-negative integers-so as Hofmann (1999) saw, it is peculiar to model x j as having been drawn from a Gaussian. EPCA is a generalization of PCA, in which the observation x j can be drawn according to any exponential-family distribution (log-linear distribution) over vectors. 4 The canonical parameter vector for this distribution is given by the j th column of C W , that is, C w j . 5 3 The graphical model further suggests that the ci and wj vectors are themselves drawn from some prior. Specifying this prior defines a MAP estimate of C and W . If we take the prior to be a spherical Gaussian with mean 0 ∈ R d , the MAP estimate corresponds to minimizing (2) plus an L2 regularizer, that is, a multiple of ||C|| 2 F + ||W || 2 F . We do indeed regularize in this way throughout all our experiments, tuning the multiplier on a held-out development set. However, regularization has only minor effects with large training corpora, and is not in the original word2vec implementation of skip-gram. 4 EPCA extends PCA in the same way that generalized linear models (GLMs) extend linear regression. The maximumlikelihood interpretation of linear regression supposes that the dependent variable xj is a linear function C of the independent variable wj plus Gaussian noise. The GLM, like EPCA, is an extension that allows other exponential-family distributions for the dependent variable xj. The difference is that in EPCA, the representations wj are learned jointly with C. 5 In the general form of EPCA, that column is passed through some "inverse link" function to obtain the expected feature values under the distribution, which in turn determines EPCA allows us to suppose that each x j was drawn from a multinomial-a more appropriate family for drawing a count vector. Our observation is that skip-gram is precisely multinomial EPCA with the canonical link function (Mohamed, 2011), which generates x j from a multinomial with log-linear parameterization.That is, skipgram chooses embeddings C, W to maximize This is the log-likelihood (plus a constant) if we assume that for each word j, the context vector x j was drawn from a multinomial with natural parameter vector C w j and count parameter N j = i X ij . This is the same model as in Figure 1a, but with a different conditional distribution for x j , and with x j taking an additional observed parent N j (which is the token count of word j).

Related work
Levy and Goldberg (2014b) also interpreted skipgram as matrix factorization. They argued that skipgram estimation by negative sampling implicitly factorizes a shifted matrix of positive empirical pointwise mutual information values. We instead regard the skip-gram objective itself as demanding EPCA-style factorization of the count matrix X: i.e., X arose stochastically from some unknown matrix of log-linear parameters (column j of X generated from parameter column j), and we seek a rank-d estimate C W of that matrix. pLSI (Hofmann, 1999) similarly factors an unknown matrix of multinomial probabilities, which is multinomial EPCA with the identity link function. In contrast, our unknown matrix holds log-linear parameters-arbitrarily shifted logprobabilities, not probabilities.
Our EPCA interpretation applies equally well to the component distributions that are used in hierarchical softmax (Morin and Bengio, 2005), which is an alternative to negative sampling. Additionally, it yields avenues of future research using Bayesian (Mohamed et al., 2008) and maximum-margin (Srebro et al., 2004) extensions to EPCA. the canonical parameters of the distribution. We use the socalled canonical link, meaning that these two steps are inverses of each other and thus the canonical parameters are themselves a linear function of wj.

Tensor Factorization
Having seen that skip-gram is a form of matrix factorization, we can generalize it to tensors. In contrast to the matrix case, there are several distinct definitions of tensor factorization (Kolda and Bader, 2009). We focus on the polyadic decomposition (Hitchcock, 1927), which yields a satisfying generalization. The tensor analogue to PCA is rank-d tensor approximation, which minimizes Given a tensor X ∈ R n 1 ×n 2 ×n 3 , this objective tries to predict each entry as the three-way dot product of the columns c i , w j , r k ∈ R d , thus finding an approximation to X that factorizes into C, W, R. This polyadic decomposition of the approximating tensor can be viewed as a Tucker decomposition (Tucker, 1966) that enforces a diagonal core. In our setting, the new matrix R ∈ R d×n 3 embeds types of context-word relations. The tensor X can be regarded as a collection of n 2 n 3 count vectors X ·jk ∈ N n 1 : the fibers of the tensor, each of which provides the context counts for some (word j, relation k) pair. Typically, X ·jk counts which context words i are related to word j by relation k.
We now move from third-order PCA to thirdorder EPCA. Minimizing equation (6) corresponds to maximum-likelihood estimation of the graphical model in Figure 1b, in which each fiber of X is viewed as being generated from a Gaussian all at once. Our higher-order skip-gram (HOSG) replaces this Gaussian with a multinomial. Thus, HOSG attempts to maximize the log-likelihood Note that as before, we are taking the total count N jk = i X ijk to be observed. So while our embedding matrices must predict which words are related to word j by relation k, we are not probabilistically modeling how often word j participates in relation k in the first place (nor how often word j occurs overall). A simple and natural move in future would be to extend the generative model to predict these facts also from w j and r k , although this weakens the pedagogical connection to EPCA.
We locally optimize the parameters of our probability model-the word, context and relation embeddings-through stochastic gradient ascent on (7). Each stochastic gradient step computes the gradient of a single summand X ijk log p(i | j, k). Unfortunately, this requires summing over n 1 contexts in the denominator of (8), which is problematic as n 1 is often very large, e.g., 10 7 . Mikolov et al. (2013) offer two speedup schemes: negative sampling and hierarchical softmax. Here we apply the negative sampling approximation to HOSG; hierarchical softmax is also applicable. See Goldberg and Levy (2014) for an in-depth discussion.
HOSG is a bit slower to train than skip-gram, since X yields up to n 3 times as many summands as X (but n 3 in practice, as X is often sparse).

Two Tensors for Word Embedding
As examples of useful tensors to factorize, we offer two third-order generalizations of Mikolov et al. (2013)'s context-word matrix. We are still predicting the distribution of contexts of a given word type.
Our first version increases the number of parameters (giving more expressivity) by conditioning on additional information. Our second version decreases the number of parameters (giving better smoothing) by factoring the word type.

Positional Tensor
When predicting the context words in a window around a given word token, Mikolov et al. (2013) uses the same distribution to predict each of them. We propose to use different distributions at different positions in the window, via a "positional tensor": X dog,ran,-2 is the number of times the context word dog was seen two positions to the left of ran. We will predict this count using p(dog | ran, -2), defined from the embeddings of the word ran, the position -2, and the context word dog and its competitors. For a 10-word window, we have X ∈ R |V |×|V |×10 . Considering word position should improve syntactic awareness.

Compositional Morphology Tensor
For Mikolov et al. (2013), related words such as ran and running are monolithic objects that do not share parameters. We decompose each word into a lemma j and a morphological tag k. contexts i are still full words. 6 Thus, we predict the count X dog,RUN,t using p(dog | RUN, t), where t is a morphological tag such as [pos=V,tense=PAST].
Our model is essentially a version of the skipgram method (Mikolov et al., 2013) that parameterizes the embedding of the word ran as a Hadamard product w j r k , where w j embeds RUN and r k embeds tag t. This is similar to the work of Cotterell et al. (2016), who parameterized word embeddings as a sum w j + r k of embeddings of the component morphemes. 7 Our Hadamard product embedding is in fact more general, since the additive embedding w j + r k can be recovered as a special case-it is equal to (w j ; 1) (1; r k ), which uses twice as many dimensions to embed each object.

Experiments
We build HOSG on top of the HYPERWORDS package. All models (both skip-gram and higher-order skip-gram) are trained for 10 epochs and use 5 negative samples. All models for §5.1 are trained on the Sept. 2016 dump of the full Wikipedia. All models for §5.2 were trained on the lemmatized and POS-tagged WaCky corpora (Baroni et al., 2009) for French, Italian, German and English (Joubarne and Inkpen, 2011;Leviant and Reichart, 2015). To ensure controlled and fair experiments, we follow Levy et al. (2015) for all preprocessing.

Experiment 1: Positional Tensor
We postulate that the positional tensor should encode richer notions of syntax than standard bag-6 If one wanted to extend the model to decompose the context words i as well, we see at least four approaches. 7 Cotterell et al. (2016) made two further moves that could be applied to extend the present paper. First, they allowed a word to consist of any number of (unordered) morphemesnot necessarily two-whose embeddings were combined (by summation) to get the word embedding. Second, this sum also included word-specific random noise, allowing them to learn word embeddings that deviated from compositionality. of-words vectors. Why? Positional information allow us to differentiate between the geometry of the coocurrence, e.g., the is found to the left of the noun it modifies and is-more often than-close to it. Our tensor factorization model explicitly encodes this information during training.
To evaluate the vectors, we use QVEC , which measures Pearson's correlation between human-annotated judgements and the vectors using CCA. The QVEC metric will be higher if the vectors better correlate with the human-annotated resource. To measure the syntactic content of the vectors, we compute the correlation between our learned vector w i for each word and its empirical distribution g i over universal POS tags (Petrov et al., 2012) in the UD treebank (Nivre et al., 2016). g i can be regarded as a vector on the (|T| − 1)-dimensional simplex, where T is the tag set. We report results on 40 languages from the UD treebanks in Table 1, using 4-word or 10-word symmetric context windows (i.e., c ∈ {2, 5}). We find that for 77.5% of the languages, our positional tensor embeddings outperform the standard skip-gram approach on the QVEC metric.
We highlight again that the positional tensor exploits no additional annotation, but better exploits the signal found in the raw text. Of course, our HOSG method could also be used to exploit annotations if available: e.g., one would get different embeddings by defining the relations of word j to be the labeled syntactic dependency relations in which it participates (Lin and Pantel, 2001;Levy and Goldberg, 2014a).

Experiment 2: Morphology Tensor
Since the compositional morphology tensor allows us to share parameters among related word forms, we get a single embedding for each lemma, i.e., all the words ran, run and running now con-  tribute signal to the embedding of run. We expect these lemma embeddings to be predictive of human judgments of lemma similarity. We evaluate using standard datasets on four languages (French, Italian, German and English). Given a list of pairs of words (always lemmata), multiple native speakers judged (on a scale of 1-10) how "similar" those words are conceptually. Our model produces a similarity judgment for each pair using the cosine similarity of their lemma embeddings w j . Table 2 shows how well this learned judgment correlates with the average human judgment. Our model does achieve higher correlation than skip-gram word embeddings. Note we did not compare to a baseline that simply embeds lemmas rather than words (equivalent to fixing r k = 1).

Related Work
Tensor factorization has already found uses in a few corners of NLP research. Van de Cruys et al. (2013) applied tensor factorization to model the compositionality of subject-verb-object triples. Similarly, Hashimoto and Tsuruoka (2015) use an implicit tensor factorization method to learn embeddings for transitive verb phrases. Tensor factorization also appears in semantic-based NLP tasks. Lei et al. (2015) explicitly factorize a tensor based on feature vectors for predicting semantic roles. Chang et al. (2014) use tensor factorization to create knowledge base embeddings optimized for relation extraction. See Bouchard et al. (2015) for a large bibliography.
Other researchers have likewise attempted to escape the bag-of-words assumption in word embeddings, e.g., Yatbaz et al. (2012) incorporates morphological and orthographic features into continuous vectors; Cotterell and Schütze (2015) consider a multi-task set-up to force morphological information into embeddings; Cotterell and Schütze (2017) jointly morphologically segment and embed words; Levy and Goldberg (2014a) derive contexts based on dependency relations; PPDB (Ganitkevitch et al., 2013) employs a mixed bag of words, parts of speech, and syntax; Rastogi et al. (2015) represent word contexts, morphology, semantic frame rela-tions, syntactic dependency relations, and multilingual bitext counts each as separate matrices, combined via GCCA; and, finally, Schwartz et al. (2016) derived embeddings based on Hearst patterns (Hearst, 1992). Ling et al. (2015) learn position-specific word embeddings ( §4.1), but do not factor them as w j r k to share parameters (we did not compare empirically to this). As demonstrated in the experiments, our tensor factorization method enables us to include other syntactic properties besides word order, e.g. morphology. Poliak et al. (2017) also create positional word embeddings. Our research direction is orthogonal to these efforts in that we provide a general purpose procedure for all sorts of higher-order coocurrence.

Conclusion
We have presented an interpretation of the skipgram model as exponential family principal components analysis-a form of matrix factorizationand, thus, related it to an older strain of work. Building on this connection, we generalized the model to the tensor case. Such higher-order skipgram methods can incorporate more linguistic structure without sacrificing scalability, as we illustrated by making our embeddings consider word order or morphology. These methods achieved better word embeddings as evaluated by standard metrics on 40 languages.