Skip-Gram − Zipf + Uniform = Vector Additivity

In recent years word-embedding models have gained great popularity due to their remarkable performance on several tasks, including word analogy questions and caption generation. An unexpected “side-effect” of such models is that their vectors often exhibit compositionality, i.e., adding two word-vectors results in a vector that is only a small angle away from the vector of a word representing the semantic composite of the original words, e.g., “man” + “royal” = “king”. This work provides a theoretical justification for the presence of additive compositionality in word vectors learned using the Skip-Gram model. In particular, it shows that additive compositionality holds in an even stricter sense (small distance rather than small angle) under certain assumptions on the process generating the corpus. As a corollary, it explains the success of vector calculus in solving word analogies. When these assumptions do not hold, this work describes the correct non-linear composition operator. Finally, this work establishes a connection between the Skip-Gram model and the Sufficient Dimensionality Reduction (SDR) framework of Globerson and Tishby: the parameters of SDR models can be obtained from those of Skip-Gram models simply by adding information on symbol frequencies. This shows that Skip-Gram embeddings are optimal in the sense of Globerson and Tishby and, further, implies that the heuristics commonly used to approximately fit Skip-Gram models can be used to fit SDR models.


Abstract
In recent years word-embedding models have gained great popularity due to their remarkable performance on several tasks, including word analogy questions and caption generation. An unexpected "sideeffect" of such models is that their vectors often exhibit compositionality, i.e., adding two word-vectors results in a vector that is only a small angle away from the vector of a word representing the semantic composite of the original words, e.g., "man" + "royal" = "king".
This work provides a theoretical justification for the presence of additive compositionality in word vectors learned using the Skip-Gram model. In particular, it shows that additive compositionality holds in an even stricter sense (small distance rather than small angle) under certain assumptions on the process generating the corpus. As a corollary, it explains the success of vector calculus in solving word analogies. When these assumptions do not hold, this work describes the correct non-linear composition operator.
Finally, this work establishes a connection between the Skip-Gram model and the Sufficient Dimensionality Reduction (SDR) framework of Globerson and Tishby: the parameters of SDR models can be obtained from those of Skip-Gram models simply by adding information on symbol frequencies. This shows that Skip-Gram embeddings are optimal in the sense of Globerson and Tishby and, further, im-plies that the heuristics commonly used to approximately fit Skip-Gram models can be used to fit SDR models.

Introduction
The strategy of representing words as vectors has a long history in computational linguistics and machine learning. The general idea is to find a map from words to vectors such that wordsimilarity and vector-similarity are in correspondence. Whilst vector-similarity can be readily quantified in terms of distances and angles, quantifying word-similarity is a more ambiguous task.
A key insight in that regard is to posit that the meaning of a word is captured by "the company it keeps" (Firth, 1957) and, therefore, that two words that keep company with similar words are likely to be similar themselves.
In the simplest case, one seeks vectors whose inner products approximate the co-occurrence frequencies. In more sophisticated methods cooccurrences are reweighed to suppress the effect of more frequent words (Rohde et al., 2006) and/or to emphasize pairs of words whose co-occurrence frequency maximally deviates from the independence assumption (Church and Hanks, 1990).
An alternative to seeking word-embeddings that reflect co-occurrence statistics is to extract the vectorial representation of words from non-linear statistical language models, specifically neural networks. (Bengio et al., 2003) already proposed (i) associating with each vocabulary word a feature vector, (ii) expressing the probability function of word sequences in terms of the feature vectors of the words in the sequence, and (iii) learning simultaneously the vectors and the parameters of the probability function. This approach came into prominence recently through works of Mikolov et al. (see below) whose main departure from (Bengio et al., 2003) was to follow the suggestion of (Mnih and Hinton, 2007) and tradeaway the expressive capacity of general neuralnetwork models for the scalability (to very large corpora) afforded by (the more restricted class of) log-linear models.
An unexpected side effect of deriving wordembeddings via neural networks is that the wordvectors produced appear to enjoy (approximate) additive compositionality: adding two wordvectors often results in a vector whose nearest word-vector belongs to the word capturing the composition of the added words, e.g., "man" + "royal" = "king" (Mikolov et al., 2013c). This unexpected property allows one to use these vectors to answer word-analogy questions algebraically, e.g., answering the question "Man is to king as woman is to " by returning the word whose word-vector is nearest to the vector v(king)v(man) + v(woman).
In this work we focus on explaining the source of this phenomenon for the most prominent such model, namely the Skip-Gram model introduced in (Mikolov et al., 2013a). The Skip-Gram model learns vector representations of words based on their patterns of co-occurrence in the training corpus as follows: it assigns to each word c in the vocabulary V , a "context" and a "target" vector, respectively u c and v c , which are to be used in order to predict the words that appear around each occurrence of c within a window of ∆ tokens. Specifically, the log probability of any target word w to occur at any position within distance ∆ of a context word c is taken to be proportional to the inner product between u c and v w , i.e., letting n = |V |, (1) Further, Skip-Gram assumes that the conditional probability of each possible set of words in a window around a context word c factorizes as the product of the respective conditional probabilities: p(w δ |c).
(2) (Mikolov et al., 2013a) proposed learning the Skip-Gram parameters on a training corpus by using maximum likelihood estimation under (1) and (2). Thus, if w i denotes the i-th word in the training corpus and T the length of the corpus, we seek the word vectors that maximize As mentioned, the normalized context vectors obtained from maximizing (3) under (1) and (2) exhibit additive compositionality. For example, the cosine distance between the sum of the context vectors of the words "Vietnam" and "capital" and the context vector of the word "Hanoi" is small.
While there has been much interest in using algebraic operations on word vectors to carry out semantic operations like composition, and mathematically-flavored explanations have been offered (e.g., in the recent work (Paperno and Baroni, 2016)), the only published work which attempts a rigorous theoretical understanding of this phenomenon is (Arora et al., 2016). This work guarantees that word vectors can be recovered by factorizing the so-called PMI matrix, and that algebraic operations on these word vectors can be used to solve analogies, under certain conditions on the process that generated the training corpus. Specifically, the word vectors must be known a priori, before their recovery, and to have been generated by randomly scaling uniformly sampled vectors from the unit sphere 1 . Further, the ith word in the corpus must have been selected with probability proportional to e u T w c i , where the "discourse" vector c i governs the topic of the corpus at the ith word. Finally, the discourse vector is assumed to evolve according to a random walk on the unit sphere that has a uniform stationary distribution.
By way of contrast, our results assume nothing a priori about the properties of the word vectors. In fact, the connection we establish between the Skip-Gram and the Sufficient Dimensionality Reduction model of (Globerson and Tishby, 2003) shows that the word vectors learned by Skip-Gram are information-theoretically optimal. Further, the context word c in the Skip-Gram model essentially serves the role that the discourse vector does in the PMI model of (Arora et al., 2016): the words neighboring c are selected with probability proportional to e u T c vw . We find the exact non-linear composition operator when no assumptions are made on the context word. When an analogous assumption to that of (Arora et al., 2016) is made, that the context words are uniformly distributed, we prove that the composition operator reduces to vector addition.
While our primary motivation has been to provide a better theoretical understanding of word compositionality in the popular Skip-Gram model, our connection with the SDR method illuminates a much more general point about the practical applicability of the Skip-Gram model. In particular, it addresses the question of whether, for a given corpus, fitting a Skip-Gram model will give good embeddings. Even if we are making reasonable linguistic assumptions about how to model words and the interdependencies of words in a corpus, it's not clear that these have to hold universally on all corpuses to which we apply Skip-Gram. However, the fact that when we fit a Skip-Gram model we are fitting an SDR model (up to frequency information), and the fact that SDR models are information-theoretically optimal in a certain sense, argues that regardless of whether the Skip-Gram assumptions hold, Skip-Gram always gives us optimal features in the following sense: the learned context embeddings and target embeddings preserve the maximal amount of mutual information between any pair of random variables X and Y consistent with the observed co-occurence matrix, where Y is the target word and X is the predictor word (in a min-max sense, since there are many ways of coupling X and Y , each of which may have different amounts of mutual information). Importantly, this statement requires no assumptions on the distribution P (X, Y ).

Compositionality of Skip-Gram
In this section, we first give a mathematical formulation of the intuitive notion of compositionality of words. We then prove that the composition operator for the Skip-Gram model in full generality is a non-linear function of the vectors of the words being composed. Under a single simplifying assumption, the operator linearizes and reduces to the addition of the word vectors. Finally, we explain how linear compositionality allows for solving word analogies with vector algebra.
A natural way of capturing the compositionality of words is to say that the set of context words c 1 , . . . , c m has the same meaning as the single word c if for every other word w, p(w|c 1 , . . . , c m ) = p(w|c) .
Although this is an intuitively satisfying definition, we never expect it to hold exactly; instead, we replace exact equality with the minimization of KL-divergence. That is, we state that the best candidate for having the same meaning as the set of context words C is the word arg min c∈V D KL (p(·|C) | p(·|c)) . (4) We refer to any vector that minimizes (4) as a paraphrase of the set of words C.
There are two natural concerns with (4). The first is that, in general, it is not clear how to define p(·|C). The second is that KL-divergence minimization is a hard problem, as it involves optimization over many high dimensional probability distributions. Our main result shows that both of these problems go away for any language model that satisfies the following two assumptions: A1. For every word c, there exists Z c such that for every word w, A2. For every set of words C = {c 1 , c 2 , . . . , c m }, there exists Z C such that for every word w, Clearly, the Skip-Gram model satisfies A1 by definition. We prove that it also satisfies A2 when m ≤ ∆ (Lemma 1). Next, we state a theorem that holds for any model satisfying assumptions A1 and A2, including the Skip-Gram model when m ≤ ∆. Theorem 1. In every word model that satisfies A1 and A2, for every set of words C = {c 1 , . . . , c m }, any paraphase c of C satisfies Theorem 1 characterizes the composition operator for any language model which satisfies our two assumptions; in general, this operator is not addition. Instead, a paraphrase c is a vector such that the average word vector under p(·|c) matches that under p(·|C). When the expectations in (7) can be computed, the composition operator can be implemented by solving a non-linear system of equations to find a vector u for which the left-hand side of (7) equals the right-hand side.
Our next result proves that although the composition operator is nontrivial in the general case, to recover vector addition as the composition operator, it suffices to assume that the word frequency is uniform.
As word frequencies are typically much closer to a Zipf distribution (Piantadosi, 2014), the uniformity assumption of Theorem 2 is not realistic. That said, we feel it is important to point out that, as reported in (Mikolov et al., 2013b), additivity captures compositionality more accurately when the training set is manipulated so that the prior distribution of the words is made closer to uniform.
Using composition to solve analogies. It has been observed that word vectors trained using nonlinear models like Skip-Gram tend to encode semantic relationships between words as linear relationships between the word vectors (Mikolov et al., 2013b;Pennington et al., 2014;Levy and Goldberg, 2014). In particular, analogies of the form "man:woman::king:?" can often be solved by taking ? to be the word in the vocabulary whose context vector has the smallest angle with u woman + (u king − u man ). Theorems 1 and 2 offer insight into the solution such analogy questions.
We first consider solving an analogy of the form "m:w::k:?"" in the case where the composition operator is nonlinear. The fact that m and w share a relationship means m is a paraphrase of the set of words {w, R}, where R is a set of words encoding the relationship between m and w. Similarly, the fact that k and ? share the same relationship means k is a paraphrase of the set of words {?, R}. By Theorem 1, we have that R and ? must satisfy We see that solving analogies when the composition operator is nonlinear requires the solution of two highly nonlinear systems of equations. In sharp contrast, when the composition operator is linear, the solution of analogies delightfully reduces to elementary vector algebra. To see this, we again begin with the assertion that the fact that m and w share a relationship means m is a paraphrase of the set of words {w, R}; Similarly, k is a paraphrase of {?, R}. By Theorem 2, u m = u w + u r and which gives the expected relationship Note that because this expression for u ? is in terms of k, w, and m, there is actually no need to assume that R is a set of actual words in V .

Proofs
Proof of Theorem 1. Note that p(w|C) equals as a function of c is equivalent to maximizing the negative cross-entropy as a function of u c , i.e., as maximizing Since Q is concave, the maximizers occur where its gradient vanishes. As ∇ uc Q equals we see that (7) follows.
Proof of Theorem 2. Recall that u C = m i=1 u i . When p(w) = 1/|V | for all w ∈ V , the negative cross-entropy simplifies to and its gradient ∇ uc Q to Thus, ∇Q(u C ) = 0 and since Q is concave, u C is its unique maximizer.
Proof of Lemma 1. First, assume that m = ∆. In the Skip-Gram model target words are conditionally independent given a context word, i.e., Applying Baye's rule, where Z C = 1/ ( m i=1 p(c i )). This establishes the result when m = ∆. The cases m < ∆ follow by marginalizing out ∆ − m context words in the equality (8).

Projection of paraphrases onto the vocabulary
Theorem 2 states that if there is a word c in the vocabulary V whose context vector equals the sum of the context vectors of the words c 1 , . . . , c m , then c has the same "meaning", in the sense of (4), as the composition of the words c 1 , . . . , c m . For any given set of words C = {c 1 , . . . , c m }, it is unlikely that there exists a word c ∈ V whose context vector is exactly equal to the sum of the context vectors of the words c 1 , . . . , c m . Similarly, in Theorem 1, the solution(s) to (7) will most likely not equal the context vector of any word in V . In both cases, we thus need to project the vector(s) onto words in our vocabulary in some manner.
Since Theorem 1 holds for any prior over V , in theory, we could enumerate all words in V and find the word(s) that minimize the difference of the left hand side of (7) from the right hand side. In practice, it turns out that the angle between the context vector of a word w ∈ V and solutionvector(s) is a good proxy and one gets very good experimental results by selecting as the paraphrase of a collection of words, the word that minimizes the angle to the paraphrase vector.
Minimizing the angle has been empirically successful at capturing composition in multiple loglinear word models. One way to understand the success of this approach is to recall that each word c is characterized by a categorical distribution over all other words w, as stated in (1). The peaks of this categorical distribution are precisely the words with which c co-occurs most often. These words characterize c more than all the other words in the vocabulary, so it is reasonable to expect that a word c whose categorical distribution has similar peaks as the categorical distribution of c is similar in meaning to c. Note that the location of the peaks of p(·|c) are immune to the scaling of u c (athough the values of p(·|c) may change); thus, the words w which best characterize c are those for which v w has a high inner product with u c / u c 2 . Since it is clear that if the angle between the context representations of c and c is small, the distributions p(w|c) and p(w|c ) will tend to have similar peaks.

Skip-Gram learns a Sufficient Dimensionality Reduction Model
The Skip-Gram model assumes that the distribution of the neighbors of a word follows a specific exponential parametrization of a categorical distribution. There is empirical evidence that this model generates features that are useful for NLP tasks, but there is no a priori guarantee that the training corpus was generated in this manner. In this section, we provide theoretical support for the usefulness of the features learned even when the Skip-Gram model is misspecified.
To do so, we draw a connection between Skip-Gram and the Sufficient Dimensionality Reduction (SDR) factorization of Globerson and Tishby (Globerson and Tishby, 2003). The SDR model learns optimal 2 embeddings for discrete random variables X and Y without assuming any parametric form on the distributions of X and Y , and it is useful in a variety of applications, including information retrieval, document classification, and association analysis (Globerson and Tishby, 2003). As it turns out, these embeddings, like Skip-Gram, are obtained by learning the parameters of an exponentially parameterized distribution. In Theorem 3 below, we show that if a Skip-Gram model is fit to the cooccurence statistics of X and Y , then the output can be trivially modified (by adding readily-available information on word frequencies) to obtain the parameters of an SDR model.
This connection is significant for two reasons: first, the original algorithm of (Globerson and Tishby, 2003) for learning SDR embeddings is expensive, as it involves information projections. Theorem 3 shows that if one can efficiently fit a Skip-Gram model, then one can efficiently fit an SDR model. This implies that Skip-Gram specific approximation heuristics like negativesampling, hierarchical softmax, and Glove, which are believed to return high-quality approximations to Skip-Gram parameters (Mikolov et al., 2013b;Pennington et al., 2014), can be used to efficiently approximate SDR model parameters. Second, (Globerson and Tishby, 2003) argues for the optimality of the SDR embedding in any domain where the training information on X and Y consists of their coocurrence statistics; this optimality and the Skip-Gram/SDR connection argues for the use of Skip-Gram approximations in such domains, and supports the positive experimental results that have been observed in applications in network science (Grover and Leskovec, 2016), proteinomics (Asgari and Mofrad, 2015), and other fields.
As stated above, the SDR factorization solves the problem of finding information-theoretically optimal features, given co-occurrence statistics for a pair of discrete random variables X and Y . Associate a vector w i to the ith state of X, a vector h j to the jth state of Y , and let W = [w T 1 · · · w T |X| ] T and H be defined similarly. Globerson and Tishby show that such optimal features can be obtained from a low-rank factoriza-tion of the matrix G of co-occurence measurements: G ij counts the number of times state i of X has been observed to co-occur with state j of Y. The loss of this factorization is measured using the KL-divergence, and so the optimal features are obtained from solving the problem arg min Here, Z G = ij G ij normalizes G into an estimate of the joint pmf of X and Y , and similarly Z W,H is the constant that normalizes e WH T into a joint pmf. The expression e WH T denotes entrywise exponentiation of WH T . Now we revisit the Skip-Gram training objective, and show that it differs from the SDR objective only slightly. Whereas the SDR objective measures the distance between the pmfs given by (normalized versions of) G and e WH T , the Skip-Gram objective measures the distance between the pmfs given by (normalized versions of) the rows of G and e WH T . That is, SDR emphasizes fitting the entire pmfs, while Skip-Gram emphasizes fitting conditional distributions.
Before presenting our main result, we state and prove the following lemma, which is of independent interest and is used in the proof of our main theorem. Recall that Skip-Gram represents each word c as a multinomial distribution over all other words w, and it learns the parameters for these distributions by a maximum likelihood estimation. It is known that learning model parameters by maximum likelihood estimation is equivalent to minimizing the KL-divergence of the learned model from the empirical distribution; the following lemma establishes the KL-divergence that Skip-Gram minimizes.
Lemma 2. Let G be the word co-occurrence matrix constructed from the corpus on which a Skip-Gram model is trained, in which case G cw is the number of times word w occurs as a neighboring word of c in the corpus. For each word c, let g c denote the empirical frequency of the word in the corpus, so that Given a positive vector x, letx = x/ x 1 . Then, the Skip-Gram model parameters U = u 1 · · · u |V | T and V = v 1 · · · u |V | T