Towards Understanding Linear Word Analogies

A surprising property of word vectors is that vector algebra can often be used to solve word analogies. However, it is unclear why - and when - linear operators correspond to non-linear embedding models such as skip-gram with negative sampling (SGNS). We provide a rigorous explanation of this phenomenon without making the strong assumptions that past theories have made about the vector space and word distribution. Our theory has several implications. Past work has conjectured that linear structures exist in vector spaces because relations can be represented as ratios; we prove that this holds for SGNS. We provide novel justification for the addition of SGNS word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Lastly, we offer an information theoretic interpretation of Euclidean distance in vector spaces, justifying its use in capturing word dissimilarity.


Introduction
Distributed representations of words are a cornerstone of current methods in natural language processing. Word embeddings, also known as word vectors, can be generated by a variety of models, all of which share Firth's philosophy (1957) that the meaning of a word is defined by "the company it keeps". The simplest such models obtain word vectors by constructing a low-rank approximation of a matrix containing a co-occurrence statistic (Landauer and Dumais, 1997;Rohde et al., 2006). In contrast, neural network models (Bengio et al., 2003;Mikolov et al., 2013b) learn word embeddings by trying to predict words using the contexts they appear in, or vice-versa.
A surprising property of word vectors derived via neural networks is that word analogies can often be solved with vector algebra. For example, 'king is to ? as man is to woman' can be solved by finding the closest vector to king + woman − man, which should be queen. It is unclear why linear operators can effectively compose embeddings generated by non-linear models like skipgram with negative sampling (SGNS). There have been two attempts to rigorously explain this phenomenon, but both have made strong assumptions about either the embedding space or the word distribution. The paraphrase model (Gittens et al., 2017) hinges on words having a uniform distribution rather than the typical Zipf's distribution, which the authors themselves acknowledge is unrealistic. The latent variable model (Arora et al., 2016) makes many a priori assumptions about the word vectors, such as the assumption that word vectors are generated by randomly scaling vectors uniformly randomly sampled from the unit sphere.
In this paper, we explain why -and under what conditions -word analogies in GloVe and SGNS embedding spaces can be solved with vector algebra, without making the strong assumptions past work has. We begin by formalizing word analogies as functions that transform one word vector into another. When this transformation is simply the addition of a displacement vector -as is the case when using vector algebra -we call the analogy a linear analogy. Using the expression PMI(x, y) + log p(x, y), which we call the cooccurrence shifted pointwise mutual information (csPMI) of a word pair (x, y), we prove that in both SGNS and GloVe spaces without reconstruction error, a linear analogy holds over a set of ordered word pairs iff csPMI(x, y) is the same for every word pair and csPMI(x 1 , x 2 ) = csPMI(y 1 , y 2 ) for any two word pairs. By then framing vector addition as a kind of word analogy, we offer several new insights into the compositionality of words: al. (2014) conjecture as an inuitive explanation of why vector algebra works for analogy solving. The conjecture is that an analogy of the form a is to b as x is to y holds iff p(w|a)/p(w|b) ≈ p(w|x)/p(w|y) for every other word w in the vocabulary. While this is sensible, it is not based on any theoretical derivation or empirical support. We provide a rigorous proof that this is indeed true.
2. Consider two words x, y and their sum z = x + y in an SGNS embedding space with no reconstruction error. If z were in the vocabulary, the similarity between z and x (as measured by the csPMI) would be the log probability of y shifted by a model-specific constant. This implies that the addition of two words automatically down-weights the more frequent word. Since many weighting schemes are based on the idea that more frequent words should be down-weighted ad hoc (Arora et al., 2017), the fact that this is done automatically provides novel justification for using addition to compose words.
3. Consider any two words x, y in an SGNS embedding space with no reconstruction error. The squared Euclidean distance between x and y has a perfect negative correlation with csPMI(x, y). In other words, the more similar two words are (as measured by csPMI) the smaller the distance between their vectors in the embedding space. Although this is intuitive, it is also the first rigorous explanation of why the Euclidean distance in embedding space is a good proxy for word dissimilarity.
Although our main theorem only concerns embedding spaces with no reconstruction error, we also explain why, in practice, linear word analogies hold in embedding spaces with some noise. We conduct experiments that support the few assumptions we make and show that the transformations represented by various word analogies correspond to different csPMI values. Without making the strong assumptions of past theories, we thus offer a rigorous explanation of why, and when, word analogies can be solved with vector algebra.
2 Related Work PMI Pointwise mutual information (PMI) is a common measure of word similarity. For two words x, y, it captures how much more frequently they co-occur than by chance: PMI(x, y) = log[p(x, y)/(p(x)p(y))] (Church and Hanks, 1990).
Word Embeddings Word embeddings are distributed representations of words in a lowdimensional continuous space. Also called word vectors, they capture semantic and grammatical properties of words, even allowing relationships to be expressed algebraically (Mikolov et al., 2013b). Word vectors are generally obtained in two ways: (a) from neural networks that learn representations by predicting co-occurrence patterns in the training corpus (Bengio et al., 2003;Mikolov et al., 2013b;Collobert and Weston, 2008); (b) from low-rank approximations of word-context matrices containing a co-occurrence statistic (Landauer and Dumais, 1997; Levy and Goldberg, 2014).
SGNS The objective of skip-gram with negative sampling (SGNS) is to maximize the probability of observed word-context pairs and to minimize the probability of k randomly sampled negative examples. For an observed word-context pair (w, c), the objective would be log where c is the negative context, randomly sampled from a scaled distribution P n . Words that appear in similar contexts will therefore have similar embeddings. Though no co-occurrence statistics are explicitly calculated, Levy and Goldberg (2014) proved that SGNS is in fact implicitly factorizing a word-context PMI matrix shifted by − log k.

Latent Variable Model
The latent variable model (Arora et al., 2016) was the first attempt to rigorously explain why word analogies can be solved algebraically. It is a generative model that assumes that word vectors are generated by the random walk of a "discourse" vector on the unit sphere. Gitten et al.'s (2017) criticism of this proof is that it assumes that word vectors are known a priori and are generated by randomly scaling vectors uniformly sampled from the unit sphere (or having properties consistent with this sampling procedure). The proof also relies on a conjecture by Pennington et al. (2014) that linear relations can be expressed as a ratio of probabilities.
Paraphrase Model The paraphrase model (Gittens et al., 2017) was the only other attempt to rigorously explain why word analogies can be solved algebraically. It proposes that any set of context words C = {c 1 , ..., c m } is semantically equivalent to a single word c if p(w|c 1 , ..., c m ) = p(w|c). One problem with this is that the number of possible context sets far exceeds the vocabulary size, precluding a one-to-one mapping; the authors circumvent this problem by replacing exact equality with the minimization of KL divergence. Assuming that the words have a uniform distribution, the paraphrase of C can then be written as an unweighted sum of its word vectors. However, this uniformity assumption is unrealistic -word frequencies obey a Zipf's distribution, which is Pareto (Piantadosi, 2014).
3 The Structure of Word Analogies

Formalizing Analogies
A word analogy is a statement of the form "a is to b as x is to y", which we will write as (a,b)::(x,y). It asserts that a and x can be transformed in the same way to get b and y respectively, and that b and y can be inversely transformed to get a and x.
A word analogy can hold over an arbitrary number of ordered pairs: e.g., "Berlin is to Germany as Paris is to France as Ottawa is to Canada ...". The elements in each pair are not necessarily in the same space -for example, the transformation for (king,roi)::(queen,reine) is English-to-French translation. For (king,queen)::(man,woman), the canonical analogy in the literature, the transformation corresponds to changing the gender. Therefore, to formalize the definition of an analogy, we will refer to it as a transformation.
Definition 1 An analogy f is an invertible transformation that holds over a set of ordered pairs S iff The word embedding literature (Mikolov et al., 2013b;Pennington et al., 2014) has focused on a very specific type of transformation, the addition of a displacement vector. For example, for (king,queen)::(man,woman), the transformation would be king + ( woman − man) = queen, where the displacement vector is expressed as the difference ( woman − man). To make a distinction with our general class of analogies in Definition 1, we will refer to these as linear analogies.
Definition 2 A linear analogy f is an invertible transformation of the form x → x + r. f holds over a set of ordered pairs S iff ∀ (x, y) ∈ S, x + r = y.
Co-occurrence Shifted PMI Theorem Let W be an SGNS or GloVe word embedding space with no reconstruction error and S be a set of ordered word pairs such that ∀ (x, y) ∈ S, x, y ∈ W.
A linear analogy f holds over S iff ∃ γ ∈ R, ∀(x, y) ∈ S, PMI(x, y) + log p(x, y) = γ and for any ( Throughout the rest of this paper, we will refer to PMI(x, y) + log p(x, y) as the co-occurrence shifted PMI (csPMI) of x and y. In sections 3.2 to 3.4, we prove the csPMI Theorem. In section 3.5, we explain why, in practice, perfect reconstruction is not needed to solve word analogies using vector algebra. In section 4, we explore what the csPMI Theorem implies about vector addition and Euclidean distance in SGNS embedding spaces.

Analogies as Parallelograms
Lemma 1 Where ·, · denotes the inner product, a linear analogy f holds over a set of ordered When S is empty, Lemma 1 is vacuously true. For the remaining cases, let γ = 2 x 1 , y 1 − x 1 2 2 − y 1 2 2 . When S = {(x 1 , y 1 )}, Lemma 1 holds. When |S|≥ 2, consider the |S|−1 subsets of the form {(x 1 , y 1 ), (x 2 , y 2 )} ⊂ S. f holds over every subset {(x 1 , y 1 ), (x 2 , y 2 )} iff it holds over S. We start by noting that by Definition 2, f holds over {(x 1 , y 1 ), (x 2 , y 2 )} iff: By rearranging (1), we know that x 2 − y 2 = x 1 − y 1 and x 2 − x 1 = y 2 − y 1 . Put another way, x 1 , y 1 , x 2 , y 2 form a quadrilateral in vector space whose opposite sides are parallel and equal in length. By definition, this quadrilateral is then a parallelogram. In fact, this is often how word analogies are visualized in the literature (see Figure 1). To prove the first part of Lemma 1, we let γ = − r 2 2 . A quadrilateral is a parallelogram iff each pair of opposite sides is equal in length. For every possible subset, r = ( y 1 − x 1 ) = ( y 2 − x 2 ). This implies that ∀(x, y) ∈ S, pair of opposite sides, which do not correspond to r, are equal in 2 2 , as stated in Lemma 1. Note that the sides that do not equal r do not necessarily have a fixed length across different subsets of S.

Analogies in the Context Space
Definition 3 Let W be an SGNS or GloVe word embedding space and C its corresponding context space. Let k denote the number of negative samples, X x,y the frequency, and b x , b y the learned biases for GloVe. If there is no reconstruction error, for any words x, y with x, y ∈ W and x c , y c ∈ C: SGNS : x, y c = PMI(x, y) − log k GloVe : SGNS and GloVe generate two vectors for each word in the vocabulary: a context vector, for when it is a context word, and a word vector, for when it is a target word. Context vectors are generally discarded after training. The SGNS identity in (3) is from Levy and Goldberg (2014), who showed that SGNS is implicitly factorizing the k-shifted wordcontext PMI matrix. The GloVe identity is simply the local objective for a word pair (Pennington et al., 2014). Since the matrix being factorized in both models is symmetric, x, y c = x c , y .
Lemma 2 A linear analogy f : x → x + r holds over a set of ordered pairs S in an SGNS or GloVe word embedding space W with no reconstruction error iff ∃ λ ∈ R, g : x c → x c + λ r holds over S in the corresponding context space C.
In other words, an analogy f that holds over S in the word space has a corresponding analogy g that holds over S in the context space. The displacement vector of g is simply the displacement vector of f scaled by some λ ∈ R. To prove this, we begin with (1) and any word w in the vocabulary: Note that we can rewrite the second equation as the third because the matrices being factorized in (3) are symmetric and there is no reconstruction error. We can simplify from the second-last step because not all word vectors lie in the same plane in W , implying that ( Thus a linear analogy with displacement vector ( y 1 − x 1 ) holds over S in the word embedding space iff an analogy with displacement vector ( y 1c − x 1c ) holds over S in the context space. This is supported by empirical findings that word and context spaces perform equally well on word analogy tasks (Pennington et al., 2014). Since there is an analogous parallelogram structure formed by x 1 , y 1 , x 2 , y 2 in the context space, there is some linear map from w → w c for each word w ∈ S. The real matrix A describing this linear map is symmetric: x, y c = x T A y = (A T x) T y = x c , y for any (x, y) ∈ S. This implies that C = AW , since w, x c = w c , x for any word w.
Since A is a real symmetric matrix, by the finite-dimensional spectral theorem, there is an orthonormal basis of W consisting of eigenvectors of A. If A had distinct eigenvalues, opposite sides of the parallelogram formed by x 1 , y 1 , x 2 , y 2 in the word space could be stretched by different factors. This would imply that the quadrilateral formed by x 1 , y 1 , x 2 , y 2 in the context space is not a parallelogram, which is a contradiction. Therefore A can only have non-distinct eigenvalues. Because A's eigenvectors are a basis for W and all have the same eigenvalue λ , all word vectors lie in the same eigenspace (i.e., C = λW ). Experiments done by Minmo and Thompson (2017) provide empirical support of this result.

Proof of the csPMI Theorem
From Lemma 1, we know that if a linear analogy f holds over a set of ordered pairs S, then ∃ γ ∈ R, ∀ (x, y) ∈ S, 2 x, y − x 2 2 − y 2 2 = γ . Because there is no reconstruction error, by Lemma 2, we can rewrite the inner product of two word vectors in terms of the inner product of a word and context vector. Then we can simplify using the SGNS identity in (3): We get the same result by expanding the GloVe identity in (3), regardless of what the learned biases b x , b y are. The second identity in Lemma 1 can be expanded in the same way, implying that a linear analogy f holds over a set of ordered pairs S iff (5) holds for every pair (x, y) ∈ S and PMI(x 1 , x 2 ) + log p(x 1 , x 2 ) = PMI(y 1 , y 2 ) + log p(y 1 , y 2 ) for any two pairs (x 1 , y 1 ), (x 2 , y 2 ) ∈ S.

Robustness to Noise
The csPMI Theorem does not explain why, in practice, linear word analogies hold in embedding spaces that have some reconstruction error. There are two reasons for this: the looser definition of vector equality in practice and the lower variance in reconstruction error associated with more frequent word pairs. For one, in practice, a word analogy task a:?::x:y is solved by finding the most similar vector to a + ( y − x), where dissimilarity is defined in terms of Euclidean or cosine distance. The correct solution to a word analogy can be found even when that solution is not exact. The second reason is that the variance of the noise ε x,y for a word pair (x, y) (i.e., x, y c − (PMI(x, y) − log k)) is a strictly decreasing function of the frequency X x,y : more frequent word pairs are associated with less reconstruction error in both SGNS and GloVe. This is because the cost of deviating from the optimal value is higher for more frequent word pairs; this is implicit in the SGNS objective (Levy and Goldberg, 2014) and explicit in the GloVe objective (Pennington et al., 2014). We also show empirically that this is true in Section 5. Assuming ε x,y ∼ N (0, h(X x,y )), where δ is the Dirac delta distribution: As the frequency of a word pair increases, the probability that the noise is negligible increases; when the frequency is infinitely large, the noise is sampled from the Dirac delta distribution and is therefore 0. Even without the assumption of zero reconstruction error, an analogy that satisfies the identity in the csPMI Theorem will hold over a set of ordered pairs in practice as long as the frequency of each pair is sufficiently large.
A possible benefit of h mapping lower frequencies to larger variances is that it reduces the probability that a linear analogy f will hold over rare word pairs. One way of interpreting this is that h essentially filters out the word pairs for which there is insufficient evidence, even if the identities in the csPMI Theorem are satisfied. This would explain why reducing the dimensionality of word vectors -up to a point -actually improves performance on word analogy tasks (Yin and Shen, 2018). Representations with the optimal dimensionality have enough noise to preclude spurious analogies that satisfy the csPMI Theorem, but not so much noise that non-spurious analogies (e.g., (king,queen)::(man,woman)) are also precluded.

Formalizing Addition
Corollary 1 Let z = x + y be the sum of words x, y in an SGNS word embedding space W with no reconstruction error. If z were a word in the vocabulary, where δ is a model-specific constant, PMI(x, z) + log p(x, z) = log p(y) + δ .
To frame the addition of two words x, y as an analogy, we need to define a set of ordered pairs S such that a linear analogy holds over S iff x + y = z. To this end, consider the set {(x, z), ( / 0, y)}, where z is a placeholder for the composition of x and y and the null word / 0 maps to 0 for a given embedding space. From Definition 2: Even though / 0 is not in the vocabulary, we can map it to 0 because its presence does not affect any other word vector. To understand why, consider the k-shifted word-context PMI matrix M that does not have / 0, and the matrix M that does, of which M is a submatrix. Where W and C are the word and context matrices, Even if the null word does not exist for a given corpus, the embeddings we would get by training on a corpus that did have the null word would otherwise be identical.
An inner product with the zero vector is always 0, so we can infer from the SGNS identity in (3) that PMI( / 0, ·) − log k = 0 for every word in the vocabulary. From the csPMI Theorem, we know that if a linear analogy holds over {(x, z), ( / 0, y)}, then: Thus the csPMI of the sum and one word is equal to the log probability of the other word shifted by a model-specific constant. In embedding spaces with some reconstruction error, there are also two noise terms ε x,z , ε / 0,y to consider. However, if we assume, as in 3.5, that the noise has a zerocentered Gaussian distribution, then E[PMI(x, z)+ log p(x, z)] = E[log p(y)+δ ]. Even without the assumption of zero reconstruction error, on average, the csPMI of the sum and one word is equal to the log probability of the other word shifted by a constant. We cannot repeat this derivation with GloVe because it is unclear what the optimal values of the biases would be, even with perfect reconstruction.

Automatically Weighting Words
Corollary 2 In an SGNS word embedding space, on average, the sum of two words has more in common with the rarer word, where commonality is measured by the csPMI.
For two words x, y, assume without loss of generality that p(x) > p(y). By (8): Therefore addition automatically down-weights the more frequent word. For example, if the vectors for x = 'the' and y = 'apple' were added to create a vector for z = 'the apple', we would expect csPMI('the apple', 'apple') > csPMI('the apple', 'the'); being a stopword, 'the' would on average be heavily down-weighted. Even with reconstruction error, if we assume that the noise follows a zero-centered Gaussian distribution, (9) holds true on average. While the rarer word is not always the more informative one, weighting schemes like inverse document frequency (IDF) (Robertson, 2004) and unsupervised smoothed inverse frequency (uSIF) (Ethayarajh, 2018) are all based on the principle that more frequent words should be down-weighted because they are typically less informative. The fact that addition automatically down-weights the more frequent word thus provides novel justification for using addition to compose words.

Interpreting Euclidean Distance
Corollary 3 In an SGNS word embedding space with no reconstruction error, ∃ λ ∈ R + and a model-specific constant δ such that for any two words x, y, λ x − y 2 = −csPMI(x, y) + δ . We derive this corollary by framing the difference between two words x, y as a word analogy. Where z is a placeholder for x − y and / 0 is the null word defined in section 4.1, a linear analogy holds over the set {(x, y), (z, / 0)} iff x − y = z. Using the SGNS identity in (3), Lemma 2, and the result from (8): Thus in an SGNS embedding space with no reconstruction error, the squared Euclidean distance between two word vectors is simply a linear function of the negative csPMI. Since csPMI(x, y) ∈ (−∞, 0] and x − y 2 is non-negative, λ must be positive. This identity is intuitive: the more similar two words are (as measured by csPMI), the smaller the distance between their vectors. In section 5, we provide empirical evidence of this.

Are Relations Ratios?
Pennington et al. (2014) conjectured that linear relationships in the embedding space -which we call displacements -correspond to ratios of the form p(w|x)/p(w|y), where (x, y) is a pair of words such that y − x is the displacement and w is any other word in the vocabulary. This claim has since been repeated in other work (Arora et al., 2016). For example, according to this conjecture, the analogy (king,queen)::(man,woman) holds iff for every word w in the vocabulary p(w|king) p(w|queen) ≈ p(w|man) p(w|woman) However, as noted earlier, this idea was neither derived from empirical results nor rigorous theory, and there has been no work to suggest that it would hold for models other than GloVe, which was designed around it. We now prove this conjecture for SGNS using the csPMI Theorem.
Pennington et al. Conjecture Let S be a set of ordered pairs (x, y) with vectors in an SGNS word embedding space with zero reconstruction error. A linear analogy holds over S iff ∀ (x 1 , y 1 ), (x 2 , y 2 ) ∈ S, p(w|x 1 )/p(w|y 1 ) = p(w|x 2 )/p(w|y 2 ) for every word w in the vocabulary.
As with the corollaries, we prove this by reframing it as an analogy. A linear analogy holds over S iff for any word w in the vocabulary, a linear analogy holds over S w = {(w, r x,y ) | (x, y) ∈ S }, where r x,y is the relation defined by the word pair (x, y). In S, x is transformed into y in each word pair; in S w , w is transformed into the null word / 0 and then into the relation r x,y , which can be composed into a single linear transformation. From Lemma 1, we know that a linear analogy holds over S w iff for any (x 1 , y 1 ), (x 2 , y 2 ) ∈ S: Using Lemma 2 and the SGNS identity (3), we can write this in terms of the conditional probability: We do not need to consider the other identity in Lemma 1, since 2 w, w − w 2 2 − w 2 2 = 0. Thus an analogy holds over S w for any w iff p(w|x 1 )/p(w|y 1 ) = p(w|x 2 )/p(w|y 2 ) for any (x 1 , y 1 ), (x 2 , y 2 ) ∈ S. Since a linear analogy holds over S w iff it holds over S, the Pennington et al. Conjecture is true.

Experiments
Measuring Noise We uniformly sample word pairs in Wikipedia and estimate the noise (i.e., x, y c − [PMI(x, y) − log k]) using SGNS vectors trained on the same corpus. As seen in Figure  2, the noise has an approximately zero-centered Gaussian distribution and the variance of the noise is lower at higher frequencies, supporting our assumptions in section 3.4. As previously mentioned, this is one reason why linear word analogies are robust to noise -the amount of noise is simply negligible at high frequencies.  Estimating csPMI According to the csPMI Theorem, if an analogy holds over a set of word pairs, then each pair (x, y) has the same csPMI value. In Table 1, we provide the mean csPMI values for various analogies in Mikolov et al. (2013a) over the set of word pairs for which they should hold (e.g., (Paris, France), (Berlin, Germany) and others for capital-world). We also provide the accuracy of the vector algebraic solutions for each analogy, found by minimizing cosine distance on a restricted set of vocabulary, namely all the words in the analogy task. As expected, when solutions to word analogies are more accurate, the analogies have lower csPMI variances. This is because an analogy is more likely to hold over a set of word pairs when the displacement vectors are the same, and thus when the csPMI values are the same. Similar analogies (e.g., capital-world and capital-commoncountries) also have similar mean csPMI values -our theory implies this, since similar analogies have similar displacement vectors. As the csPMI increases, the type of analogy gradually changes from geography (capital-world, city-in-state) to verb tense (gram5-present-participle, gram7-pasttense) to adjectives (gram4-comparative, gram4superlative). We do not witness the same gradation with the mean PMI, implying that the transformation represented by an analogy corresponds to csPMI but not PMI.
Euclidean Distance Because the sum of two word vectors is not in the vocabulary, we cannot calculate co-occurrence statistics involving the sum, precluding us from testing Corollaries 1 and 2. We test Corollary 3 by uniformly sampling word pairs and plotting, in Figure 3, the negative csPMI against the squared Euclidean distance between the SGNS word vectors. As we would expect, there is a moderately strong and positive correlation (Pearson's r = 0.437): the more similar two words are (as measured by csPMI) the smaller the Euclidean distance between their vectors.
Unsolvability The csPMI Theorem reveals two reasons why a linear analogy may be unsolvable in a given space: polysemy and corpus bias. Consider senses {x 1 , ..., x M } of a polysemous word x. Assuming perfect reconstruction, a linear analogy f whose displacement has csPMI γ does not hold over (x, y) if γ = PMI(x, y) + log p(x, y) = log [p(x 1 |y) + ... + p(x M |y)] p(y|x). While only one sense may be relevant, the Theorem applies over all the senses. Even if (a,b)::(x,y) makes intuitive sense, there is also no guarantee that csPMI(a, b) ≈ csPMI(x, y) for a given corpus. The less frequent a word pair is, the more pronounced the issue: even small changes in frequency can have a large impact on the csPMI. This is why the accuracy for the currency analogy is so low (see Table 1) -currencies and their country co-occur in Wikipedia with a median frequency of only 19.

Conclusion
In this paper, we rigorously explained why word analogies can be solved using vector algebra. Specifically, we proved that an analogy holds in an SGNS or GloVe embedding space with no reconstruction error iff the co-occurrence shifted PMI is the same for every word pair and across any two word pairs. This had three implications. First, we provided a rigorous proof of the Pennington et al. (2014) conjecture, the intuitive explanation of this phenomenon. Second, we provided novel justification for the addition of word vectors by showing that it automatically down-weights the more frequent word, as weighting schemes do ad hoc. Third, we provided the first rigorous explanation of why the Euclidean distance between word vectors is a good proxy for word dissimilarity. Most importantly, our theory does not make the unrealistic assumptions of past theories, making it a much more tenable explanation.