The (too Many) Problems of Analogical Reasoning with Word Vectors

This paper explores the possibilities of analogical reasoning with vector space models. Given two pairs of words with the same relation (e.g. man:woman :: king:queen), it was proposed that the offset between one pair of the corresponding word vectors can be used to identify the unknown member of the other pair (king - man + woman = queen). We argue against such “linguistic regularities” as a model for linguistic relations in vector space models and as a benchmark, and we show that the vector offset (as well as two other, better-performing methods) suffers from dependence on vector similarity.


Introduction
This paper considers the phenomenon of "vectororiented reasoning" via linear vector offset in vector space models (VSMs) (Mikolov et al., 2013c,a). Given two pairs of words with the same linguistic relation (woman:man :: king:queen), it has been proposed that the offset between one pair of word vectors can be used to identify the unknown member of a different pair of words via solving proportional analogy problems ( −−→ king − −−→ man + − −−−− → woman = ? −−−→ queen), as shown in Fig. 1. We will refer to this method as 3CosAdd.
This approach attracted a lot of attention, both as the "poster child" of word embeddings, and for its potential practical utility. Given the vital role that analogical reasoning plays in human cognition for discovering new knowledge and understanding new concepts, automated analogical reasoning could become a game-changer in many fields, providing a universal mechanism for detecting linguistic relations (Turney, 2008) and word sense disambiguation (Federici et al., 1997). It is already used in many downstream NLP tasks, such as splitting compounds (Daiber et al., 2015), semantic search (Cohen et al., 2015), cross-language relational search (Duc et al., 2012), to name a few. provided. We have explored several related methods and found that the proposed method performs well for both syntactic and semantic relations. We note that this measure is qualitatively similar to relational similarity model of (Turney, 2012), which predicts similarity between members of the word pairs (x b , x d ), (x c , x d ) and dis-similarity for (x a , x d ).

Experimental Results
To evaluate the vector offset method, we used vectors generated by the RNN toolkit of Mikolov (2012). Vectors of dimensionality 80, 320, and 640 were generated, along with a composite of several systems, with total dimensionality 1600. The systems were trained with 320M words of Broadcast News data as described in (Mikolov et al., 2011a), and had an 82k vocabulary. Table 2 shows results for both RNNLM and LSA vectors on the syntactic task. LSA was trained on the same data as the RNN. We see that the RNN vectors capture significantly more syntactic regularity than the LSA vectors, and do remarkably well in an absolute sense, answering more than one in three questions correctly. 2 In Table 3 we compare the RNN vectors with those based on the methods of Collobert and Weston (2008)    We conduc mantic test se tion category, ilarity to each then uses the are evaluated in the task, S ρ and MaxDi ues are better. report the ave The idea that linguistic relations are mirrored in neat geometrical relations (as shown in Fig. 1) is also intuitively appealing, and 3CosAdd has become a popular benchmark. Roughly, the current VSMs score between 40% (Lai et al., 2016) and 75% (Pennington et al., 2014) on the Google test set (Mikolov et al., 2013a). However, in fact performance varies widely for different types of relations (Levy and Goldberg, 2014;Köper et al., 2015;Gladkova et al., 2016).
One way to explain the current limitations is to attribute them to the imperfections of the current models and/or corpora with which they are built: with this view, in a perfect VSM, any linguistic relation should be recoverable via vector offset.
The alternative to be explored in this paper is that perhaps natural language semantics is more complex than suggested by Fig. 1, and there may be both theoretical and mathematical issues with analogical reasoning with word vectors and its 3CosAdd implementation.
We present a series of experiments with two popular VSMs (GloVe and Word2Vec) to show that the accuracy of 3CosAdd depends on the proximity of the target vector to its source (i.e. −−−→ queen should be quite similar to −−→ king). Since not all linguistic relations can be expected to result in high word vector proximity, the method is limited to those that happen to be so in a given VSM. Furthermore, its accuracy also varies because the "linguistic regularities" are actually not so regular, and should not be expected to be so. We also compare 3CosAdd to two alternative methods to investigate whether better algorithms can improve on these and other accounts.
2 Background: "Relational Similarity" vs "Word Analogies" The most fundamental term for what 3CosAdd is supposed to capture is actually not analogy, but rather relational similarity, i.e. the idea that pairs of words may hold similar relations to those between other pairs of words. For example, the relation between cat and feline is similar to the relation between dog and canine. Notably, this is similarity rather than identity: "instances of a single relation may still have significant variability in how characteristic they are of that class" (Jurgens et al., 2012). Analogy as it is known in philosophy and logic is something quite different. The "classical" analogical reasoning follows roughly this template: objects X and Y share properties a, b, and c; therefore, they may also share the property d. For example, both Earth and Mars orbit the Sun, have at least one moon, revolve on axis, and are subject to gravity; therefore, if Earth supports life, so could Mars (Bartha, 2016).
The NLP move from relational similarity to analogy follows the use of the term by P. Turney, who distinguishes between attributional similarity between two words and relational similarity between two pairs of words. On this interpretation, two word pairs that have a high degree of relational similarity are analogous (Turney, 2006).
In terms of practical NLP tasks, Turney et al. (2003) introduced the task of solving SAT 1 analogy problems by choosing from several provided options. These problems were formulated as proportional analogies, written in the form a : a : It is this use of the term "analogy" that Mikolov et al. (2013c) followed in proposing the 3CosAdd method. They formulated the task as selecting a single best fitting vector out of the whole vocabu-1 Scholastic Aptitude Test. lary of the VSM. It became known as word analogy task, but in its core it is still basically estimation of relational similarity, and could be formulated as such: given a pair of words a and a , find how they are related and then find word b , such that it has a similar relation with the word b. A crucial difference is that the graded, non-binary nature of relational similarity is now not in focus: the goal is to find a single correct answer.
The dataset that came to be known as the Google analogy test set (Mikolov et al., 2013a), included 14 linguistic relations with 19544 questions in total. It has become one of the most popular benchmarks for VSMs. This evaluation paradigm assumes that: (1) Words in similar linguistic relations should in principle be recoverable via relational similarity to known word pairs.
(2) 3CosAdd score reflects the extent to which a given VSM encodes linguistic relations.
(1) became dubious when it was shown that accuracy of 3CosAdd varies widely between categories (Levy and Goldberg, 2014), and even the best-performing GloVe model scores under 30% on the more challenging Bigger Analogy Test Set (BATS) (Gladkova et al., 2016). It appears that not all relations can be identified in this way, with lexical semantic relations such as synonymy and antonymy being particularly difficult (Köper et al., 2015;Vylomova et al., 2016). The assumption of a single best-fitting candidate answer is also being targeted (Newman-Griffis et al., 2017).
(2) was refuted when Drozd et al. (2016) demonstrated that some relations missed by 3CosAdd could be recovered with a supervised method, and therefore the information was present in the VSM -just not recoverable with 3CosAdd.
3 What Does 3CosAdd Really Do?

Methodology
We present a series of experiments performed with BATS dataset. Although there are more results on analogy task published with Google test than with BATS, Google test only contains 15 types of linguistic relations, and these happen to be the easier ones (Gladkova et al., 2016).  . BATS covers most relations in the Google set, but it adds many new and more difficult relations, balanced across derivational and inflectional morphology, lexicographic and encyclopedic semantics (10 relations of each type). Thus BATS provides a less flattering, but more accurate estimate of the capacity for analogical reasoning in the current VSMs. We use pre-trained GloVe vectors by Pennington et al. (2014), released by the authors 2 and trained on Gigaword 5 + Wikipedia 2014 (300 dimensions, window size 10). We also experiment with Word2Vec vectors (Mikolov et al., 2013b) released by the authors 3 , trained on a subcorpus of Google news (also with 300 dimensions).
The evaluation with 3CosAdd and LRCos methods was conducted with the Python script that accompanies BATS. We also added an implementation of 3CosMul, a multiplicative objective proposed by Levy and Goldberg (2014), now available in the same script 4 . Since 3CosMul requires normalization, we used normalized GloVe and Word2Vec vectors in all experiments.
Questions with words not in the model vocabulary were excluded (0.01% BATS questions for GloVe and 0.016% for Word2Vec).

The "Honest" 3CosAdd
Let us remember that 3CosAdd as initially formulated by Mikolov et al. (2013c) excludes the three source vectors a, a and b from the pool of possible answers. Linzen (2016) showed that if that is not done, the accuracy drops dramatically, hitting zero for 9 out of 15 Google test categories.
Let us investigate what happens on BATS data, split by 4 relation types. The rows of Fig. 2 represent all questions of a given category, with darker color indicating higher percentage of predicted vectors being the closest to a, a , b, b , or any other vector.   2 shows that if we do not exclude the source vectors, b is the most likely to be predicted; in derivational and encyclopedic categories a is also possible in under 30% of cases. b is as unlikely to be predicted as a, or any other vector.
This experiment suggests that the addition of the offset between a and a typically has a very small effect on the b vector -not sufficient to induce a shift to a different vector on its own. This would in effect limit the search space of 3CosAdd to the close neighborhood of the b vector.
It explains another phenomenon pointed out by Linzen (2016)  Google test set 70% accuracy was achieved by simply taking the closest neighbor of the vector b, while 3CosAdd improved the accuracy by only 10%. That would indeed be expected if most singular (a) and plural (a ) forms of the same noun were so similar, that subtracting them would result in a nearly-null vector which would not change much when added to b.

Distance to the Target Vector
Levy and Goldberg (2014, p.173) suggested that 3CosAdd method is "mathematically equivalent to seeking a word (b ) which is similar to b and a but is different from a." We examined the similarity between all source vector pairs, looking not only at the actual, top-1 accuracy of the 3CosAdd (i.e. the vector the closest to the hypothetical vector), but also at whether the correct answer was found in the top-3 and top-5 neighbors of the predicted vector. For each similarity bin we also estimated how many questions of the whole BATS dataset there were. The results are presented in Fig. 3. Our data indicates that, indeed, for all combinations of source vectors, the accuracy of 3CosAdd decreases as their distance in vector space increases. It is the most successful when all three source vectors are relatively close to each other and the target vector. This is in line with the above evidence from the "honest" 3CosAdd: if the offset is typically small, for it to lead to the target vector, that target vector should be close.
Consider also the ranks of the b vectors in the neighborhood of b , shown in Fig. 3f. For nearly 40% of the successful questions b was within 10 neighbors of b -and over 40% of low-accuracy questions were over 90 neighbors away.
As predicted by Levy et al., b and a vectors do not exhibit the same clear trend for higher accuracy with higher similarity that is observed in all other cases (Fig. 3f). However, in experiments with only 20 morphological categories we did observe the same trend for b and a as for the other vector pairs (see Fig. 4). This is counter-intuitive, and requires further examination. The observed correlation between the accuracy of 3CosAdd and the distance to the target vector could explain in particular the overall lower performance on BATS derivational morphology questions (only 0.08% top-1 accuracy) as opposed to inflectional (0.59%) or encyclopedic semantics (0.26%). −−→ man and − −−−− → woman could be expected to be reasonably similar distributionally, as they combine with many of the same verbs: both men and women sit, sleep, drink etc. However, the same could not be said of words derived with prefixes that change part of speech. Going from , is likely to have to take us further in the vector space.
To make sure that the above trend is not specific to GloVe, we repeated these experiments with Word2Vec, which exhibited the same trends. All data is presented in Appendix A.1.

Uniqueness of a Relation
Note that the dependence of 3CosAdd on similarity is not entirely straightforward: Fig. 3b shows that for the highest similarity (0.9 and more) there is actually a drop in accuracy. The same trend was observed with Word2Vec (Fig 10 in Appendix 1). Theoretically, it could be attributed to there not being much data in the highest similarity range; but BATS has 98,000 questions, and even 0.1% of that is considerable.
The culprit is the "dishonesty" of 3CosAdd: as discussed above, it excludes the source vectors a, a , and b from the pool of possible answers. Not only does this mask the real extent of the difference between a and a , but it also creates a fundamental difficulty with categories where the source vectors may be the correct answers. This is what explains the unexpected drops in accuracy at the highest similarity between vectors b and a . Consider the question , the correct answer would a priori be excluded. In BATS data, this factor affects several semantic categories, including country:language, thing:color, animal:young, and animal:shelter.

Density of Vector Neighborhoods
If solving proportional analogies with word vectors is like shooting, the farther away the target vector is, the more difficult it should be to hit. Also, we can hypothesize that the more crowded a particular region is, the more difficult it should be to hit a particular target.
However, density of vector neighborhoods is not as straightforward to measure as vector similarity. We could look at average similarity between, e.g., top-10 ranking neighbors, but that could misrepresent the situation if some neighbors were very close and some were very far.
In this experiment we estimate density as the similarity to the 5th neighbor. The higher it is, the more highly similar neighbors a word vector has. This approach is shown in Fig. 5. The results seem counter-intuitive: denser neighborhoods actually yield higher accuracy (although there are virtually no cases of very tight neighborhoods). One explanation could be its reverse correlation with distance: if the neighborhood of b is sparse, the closest word is likely to be relatively far away. But that runs contrary to the above findings that closer source vectors improve the accuracy of 3CosAdd. Then we could expect lower accuracy in sparser neighborhoods.

Comparison with Other Methods
We repeat the above experiments on GloVe with 3CosMul, a multiplication-based alternative to 3CosAdd proposed by Levy and Goldberg (2014): .001 is used to prevent division by zero) As 3CosMul does not explicitly calculate the predicted vector, we did not plot the similarity of b to the predicted vector. But for other vector pairs shown in Fig. 6, we can see that 3CosMul, The numerical values for all data can be found in the Appendix. like 3CosAdd, has much higher chances of success where target vectors are close to the source.
We also consider LRCos, a method based on supervised learning from a set of word pairs (Drozd et al., 2016). LRCos reinterprets the analogy task as follows: given a set of word pairs (e.g. brother:sister, husband:wife, man:woman, etc.), the available examples of the class of the target b vector (sister, wife, woman, etc.) and randomly selected negative examples are used to learn a representation of the target class with a supervised classifier. The question is this: what word is the closest to −−→ king, but belongs to the "women" class?
With LRCos it is only meaningful to look at the similarity of b to b (Fig. 7). Once again, we see the same trend: closer targets are easier to hit. However, if we look at overall accuracy, there is a big difference between the three methods. Fig. 8b shows that the accuracy of LRCos is much higher than the top-1 3CosAdd or 3Cos-Mul. Moreover, its "honest" version ( Fig. 8a) performs just as well as the "dishonest" one. These results are consistent with the results reported by Drozd et al. (2016). As for 3CosMul, Levy et al. (2015) show that 3CosMul outperforms 3CosAdd in PPMI, SGNS, GloVe and SVD models with the Google dataset, sometimes yielding 10-25% improvement. Our BATS experiment confirms the overall superiority of 3CosMul to 3CosAdd, although the difference is less dramatic.
Thus LRCos considerably outdoes its competitors, although it does not manage to avoid the similarity problem. We attribute this to the set-based, supervised nature of LRCos that gives it an edge on a different problem that affects both 3CosAdd and 3CosMul: the assumption of "linguistic regularities" from which we started. There are unresolved questions about the underlying assumption that the offset between vectors a and a provides access to certain features combinable with vector b to detect b , and that such offset should be more or less constant for all words in a given linguistic relations. Table 2 shows that this does not happen in a reliable way (data: BATS category D06 "re+verb").  Both correct and incorrect answers lie in about the same similarity range, so we cannot attribute the failures to the reliance of 3CosAdd on close neighborhoods. The distance from − −−− → marry to − −−−−− → remarry is the same; thus it must be the case that the offset between different a and a is not the same, and leads to different answers -with a frustratingly small margin of error.

Can We Just Blame the Corpus?
Source corpora are noisy, and it is tempting to blame almost anything on that. It could be literal text-processing noise (e.g. not quite cleaned HTML data and ad texts) or, more broadly, any kind of information in the VSM that is irrelevant to the question at hand. This includes polysemy: for a word-level VSM the difference between −−→ king and −−−→ queen is not exactly the same as the difference between −−→ man and − −−−− → woman just for the existence of the Queen band (although that factor should not affect the "re-" prefix verbs in Table 2).
In addition to irrelevant information, there is also missing information. Corpora of written texts are a priori not the same source of input as what children get when they learn their language. Natural language semantics relies on much data that the current VSMs do not have, including multimodal data and frequencies of events too commonplace to be mentioned in writing (Erk, 2016, p.18).
This means that the distributional difference between − → tell and − −− → retell (or − −−− → marry and − −−−−− → remarry, or both pairs) does not necessarily reflect the full range of the relevant difference, which could perhaps have helped to bring the vector offset calculation closer to the desired outcome. On this view, in the ideal world all word vectors with the "re-" feature would be nearly aligned. Some blame could also be passed to the condensed vectors such as SVD or neural word embeddings, which blend distributional features in a non-transparent way, potentially obscuring the relevant ones.
The current source corpora and VSMs could certainly be improved. But both linguistics and philosophy suggest that there are also issues with the idea of linguistic relations being so regular.

Semantics is Messy
In theory, according to the distributional hypothesis, we would expect the relatively straightforward "repeated action" paradigm of verbs with and without the prefix "re-" in Table 2 to surface distributionally in the use of adverbs like "again". However, we have no reason to expect this to happen in quantitatively exactly the same way for all the verbs, even in an "ideal" corpus. And variation would lead to irregularities that we observe.
In fact, such variation would make VSMs more like human mental lexicon, not less. A wellknown problem in psychology is the asymmetry of similarity judgments, upon which relational similarity and analogical reasoning are based. Logically a is like b is equivalent to b is like a, but humans do not necessarily agree with both statements to the same degree (Tversky, 1977).
Consider the "re-" prefix examples above. We could expect 100% success by native English speakers on a "complete the verb paradigm" task, because they would be inevitably made aware of the "add re-" rule during its completion. Even so, processing time would vary due to such factors as frequencies and prototypicality. The psychological evidence is piling for certain gradedness in mental representation of morphological rules: people can rate the same structure differently on complexity ("settlement" is reported more affixed that "government"), similarity judgments for semantically transparent and non-transparent bases are continuous, and there are graded priming effects for both orthographic, semantic and phonological similarity between derived words and their roots (Hay and Baayen, 2005).
There are several connectionist proposals to simulate asymmetry through biases, saliency features, or structural alignment (Thomas and Mareschal, 1997, p.758). The irregularities we observe in the VSMs could perhaps even be welcomed as another way to model this phenomenon -although it remains to be seen to what extent the parallel we draw here is appropriate.
As a side note, let us remember that equations such as −−→ king − −−→ man + − −−−− → woman = −−−→ queen should only be interpreted distributionally, although it is tempting to suppose that they reflect something like semantic features. That would be misleading on several accounts. First of all, the 3CosAdd math is commutative, which would be dubious for semantic features 5 . Secondly, it would bring us to the wall that componential analysis in linguistic semantics has hit a long time ago: semantic features defy definitions 6 , they only apply to a portion of vocabulary, and they impose binary oppositions that are psycholinguistically unrealistic (Leech, 1981, pp.117-119). woman result certainly "femaleness" -or perhaps "maleness", or some mysterious "malefemale gender change" semantic feature?

Analogy Is Not an Inference Rule
Let us now come back to the fact that the "linguistic regularities" are in fact relying on relational similarity (Section 2), and relational similarity is not something binary. That takes us straight to the most fundamental difficulty with analogy as it is known in philosophy and logic. Analogy is undeniably fundamental to human reasoning as an instrument for discovery and understanding the unknown from the known -but it is not, and has never been an inference rule.
Consider the example where Mars is similar to Earth in several ways, and therefore could be supporting life. This analogy does not guarantee the existence of Martians, and it could even be similarly applied to even less suitable planets.
Basically, the problem with analogy is that not all similarities warrant all conclusions, and establishing valid analogies requires much case-by-case consideration. For this and some other reasons, analogy has long been rejected in generative linguistics as a mechanism for language acquisition through discovery, although now it is making a comeback (Itkonen, 2005, p.67-75).
This general difficulty with analogical reasoning -it does work in humans, but selectively, so to say, -is inherited by the so-called proportional analogies of the a : a :: b : b kind. A case in point is their use in schools as verbal reasoning tests. In 2005 analogies were removed from SAT, its criticisms including ambiguity, guesswork and puzzle-like nature (Pringle, 2003). It is also telling that SAT analogy problems came with a set of potential answers to choose from, because otherwise students would supply a range of answers with varying degrees of incorrectness.
In case of the "re-" prefix above, once again, we could expect 100% success rate by humans who could see the "add re-" pattern; but semantic BATS questions would yield more variation. Consider the question "trout is to river as lion is to ". Some would say den, thinking of the river as the trout's "home", but some could say savanna in the broader habitat terms; cage or zoo or safari park or even circus would all be valid to various degrees. BATS accepts several answer options, but it is hardly feasible to list them all for all cases.
Given the above, the question is: if analogical reasoning requires much case-by-case consideration in humans, what should we expect from VSMs with a single linear algebra operation?

Implications for Evaluation of VSMs
The analogy task continues to enjoy immense popularity in the NLP community as the standard evaluation task for VSMs. We have already mentioned two problems with the task: the problem of the Google test scores being flattering to the VSMs (Gladkova et al., 2016), and also 3CosAdd disadvantaging them, because the required semantic information may be encoded in more complex ways (Drozd et al., 2016).
What the present work adds to the discussion is the demonstration of how strongly the accuracy on the analogy task depends on the target vector being relatively close to the source in the vector space model -not only for 3CosAdd, but also 3CosMul and LRCos. This is in fact a fundamental problem that is encountered in many other NLP tasks 7 .
That problem brings about the following question: what have we been evaluating with 3CosAdd all this time?
The answer seems to be this: analogy task scores indicate to what extent the semantic space of a given VSM was structured in a way that, for each word category, favored the linguistic relation that happened to be picked by the creators of the particular test dataset. BATS makes this clearer, because it is well balanced across different types of relations. Most models score well on morphological inflections -because morphological forms of the same word are highly distributionally similar and are likely to be close. But we do not see equal success for synonyms, suffixes, colors and other categories -because it is hard to expect of any one model to "guess" which words should have synonyms as closest neighbors and which words should be close to their antonyms.
As a matter of fact, for a general-purpose VSM we would not want that: every word can participate in hundreds of linguistic relations that we may be interested in, but we cannot expect them all to be close neighbors. We would want a VSM whose vector neighborhoods simply reflect whatever distributional properties were observed in a corpus. The challenge is to find reasoning methods that could reliably identify linguistic relations from vectors at any distance.
Given the irregularities discussed in section 5, 7 E.g. in taxonomy construction it was found helpful to narrow the semantic space with domains or clusters, essentially "zooming in" on certain relations (Fu et al., 2014;Espinosa Anke et al., 2016). these methods would also have to rely on a more linguistically and cognitively realistic model of how meanings are reflected in distributional properties of words.
LRCos made a step in the right direction, as it does not rely on unique and neatly aligned word pairs, but it can only work for relations between coherent word classes. That excludes many lexicographic relations like synonyms (car is to automobile as snake is to serpent), frame-semantic or encyclopedic relations (white is to snow as red is to rose).

Conclusion
While it would be highly desirable to have automated reasoning about linguistic relations with VSMs as a powerful, all-purpose tool, it is so far a remote goal. We investigated the potential of the vector offset method in solving the so-called proportional analogies, which rely on one pair of words with a known linguistic relation to identify the missing member of another pair of words.
We have presented a series of experiments showing that the success of the linear vector offset (as well as two better-performing methods) depends on the structure of the VSM: the targets that are further away in the vector space have worse chances of being recovered. This is a crucial limitation: no model could possibly hold all related words close in the vector space, as there are many thousands of linguistic relations, and many are context-dependent. Furthermore, the offsets of different word vector pairs appear to not be so regular, even for relatively straightforward linguistic relations. We argue that the observed irregularities should not just be blamed on the corpus. There is a number of theoretical issues with the very approach to linguistic relations as something neat and binary. We hope to drive attention to the graded nature of relational similarity that underlies analogical reasoning, and the need for automated reasoning algorithms to become more psychologically plausible in order to become more successful.