What Analogies Reveal about Word Vectors and their Compositionality

Analogy completion via vector arithmetic has become a common means of demonstrating the compositionality of word embeddings. Previous work have shown that this strategy works more reliably for certain types of analogical word relationships than for others, but these studies have not offered a convincing account for why this is the case. We arrive at such an account through an experiment that targets a wide variety of analogy questions and defines a baseline condition to more accurately measure the efficacy of our system. We find that the most reliably solvable analogy categories involve either 1) the application of a morpheme with clear syntactic effects, 2) male–female alternations, or 3) named entities. These broader types do not pattern cleanly along a syntactic–semantic divide. We suggest instead that their commonality is distributional, in that the difference between the distributions of two words in any given pair encompasses a relatively small number of word types. Our study offers a needed explanation for why analogy tests succeed and fail where they do and provides nuanced insight into the relationship between word distributions and the theoretical linguistic domains of syntax and semantics.


Introduction
In recent years, low-dimensional vectors have proven an efficient and fruitful means of representing words for numerous computational applications, from calculating semantic similarity to serv- * This work was done while the first author was a postdoctoral research associate at the University of Minnesota.
ing as an early layer in deep learning architectures (Baroni et al., 2014;Schnabel et al., 2015;LeCun et al., 2015). Despite these advances, however, strategies for representing meaning compositionally with a vector model remain limited. Given the difficulties in training representations of composed meaning (for example, most possible phrases will be rare or unattested in training data), achieving an accurate means of building complex lexical or phrasal representations from lower-order ones would be a decisive coup in computational semantics.
Another promising avenue of compositional semantics is the representation of concepts that do not map easily to lexemes. A simple averaging of two vectors may yield a concept that is semantically akin to both, and the arithmetic difference between word vectors has been said to represent the relationship between two terms. The ability to model knowledge unbounded by linguistic labels is an exciting prospect for natural language processing and artificial intelligence more broadly.
A common test of the compositional properties of word vectors is complete-the-analogy questions. Word vector arithmetic has achieved surprisingly high accuracy on this type of task. A flurry of recent studies have applied this test under various conditions, but there has been limited focus on defining precisely what types of relations vectors can capture, and less still on explaining these differences. As such, there remains a major gap in our understanding of distributional semantics. Our original experimental work improves upon prior methods by 1) targeting a wide variety of analogy questions drawn from several available resources and 2) defining a baseline condition to control for differences in "difficulty" between questions. These considerations enable an analysis that constitutes a major step towards a comprehensive, theoretically grounded account for the observed phenomena. To begin, however, we present a brief review of the analogy problem as usually posed.

Background
Several computational approaches have been proposed for representing the meaning of words (and holistic phrases) in terms of their co-occurrence with other words in large text corpora. Some of these, such as latent semantic analysis (Landauer and Dumais, 1997), focus on developing semantic representations based on theories of human cognition, whereas others, such as random indexing (Kanerva, 2009) and word embeddings (Bengio et al., 2003;Mikolov et al., 2013a) focus more on computational efficiency. Despite differences in purpose and implementation, all current distributional semantic approaches rely on the same basic principle of using similarity between cooccurrence frequency distributions as a way to infer the strength of association between words. For many practical purposes, such as information indexing and retrieval and semantic clustering, these approaches work remarkably well.
There is no obvious best way to compose these types of representations into larger arbitrary linguistic units, although it does seem that certain regularities exist between terms that surface through vector subtraction (Mikolov et al., 2013c;. Why should this be the case? Consider the relationships between a difference vector w b − w a and other words in the vocabulary: w b − w a will be orthogonal to words that co-occur equally frequently with w a and w b , highly similar to words that co-occur only with w b , and dissimilar (negative) to words that co-occur only with w a . 1 If a word's context is a fair representation of its meaning, as is the key tenet of the distributional hypothesis, then this vector difference should isolate crucial differences in meaning.
Analogy tasks have been used to test how well vector differences capture consistent semantic differences. Four-word proportional analogies, typically written as w 1 :w 2 ::w 3 :w 4 , feature two pairs of words such that the relationship between w 1 and w 2 is the same as between w 3 and w 4 . If these words are represented with vectors, then, it is assumed that the differences between each pair are 1 These assertions are supported by the distributivity of a dot product, which is the standard calculation for similarity, over addition: roughly equal: (1) In the most popular version of this task, a system is given the first three words in the analogy and asked guess the best candidate for w 4 . Solving for w 4 , (2) and thus a system selects its hypothesis w hyp from the vocabulary V -typically excluding w 1 , w 2 and w 3 -by finding the word with maximum angular (cosine) similarity to the hypothesis vector (expressed as vector dot product, assuming all word vectors are unit length): We call this algorithm 3COSADD following .  note that this strategy is equivalent to finding the word in the lexicon that is the best match for w 3 and w 2 while also being most distant from w 1 . This reframing suggests that it may not be necessary at all to represent ineffable concepts through intermediate stages of vector composition; 3COSADD could be solving analogies simply through term similarity. Indeed, words in a pair sharing some relation tend to be similar to each other; when they are extremely similar, the difference between w 2 and w 1 is negligible, and the task becomes trivial. Linzen (2016) makes this observation as well and goes on to demonstrate that accuracy falls to near zero across the board when not excluding w 1 , w 2 , and w 3 from contention in the hypothesis space, which shows how strongly dependent 3COSADD is upon vector similarity. We agree wholeheartedly with that paper's claim that it is important to measure the consistency of vector differences in a way that is mindful of the typically high similarity between paired terms.

Analogy Test Sets
Several categorized sets of semantic and syntactic analogies are publicly available. One of the earliest was published by Microsoft Research (Mikolov et al., 2013c) and consists of 16 categories of inflectional morphological relations for English nouns, verbs, and adjectives. The most commonly reported test set, which we refer to as the Google set, is included with the distribution of the word2vec tool (Mikolov et al., 2013a). The Google set comprises 14 categories, mostly involving inflectional or geographical relationships between terms. Categories are grouped into a "semantic" and a "syntactic" subset, and results are often reported averaged over each rather than by category. This practice is rather problematic in our view, as the syntactic/semantic division is quite coarse and even questionable in some cases. We explore the relationship between syntax, semantics, and morphology in detail later on.
The "Better Analogy Test Set" (BATS) is a large set developed to contain a balanced sampling of a wide range of categories . BATS features 40 categories of 50 word pairs each, covering inflectional and derivational morphology as well as several semantic relations.
The relational similarity task in SemEval-2012 featured relations between word pairs targeting a massive range of lexical semantic relationships (Jurgens et al., 2012). By drawing on the aggregated results of the task's participants, we have extracted highly representative pairs for each relation to build an analogy set.

Accounting for Analogy Performance
In addition to those already cited, numerous other recent papers have evaluated word embeddings by benchmarking on analogy questions (Mikolov et al., 2013b;Garten et al., 2015;Lofi et al., 2016). There is some consensus regarding performance across question types: systems do well on questions of inflectional morphology (especially so for English (Nicolai et al., 2015)), but far less reliably so for various non-geographical semantic questions-although some gains in performance are possible by adjusting the embedding algorithms used or their hyperparameters (Levy et al., 2015), or by training further on subproblems (Drozd et al., 2016).
Amongst all of these findings, however, we found lacking a cohesive, thorough, and satisfying account of why vector arithmetic works where it does to solve analogies. To that end, we conducted an experiment to arrive at such an explanation, with some notable departures from previously used methods. We included a wide range of available test data, which is key because individual sets usually feature some bias towards one type or a few types of question, and benchmarkers often report nothing more than accuracy av-eraged over an entire set (Schnabel et al., 2015). Additionally, we define a baseline, which is critical not only to gauge effectiveness, but also to understand the mechanism behind solving analogies using compositional methods.
In the following sections we present the design of the experiment, baseline condition, and question sets; a discussion of how performance on analogy questions breaks down by broad category; and finally, a theoretical accounting for the observed patterns and the implications for distributional semantics.

Word Embeddings
We used word embeddings trained on the plain text of all articles from Wikipedia as of September 2015, processed to remove all punctuation and case distinctions. We tested the word2vec and GloVe (Pennington et al., 2014) training algorithms. Results were qualitatively very similar between the two, although word2vec scored slightly higher on our metrics. Due to space considerations, we discuss only the word2vec results.
Hyperparameters were set as recommended for analogy tasks by the developers: 200-dimensional vectors, continuous bag-of-words sampling, 8word window size. (We also tested a skip-gram model in word2vec and saw only slight and occasional differences-more subtle even than those seen between word2vec and GloVe.)

Test Set
We used a pooled set of analogy questions comprising the Google, Microsoft, SemEval 2012, and BATS test sets. At test time, any analogies that featured a word absent from our lexicon were discarded. (Note that the Microsoft categories testing the English possessive enclitic 's were not tested, as preprocessing for our vector training corpus removed all punctuation.) The sizes of each set following the removal of out-of-vocabulary analogies are given in Table 1.
Note that the BATS and SemEval data sets feature a number of word pairs in each category but not four-word analogy questions. We simply took every possible pair of pairs from the same category, so long as this did not result in an analogy in which w 1 and w 2 were the same word or in which w 4 was not unique. Some pairs in BATS have more than one correct answer; for uniformity with other test sets, we use only the first answer provided for each of these pairs. For SemEval, we used the "platinum standard" data distribution, which includes rankings of word pairs in each category based on how well they represent the relationship as defined. We took only the best half of pairs from that ranking to generate the test set. This was necessary because pairs lower down the list tend to poorly represent the relationship, or even to represent its opposite.

Measures
Virtually all existing studies of automated analogy solving report accuracy as the main measure. Accuracy is indeed a relevant measure when the goal is to simulate human performance on a particular task. Our purpose, however, is to understand the nature of semantic representations and account for when vector arithmetic does and does not function well as a model of relationships.
For every analogy question, we calculate the ranking of the correct w 4 in the hypothesis space-that is, the ordering of all words in the lexicon in descending order of the result of the 3COSADD hypothesis function (3). A "correct" answer would correspond to a ranking of 1.
Accuracy is a coarse measure in that it is insensitive to any ranking other than 1. Rather than accuracy, we borrow a measure from information retrieval (Voorhees, 1999)-the reciprocal of rank (RR) averaged across analogy questions in each category, which is always a positive fraction in the range: Numerically, RR acts as a "softer" version of accuracy, with rankings other than 1 contributing somewhat to the average. Besides being coarse, accuracy is also an uncontrolled measure in that it is insensitive to differences in analogy "difficulty," by which we mean the prior degree of similarity between sin-gle word vectors. An example: nominal plural analogies, such as dog:dogs::horse:horses, often achieve high accuracy, but this may follow naturally from the high similarity between most singular nouns and their plural forms-indeed, for both of these pairs, the singular and plural forms are the closest terms to each other in our trained vector space.
To measure the efficacy of vector arithmetic in a manner controlled for variances in prior vector similarity, we propose a baseline, defined for each analogy as the best ranking between the word most similar to w 2 and the word most similar to w 3 : rank base = min(rank(arg max w∈V (w · w 2 )), rank(arg max w∈V (w · w 3 ))) (5) For the above example, as dog is the most similar word to dogs, there is no improvement to be made upon baseline. Likewise, for the analogy banana:yellow::sky:blue, baseline would likely be high because yellow and blue are very similar.
Consistent with reporting RR for 3COSADD, we report baseline reciprocal rank (BRR). We suspect that using RR will be especially illustrative for baseline, where there may be many "near misses" that are informative but would all be reduced to zero if measuring only accuracy.
Our baseline is similar to the so-called ONLY-B baseline tested by Linzen (2016), except that the latter considers only w 3 . We include w 2 because this term has just as much effect on the 3COSADD hypothesis as w 3 . Note that our baseline would not itself be implementable as a solving strategy because it presumes access to w 4 to select between w 2 and w 3 ; nevertheless, we contend that it is helpful to define the baseline as we have done to account for those categories in the test data where all w 2 and w 4 are drawn from a small semantic cluster-most notably, the color example in the previous paragraph. (Overall, 16-18% of analogies across our test sets show similarity to w 2 as a better baseline than to w 3 .) Improvement is defined as the difference between 3COSADD RR and baseline RR, a measure we will refer to as reciprocal rank gain (RRG). RRG is more sensitive to shifts in rank that might not result in perfect accuracy. Analogies that show improvement from a very poor rank to first place will show a gain of nearly 1, whereas moving from second to first place is only 0.5 (and moving from poor rank to second is nearly 0.5). If 3COSADD yields a worse hypothesis, this will be reflected as a negative RRG.
We also tested other solving methods suggested by , 3COS-MUL and PAIRDIRECTION, although we do not report them here-results with the former were virtually indistinguishable from 3COSADD, and poorer overall with the latter.
The raw results of our similarity experiments, as well as source code to replicate all steps of the experiments and analysis, can be downloaded at https://github.com/gpfinley/analogies.

Results
Most broadly, we confirm prior findings that vector arithmetic can be used to solve analogy questions, with a mean RRG of .165 across all questions in all categories (t = 187, p .01). For a more nuanced analysis, we sorted analogy tests into four broad supercategories of analogical relationship: 30 categories of inflectional morphology, 12 of derivational morphology, 10 of named entities, and 95 of semantics of non-named entities (79 of which are from SemEval).
The gain in RR from baseline for all categories is presented visually in Figure 1, where they are grouped into our four supercategories for ease of interpretation. (See the appendix for the names of the top performing categories.) Each individual category is represented by a line between its BRR and 3COSADD RR. Within each supercategory, we also consider intermediate groupings of categories, and these are visualized by differences in line stroke in the figure. Note that some patterns are evident between and within supercategories: • Inflectional: Although all inflectional categories show positive RRG, adjectival and verbal inflection shows reliably higher RRG than nominal inflection.
• Named entities: All categories-and particularly those dealing with country capitalsshow high RRG.
• Lexical: Analogy relationships based on gender difference exhibit high RRG, while most other categories have low or even negative RRG.
We performed a linear regression analysis to predict RRG as a function of supercategory (F = 24600, p .01, R 2 = .39). The model is summarized in Table 4. (Note that the model contains no intercept term, so the coefficient for each supercategory is equivalent to its mean RRG.) A positive RRG can be demonstrated with high statistical significance for all supercategories except lexical semantics.
We also investigated possible effects of word frequency on analogy performance.
Multicollinearity poses a major challenge here: the frequencies of all four words in an analogy are highly correlated, and frequency can change dramatically across category. A comprehensive analysis of this complex problem is beyond the scope of this study, although we did find that the difference between an analogy's w 4 frequency and the mean w 4 frequency in that category correlates positively with RRG, although this effect is subtle (r = .016, t = 6.28, p .01).

Discussion
It is clear from our results that vector arithmetic is a better approach for certain types of analogy questions than for others. Almost as clear is the hierarchy of the four broad types of questions that we have defined: excellent performance for inflection and named entities, with decidedly mixed results for derivational morphology and poorer still for lexical semantics-with the notable exception of male-female analogies. Below, we account for these patterns in the context of two domains of linguistic theory: the interaction between morphology and syntax, and the type-theoretic difference between individuals and sets.

Morphology and Syntax
Verbal and adjectival inflection show much more improvement over baseline than nominal inflection. It may simply be that the nominal categories have too high a baseline value to show much evidence of improvement by 3COSADD. It is also possible, however, that the nominal plural has fewer syntactic implications than verbal and adjectival morphology: nouns in non-subject position do not participate in number agreement in  Figure 1: Mean reciprocal rank shifts between baseline and 3COSADD for four supercategories. Each line is a single category of analogy questions ("country -capital" or "male -female," for example). Some lines are differentiated by stroke type (dotted, solid, or dashed), the meaning of which is idiosyncratic to each supercategory: for inflectional, dashed lines are for nouns, dotted lines for adjectives, and solid lines for verbs; for derivational, dotted lines are for morphemes that change syntactic class with minimal semantic impact (e.g., -ly, as opposed to re-); for named entities, dotted lines are for country capitals; for lexical semantics, dotted lines are for gender relationships. Within supercategory, the difference in RRG between categories of different stroke types is significant in every case (|t| between 14.5 and 58.7, p .01).
English, so the plurality of many nouns in a text has little syntactic consequence.
Derivational morphology might be expected to perform worse than inflectional morphology for a number of reasons. Even for highly productive morphemes, derivation tends to have more id-iosyncratic meaning (Haspelmath and Sims, 2010, 100). For example, although 'recruitment' refers to the act of recruiting, 'government' refers to a governing body rather than the act of governing; similarly, the adverb 'sadly' can be used as a sentential adverb (expressing the speaker's attitude .0020 207 *** Table 2: Summary of regression model for reciprocal rank gain as a function of analogy supercategory. All starred levels are highly significant (p .01).
about the statement) as well as a manner adverb, whereas 'angrily' cannot. These semantic characteristics introduce lexically dependent variance that is far less pronounced for inflection. From our results with derivational sets, there is evidence of a trend in which morphemes with predominantly syntactic consequences are better handled than those with stronger semantic consequences (see dotted/solid lines in Figure 1). Significant further experimental work is needed to quantify the syntactic versus semantic effects of derivational morphemes.
We predict that such work would support the notion of a continuum between morphemes with only syntactic effects and those with only (lexically) semantic effects. Those towards the syntactic end of the continuum will tend to be better captured by vector offsets in distributional representations. There would be a partial overlap between this continuum and the inflectional-derivational continuum in that derivational morphology tends to have more idiosyncratic meanings and is less relevant to syntax. There would be differences as well, especially as regards the property that word class-changing morphology is more derivational: the repetitive re-in English, for example, may be considered less derivational than the deverbal nominalizer -ment because it does not change word class, but re-has virtually no syntactic consequences for the verb to which it affixes.

Semantics: Named Entities as Individuals
Our results show that analogy sets containing named entities are more readily solvable than those that contain other lexical categories (common nouns, verbs, etc.).
A possible explanation for this is that named entities have a single real-world referent-there is, for instance, only one Amsterdam-while there is a large set of real-world referents that correspond to a common noun like 'dog'. We would expect the co-occurrences of 'dog', then, to be more di-verse than those of a named entity like 'Amsterdam'.
The distinction drawn here between named entities and other parts of speech is analogous to the distinction between words of type e ("individuals") and words of type e, t in Montagovian settheoretic semantics (Montague, 1973). According to Montague, proper names (arguments of type e) denote individuals, while verbs and common nouns (predicates of type e, t ) denote sets of individuals. Thus, 'Amsterdam' denotes an individual, while 'dog' denotes the set of dogs.
To better appreciate how this distinction might lead to "fuzzier" representations for some words, consider that training a vector on separate references to numerous members of a set of individuals is akin to a massive case of pseudo-polysemy-the vector can only capture the average of all referents rather than a single, clear referent. Polysemy is a well-known problem in training word vectors (Reisinger and Mooney, 2010), although this case of multiple referents has not been considered before to our knowledge.
Overall, named entity categories show very good RRG results, especially when both terms in a pair are named entities (as opposed to 'nameoccupation', say). Country capitals show excellent performance in particular. In the broader history of this line of reserach, it is worth noting that the composition of the Google test set plays to this strength: country capital questions constitute over a quarter of its analogies (and over half of those in the "semantic" set, as noted by Gladkova et al. (2016)). As our experiments and others have demonstrated, however, the vector arithmetic approach struggles for most semantic questions.
Given the enormous influence of word2vec, it is worth asking whether prevailing knowledge in this field has been influenced by a selective focus on easier tasks. As further illustration of this point, note that the classic go-to example, king:queen::man:woman, is drawn from the sole category in lexical semantics with any clear positive result in our experiments.
As a matter of fact, we should address the exceptional performance on analogies in malefemale categories; why, of all lexical semantic sets, do we see such high performance here? We suspect these categories does well for the same reason that inflectional analogies do well: English features gender agreement with some personal pronouns-and, of course, with coreferential gendered terms-so there are concrete and regular distributional consequences of a noun's semantic gender.

A Unified Account
A recurrent thread in our accounting for all categories is that 3COSADD does well with relationships that have predictable effects on distribution-i.e., nearby terms and their morphology and syntax (although all morphology is effectively suppletive for these embeddings). This is especially evident with inflectional morphology, and true as well for certain types of derivational morphology as well as classes that participate in agreement, such as gender.
Relations between named entities are not governed by syntactic differences as inflectional relationships are, but there is a certain distributional parallel between the two: terms with a single referent will generally exhibit a less blurred cooccurrence profile than those with multiple referents; similarly, the difference between two realizations of the same root (e.g. 'hot' and 'hotter') will be highly non-orthogonal primarily with words of syntactic relevance, which is also a small set. The common theme is clear: the smaller the set of unique word types that co-occur with either word 1 or word 2 but not both (i.e.,the symmetric difference), the more cleanly the relationship between word 1 and word 2 can be captured.
Recall that our results also suggest that analogy questions containing frequent words are easier to solve with vector arithmetic than those containing less frequent words. We suspect that this is because the distributional representations of frequent words are more robust and less noisy. We believe, however, that more targeted investigation into the effects of frequency might qualify this generalization. For instance, it is reasonable to assume that a word's frequency correlates with the diversity of its co-occurrence, and that this diversity could signal distinct word senses, which are notoriously tricky for distributional representations. This is a ripe topic for further study.

Challenges
One challenge in interpreting our results is that categories with seemingly identical relations can show marked discrepancies in performance: note the differences between Google 'comparative' and Microsoft 'JJ JJR', which examine the same inflectional relationship but show rather different levels of performance. Similarly, note the extreme difference in baseline rank for Google 'gender' (called 'family' in the original set) and BATS 'male -female' categories. Clearly, lexical choices make a significant difference and can even overshadow the inter-category differences that we are trying to measure. Note that in both of the above examples, the version of the category featuring more unique word types showed lower baseline and lower gain.
The explanations we put forward here may need to be extended to address other types of relationships that we did not evaluate. One particular interesting example might be Linzen et al.'s (2016) tests of analogies between quantifiers across domains-e.g., everything:nothing::always:never-which show intriguingly mixed results.

Conclusion
We evaluated syntactic and semantic analogy questions from a large and highly diverse test set using metrics more controlled and more sensitive than accuracy. Inspecting the results across categories, we were able to account for the differences in performance we observed across types of word relationships in terms that are consistent with the distributional training objectives of word embeddings.
Vector arithmetic with word embeddings is most effective when co-occurrence are limited to a small number of words, either by syntactic regularities or ease of semantic representation. It is possible to account for both of these by considering distributional phenomena directly.
Still, questions remain-do our negative results reflect the failure of word vectors to model semantic nuances, or the failure of vector arithmetic to capture them, or is the semantic data simply too noisy for current methods? Further experiments with special attention paid to smoothing lexical semantic representations will be key to solving this problem.