Traversal-Free Word Vector Evaluation in Analogy Space

In this paper, we propose an alternative evaluating metric for word analogy questions (A to B is as C to D) in word vector evaluation. Different from the traditional method which predicts the fourth word by the given three, we measure the similarity directly on the “relations” of two pairs of given words, just as shifting the relation vectors into a new analogy space. Cosine and Euclidean distances are then calculated as measurements. Observation and experiments shows the proposed analogy space evaluation could offer a more comprehensive evaluating result on word vectors with word analogy questions. Meanwhile, computational complexity are remarkably reduced by avoiding traversing the vocabulary.


Introduction
In recent years, word vector, or addressed as word embedding or distributed vector representation of word, achieves high popularity in NLP (Natural Language Processing) applications. A word vector is a real-valued vector, which is quite lowdimensional when comparing with traditional onehot representation of words. The theory behind is believed to be the early concept of distributional representation (Hinton, 1986), and modern word vector derives from the training process of neural language models (Bengio et al., 2003).
However, discussion about how to evaluate the quality of word vectors remains open. Except for actual applications, most frequently used evaluation tasks are word similarity and word analogy. A word similarity task is to find the nearest word in the vector space of the given word, based on the theory that words with similar meanings should gather together. Although it is widely used, arguments are made to question its capability (Batchkarov et al., 2016;Faruqui et al., 2016).
While in a word analogy test, three words A, B and C are given and the goal is to find a fourth word D, which logically conforms "A to B is as C to D". Word analogy test has a long history of being used in examinations or IQ tests for human (McClelland, 1973;Sternberg, 1985) and is introduced into word vector evaluation by Mikolov et al. (2013b). After that, it has been widely applied.
Efforts are made to improve the original analogy metric, such as using PAIRDIRECTION to replace 3COSADD in calculation  or taking multiple word pairs into consideration (Drozd et al., 2016), but the goal is still to find word D from the vocabulary. Besides, Linzen (2016) made a thorough assessment of word analogy test, and the most prominent finding is that if not exclude three given words, the prediction of D would almost always be C (91%) or B (5%), especially when the lineal offset between words is small. This phenomenon would arouse the doubt, that whether we are searching for a word D which holds the same logic to C just as B to A, or actually searching for the nearest word of C? Furthermore, the general accuracy decline in reversed analogy also suggests the incertainty of current analogy evaluation metric.
In this paper, we would dig deeper into the limitations of current analogy evaluation metric in Section 2 and propose our simple alternative plan in Section 3, which is called "Analogy Space Evaluation". A significant difference of our approach is that we avoid traversing vocabulary from time to time. Experiments are presented in Section 4 and finally come the conclusion and discussion.

Limitations of Traditional Metric
In traditional word analogy evaluation, by given word pairs (A, B) and (C, D) with same syntactic or semantic relation, the goal is to find the nearest word to "C +B −A" in the vector space by Cosine similarity and check whether the word obtained is D. Practically some approaches use unit vector of A, B and C in "C + B − A", such as widely used Word2Vec. Anyway, the return value of such a word analogy question is in Boolean type. Generally, evaluating word vectors requires thousands of word analogy questions, which return thousands of Boolean values to calculate the accuracy from a macro perspective: how many supposed D have been successfully predicted. However, if we treat each question as an independent target in a micro aspect, result in Boolean type suffers an unneglectable information loss: true or false cannot quantitatively manifest the extent of how true or how false. For instance, it does not matter whether D is the 2nd nearest word to "C + B − A" or the 100th.
Another limitation of traditional metric is the deficiency in comprehensiveness. In a typical "A to B is as C to D" analogy, there are in fact 4 prediction choices, although in some analogies like "Nation-Currency" or "Nation-Language", available choices could drop to 2, since in reverse logic the answer is not unique. A single prediction on D is not enough to represent the quality of all 4 word vectors trained.
For better illustration, we run widely used "GoogleNews-vectors-negative300.bin" on default Word2Vec English analogy test and extract two examples to Table 1. All 4 words in example analogy questions are predicted and top 4 results are presented accordingly. From Table 1, it is clear that no matter with absolute value or average ranking of desired word in predictions, situation in Grammar-2 is apparently better than Grammar-1. However, because only word D is predicted by traditional metric, Grammar-1 would return a positive result while Grammar-2 is negative, which obviously fails to correctly represent the quality of corresponding word vectors trained.
In default Word2Vec analogy test, there is always another analogy question, which in fact predict word B of the original question. But there is no reverse logic prediction for A and C. So in final accuracy calculation, these two sets of words in Table 1 contribute the same precision of 0.5, which still cannot reflect the quality difference between these two sets of word vectors trained. Perhaps, 4 analogy questions are needed, but that would lead to another issue: higher complexity. Every time when searching for a nearest word, cosine similarity must be calculated with each word in the vocabulary. When the testing set is large, it may take quite a long time, and the time would be doubled if all 4 possible questions are included. Moreover, the majority of words in the vocabulary are actually unrelated with the prediction target. Calculating these words is simply wasting time.
Based on all above reasons, we aim to offer an alternative metric for word analogy evaluation, by constructing a new analogy space based on the relation vectors achieved from analogy questions, in order to solve existing limitations in quantification, comprehensiveness and complexity.

Analogy Space Evaluation
Proposed analogy space shares same dimensionality of original word vector space. For each analogy question, two relation vectors can be found in original word vector space, just as the definition of PAIRDIRECTION by . Mathematically, the value of such a relation vector is the same as the position of the ending point if we take the starting point as the space origin. This is  Figure 1 illustrates this process by several example words of "Nation-Capital" and "Nation-Language" analogies (extracted from same test of Table 1, visualized by PCA). Naturally, we expect relations with same or similar logic gather together in the analogy space. In order to quantitatively evaluate the similarity, we prepare four different measurements, based on Cosine similarity or Euclidean distance respectively. If we denote the vectors of word A, B, C and D as a, b, c and d, then while Cos. ∈ [−1, 1] and Euc. ∈ [0, 1]. N-Cos. and N-Euc. have similar definitions, but using unit word vectors in calculation. Table 2 shows the result of examples mentioned in Table 1 and Figure  1. Among them, "NC:DE-CN" and "NL:DE-CN" succeed 2/2 in traditional nearest word evaluation, while all others achieve 1/2. It's clear that proposed measurements could better represent the quality of these involved words or relations in a quantitative way. As already mentioned, words in Grammar-2 are considered better trained than Grammar-1, and this difference can be captured by proposed measurements only. And for NCs and NLs, traditional metric reports exactly the same accuracy, but as we can see, detailed similarities differ a lot. We believe these phenomena could help word analogy evaluation in the micro aspect.

Macro Experiments
In this section, we would do some experiments on complete analogy question sets and discuss complexity. For English word vectors, we trained two sets on Wikipedia dump with different window size (w) and iteration (i) by Skip-Gram model, with same dimensionality of 300. They would further be compared with GoogleNews public set. We will evaluate these sets with proposed measurements, along with traditional analogy evaluation result and the performances of a downstream application: Sentence Boundary Detection (SBD). Details of SBD implementation can be found in references (Che et al., 2016a,b). Beside of English test, we also conducted several tests in German. Leipzig dataset (Goldhahn et al., 2012) are used to training German word vectors with Word2Vec toolkit. Then the vectors with different training configurations are evaluated by a set of analogy questions, which contains 2834 semantic questions in 18 categories (including some reverse logics) and 77886 syntactic questions in 9 categories. We have uploaded these analogy questions in German for public access 1 . Table 3 shows the results and time expenditures of these experiments. It is clear that proposed measurements have same trend with traditional metric, which means once set X achieves better result than set Y in traditional test, it would also do better in proposed alternatives. Performances in downstream application SBD are also fit this trend in general. Meanwhile, proposed evaluation could significantly save time, approximately 95%. These facts prove that we can achieve same performance within way less time.
However, we also found some limitations. The absolute difference between different vector sets

Conclusion & Discussion
In this paper, we discuss some limitations of traditional word analogy evaluation metric in word vector evaluation, and then propose a simple alternative plan called "Analogy Space Evaluation", which directly measures the relation vectors between given pairs of words, instead of traversing the vocabulary to seek the nearest word of the target. Experiments shows that proposed approach serves as good as traditional metric in performance, but reduces the computational complexity significantly.
This effort can be simply applied on any existing word analogy tasks. Frankly speaking, we cannot claim that our method outperforms the original, except for the complexity part. But complexity does matter. Currently analogy tasks generally contain tens of thousands questions, so traditional traversal-based evaluation can still manage. However, we would definitely want to test higher portion of words in the vocabulary, and with the efforts from the whole community, we may have a "nearly optimized" test set someday with up to million words involved. At that time, traversalfree could be a highly desirable quality.
As far as we know, there is no widely acknowledged benchmark which can be used to test new evaluation methods, so our effort remains estimation. In the future, we would attempt to implement more real applications, just as SBD mentioned in this paper, and take their performances as feedbacks, in order to contribute in this dilemma of "Evaluation of Evaluation".