Problems With Evaluation of Word Embeddings Using Word Similarity Tasks

Lacking standardized extrinsic evaluation methods for vector representations of words, the NLP community has relied heavily on word similarity tasks as a proxy for intrinsic evaluation of word vectors. Word similarity evaluation, which correlates the distance between vectors and human judgments of semantic similarity is attractive, because it is computationally inexpensive and fast. In this paper we present several problems associated with the evaluation of word vectors on word similarity datasets, and summarize existing solutions. Our study suggests that the use of word similarity tasks for evaluation of word vectors is not sustainable and calls for further research on evaluation methods.


Introduction
Despite the ubiquity of word vector representations in NLP, there is no consensus in the community on what is the best way for evaluating word vectors. The most popular intrinsic evaluation task is the word similarity evaluation. In word similarity evaluation, a list of pairs of words along with their similarity rating (as judged by human annotators) is provided. The task is to measure how well the notion of word similarity according to humans is captured by the word vector representations. Table 1 shows some word pairs along with their similarity judgments from WS-353 (Finkelstein et al., 2002), a popular word similarity dataset.
Let a, b be two words, and a, b ∈ R D be their corresponding word vectors in a D-dimensional vector space. Word similarity in the vector-space can be obtained by computing the cosine similar- ity between the word vectors of a pair of words: where, a is the 2 -norm of the vector, and a · b is the dot product of the two vectors. Once the vector-space similarity between the words is computed, we obtain the lists of pairs of words sorted according to vector-space similarity, and human similarity. Computing Spearman's correlation (Myers and Well, 1995) between these ranked lists provides some insight into how well the learned word vectors capture intuitive notions of word similarity. Word similarity evaluation is attractive, because it is computationally inexpensive and fast, leading to faster prototyping and development of word vector models. The origin of word similarity tasks can be tracked back to Rubenstein and Goodenough (1965) who constructed a list of 65 word pairs with annotations of human similarity judgment. They created this dataset to validate the veracity of the distributional hypothesis (Harris, 1954) according to which the meaning of words is evidenced by the context they occur in. They found a positive correlation between contextual similarity and human-annotated similarity of word pairs. Since then, the lack of a standard evaluation method for word vectors has led to the creation of several ad hoc word similarity datasets. Table 2 provides a list of such benchmarks obtained from wordvectors.org (Faruqui and Dyer, 2014a).  In this paper, we give a comprehensive analysis of the problems that are associated with the evaluation of word vector representations using word similarity tasks. 1 We survey existing literature to construct a list of such problems and also summarize existing solutions to some of the problems. Our findings suggest that word similarity tasks are not appropriate for evaluating word vector representations, and call for further research on better evaluation methods

Problems
We now discuss the major issues with evaluation of word vectors using word similarity tasks, and present existing solutions (if available) to address them.

Subjectivity of the task
The notion of word similarity is subjective and is often confused with relatedness. For example, cup, and coffee are related to each other, but not similar. Coffee refers to a plant (a living organism) or a hot brown drink, whereas cup is a manmade object, which contains liquids, often coffee. Nevertheless, cup and coffee are rated more similar than pairs such as car and train in WS-353 (Finkelstein et al., 2002). Such anomalies are also found in recently constructed datasets like MEN (Bruni et al., 2012). Thus, such datasets unfairly penalize word vector models that capture the fact that cup and coffee are dissimilar.
1 An alternative to correlation-based word similarity evaluation is the word analogy task, where the task is to find the missing word b * in the relation: a is to a * as b is to b * , where a, a * are related by the same relation as a, a * . For example, king : man :: queen : woman. Mikolov et al. (2013b) showed that this problem can be solved using the vector offset method: b * ≈ b − a + a * . Levy and Goldberg (2014a) show that solving this equation is equivalent to computing a linear combination of word similarities between the query word b * , with the given words a, b, and b * . Thus, the results we present in this paper naturally extend to the word analogy tasks.
In an attempt to address this limitation, Agirre et al. (2009) divided WS-353 into two sets containing word pairs exhibiting only either similarity or relatedness. Recently, Hill et al. (2014) constructed a new word similarity dataset (SimLex), which captures the degree of similarity between words, and related words are considered dissimilar. Even though it is useful to separate the concept of similarity and relatedness, it is not clear as to which one should the word vector models be expected to capture.

Semantic or task-specific similarity?
Distributional word vector models capture some aspect of word co-occurrence statistics of the words in a language (Levy and Goldberg, 2014b;Levy et al., 2015). Therefore, to the extent these models produce semantically coherent representations, it can be seen as evidence of the distributional hypothesis of Harris (1954). Thus, word embeddings like Skip-gram, CBOW, Glove, LSA (Turney and Pantel, 2010;Mikolov et al., 2013a;Pennington et al., 2014) which are trained on word co-occurrence counts can be expected to capture semantic word similarity, and hence can be evaluated on word similarity tasks.
Word vector representations which are trained as part of a neural network to solve a particular task (apart from word co-occurrence prediction) are called distributed word embeddings (Collobert and Weston, 2008), and they are task-specific in nature. These embeddings capture task-specific word similarity, for example, if the task is of POS tagging, two nouns cat and man might be considered similar by the model, even though they are not semantically similar. Thus, evaluating such task-specific word embeddings on word similarity can unfairly penalize them. This raises the question: what kind of word similarity should be captured by the model?

No standardized splits & overfitting
To obtain generalizable machine learning models, it is necessary to make sure that they do not overfit to a given dataset. Thus, the datasets are usually partitioned into a training, development and test set on which the model is trained, tuned and finally evaluated, respectively (Manning and Schütze, 1999). Existing word similarity datasets are not partitioned into training, development and test sets. Therefore, optimizing the word vectors to perform better at a word similarity task implic-itly tunes on the test set and overfits the vectors to the task. On the other hand, if researchers decide to perform their own splits of the data, the results obtained across different studies can be incomparable. Furthermore, the average number of word pairs in the word similarity datasets is small (≈ 781, cf. Table 2), and partitioning them further into smaller subsets may produce unstable results.
We now present some of the solutions suggested by previous work to avoid overfitting of word vectors to word similarity tasks. Faruqui and Dyer (2014b), and Lu et al. (2015) evaluate the word embeddings exclusively on word similarity and word analogy tasks. Faruqui and Dyer (2014b) tune their embedding on one word similarity task and evaluate them on all other tasks. This ensures that their vectors are being evaluated on held-out datasets. Lu et al. (2015) propose to directly evaluate the generalization of a model by measuring the performance of a single model on a large gamut of tasks. This evaluation can be performed in two different ways: (1) choose the hyperparameters with best average performance across all tasks, (2) choose the hyperparameters that beat the baseline vectors on most tasks. 2 By selecting the hyperparameters that perform well across a range of tasks, these methods ensure that the obtained vectors are generalizable. Stratos et al. (2015) divided each word similarity dataset individually into tuning and test set and reported results on the test set.

Low correlation with extrinsic evaluation
Word similarity evaluation measures how well the notion of word similarity according to humans is captured in the vector-space word representations. Word vectors that can capture word similarity might be expected to perform well on tasks that require a notion of explicit semantic similarity between words like paraphrasing, entailment. However, it has been shown that no strong correlation is found between the performance of word vectors on word similarity and extrinsic evaluation NLP tasks like text classification, parsing, sentiment analysis (Tsvetkov et al., 2015;Schnabel et al., 2015). 3 An absence of strong correlation between the word similarity evaluation and downstream tasks calls for alternative approaches to evaluation.

Absence of statistical significance
There has been a consistent omission of statistical significance for measuring the difference in performance of two vector models on word similarity tasks. Statistical significance testing is important for validating metric gains in NLP (Berg-Kirkpatrick et al., 2012;Søgaard et al., 2014), specifically while solving non-convex objectives where results obtained due to optimizer instability can often lead to incorrect inferences (Clark et al., 2011). The problem of statistical significance in word similarity evaluation was first systematically addressed by Shalaby and Zadrozny (2015), who used Steiger's test (Steiger, 1980) 4 to compute how significant the difference between rankings produced by two different models is against the gold ranking. However, their method needs explicit ranked list of words produced by the models and cannot work when provided only with the correlation ratio of each model with the gold ranking. This problem was solved by Rastogi et al. (2015), which we describe next. Rastogi et al. (2015) observed that the improvements shown on small word similarity task datasets by previous work were insignificant. We now briefly describe the method presented by them to compute statistical significance for word similarity evaluation. Let A and B be the rankings produced by two word vector models over a list of words pairs, and T be the human annotated ranking. Let r AT , r BT and r AB denote the Spearman's correlation between A : T , B : T and A : B resp. andr AT ,r BT andr AB be their empirical estimates. Rastogi et al. (2015) introduce σ r p 0 as the minimum required difference for significance (MRDS) which satisfies the following: (2) Here pval is the probability of the test statistic under the null hypothesis that r AT = r BT found using the Steiger's test. The above conditional ensures that if the empirical difference between the rank correlations of the scores of the competing methods to the gold ratings is less than σ r p 0 then either the true correlation between the competing methods is greater than r, or the null hypothesis of no difference has p-value greater than p 0 . σ r p 0 depends on the size of the dataset, p 0 and r and Rastogi et al. (2015) present its values for common word similarity datasets. Reporting statistical significance in this way would help estimate the differences between word vector models.

Frequency effects in cosine similarity
The most common method of measuring the similarity between two words in the vector-space is to compute the cosine similarity between the corresponding word vectors. Cosine similarity implicitly measures the similarity between two unitlength vectors (eq. 1). This prevents any biases in favor of frequent words which are longer as they are updated more often during training (Turian et al., 2010).
Ideally, if the geometry of embedding space is primarily driven by semantics, the relatively small number of frequent words should be evenly distributed through the space, while large number of rare words should cluster around related, but more frequent words. However, it has been shown that vector-spaces contain hubs, which are vectors that are close to a large number of other vectors in the space (Radovanović et al., 2010). This problem manifests in word vector-spaces in the form of words that have high cosine similarity with a large number of other words (Dinu et al., 2014). Schnabel et al. (2015) further refine this hubness problem to show that there exists a power-law relationship between the frequency-rank 5 of a word and the frequency-rank of its neighbors. Specifically, they showed that the average rank of the 1000 nearest neighbors of a word follows: nn-rank ≈ 1000 · word-rank 0.17 This shows that pairs of words which have similar frequency will be closer in the vector-space, thus showing higher word similarity than they should according to their word meaning. Even though newer datasets of word similarity sample words from different frequency bins (Luong et al., 2013;Hill et al., 2014), this still does not solve the problem that cosine similarity in the vector-space gets polluted by frequency-based effects. Different distance normalization schemes have been proposed to downplay the frequency/hubness effect when computing nearest neighbors in the vector space (Dinu et al., 2014;Tomašev et al., 2011), but their applicability as an absolute measure of distance for word similarity tasks still needs to investigated.

Inability to account for polysemy
Many words have more than one meaning in a language. For example, the word bank can either correspond to a financial institution or to the land near a river. However in WS-353, bank is given a similarity score of 8.5/10 to money, signifying that bank is a financial institution. Such an assumption of one sense per word is prevalent in many of the existing word similarity tasks, and it can incorrectly penalize a word vector model for capturing a specific sense of the word absent in the word similarity task.
To account for sense-specific word similarity, Huang et al. (2012) introduced the Stanford contextual word similarity dataset (SCWS), in which the task is to compute similarity between two words given the contexts they occur in. For example, the words bank and money should have a low similarity score given the contexts: "along the east bank of the river", and "the basis of all money laundering". Using cues from the word's context, the correct word-sense can be identified and the appropriate word vector can be used. Unfortunately, word senses are also ignored by majority of the frequently used word vector models like Skip-gram and Glove. However, there has been progress on obtaining multiple vectors per word-type to account for different word-senses (Reisinger and Mooney, 2010;Huang et al., 2012;Neelakantan et al., 2014;Jauhar et al., 2015;Rothe and Schütze, 2015).

Conclusion
In this paper we have identified problems associated with word similarity evaluation of word vector models, and reviewed existing solutions wherever possible. Our study suggests that the use of word similarity tasks for evaluation of word vectors can lead to incorrect inferences and calls for further research on evaluation methods.
Until a better solution is found for intrinsic evaluation of word vectors, we suggest task-specific evaluation: word vector models should be compared on how well they can perform on a downstream NLP task. Although task-specific evaluation produces different rankings of word vector models for different tasks (Schnabel et al., 2015), this is not necessarily a problem because different vector models capture different types of information which can be more or less useful for a particular task.