Specializing Word Embeddings for Similarity or Relatedness

We demonstrate the advantage of specializing semantic word embeddings for either similarity or relatedness. We compare two variants of retroﬁtting and a joint-learning approach, and ﬁnd that all three yield specialized semantic spaces that capture human intuitions regarding similarity and re-latedness better than unspecialized spaces. We also show that using specialized spaces in NLP tasks and applications leads to clear improvements, for document classiﬁcation and synonym selection, which rely on either similarity or relatedness but not both.


Introduction
Most current models of semantic word representation exploit the distributional hypothesis: the idea that words occurring in similar contexts have similar meanings (Harris, 1954;Turney and Pantel, 2010;Clark, 2015). Such representations (or embeddings) can reflect human intuitions about similarity and relatedness (Turney, 2006;Agirre et al., 2009), and have been applied to a wide variety of NLP tasks, including bilingual lexicon induction (Mikolov et al., 2013b), sentiment analysis (Socher et al., 2013) and named entity recognition (Turian et al., 2010;Guo et al., 2014).
Arguably, one of the reasons behind the popularity of word embeddings is that they are "general purpose": they can be used in a variety of tasks without modification. Although this behavior is sometimes desirable, it may in other cases be detrimental to downstream performance. For example, when classifying documents by topic, we are particularly interested in related words rather than similar ones: knowing that dog is associated with cat is much more informative of the topic than knowing that it is a synonym of canine. Conversely, if our embeddings indicate that table is closely related to chair, that does not mean we should translate table into French as chaise.
This distinction between "genuine" similarity and associative similarity (i.e., relatedness) is well-known in cognitive science (Tversky, 1977). In NLP, however, semantic spaces are generally evaluated on how well they capture both similarity and relatedness, even though, for many word combinations (such as car and petrol), these two objectives are mutually incompatible (Hill et al., 2014b). In part, this oversight stems from the distributional hypothesis itself: car and petrol do not have the same, or even very similar, meanings, but these two words may well occur in similar contexts. Corpus-driven approaches based on the distributional hypothesis therefore generally learn embeddings that capture both similarity and relatedness reasonably well, but neither perfectly.
In this work we demonstrate the advantage of specializing semantic spaces for either similarity or relatedness. Specializing for similarity is achieved by learning from both a corpus and a thesaurus, and for relatedness by learning from both a corpus and a collection of psychological association norms. We also compare the recentlyintroduced technique of graph-based retrofitting (Faruqui et al., 2015) with a skip-gram retrofitting and a skip-gram joint-learning approach. All three methods yield specialized semantic spaces that capture human intuitions regarding similarity and relatedness significantly better than unspecialized spaces, in one case yielding state-of-the-art results for word similarity. More importantly, we show clear improvements in downstream tasks and applications: specialized similarity spaces improve synonym detection, while association spaces work better than both general-purpose and similarityspecialized spaces for document classification.

Approach
The underlying assumption of our approach is that, during training, word embeddings can be "nudged" in a particular direction by including information from an additional semantic data source. For directing embeddings towards genuine similarity, we use the MyThes thesaurus developed by the OpenOffice.org project 1 . It contains synonyms for almost 80,000 words in English. For directing embeddings towards relatedness, we use the University of South Florida (USF) free association norms (Nelson et al., 2004). This dataset contains scores for free association (an experimental measure of cognitive association) of over 10,000 concept words. For raw text data we use a dump of the English Wikipedia plus newswire text (8 billion words in total) 2 .

Evaluations (Intrinsic and Extrinsic)
For instrinsic comparisons with human judgements, we evaluate on SimLex (Hill et al., 2014b) (999 pairwise comparisons), which explicitly measures similarity, and MEN (Bruni et al., 2014) (3000 comparisons), which explicitly measures relatedness. We also consider two downstream tasks and applications. In the TOEFL synonym selection task (Landauer and Dumais, 1997), the objective is to select the correct synonym for a target word from a multiple-choice set of possible answers. For a more extrinsic evaluation, we use a document classification task based on the Reuters Corpus Volume 1 (RCV1) (Lewis et al., 2004). This dataset consists of over 800,000 manually categorized news articles. 3

Joint Learning
The standard skip-gram training objective for a sequence of training words w 1 , w 2 , ..., w T and a context size c is the log-likelihood criterion: where p(w t+j |w t ) is obtained via the softmax: p(w t+j |w t ) = exp u w t+j vw t w exp u w vw t where u w and v w are the context and target vector representations for word w, respectively, and w ranges over the full vocabulary (Mikolov et al., In the all condition, all additional contexts for a target word are added at each occurrence: The set of additional contexts A wt contains the relevant contexts for a word w t ; e.g., for the word dog, A dog for the thesaurus is the set of all synonyms of dog in the thesaurus.

Retrofitting
Faruqui et al. (2015) introduced retrofitting as a post-hoc graph-based learning objective that improves learned word embeddings. We experiment with their method, calling it graph-based retrofitting. In addition, we introduce a similar approach that instead uses the same objective function that was used to learn the original skip-gram embeddings. In other words, we first train a standard skip-gram model, and then learn from the additional contexts in a second training stage as if they form a separate corpus: We call this approach skip-gram retrofitting. In all cases, our embeddings have 300 dimensions, which has been found to work well (Mikolov et al., 2013a; 3

Results for Intrinsic Evaluation
We compare standard skip-gram embeddings with retrofitted and jointly learned specialized embeddings, as well as with "fitted" embeddings that were randomly initialized and learned only from the additional semantic resource. In each case, the training algorithm was run for a single iteration (results from more iterations are presented later). As shown in Table 1, embeddings that were specialized for similarity using a thesaurus perform better on SimLex-999, and those specialized for relatedness using association data perform better on MEN. Fitting, or learning only from the additional semantic resource without access to raw text, does not perform well. Skip-gram retrofitting with the thesaurus performs best on SimLex-999; joint learning with sampling from the USF norms performs best on MEN, although the two retrofitting approaches are close. There is an interesting difference between the two joint learning approaches: while sampling a single free associate as additional context works best for relatedness, presenting all additional contexts (synonyms) works best for similarity. In both cases, skip-gram retrofitting matches or outperforms graph-based retrofitting.
More training iterations All the results above were obtained using a single training iteration. When retrofitting, however, it is easy to learn from multiple iterations of the thesaurus or the USF norms. The results are shown in Figure 1, where the dashed lines are the joint learning and standard skip-gram results for comparison with retrofitting scores. As would be expected, too many iterations leads to overfitting on the semantic resource, with performance eventually decreasing after the initial increase. The results show that retrofitting is particularly useful for similarity, as indicated by the large increase in performance on SimLex-999. The highest performance obtained, at 5 iterations, is a Spearman ρ s correlation of 0.53, which, as far as we know, matches the current state-of-the-art. 4 For relatedness-specific embeddings, the effect is less clear: joint learning performs comparatively much better. Retrofitting does outperform it, at around 2-10 iterations on the USF norms, but the improvement is marginal. The highest retrofitting score is 0.74; the highest joint learning score is 0.72. Both are highly competitive results on MEN, and outperform e.g. GloVe at 0.71 (Pennington et al., 2014). Joint learning with a thesaurus, however, leads to poor performance on MEN, as expected: the embeddings get dragged away from relatedness and towards similarity.

Curriculum learning?
The fact that joint learning works better when supplementing raw text input with free associates, but skip-gram retrofitting works better with additional thesaurus information, could be due to curriculum learning effects (Bengio et al., 2009). Unlike the USF norms, many of the words from the thesaurus are unusual and have low frequency. This suggests that the thesaurus is more 'advanced' (from the perspective of the learning model) than the USF norms as an information source. Its information may be detrimental to model optimization when encountered early in training (in the joint learning condition) because the model has not acquired the basic concepts on which it builds. However, with retrofitting the model first acquires good representations for frequent words from the raw text, after which it can better understand, and learn from, the information in the thesaurus.

TOEFL Synonym Task
Unsupervised synonym selection has many applications including the generation of thesauri and other lexical resources from raw text (Kageura et al., 2000). In the well-known TOEFL evaluation (Freitag et al., 2005) models are required to identify true synonyms to question words from a selection of possible answers. To test our models on this task, for each question in the dataset, we rank the multiple-choice answers according to the cosine similarity between their word embeddings and that of the target word, and choose the highestranked option.   Table 2: TOEFL synonym selection and document classification accuracy (percentage of correctly answered questions/correctly classified documents).
As Table 2 shows, similarity-specialized embeddings perform much better than standard embeddings and relatedness-specialized embeddings. Retrofitting outperforms joint learning, and skipgram retrofitting matches or outperforms graphbased retrofitting.

Document Classification
To investigate how well the various semantic spaces perform on document classification, we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.
The results are reported in the rightmost column of Table 2. Relatedness-specialized embed-dings perform better on document topic classification than similarity embeddings, except with graph-based retrofitting, which in fact performs below the standard skip-gram model. The jointlearning model with all relevant free association norms presented as context for each target word is the best performing model. The differences in the table appear small, but the dataset contains more than 10,000 documents, so every percentage point is worth more than 100 documents. Joint learning while presenting all relevant association norms for each target word performs best on this task.

Conclusion
We have demonstrated the advantage of specializing embeddings for the tasks of genuine similarity and relatedness. In doing so, we compared two retrofitting methods and a joint learning approach. Specialized embeddings outperform standard embeddings by a large margin on instrinsic similarity and relatedness evaluations. We showed that the difference in how embeddings are specialized carries to downstream NLP tasks, demonstrating that similarity embeddings are better at the TOEFL synonym selection task and relatedness embeddings at a document topic classification task. Lastly, we varied the number of iterations that we use for retrofitting, showing that performance could be improved even further by going over several iterations of the semantic resource.
(306920) and EPSRC grant EP/I037512/1. We thank Yoshua Bengio, Kyunghyun Cho and Ivan Vulić for useful discussions and the anonymous reviewers for their helpful comments.