Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning

Recent work on word embeddings has shown that simple vector subtraction over pre-trained embeddings is surprisingly effective at capturing different lexical relations, despite lacking explicit supervision. Prior work has evaluated this intriguing result using a word analogy prediction formulation and hand-selected relations, but the generality of the finding over a broader range of lexical relation types and different learning settings has not been evaluated. In this paper, we carry out such an evaluation in two learning settings: (1) spectral clustering to induce word relations, and (2) supervised learning to classify vector differences into relation types. We find that word embeddings capture a surprising amount of information, and that, under suitable supervised training, vector subtraction generalises well to a broad range of relations, including over unseen lexical items.


Introduction
Learning to identify lexical relations is a fundamental task in natural language processing ("NLP"), and can contribute to many NLP applications including paraphrasing and generation, machine translation, and ontology building (Banko et al., 2007;Hendrickx et al., 2010).
Recently, attention has been focused on identifying lexical relations using word embeddings, which are dense, low-dimensional vectors obtained either from a "predict-based" neural network trained to predict word contexts, or a "countbased" traditional distributional similarity method combined with dimensionality reduction.The skipgram model of Mikolov et al. (2013a) and other similar language models have been shown to perform well on an analogy completion task (Mikolov et al., 2013b;Mikolov et al., 2013c;Levy and Goldberg, 2014a), in the space of relational sim-ilarity prediction (Turney, 2006), where the task is to predict the missing word in analogies such as A:B :: C: -?-.A well-known example involves predicting the vector queen from the vector combination king − man + woman, where linear operations on word vectors appear to capture the lexical relation governing the analogy, in this case OPPOSITE-GENDER.The results extend to several semantic relations such as CAPITAL-OF (paris−france+poland ≈ warsaw) and morphosyntactic relations such as PLURALISATION (cars − car + apple ≈ apples).Remarkably, since the model is not trained for this task, the relational structure of the vector space appears to be an emergent property.
The key operation in these models is vector difference, or vector offset.For example, the paris − france vector appears to encode CAPITAL-OF, presumably by cancelling out the features of paris that are France-specific, and retaining the features that distinguish a capital city (Levy and Goldberg, 2014a).The success of the simple offset method on analogy completion suggests that the difference vectors ("DIFFVEC" hereafter) must themselves be meaningful: their direction and/or magnitude encodes a lexical relation.
Previous analogy completion tasks used with word embeddings have limited coverage of lexical relation types.Moreover, the task does not explore the full implications of DIFFVECs as meaningful vector space objects in their own right, because it only looks for a one-best answer to the particular lexical analogies in the test set.In this paper, we introduce a new, larger dataset covering many well-known lexical relation types from the linguistics and cognitive science literature.We then apply DIFFVECs to two new tasks: unsupervised and supervised relation extraction.First, we cluster the DIFFVECs to test whether the clusters map onto true lexical relations.We find that the clustering works remarkably well, although syntactic relations are captured better than semantic ones.
Second, we perform classification over the DIFF-VECs and obtain remarkably high accuracy in a closed-world setting (over a predefined set of word pairs, each of which corresponds to a lexical relation in the training data).When we move to an open-world setting including random word pairs -many of which do not correspond to any lexical relation in the training data -the results are poor.We then investigate methods for better attuning the learned class representation to the lexical relations, focusing on methods for automatically synthesising negative instances.We find that this improves the model performance substantially.
We also find that hyper-parameter optimised count-based methods are competitive with predictbased methods under both clustering and supervised relation classification, in line with the findings of Levy et al. (2015a).

Background and Related Work
A lexical relation is a binary relation r holding between a word pair (w i , w j ); for example, the pair (cart, wheel) stands in the WHOLE-PART relation.Relation learning in NLP includes relation extraction, relation classification, and relational similarity prediction.In relation extraction, related word pairs in a corpus and the relevant relation are identified.Given a word pair, the relation classification task involves assigning a word pair to the correct relation from a pre-defined set.In the Open Information Extraction paradigm (Banko et al., 2007;Weikum and Theobald, 2010), also known as unsupervised relation extraction, the relations themselves are also learned from the text (e.g. in the form of text labels).On the other hand, relational similarity prediction involves assessing the degree to which a word pair (A, B) stands in the same relation as another pair (C, D), or to complete an analogy A:B :: C: -?-.Relation learning is an important and long-standing task in NLP and has been the focus of a number of shared tasks (Girju et al., 2007;Hendrickx et al., 2010;Jurgens et al., 2012).
An exciting development, and the inspiration for this paper, has been the demonstration that vector difference over word embeddings (Mikolov et al., 2013c) can be used to model word analogy tasks.This has given rise to a series of papers exploring the DIFFVEC idea in different contexts.The original analogy dataset has been used to evaluate predict-based language models by Mnih and Kavukcuoglu (2013) and also Zhila et al. (2013), who combine a neural language model with a pattern-based classifier.Kim and de Marneffe (2013) use word embeddings to derive representations of adjective scales, e.g.hot-warm-coolcold.Fu et al. (2014) similarly use embeddings to predict hypernym relations, in this case clustering words by topic to show that hypernym DIFFVECs can be broken down into more fine-grained relations.Neural networks have also been developed for joint learning of lexical and relational similarity, making use of the WordNet relation hierarchy (Bordes et al., 2013;Socher et al., 2013;Xu et al., 2014;Yu and Dredze, 2014;Faruqui et al., 2015;Fried and Duh, 2015).
Another strand of work responding to the vector difference approach has analysed the structure of predict-based embedding models in order to help explain their success on the analogy and other tasks (Levy and Goldberg, 2014a;Levy and Goldberg, 2014b;Arora et al., 2015).However, there has been no systematic investigation of the range of relations for which the vector difference method is most effective, although there have been some smallerscale investigations in this direction.Makrai et al. (2013) divide antonym pairs into semantic classes such as quality, time, gender, and distance, finding that for about two-thirds of antonym classes, DIFFVECs are significantly more correlated than random.Necs ¸ulescu et al. ( 2015) train a classifier on word pairs, using word embeddings to predict coordinates, hypernyms, and meronyms.Roller and Erk (2016) analyse the performance of vector concatenation and difference on the task of predicting lexical entailment and show that vector concatenation overwhelmingly learns to detect Hearst patterns (e.g., including, such as).Köper et al. (2015) undertake a systematic study of morphosyntactic and semantic relations on word embeddings produced with word2vec ("w2v" hereafter; see §3.1) for English and German.They test a variety of relations including word similarity, antonyms, synonyms, hypernyms, and meronyms, in a novel analogy task.Although the set of relations tested by Köper et al. (2015) is somewhat more constrained than the set we use, there is a good deal of overlap.However, their evaluation is performed in the context of relational similarity, and they do not perform clustering or classification on the DIFFVECs.

General Approach and Resources
We define the task of lexical relation learning to take a set of (ordered) word pairs {(w i , w j )} and a set of binary lexical relations R = {r k }, and map each word pair (w i , w j ) as follows: (a) (w i , w j ) → r k ∈ R, i.e. the "closed-world" setting, where we assume that all word pairs can be uniquely classified according to a relation in R; or (b) (w i , w j ) → r k ∈ R ∪ {φ} where φ signifies the fact that none of the relations in R apply to the word pair in question, i.e. the "open-world" setting.
Our starting point for lexical relation learning is the assumption that important information about various types of relations is implicitly embedded in the offset vectors.While a range of methods have been proposed for composing word vectors (Baroni et al., 2012;Weeds et al., 2014;Roller et al., 2014), in this research we focus exclusively on DIFFVEC (i.e.w 2 − w 1 ).A second assumption is that there exist dimensions, or directions, in the embedding vector spaces responsible for a particular lexical relation.Such dimensions could be identified and exploited as part of a clustering or classification method, in the context of identifying relations between word pairs or classes of DIFFVECs.
In order to test the generalisability of the DIFF-VEC method, we require: (1) word embeddings, and (2) a set of lexical relations to evaluate against.As the focus of this paper is not the word embedding pre-training approaches so much as the utility of the DIFFVECs for lexical relation learning, we take a selection of four pre-trained word embeddings with strong currency in the literature, as detailed in §3.1.We also include the state-of-the-art count-based approach of Levy et al. (2015a), to test the generalisability of DIFFVECs to count-based word embeddings.
For the lexical relations, we want a range of relations that is representative of the types of relational learning tasks targeted in the literature, and where there is availability of annotated data.To this end, we construct a dataset from a variety of sources, focusing on lexical semantic relations (which are less well represented in the analogy dataset of Mikolov et al. (2013c)), but also including morphosyntactic and morphosemantic relations (see §3.2).

Name
Dimensions

Word Embeddings
We consider four highly successful word embedding models in our experiments: w2v (Mikolov et al., 2013a;Mikolov et al., 2013b), GloVe (Pennington et al., 2014), SENNA (Collobert and Weston, 2008), and HLBL (Mnih and Hinton, 2009), as detailed below.We also include SVD (Levy et al., 2015a), a count-based model which factorises a positive PMI (PPMI) matrix.For consistency of comparison, we train SVD as well as a version of w2v and GloVe (which we call w2v wiki and GloVe wiki , respectively) on the English Wikipedia corpus (comparable in size to the training data of SENNA and HLBL), and apply the preprocessing of Levy et al. (2015a).We additionally normalise the w2v wiki and SVD wiki vectors to unit length; GloVe wiki is natively normalised by column.1 w2v CBOW (Continuous Bag-Of-Words; Mikolov et al. (2013a)) predicts a word from its context using a model with the objective: wi+j where w i and wi are the vector representations for the ith word (as a focus or context word, respectively), V is the vocabulary size, T is the number of tokens in the corpus, and c is the context window size.2Google News data was used  to train the model.We use the focus word vectors, W = {w k } V k=1 , normalised such that each w k = 1.The GloVe model (Pennington et al., 2014) is based on a similar bilinear formulation, framed as a low-rank decomposition of the matrix of corpus co-occurrence frequencies: where w i is a vector for the left context, w j is a vector for the right context, P ij is the relative frequency of word j in the context of word i, and f is a heuristic weighting function to balance the influence of high versus low term frequencies.The model was trained on English Wikipedia and the English Gigaword corpus version 5. The SVD model (Levy et al., 2015a) uses positive pointwise mutual information (PMI) matrix defined as: PPMI(w, c) = max(log P (w, c) where P (w, c) is the joint probability of word w and context c, and P (w) and P (c) are their marginal probabilities.The matrix is factorised by singular value decomposition.HLBL (Mnih and Hinton, 2009) is a log-bilinear formulation of an n-gram language model, which predicts the ith word based on context words (i − n, . . ., i − 2, i − 1).This leads to the following duty, denoting either the embedding for the ith token, wi, or kth word type, w k .training objective: , where wi = n−1 j=1 C j w i−j is the context embedding, C j is a scaling matrix, and b * is a bias term.
The final model, SENNA (Collobert and Weston, 2008), was initially proposed for multi-task training of several language processing tasks, from language modelling through to semantic role labelling.Here we focus on the statistical language modelling component, which has a pairwise ranking objective to maximise the relative score of each word in its local context: where the last c − 1 words are used as context, and f (x) is a non-linear function of the input, defined as a multi-layer perceptron.
For HLBL and SENNA, we use the pre-trained embeddings from Turian et al. (2010), trained on the Reuters English newswire corpus.In both cases, the embeddings were scaled by the global standard deviation over the word-embedding matrix, W scaled = 0.1 × W σ(W ) .For w2v wiki , GloVe wiki and SVD wiki we used English Wikipedia.We followed the same preprocessing procedure described in Levy et al. (2015a), 3  i.e., lower-cased all words and removed non-textual elements.During the training phase, for each model we set a word frequency threshold of 5.For the SVD model, we followed the recommendations of Levy et al. (2015a) in setting the context window size to 2, negative sampling parameter to 1, eigenvalue weighting to 0.5, and context distribution smoothing to 0.75; other parameters were assigned their default values.For the other models we used the following parameter values: for w2v, context window = 8, negative samples = 25, hs = 0, sample = 1e-4, and iterations = 15; and for GloVe, context window = 15, x max = 10, and iterations = 15.

Lexical Relations
In order to evaluate the applicability of the DIFF-VEC approach to relations of different types, we assembled a set of lexical relations in three broad categories: lexical semantic relations, morphosyntactic paradigm relations, and morphosemantic relations.We constrained the relations to be binary and to have fixed directionality. 4Consequently we excluded symmetric lexical relations such as synonymy.We additionally constrained the dataset to the words occurring in all embedding sets.There is some overlap between our relations and those included in the analogy task of Mikolov et al. (2013c), but we include a much wider range of lexical semantic relations, especially those standardly evaluated in the relation classification literature.We manually filtered the data to remove duplicates (e.g., as part of merging the two sources of LEXSEM Hyper intances), and normalise directionality.
The final dataset consists of 12,458 triples relation, word 1 , word 2 , comprising 15 relation types, extracted from SemEval'12 (Jurgens et al., 2012), BLESS (Baroni and Lenci, 2011), the MSR analogy dataset (Mikolov et al., 2013c), the light verb dataset of Tan et al. (2006a), Princeton Word-Net (Fellbaum, 1998)  high relational similarity, or equivalently clusters of similar offset vectors.Under the additional assumption that a given word pair corresponds to a unique lexical relation (in line with our definition of the lexical relation learning task in §3), a hard clustering approach is appropriate.In order to test these assumptions, we cluster our 15-relation closed-world dataset in the first instance, and evaluate against the lexical resources in §3.2.
As further motivation, we projected the DIFF-VEC space for a small number of samples of each class using t-SNE (Van der Maaten and Hinton, 2008), and found that many of the morphosyntactic relations (VERB 3 , VERB Past , VERB 3Past , NOUN SP ) form tight clusters (Figure 1).
We cluster the DIFFVECs between all word pairs in our dataset using spectral clustering (Von Luxburg, 2007).Spectral clustering has two hyperparameters: the number of clusters, and the pairwise similarity measure for comparing DIFF-VECs.We tune the hyperparameters over development data, in the form of 15% of the data obtained by random sampling, selecting the configuration that maximises the V-Measure (Rosenberg and Hirschberg, 2007).Figure 2 presents V-Measure values over the test data for each of the four word embedding models.We show results for different numbers of clusters, from N = 10 in steps of 10, up to N = 80 (beyond which the clustering quality diminishes). 8Observe that w2v achieves the best results, with a V-Measure value of around 0.36, 9 which is relatively constant over varying numbers of clusters.GloVe and SVD mirror this result, but are consistently below w2v at a V-Measure of around 0.31.HLBL and SENNA performed very similarly, at a substantially lower V-Measure than w2v or GloVe, closer to 0.21.As a crude calibration for these results, over the related clustering task of word sense induction, the best-performing systems in SemEval-2010 Task 4 (Manandhar et al., 2010) achieved a V-Measure of under 0.2.
The lower V-measure for w2v wiki and GloVe wiki (as compared to w2v and GloVe, respectively) indicates that the volume of training data plays a role in the clustering results.However, both methods still perform well above SENNA and HLBL, and w2v has a clear empirical advantage over GloVe.We note that SVD wiki performs almost as well as w2v wiki , consistent with the results of Levy et al. (2015a).
We additionally calculated the entropy for each lexical relation, based on the distribution of instances belonging to a given relation across the different clusters (and simple MLE).For each embedding method, we present the entropy for the cluster size where V-measure was maximised over the development data.Since the samples are distributed nonuniformly, we normalise entropy results for each method by log(n) where n is the number of samples in a particular relation.The re-9 V-Measure returns a value in the range [0, 1], with 1 indicating perfect homogeneity and completeness.Considering w2v embeddings, for VERB 3 there was a single cluster consisting of around 90% of VERB 3 word pairs.Most errors resulted from POS ambiguity, leading to confusion with VERB-NOUN in particular.Example VERB 3 pairs incorrectly clustered are: (study, studies), (run, runs), and (like, likes).This polysemy results in the distance represented in the DIFFVEC for such pairs being above average for VERB 3 , and consequently clustered with other cross-POS relations.
For VERB Past , a single relatively pure cluster was generated, with minor contamination due to pairs such as (hurt, saw), (utensil, saw), and (wipe, saw).Here, the noun saw is ambiguous with a high-frequency past-tense verb; hurt and wipe also have ambigous POS.
A related phenomenon was observed for NOUN Coll , where the instances were assigned to a large mixed cluster containing word pairs where the second word referred to an animal, reflecting the fact that most of the collective nouns in our dataset relate to animals, e.g.(stand, horse), (ambush, tigers), (antibiotics, bacteria).This is interesting from a DIFFVEC point of view, since it shows that the lexical semantics of one word in the pair can overwhelm the semantic content of the DIFFVEC (something that we return to investigate in §5.4).LEXSEM Mero was also split into multiple clusters along topical lines, with separate clusters for weapons, dwellings, vehicles, etc.
Given the encouraging results from our clustering experiment, we next evaluate DIFFVECs in a supervised relation classification setting.

Classification
A natural question is whether we can accurately characterise lexical relations through supervised learning over the DIFFVECs.For these experiments we use the w2v, w2v wiki , and SVD wiki embeddings exclusively (based on their superior performance in the clustering experiment), and a subset of the relations which is both representative of the breadth of the full relation set, and for which we have sufficient data for supervised training and evaluation, namely: NOUN Coll , LEXSEM Event , LEXSEM Hyper , LEXSEM Mero , NOUN SP , PREFIX, VERB 3 , VERB 3Past , and VERB Past (see Table 2).
We consider two applications: (1) a CLOSED-WORLD setting similar to the unsupervised evaluation, in which the classifier only encounters word pairs which correspond to one of the nine relations; and (2) a more challenging OPEN-WORLD setting where random word pairs -which may or may not correspond to one of our relations -are included in the evaluation.For both settings, we further investigate whether there is a lexical memorisation effect for a broad range of relation types of the sort identified by Weeds et al. (2014) and Levy et al. (2015b) for hypernyms, by experimenting with disjoint training and test vocabulary.

CLOSED-WORLD Classification
For the CLOSED-WORLD setting, we train and test a multiclass classifier on datasets comprising DIFFVEC, r pairs, where r is one of our nine relation types, and DIFFVEC is based on one of w2v, w2v wiki and SVD.As a baseline, we cluster the data as described in §4, running the clusterer several times over the 9-relation data to select the optimal V-Measure value based on the development data, resulting in 50 clusters.We label each cluster with the majority class based on the training instances, and evaluate the resultant labelling for the test instances.
We use an SVM with a linear kernel, and report results from 10-fold cross-validation in Table 4.
The SVM achieves a higher F-score than the baseline on almost every relation, particularly on LEXSEM Hyper , and the lower-frequency NOUN SP , NOUN Coll , and PREFIX.Most of the relations - even the most difficult ones from our clustering experiment -are classified with very high Fscore.That is, with a simple linear transformation of the embedding dimensions, we are able to achieve near-perfect results.The PREFIX relation achieved markedly lower recall, resulting in a lower F-score, due to large differences in the predominant usages associated with the respective words (e.g., (union, reunion), where the vector for union is heavily biased by contexts associated with trade unions, but reunion is heavily biased by contexts relating to social get-togethers; and (entry, reentry), where entry is associated with competitions and entrance to schools, while reentry is associated with space travel).Somewhat surprisingly, given the small dimensionality of the input (vectors of size 300 for all three methods), we found that the linear SVM slightly outperformed a non-linear SVM using an RBF kernel.We observe no real difference between w2v wiki and SVD wiki , supporting the hypothesis of Levy et al. (2015a) that under appropriate parameter settings, count-based methods achieve high results.The impact of the training data volume for pre-training of the embeddings is also less pronounced than in the case of our clustering experiment.

OPEN-WORLD Classification
We now turn to a more challenging evaluation setting: a test set including word pairs drawn at random.This setting aims to illustrate whether a DIFF-VEC-based classifier is capable of differentiating related word pairs from noise, and can be applied to open data to learn new related word pairs.10For these experiments, we train a binary classifier for each relation type, using 2 3 of our relation Table 5: Precision (P) and recall (R) for OPEN-WORLD classification, using the binary classifier without ("Orig") and with ("+neg") negative samples .
data for training and 1 3 for testing.The test data is augmented with an equal quantity of random pairs, generated as follows: (1) sample a seed lexicon by drawing words proportional to their frequency in Wikipedia;11 (2) take the Cartesian product over pairs of words from the seed lexicon; (3) sample word pairs uniformly from this set.This procedure generates word pairs that are representative of the frequency profile of our corpus.
We train 9 binary RBF-kernel SVM classifiers on the training partition, and evaluate on our randomly augmented test set.Fully annotating our random word pairs is prohibitively expensive, so instead, we manually annotated only the word pairs which were positively classified by one of our models.The results of our experiments are presented in the left half of Table 5, in which we report on results over the combination of the original test data from §5.1 and the random word pairs, noting that recall (R) for OPEN-WORLD takes the form of relative recall (Pantel et al., 2004) over the positively-classified word pairs.The results are much lower than for the closed-word setting (Table 4), most notably in terms of precision (P).For instance, the random pairs (have, works), (turn, took), and (works, started) were incorrectly classified as VERB 3 , VERB Past and VERB 3Past , respectively.That is, the model captures syntax, but lacks the ability to capture lexical paradigms, and tends to overgenerate.

OPEN-WORLD Training with Negative Sampling
To address the problem of incorrectly classifying random word pairs as valid relations, we retrain the classifier on a dataset comprising both valid and automatically-generated negative distractor samples.The basic intuition behind this approach is to construct samples which will force the model to learn decision boundaries that more tightly capture the true scope of a given relation.To this end, we automatically generated two types of negative distractors: opposite pairs: generated by switching the order of word pairs, Oppos w1 ,w2 = word 1 − word 2 .This ensures the classifier adequately captures the asymmetry in the relations.shuffled pairs: generated by replacing w 2 with a random word w 2 from the same relation, Shuff w1 ,w2 = word 2 − word 1 .This is targeted at relations that take specific word classes in particular positions, e.g., (VB, VBD) word pairs, so that the model learns to encode the relation rather than simply learning the properties of the word classes.Both types of distractors are added to the training set, such that there are equal numbers of valid relations, opposite pairs and shuffled pairs.
After training our classifier, we evaluate its predictions in the same way as in §5.2, using the same test set combining related and random word pairs. 12he results are shown in the right half of Table 5 (as "+neg").Observe that the precision is much higher and recall somewhat lower compared to the classifier trained with only positive samples.This follows from the adversarial training scenario: using negative distractors results in a more conservative classifier, that correctly classifies the vast majority of the random word pairs as not corresponding to a given relation, resulting in higher precision at the expense of a small drop in recall.Overall this leads to higher F-scores, as shown in Figure 3, other than for hypernyms (LEXSEM Hyper ) and prefixes (PREFIX).For example, the standard classifier for NOUN Coll learned to match word pairs including an animal name (e.g., (plague, rats)), while training with negative samples resulted in much more conservative predictions and consequently much lower recall.The classifier was able to capture (herd, horses) but not (run, salmon), (party, jays) or (singular, boar) as instances of NOUN Coll , possibly because of polysemy.The most striking difference in performance was for LEXSEM Mero , where the standard classifier generated many false positive noun pairs (e.g.(series, radio)), but the false positive rate was considerably reduced with negative sampling.

Lexical Memorisation
Weeds et al. ( 2014) and Levy et al. (2015b) recently showed that supervised methods using DIFF-VECs achieve artificially high results as a result of "lexical memorisation" over frequent words associated with the hypernym relation.For example, (animal, cat), (animal, dog), and (animal, pig) all share the superclass animal, and the model thus learns to classify as positive any word pair with animal as the first word.
To address this effect, we follow Levy et al. (2015b) in splitting our vocabulary into training and test partitions, to ensure there is no overlap between training and test vocabulary.We then train classifiers with and without negative sampling ( §5.3), incrementally adding the random word pairs from §5.2 to the test data (from no random word pairs to five times the original size of the test data) to investigate the interaction of negative sampling with greater diversity in the test set when there is a split vocabulary.The results are shown in Figure 4.
Observe that the precision for the standard classifier decreases rapidly as more random word pairs are added to the test data.In comparison, the precision when negative sampling is used shows only a small drop-off, indicating that negative sampling is effective at maintaining precision in an OPEN-WORLD setting even when the training and test vocabulary are disjoint.This benefit comes at the expense of recall, which is much lower when negative sampling is used (note that recall stays relatively constant as random word pairs are added, as the vast majority of them do not correspond to any relation).At the maximum level of random word pairs in the test data, the F-score for the negative sampling classifier is higher than for the standard classifier.

Conclusions
This paper is the first to test the generalisability of the vector difference approach across a broad range of lexical relations (in raw number and also variety).Using clustering we showed that many types of morphosyntactic and morphosemantic differences are captured by DIFFVECs, but that lexical semantic relations are captured less well, a finding which is consistent with previous work (Köper et al., 2015).In contrast, classification over the DIFFVECs works extremely well in a closed-world setting, showing that dimensions of DIFFVECs encode lexical relations.Classification performs less well over open data, although with the introduction of automatically-generated negative samples, the results improve substantially.Negative sampling also improves classification when the training and test vocabulary are split to minimise lexical memorisation.Overall, we conclude that the DIFFVEC approach has impressive utility over a broad range of lexical relations, especially under supervised classification.

Figure 1
Figure 1: t-SNE projection (Van der Maaten and Hinton, 2008) of DIFFVECs for 10 sample word pairs of each relation type, based on w2v.The intersection of the two axes identify the projection of the zero vector.Best viewed in colour.

Figure 2 :
Figure 2: Spectral clustering results, comparing cluster quality (V-Measure) and the number of clusters.DIFFVECs are clustered and compared to the known relation types.Each line shows a different source of word embeddings.

Figure 3 :
Figure 3: F-score for OPEN-WORLD classification, comparing models trained with and without negative samples.

Figure 4 :
Figure 4: Evaluation of the OPEN-WORLD model when trained on split vocabulary, for varying numbers of random word pairs in the test dataset (expressed as a multiplier relative to the number of CLOSED-WORLD test instances).

Table 1 :
The pre-trained word embeddings used in our experiments, with the number of dimensions and size of the training data (in word tokens).The models trained on English Wikipedia ("wiki") are in the lower half of the table.

Table 2 :
Description of the 15 lexical relations.
, Wiktionary, 5 and a web lexicon of collective nouns, 6 as listed in Table2.7

Table 3 :
The entropy for each lexical relation over the clustering output for each set of pre-trained word embeddings.sults are in Table3, with the lowest entropy (purest clustering) for each relation indicated in bold.Looking across the different lexical relation types, the morphosyntactic paradigm relations (NOUN SP and the three VERB relations) are by far the easiest to capture.The lexical semantic relations, on the other hand, are the hardest to capture for all embeddings.