Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection

The fundamental role of hypernymy in NLP has motivated the development of many methods for the automatic identification of this relation, most of which rely on word distribution. We investigate an extensive number of such unsupervised measures, using several distributional semantic models that differ by context type and feature weighting. We analyze the performance of the different methods based on their linguistic motivation. Comparison to the state-of-the-art supervised methods shows that while supervised methods generally outperform the unsupervised ones, the former are sensitive to the distribution of training instances, hurting their reliability. Being based on general linguistic hypotheses and independent from training data, unsupervised measures are more robust, and therefore are still useful artillery for hypernymy detection.


Introduction
In the last two decades, the NLP community has invested a consistent effort in developing automated methods to recognize hypernymy. Such effort is motivated by the role this semantic relation plays in a large number of tasks, such as taxonomy creation (Snow et al., 2006;Navigli et al., 2011) and recognizing textual entailment (Dagan et al., 2013). The task has appeared to be, however, a challenging one, and the numerous approaches proposed to tackle it have often shown limitations.
Early corpus-based methods have exploited patterns that may indicate hypernymy (e.g. "animals such as dogs") (Hearst, 1992;Snow et al., 2005), but the recall limitation of this approach, requiring both words to co-occur in a sentence, motivated the development of methods that rely on adaptations of the distributional hypothesis (Harris, 1954).
The first distributional approaches were unsupervised, assigning a score for each (x, y) wordpair, which is expected to be higher for hypernym pairs than for negative instances. Evaluation is performed using ranking metrics inherited from information retrieval, such as Average Precision (AP) and Mean Average Precision (MAP). Each measure exploits a certain linguistic hypothesis such as the distributional inclusion hypothesis (Weeds and Weir, 2003;Kotlerman et al., 2010) and the distributional informativeness hypothesis (Santus et al., 2014;Rimell, 2014).
In the last couple of years, the focus of the research community shifted to supervised distributional methods, in which each (x, y) word-pair is represented by a combination of x and y's word vectors (e.g. concatenation or difference), and a classifier is trained on these resulting vectors to predict hypernymy (Baroni et al., 2012;Roller et al., 2014;Weeds et al., 2014). While the original methods were based on count-based vectors, in recent years they have been used with word embeddings (Mikolov et al., 2013;Pennington et al., 2014), and have gained popularity thanks to their ease of use and their high performance on several common datasets. However, there have been doubts on whether they can actually learn to recognize hypernymy (Levy et al., 2015b).
Additional recent hypernymy detection methods include a multimodal perspective (Kiela et al., 2015), a supervised method using unsupervised measure scores as features (Santus et al., 2016a), and a neural method integrating path-based and distributional information (Shwartz et al., 2016).
In this paper we perform an extensive evaluation of various unsupervised distributional measures for hypernymy detection, using several distributional semantic models that differ by context type and feature weighting. Some measure vari-ants and context-types are tested for the first time. 1 We demonstrate that since each of these measures captures a different aspect of the hypernymy relation, there is no single measure that consistently performs well in discriminating hypernymy from different semantic relations. We analyze the performance of the measures in different settings and suggest a principled way to select the suitable measure, context type and feature weighting according to the task setting, yielding consistent performance across datasets.
We also compare the unsupervised measures to the state-of-the-art supervised methods. We show that supervised methods outperform the unsupervised ones, while also being more efficient, computed on top of low-dimensional vectors. At the same time, however, our analysis reassesses previous findings suggesting that supervised methods do not actually learn the relation between the words, but only characteristics of a single word in the pair (Levy et al., 2015b). Moreover, since the features in embedding-based classifiers are latent, it is difficult to tell what the classifier has learned. We demonstrate that unsupervised methods, on the other hand, do account for the relation between words in a pair, and are easily interpretable, being based on general linguistic hypotheses.

Distributional Semantic Spaces
We created multiple distributional semantic spaces that differ in their context type and feature weighting. As an underlying corpus we used a concatenation of the following two corpora: ukWaC (Ferraresi, 2007), a 2-billion word corpus constructed by crawling the .uk domain, and WaCkypedia EN (Baroni et al., 2009), a 2009 dump of the English Wikipedia. Both corpora include POS, lemma and dependency parse annotations. Our vocabulary (of target and context words) includes only nouns, verbs and adjectives that occurred at least 100 times in the corpus.
Context Type We use several context types: • Window-based contexts: the contexts of a target word w i are the words surrounding it in a ksized window: w i−k , ..., w i−1 , w i+1 , ..., w i+k . If the context-type is directional, words occurring before and after w i are marked differently, i.e.: w i−k /l, ..., w i−1 /l, w i+1 /r, ..., w i+k /r.  Figure 1: An example dependency tree of the sentence cute cats drink milk, with the target word cats. The dependencybased contexts are drink-v:nsubj and cute-a:amod −1 . The joint-dependency context is drink-v#milk-n. Differently from Chersoni et al. (2016), we exclude the dependency tags to mitigate the sparsity of contexts.
• Dependency-based contexts: rather than adjacent words in a window, we consider neighbors in a dependency parse tree (Padó and Lapata, 2007;Baroni and Lenci, 2010). The contexts of a target word w i are its parent and daughter nodes in the dependency tree (dep). We also experimented with a joint dependency context inspired by Chersoni et al. (2016), in which the contexts of a target word are the parent-sister pairs in the dependency tree (joint). See Figure 1 for an illustration.
Feature Weighting Each distributional semantic space is spanned by a matrix M in which each row corresponds to a target word while each column corresponds to a context. The value of each cell M i,j represents the association between the target word w i and the context c j . We experimented with two feature weightings: • Frequency -raw frequency (no weighting): M i,j is the number of co-occurrences of w i and c j in the corpus.
• Positive PMI (PPMI) -pointwise mutual information (PMI) (Church and Hanks, 1990) is defined as the log ratio between the joint probability of w and c and the product of their marginal probabilities: P M I(w, c) = logP (w,c) , whereP (w),P (c), andP (w, c) are estimated by the relative frequencies of a word w, a context c and a word-context pair (w, c), respectively. To handle unseen pairs (w, c), yielding P M I(w, c) = log(0) = −∞, PPMI (Bullinaria and Levy, 2007) assigns zero to negative PMI scores: P P M I(w, c) = max(P M I(w, c), 0).
In addition, one of the measures we used (Santus et al., 2014) required a third feature weighting: • Positive LMI (PLMI) -positive local mutual information (PLMI) (Evert, 2005;Evert, 2008). PPMI was found to have a bias towards rare events. PLMI simply balances PPMI by multiplying it by the co-occurrence frequency of w and c: P LM I(w, c) = f req(w, c) · P P M I(w, c).

Unsupervised Hypernymy Detection Measures
We experiment with a large number of unsupervised measures proposed in the literature for distributional hypernymy detection, with some new variants. In the following section, v x and v y denote x and y's word vectors (rows in the matrix M ). We consider the scores as measuring to what extent y is a hypernym of x (x → y).

Similarity Measures
Following the distributional hypothesis (Harris, 1954), similar words share many contexts, thus have a high similarity score. Although the hypernymy relation is asymmetric, similarity is one of its properties (Santus et al., 2014).
• Cosine Similarity (Salton and McGill, 1986) A symmetric similarity measure: (Lin, 1998) A symmetric similarity measure that quantifies the ratio of shared contexts to the contexts of each word: • APSyn (Santus et al., 2016b) A symmetric measure that computes the extent of intersection among the N most related contexts of two words, weighted according to the rank of the shared contexts (with N as a hyper-parameter): AP Syn(x, y) = Σ c∈N ( vx)∩N ( vy) 1 rankx(c)+ranky(c) 2

Inclusion Measures
According to the distributional inclusion hypothesis, the prominent contexts of a hyponym (x) are expected to be included in those of its hypernym (y).
• Weeds Precision (Weeds and Weir, 2003) A directional precision-based similarity measure. This measure quantifies the weighted inclusion of x's contexts by y's contexts: • cosWeeds (Lenci and Benotto, 2012) Geometric mean of cosine similarity and Weeds precision: • ClarkeDE (Clarke, 2009) Computes degree of inclusion, by quantifying weighted coverage of the hyponym's contexts by those of the hypernym: • balAPinc (Kotlerman et al., 2010) Balanced average precision inclusion.
is an adaptation of the average precision measure from information retrieval for the inclusion hypothesis. N y is the number of non-zero contexts of y and P (r) is the precision at rank r, defined as the ratio of shared contexts with y among the top r contexts of x. rel(c) is the relevance of a context c, set to 0 if c is not a context of y, and to 1 − ranky(c) Ny+1 otherwise, where rank y (c) is the rank of the context c in y's sorted vector. Finally, is the geometric mean of APinc and Lin similarity.
• invCL (Lenci and Benotto, 2012) Measures both distributional inclusion of x in y and distributional non-inclusion of y in x:

Informativeness Measures
According to the distributional informativeness hypothesis, hypernyms tend to be less informative than hyponyms, as they are likely to occur in more general contexts than their hyponyms.
• SLQS (Santus et al., 2014) SLQS(x → y) = 1 − E x E y The informativeness of a word x is evaluated as the median entropy of its top N contexts: where H(c) is the entropy of context c.
• SLQS Sub A new variant of SLQS based on the assumption that if y is judged to be a hypernym of x to a certain extent, then x should be judged to be a hyponym of y to the same extent (which is not the case for regular SLQS). This is achieved by subtraction: SLQS and SLQS Sub have 3 hyper-parameters: i) the number of contexts N ; ii) whether to use median or average entropy among the top N contexts; and iii) the feature weighting used to sort the contexts by relevance (i.e., PPMI or PLMI).
• SLQS Row Differently from SLQS, SLQS Row computes the entropy of the target rather than the average/median entropy of the contexts, as an alternative way to compute the generality of a word. 2 In addition, parallel to SLQS we tested SLQS Row with subtraction, SLQS Row Sub.
• RCTC (Rimell, 2014) Ratio of change in topic coherence: where t x are the top N contexts of x, considered as x's topic, and t x\y are the top N contexts of x which are not contexts of y. T C(A) is the topic coherence of a set of words A, defined as the median pairwise PMI scores between words in A. N is a hyper-parameter. The measure is based on the assumptions that excluding y's contexts from x's increases the coherence of the topic, while excluding x's contexts from y's decreases the coherence of the topic. We include this measure under the informativeness inclusion, as it is based on a similar hypothesis.

Reversed Inclusion Measures
These measures are motivated by the fact that, even though-being more general-hypernyms are expected to occur in a larger set of contexts, sentences like "the vertebrate barks" or "the mammal arrested the thieves" are not common, since hyponyms are more specialized and are hence more appropriate in such contexts. On the other hand, hyponyms are likely to occur in broad contexts (e.g. eat, live), where hypernyms are also appropriate. In this sense, we can define the reversed inclusion hypothesis: "hypernym's contexts are likely to be included in the hyponym's contexts". The following variants are tested for the first time.

Datasets
We use four common semantic relation datasets: BLESS (Baroni and Lenci, 2011), EVALution (Santus et al., 2015), Lenci/Benotto (Benotto, 2015), and Weeds (Weeds et al., 2014). The datasets were constructed either using knowledge resources (e.g. WordNet, Wikipedia), crowdsourcing or both. The semantic relations and the size of each dataset are detailed in  We split each dataset randomly to 90% test and 10% validation. The validation sets are used to tune the hyper-parameters of several measures: SLQS (Sub), APSyn and RCTC.

Comparing Unsupervised Measures
In order to evaluate the unsupervised measures described in Section 3, we compute the measure scores for each (x, y) pair in each dataset. We first measure the method's ability to discriminate hypernymy from all other relations in the dataset, i.e. by considering hypernyms as positive instances, and other word pairs as negative instances. In addition, we measure the method's ability to discriminate hypernymy from every other relation in the dataset by considering one relation at a time. For a relation R we consider only (x, y) pairs that are annotated as either hypernyms (positive instances) or R (negative instances). We rank the pairs according to the measure score and compute average precision (AP) at k = 100 and k = all. 5 lated differently in each sense. We consider y as a hypernym of x if hypernymy holds in some of the words' senses. Therefore, when a pair is assigned both hypernymy and another relation, we only keep it as hypernymy. 5 We tried several cut-offs and chose the one that seemed to be more informative in distinguishing between the unsupervised measures. Table 2 reports the best performing measure(s), with respect to AP @100, for each relation in each dataset. The first observation is that there is no single combination of measure, context type and feature weighting that performs best in discriminating hypernymy from all other relations. In order to better understand the results, we focus on the second type of evaluation, in which we discriminate hypernyms from each other relation.
The results show preference to the syntactic context-types (dep and joint), which might be explained by the fact that these contexts are richer (as they contain both proximity and syntactic information) and therefore more discriminative. In feature weighting there is no consistency, but interestingly, raw frequency appears to be successful in hypernymy detection, contrary to previously reported results for word similarity tasks, where PPMI was shown to outperform it (Bullinaria and Levy, 2007;Levy et al., 2015a).
The new SLQS variants are on top of the list in many settings. In particular they perform well in discriminating hypernyms from symmetric relations (antonymy, synonymy, coordination).
The measures based on the reversed inclusion hypothesis performed inconsistently, achieving perfect score in the discrimination of hypernyms from unrelated words, and performing well in few other cases, always in combination with syntactic contexts.
Finally, the results show that there is no single combination of measure and parameters that performs consistently well for all datasets and classification tasks. In the following section we analyze the best combination of measure, context type and feature weighting to distinguish hypernymy from any other relation.

Best Measure Per Classification Task
We considered all relations that occurred in two datasets. For such relation, for each dataset, we ranked the measures by their AP@100 score, selecting those with score ≥ 0.8. 6 Table 3 displays the intersection of the datasets' best measures.

Hypernym vs. Meronym
The inclusion hypothesis seems to be most effective in discriminating between hypernyms and meronyms under syntactic contexts. We conjecture that the windowbased contexts are less effective since they capture topical context words, that might be shared also among holonyms and their meronyms (e.g. car will occur with many of the neighbors of wheel). However, since meronyms and holonyms often have different functions, their functional contexts, which are expressed in the syntactic context-types, are less shared. This is where they mostly differ from hyponym-hypernym pairs, which are of the same function (e.g. cat is a type of animal). Table 2 shows that SLQS performs well in this task on BLESS. This is contrary to previous findings that suggested that SLQS is weak in discriminating between hypernyms and meronyms, as in many cases the holonym is more general than the meronym (Shwartz et al., 2016). 7 The surprising result could be explained by the nature of meronymy in this dataset: most holonyms in BLESS are rather specific words.
BLESS was built starting from 200 basic level concepts (e.g. goldfish) used as the x words, to which y words in different relations were associated (e.g. eye, for meronymy; animal, for hypernymy). x words represent hyponyms in the hyponym-hypernym pairs, and should therefore not be too general. Indeed, SLQS assigns high scores to hyponym-hypernym pairs. At the same time, in the meronymy relation in BLESS, x is the holonym and y is the meronym. For consistency with EVALution, we switched those pairs in BLESS, placing the meronym in the x slot and the holonym in the y slot. As a consequence, after the switching, holonyms in BLESS are usually rather specific words (e.g., there are no holonyms like animal and vehicle, as these words were originally in the y slot). In most cases, they are not more general than their meronyms ((eye, goldfish)), yielding low SLQS scores which are easy to separate from hypernyms. We note that this is a weakness of the BLESS dataset, rather than a strength of the measure. For instance, on EVALution, SLQS performs worse (ranked only as high as 13th), as this dataset has no such restriction on the basic level concepts, and may contain pairs like (eye, animal).

Hypernym vs. Attribute
Symmetric similarity measures computed on syntactic contexts succeed to discriminate between hypernyms and attributes. Since attributes are syntactically different from hypernyms (in attributes, y is an adjective), it is unsurprising that they occur in different syntactic contexts, yielding low similarity scores. nearly 50% of the SLQS false positive pairs were meronymholonym pairs, in many of which the holonym is more general than the meronym by definition, e.g. (mauritius, africa).  Table 4: Best performance on the validation set (10%) of each dataset for the supervised and unsupervised measures, in terms of Average Precision (AP) at k = 100, for hypernym vs. each single relation.

Hypernym vs. Antonym
In all our experiments, antonyms were the hardest to distinguish from hypernyms, yielding the lowest performance. We found that SLQS performed reasonably well in this setting. However, the measure variations, context types and feature weightings were not consistent across datasets. SLQS relies on the assumption that y is a more general word than x, which is not true for antonyms, making it the most suitable measure for this setting.
Hypernym vs. Synonym SLQS performs well also in discriminating between hypernyms and synonyms, in which y is also not more general than x. We observed that in the joint context type, the difference in SLQS scores between synonyms and hypernyms was the largest. This may stem from the restrictiveness of this context type. For instance, among the most salient contexts we would expect to find informative contexts like drinks milk for cat and less informative ones like drinks water for animal, whereas the nonrestrictive single dependency context drinks would probably be present for both.
Another measure that works well is invCL: interestingly, other inclusion-based measures assign high scores to (x, y) when y includes many of x's contexts, which might be true also for synonyms (e.g. elevator and lift share many contexts). in-vCL, on the other hand, reduces with the ratio of y's contexts included in x, yielding lower scores for synonyms.

Hypernym vs. Coordination
We found no consistency among BLESS and Weeds. On Weeds, inclusion-based measures (ClarkeDE, invCL and Weeds) showed the best results. The best performing measures on BLESS, however, were variants of SLQS, that showed to perform well in cases where the negative relation is symmetric (antonym, synonym and coordination). The difference could be explained by the nature of the datasets: the BLESS test set contains 1,185 hypernymy pairs, with only 129 distinct ys, many of which are general words like animal and object. The Weeds test set, on the other hand, was intentionally constructed to contain an overall unique y in each pair, and therefore contains much more specific ys (e.g. (quirk, strangeness)). For this reason, generality-based measures perform well on BLESS, and struggle with Weeds, which is handled better using inclusion-based measures.

Comparison to State-of-the-art Supervised Methods
For comparison with the state-of-the-art, we evaluated several supervised hypernymy detection methods, based on the word embeddings of x and y: concatenation v x ⊕ v y (Baroni et al., 2012), difference v y − v x (Weeds et al., 2014), and ASYM (Roller et al., 2014). We downloaded several pretrained embeddings (Mikolov et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014), and trained a logistic regression classifier to predict hypernymy. We used the 90% portion (originally the test set) as the train set, and the other 10% (originally the validation set) as a test set, reporting the best results among different vectors,  Table 5: Average Precision (AP) at k = 100 of the best supervised and unsupervised methods for hypernym vs. random-n, on the original BLESS validation set and the validation set with the artificially added switched hypernym pairs. method and regularization factor. 8 Table 4 displays the performance of the best classifier on each dataset, in a hypernym vs. a single relation setting. We also re-evaluated the unsupervised measures, this time reporting the results on the validation set (10%) for comparison.
The overall performance of the embeddingbased classifiers is almost perfect, and in particular the best performance is achieved using the concatenation method (Baroni et al., 2012) with either GloVe (Pennington et al., 2014) or the dependency-based embeddings (Levy and Goldberg, 2014). As expected, the unsupervised measures perform worse than the embedding-based classifiers, though generally not bad on their own.
These results may suggest that unsupervised methods should be preferred only when no training data is available, leaving all the other cases to supervised methods. This is, however, not completely true. As others previously noticed, supervised methods do not actually learn the relation between x and y, but rather separate properties of either x or y. Levy et al. (2015b) named this the "lexical memorization" effect, i.e. memorizing that certain ys tend to appear in many positive pairs (prototypical hypernyms).
On that account, the Weeds dataset has been designed to avoid such memorization, with every word occurring once in each slot of the relation. While the performance of the supervised methods on this dataset is substantially lower than their performance on other datasets, it is yet well above the random baseline which we might expect from a method that can only memorize words it has seen during training. 9 This is an indication that supervised methods can abstract away from the words.
Indeed, when we repeated the experiment with a lexical split of each dataset, i.e., such that the train and test set consist of distinct vocabularies, we found that the supervised methods' performance did not decrease dramatically, in contrast to the 8 In our preliminary experiments we also trained other classifiers used in the distributional hypernymy detection literature (SVM and SVM+RBF kernel), that performed similarly. We report the results for logistic regression, since we use the prediction probabilities to measure average precision. 9 The dataset is balanced between its two classes.
findings of Levy et al. (2015b). The large performance gaps reported by Levy et al. (2015b) might be attributed to the size of their training sets. Their lexical split discarded around half of the pairs in the dataset and split the rest of the pairs equally to train and test, resulting in a relatively small train set. We performed the split such that only around 30% of the pairs in each dataset were discarded, and split the train and test sets with a ratio of roughly 90/10%, obtaining large enough train sets.
Our experiment suggests that rather than memorizing the verbatim prototypical hypernyms, the supervised models might learn that certain regions in the vector space pertain to prototypical hypernyms. For example, device (from the BLESS train set) and appliance (from the BLESS test set) are two similar words, which are both prototypical hypernyms. Another interesting observation was recently made by Roller and Erk (2016): they showed that when dependency-based embeddings are used, supervised distributional methods trace x and y's separate occurrences in different slots of Hearst patterns (Hearst, 1992).
Whether supervised methods only memorize or also learn, it is more consensual that they lack the ability to capture the relation between x and y, and that they rather indicate how likely y (x) is to be a hypernym (hyponym) (Levy et al., 2015b;Santus et al., 2016a;Shwartz et al., 2016;Roller and Erk, 2016). While this information is valuable, it cannot be solely relied upon for classification.
To better understand the extent of this limitation, we conducted an experiment in a similar manner to the switched hypernym pairs in Santus et al. (2016a). We used BLESS, which is the only dataset with random pairs. For each hypernym pair (x 1 , y 1 ), we sampled a word y 2 that participates in another hypernym pair (x 2 , y 2 ), such that (x 1 , y 2 ) is not in the dataset, and added (x 1 , y 2 ) as a random pair. We added 139 new pairs to the validation set, such as (rifle, animal) and (salmon, weapon). We then used the best supervised and unsupervised methods for hypernym vs. randomn on BLESS to re-classify the revised validation set. Table 5 displays the experiment results.
The switched hypernym experiment paints a much less optimistic picture of the embeddings' actual performance, with a drop of 42 points in average precision. 121 out of the 139 switched hypernym pairs were falsely classified as hypernyms. Examining the y words of these pairs reveals general words that appear in many hypernym pairs (e.g. animal, object, vehicle). The unsupervised measure was not similarly affected by the switched pairs, and the performance even slightly increased. This result is not surprising, since most unsupervised measures aim to capture aspects of the relation between x and y, while not relying on information about one of the words in the pair. 10

Discussion
The results in Section 5 suggest that a supervised method using the unsupervised measures as features could possibly be the best of both worlds. We would expect it to be more robust than embeddingbased methods on the one hand, while being more informative than any single unsupervised measure on the other hand.
Such a method was developed by Santus et al. (2016a), however using mostly features that describe a single word, e.g. frequency and entropy. It was shown to be competitive with the state-of-theart supervised methods. With that said, it was also shown to be sensitive to the distribution of training examples in a specific dataset, like the embeddingbased methods.
We conducted a similar experiment, with a much larger number of unsupervised features, namely the various measure scores, and encountered the same issue. While the performance was good, it dropped dramatically when the model was tested on a different test set.
We conjecture that the problem stems from the currently available datasets, which are all somewhat artificial and biased. Supervised methods which are strongly based on the relation between the words, e.g. those that rely on path-based information (Shwartz et al., 2016), manage to overcome the bias. Distributional methods, on the other hand, are based on a weaker notion of the relation between words, hence are more prone to overfit the distribution of training instances in a specific dataset. In the future, we hope that new 10 Turney and Mohammad (2015) have also shown that unsupervised methods are more robust than supervised ones in a transfer-learning experiment, when the "training data" was used to tune their parameters. datasets will be available for the task, which would be drawn from corpora and will reflect more realistic distributions of words and semantic relations.

Conclusion
We performed an extensive evaluation of unsupervised methods for discriminating hypernyms from other semantic relations. We found that there is no single combination of measure and parameters which is always preferred; however, we suggested a principled linguistic-based analysis of the most suitable measure for each task that yields consistent performance across different datasets.
We investigated several new variants of existing methods, and found that some variants of SLQS turned out to be superior on certain tasks. In addition, we have tested for the first time the joint context type (Chersoni et al., 2016), which was found to be very discriminative, and might hopefully benefit other semantic tasks.
For comparison, we evaluated the state-ofthe-art supervised methods on the datasets, and they have shown to outperform the unsupervised ones, while also being efficient and easier to use. However, a deeper analysis of their performance demonstrated that, as previously suggested, these methods do not capture the relation between x and y, but rather indicate the "prior probability" of either word to be a hyponym or a hypernym. As a consequence, supervised methods are sensitive to the distribution of examples in a particular dataset, making them less reliable for real-world applications. Being motivated by linguistic hypotheses, and independent from training data, unsupervised measures were shown to be more robust. In this sense, unsupervised methods can still play a relevant role, especially if combined with supervised methods, in the decision whether the relation holds or not.