How Well Can We Predict Hypernyms from Word Embeddings? A Dataset-Centric Analysis

One key property of word embeddings currently under study is their capacity to encode hypernymy. Previous works have used supervised models to recover hypernymy structures from embeddings. However, the overall results do not clearly show how well we can recover such structures. We conduct the first dataset-centric analysis that shows how only the Baroni dataset provides consistent results. We empirically show that a possible reason for its good performance is its alignment to dimensions specific of hypernymy: generality and similarity


Introduction
Word embeddings have been widely used as features in NLP tasks like parsing and textual entailment. One key aspect that has been investigated is their capacity to encode hypernymy; this semantic relation denotes a taxonomical order of objects in the world; for example, a dog is a canine which is a vertebrate. To test the ability of embeddings to encode hypernymy, previous work has proposed supervised models to learn whether a given pair of embeddings (w i , w j ) are in the hypernymy relation (Roller et al., 2014;Necsulescu et al., 2015;Fu et al., 2014).
Results from previous work suggest that word embeddings indeed capture hypernymy information. This observation is relatively general and robust across several choices of datasets, models and embeddings. For example, Levy et al. (2015) achieve up to 0.85 F1, while Roller and Erk (2016) achieve up to 0.90 F1. Both of these results are achieved on the Baroni dataset (Baroni et al., 2012). For most other datasets, models achieve promising scores above 0.60 F1 points; e.g. Roller and Erk (2016) report 0.66 F1 points for a linear model on the balanced Turney dataset (Turney and Mohammad, 2015).
On closer look, however, we find that the current F1-based results may be somewhat misleading. In particular, several papers report F1 scores in the higher 60% level on balanced datasets-on such datasets a baseline that predicts each pair to be in the hypernym relation already achieves 66% F1. And when calculating accuracy instead of F1 scores we observe accuracies around 50%-60% for state of the art models, often barely above chance level (Table 3).
There is one striking exception when it comes to accuracy results. On the Baroni dataset, accuracy is as high as 81%. These observations lead us to the following questions regarding the datasets and overall results: Are the scores on the Baroni dataset high because it is an easy dataset? Or are they high because it is easier to learn hypernymy from the Baroni training set due to its design? To what extent can the Baroni dataset help us to predict hypernyms from word embeddings?
In this work we conduct the first dataset-centric analysis across 6 datasets to empirically answer the questions above. We take inspiration from the work of (Torralba and Efros, 2011) in the computer vision domain where a set of datasets are compared and biases are exposed. In the same spirit, we compare a set of datasets by evaluating the ability of models trained on such datasets to generalize to different test distributions.
We show how the Baroni dataset outperforms the other datasets. In particular, we find that models trained on Baroni's data can outperform other models even on their home turf. For example, a model trained on Baroni's data can do better on the Kotlerman (Kotlerman et al., 2010) test set than models trained on the Kotlerman training set with the same size.
Furthermore, we show that the Baroni dataset seems to exhibit a pronounced behaviour along two dimensions known to be relevant for hypernymy: generality and similarity. This behaviour appears to be important for the success of Baroni's dataset: if we filter and resample other training datasets with respect to this behaviour, we generally achieve better results.

Background
We first give a brief overview of hypernymy detection, important findings in this domain, and then relevant work on dataset analysis.

Supervised Hypernym Detection
The task is posed as a binary classification problem. An instance pair is composed of two embeddings, e.g. (w cat , w animal , positive). A vector operation such as concatenation (concat) or difference (diff ) is then applied to both embeddings. Vylomova et al. (2016) learned a range of semantic relations, including hypernymy, using the diff operator and achieved positive results. Roller and Erk (2016) showed that concat with a logistic regression classifier learns to extract Hearst patterns (such as, including, etc.) from distributional vectors. Weeds et al. (2014) and Vylomova et al. (2016) described the lexical memorization phenomenon: a classifier learns that a word w i is hyponym of a word w j based on the frequency of w j appearing in the hypernym slot in positive pairs. In order to avoid high scores at test time due to this effect, Weeds et al. (2014) suggest having disjoint vocabularies between training and test sets.

Dataset Analysis
Torralba and Efros (2011) compared a set of object recognition datasets by testing each of them across different test distributions. In order to fairly compare these datasets, Torralba and Efros (2011) first eliminated some visible biases such as sample size by normalizing the datasets. In this way, other biases in the datasets were exposed such as the photographer's shooting position, or the labellers' perception, that may not be easily observable and may harm the classifier performance. Torralba and Efros (2011) concluded that some datasets are a better representation of the problem domain.

Materials
We describe both the datasets that we compare and the word embedding model that we use as features.

Datasets
We pick the datasets used by Levy et al. (2015) and Weeds et al. (2014)  Baroni Baroni et al. (2012) drew instance pairs from WordNet that were manually checked to discard noisy ones.
Bless The original dataset (Baroni and Lenci, 2011) contains several semantic relations. Levy et al. (2015) used the hypernymy pairs as positive instances and the pairs in all the other semantic relations as negative instances.
Levy From a set of entailing propositions of the form (subject, verb, object) in (Levy et al., 2014), Levy et al. (2015) extracted entailing nouns that shared two arguments to create instance pairs.
Weeds Weeds et al. (2014) drew instance pairs from WordNet under the constraint that none of the words in a pair must be seen in any other pair in the same role (hyponym or hypernym).

Word Embeddings
We pick what we believe to be one of the most representative word embedding models.
GloVe Pennington et al. (2014) designed a vector space model using a log-bilinear regression function. They learned unsupervised word embeddings from a matrix of word co-ocurrences while maintaining linear sub-structures in such space.
We do not show results on the also widely-used model of Word2Vec since we get similar results.

Cross-test Evaluation
We evaluate the robustness of the six datasets for generalising to different test distributions. In order to fairly compare the datasets, we follow Torralba and Efros (2011) and remove biases such as sample size and imbalance by sub-sampling with replacement and uniformly at random the training sets. We obtain 20 subsets, i.e. samples, from each of the training sets. Each sample is normalized and balanced to 400 instances. 1 We learn a model for each sample using the Scikit-learn (Pedregosa et al., 2011) package and test it on all the six test sets. We try all combinations of vector operator (diff, concat) and classifier (logistic regression, SVM). Hyperparameter tuning and model selection are performed using self-validation sets. We report AUC and accuracy scores solely for the Glove embeddings of dimensionality 50 given that the results on other embedding models are quite comparable.

Ranking Pairs: AUC ROC
The Area Under the ROC Curve measures the ability of a classifier to rank positive instances with respect to negative ones independently of any threshold value. Unfortunately, this metric may throw an overoptimistic value under highly imbalanced data: a disproportional number of negative instances will push the positive ones higher in the ranking, while false positives will slightly affect the overall score (Zou et al., 2016). Therefore we balance the test sets using an under-sampling scheme. 2 In Table 2 we can see that, remarkably, the Baroni dataset surpasses all datasets on their own self-test sets, except for the Bless test. Interestingly, all the training sets performed better on the Baroni test set than on their self-test set (except, for the Bless dataset). This indicates both the robust generalization and superior performance of the Baroni dataset. 3 We note that no training sample has overlap with any self or cross test set, except for the Weeds dataset. On the one hand, the Weeds training sam-ples slightly overlap with the cross-test sets. On the other hand, the Weeds test set overlaps in at least 10% of the pairs with the cross-training samples. This may influence the cross-test scores (Vylomova et al., 2016).

Detecting Hypernyms: Accuracy
We optimize a threshold, on self-validation sets, for each model in Section 4.1. In Table 3 we can see again the superior performance of the Baroni dataset. While the mean of all the self-test scores (main diagonal) is 0.606 points, Baroni achieves a mean of 0.655 points.
Interestingly, in average all the datasets perform close to a random behavior, with the exception of the Baroni and Weeds datasets. 4 Furthemore, this poor behavior is observed on self-test sets for 3 datasets (Kotlerman, Levy, and Turney). This contrasts to the AUC scores obtained before. One possible cause may be a sensitivity problem in the threshold optimization.

Dataset Analysis
We provide an empirical rationale behind the good performance of the Baroni dataset: we believe it aligns to two dimensions specific of hypernymygenerality and similarity-i.e. the instances in the dataset form what we believe to be patterns denoting hypernymy. We explain below these patterns.
We use WordNet (Fellbaum, 1998) to compute both generality and similarity levels. We define generality levels as the absolute difference, in number of edges, of two words to the root of the taxonomy: g = |distance(word 1 , root) − distance(word 2 , root)|. We define similarity levels as the similarity score between two words; we use the Wu-Palmer function. 5 We explain now the patterns mentioned above. In the generality level g = 0, where co-hyponyms exist, we expect only negative pairs to populate the dataset. In the rest of the levels, we would expect a distribution where the number of instance pairs is inversely proportional to the generality level because the branching factor at the bottom levels is greater by a factor α in comparison to the top levels; this means that we are more likely to sample pairs of words connected by fewer number of   Table 3: Cross-test performance: Mean accuracy scores over 20 samples. Self-test score in bold.
edges than by higher number of edges.
On the other hand, for the similarity distribution, as a function of the number of edges, at large values we expect a dominance of positive instances because the number of edges between the words in a true hypernym pair is generally fewer than between a non-hypernym pair. In addition, as we argued for the generality distribution, we are more likely to sample shorter hypernym pairs than longer pairs.

Exploring the Baroni dataset
In Fig. 1 we see that at level g = 0 only negative pairs are found in the Baroni dataset. We also observe that the distribution matches the expected distribution along generality levels. In Fig. 2 we see that from the level s = 0.2, towards the highest levels, there is a clear dominance of positive pairs; though we also find negative pairs in these levels. These negative pairs may be positive pairs reversed, e.g. (w animal , w cat , negative), or pairs with related words, e.g. (w cat , w invertebrate , negative). We also see that from the level s = 0.1 towards the lowest levels, the negative pairs dominate.
We compare the Baroni distribution with the Turney distribution. In Fig. 3 we observe that the shape of the generality distribution roughly fits our expected distribution; however, we see that positive pairs populate level g = 0. This seems to show that around 10% of the positive pairs in the Turney dataset are spurious pairs.
In Fig. 4 we observe that the similarity distribution from the Turney dataset does not fit the expected distribution. Even though at high levels the dominance is mainly of positive pairs, at low levels we also see a strong presence of positive pairs along with negative pairs. This may imply that a high number of positive pairs are noisy or inconsistent, which may explain the low performance of the Turney dataset. Figure 1: Distribution of instance pairs on the Baroni dataset along generality levels.

Mimicking the Baroni Distribution
We believe that the patterns found in the Baroni training set may be part of the cause of its good performance. To corroborate our hypothesis, we draw a new training set from the union of all the    training sets such that we mimic the Baroni distributions in Fig. 1 and Fig. 2. More specifically, we allow a pair to populate our new training set if it fulfils constraints regarding the number of instances along generality and similarity levels.
One example constraint that needs to be fulfilled for positive pairs is: IF generality level g > 0 AND positive vs. negative pairs ratio is fulfilled according to ratio r g AND similarity level s >= 0.1 AND positive vs. negative pairs ratio is fulfilled according to ratio r s THEN accept pair.
We obtain 20 balanced and normalized samples populated with 400 instances in each of them. We compare against a dataset baseline where we allow any pair, chosen uniformly at random, to populate the baseline. For building the dataset baseline, we use the same random seeds as those used for building the samples that mimic the Baroni distribution. In Table 4 we see how the new training set robustly outperforms the baseline. These results support our hypothesis for why the Baroni dataset is able to outperform all the datasets.

Conclusions
We performed the first dataset-centric analysis for investigating how well we can predict hypernym pairs from word embeddings. We showed in crosstest evaluations how -in contrast to what results from previous work suggest-the Baroni dataset is the only one that consistently enables us to predict hypernym pairs. We empirically showed that the superior performance of the Baroni dataset may be in part due to its alignment to two dimensions relevant to of hypernymy: generality and similarity. We empirically corroborated this hypothesis by building a new training set that mimics the Baroni distribution and outperforms on average a dataset baseline.