Hierarchical Embeddings for Hypernymy Detection and Directionality

We present a novel neural model HyperVec to learn hierarchical embeddings for hypernymy detection and directionality. While previous embeddings have shown limitations on prototypical hypernyms, HyperVec represents an unsupervised measure where embeddings are learned in a specific order and capture the hypernym–hyponym distributional hierarchy. Moreover, our model is able to generalize over unseen hypernymy pairs, when using only small sets of training data, and by mapping to other languages. Results on benchmark datasets show that HyperVec outperforms both state-of-the-art unsupervised measures and embedding models on hypernymy detection and directionality, and on predicting graded lexical entailment.


Introduction
Hypernymy represents a major semantic relation and a key organization principle of semantic memory (Miller and Fellbaum, 1991;Murphy, 2002). It is an asymmetric relation between two terms, a hypernym (superordinate) and a hyponym (subordiate), as in animal-bird and flower-rose, where the hyponym necessarily implies the hypernym, but not vice versa. From a computational point of view, automatic hypernymy detection is useful for NLP tasks such as taxonomy creation (Snow et al., 2006;Navigli et al., 2011), recognizing textual entailment (Dagan et al., 2013), and text generation (Biran and McKeown, 2013), among many others.
Two families of approaches to identify and discriminate hypernyms are predominent in NLP, both of them relying on word vector representa-tions. Distributional count approaches make use of either directionally unsupervised measures or of supervised classification methods. Unsupervised measures exploit the distributional inclusion hypothesis (Geffet and Dagan, 2005;Zhitomirsky-Geffet and Dagan, 2009), or the distributional informativeness hypothesis (Santus et al., 2014;Rimell, 2014). These measures assign scores to semantic relation pairs, and hypernymy scores are expected to be higher than those of other relation pairs. Typically, Average Precision (AP) (Kotlerman et al., 2010) is applied to rank and distinguish between the predicted relations. Supervised classification methods represent each pair of words as a single vector, by using the concatenation or the element-wise difference of their vectors (Baroni et al., 2012;Roller et al., 2014;Weeds et al., 2014). The resulting vector is fed into a Support Vector Machine (SVM) or into Logistic Regression (LR), to predict hypernymy. Across approaches, Shwartz et al. (2017) demonstrated that there is no single unsupervised measure which consistently deals well with discriminating hypernymy from other semantic relations. Furthermore, Levy et al. (2015) showed that supervised methods memorize prototypical hypernyms instead of learning a relation between two words.
Approaches of hypernymy-specific embeddings utilize neural models to learn vector representations for hypernymy. Yu et al. (2015) proposed a supervised method to learn term embeddings for hypernymy identification, based on pre-extracted hypernymy pairs. Recently, Tuan et al. (2016) proposed a dynamic weighting neural model to learn term embeddings in which the model encodes not only the information of hypernyms vs. hyponyms, but also their contextual information. The performance of this family of models is typically evaluated by using an SVM to discriminate hypernymy from other relations.
In this paper, we propose a novel neural model HyperVec to learn hierarchical embeddings that (i) discriminate hypernymy from other relations (detection task), and (ii) distinguish between the hypernym and the hyponym in a given hypernymy relation pair (directionality task). Our model learns to strengthen the distributional similarity of hypernym pairs in comparison to other relation pairs, by moving hyponym and hypernym vectors close to each other. In addition, we generate a distributional hierarchy between hyponyms and hypernyms. Relying on these two new aspects of hypernymy distributions, the similarity of hypernym pairs receives higher scores than the similarity of other relation pairs; and the distributional hierarchy of hyponyms and hypernyms indicates the directionality of hypernymy.
Our model is inspired by the distributional inclusion hypothesis, that prominent context words of hyponyms are expected to appear in a subset of the hypernym contexts. We assume that each context word which appears with both a hyponym and its hypernym can be used as an indicator to determine which of the two words is semantically more general: Common context word vectors which represent distinctive characteristics of a hyponym are expected to be closer to the hyponym vector than to its hypernym vector. For example, the context word flap is more characteristic for a bird than for its hypernym animal; hence, the vector of flap should be closer to the vector of bird than to the vector of animal.
We evaluate our HyperVec model on both unsupervised and supervised hypernymy detection and directionality tasks. In addition, we apply the model to the task of graded lexical entailment (Vulić et al., 2016), and we assess the capability of HyperVec on generalizing hypernymy by mapping to German and Italian. Results on benchmark datasets of hypernymy show that the hierarchical embeddings outperform state-of-the-art measures and previous embedding models. Furthermore, the implementation of our models is made publicly available. 1 2 Related Work Unsupervised hypernymy measures: A variety of directional measures for unsupervised hypernymy detection (Weeds and Weir, 2003;Weeds et al., 2004;Clarke, 2009;Kotlerman et al., 2010; 1 www.ims.uni-stuttgart.de/data/hypervec Lenci and Benotto, 2012) all rely on some variation of the distributional inclusion hypothesis: If u is a semantically narrower term than v, then a significant number of salient distributional features of u is expected to be included in the feature vector of v as well. In addition, Santus et al. (2014) proposed the distributional informativeness hypothesis, that hypernyms tend to be less informative than hyponyms, and that they occur in more general contexts than their hyponyms. All of these approaches represent words as vectors in distributional semantic models (Turney and Pantel, 2010), relying on the distributional hypothesis (Harris, 1954;Firth, 1957). For evaluation, these directional models use the AP measure to assess the proportion of hypernyms at the top of a score-sorted list. In a different vein, Kiela et al. (2015) introduced three unsupervised methods drawn from visual properties of images to determine a concept's generality in hypernymy tasks.

Supervised hypernymy methods:
The studies in this area are based on word embeddings which represent words as low-dimensional and realvalued vectors (Mikolov et al., 2013b;Pennington et al., 2014). Each hypernymy pair is encoded by some combination of the two word vectors, such as concatenation (Baroni et al., 2012) or difference (Roller et al., 2014;Weeds et al., 2014). Hypernymy is distinguished from other relations by using a classification approach, such as SVM or LR. Because word embeddings are trained for similar and symmetric vectors, it is however unclear whether the supervised methods do actually learn the asymmetry in hypernymy (Levy et al., 2015).
Hypernymy-specific embeddings: These approaches are closest to our work. Yu et al. (2015) proposed a dynamic distance-margin model to learn term embeddings that capture properties of hypernymy. The neural model is trained on the taxonomic relation data which is pre-extracted. The resulting term embeddings are fed to an SVM classifier to predict hypernymy. However, this model only learns term pairs without considering their contexts, leading to a lack of generalization for term embeddings. Tuan et al. (2016) introduced a dynamic weighting neural network to learn term embeddings that encode information about hypernymy and also about their contexts, considering all words between a hypernym and its hyponym in a sentence. The proposed model is trained on a set of hypernym relations extracted from WordNet (Miller, 1995). The embeddings are applied as features to detect hypernymy, using an SVM classifier. Tuan et al. (2016) handles the drawback of the approach by Yu et al. (2015), considering the contextual information between two terms; however the method still is not able to determine the directionality of a hypernym pair. Vendrov et al. (2016) proposed a method to encode order into learned distributed representations, to explicitly model partial order structure of the visual-semantic hierarchy or the hierarchy of hypernymy in WordNet. The resulting vectors are used to predict the transitive hypernym relations in WordNet.

Hierarchical Embeddings
In this section, we present our model of hierarchical embeddings HyperVec. Section 3.1 describes how we learn the embeddings for hypernymy, and Section 3.2 introduces the unsupervised measure HyperScore that is applied to the hypernymy tasks.

Learning Hierarchical Embeddings
Our approach makes use of a set of hypernyms which could be obtained from either exploiting the transitivity of the hypernymy relation (Fallucchi and Zanzotto, 2011) or lexical databases, to learn hierarchical embeddings. We rely on Word-Net, a large lexical database of English (Fellbaum, 1998), and extract all hypernym-hyponym pairs for nouns and for verbs, including both direct and indirect hypernymy, e.g., animal-bird, birdrobin, animal-robin. Before training our model, we exclude all hypernym pairs which appear in any datasets used for evaluation.
In the following, Section 3.1.1 first describes the Skip-gram model which is integrated into our model for optimization. Section 3.1.2 then describes the objective functions to train the hierarchical embeddings for hypernymy.

Skip-gram Model
The Skip-gram model is a word embeddings method suggested by Mikolov et al. (2013b). Levy and Goldberg (2014) introduced a variant of the Skip-gram model with negative sampling (SGNS), in which the objective function is defined as follows: where the skip-gram with negative sampling is trained on a corpus of words w ∈ V W and their contexts c ∈ V C , with V W and V C the word and context vocabularies, respectively. The collection of observed words and context pairs is denoted as D; the term #(w, c) refers to the number of times the pair (w, c) appeared in D; the term σ(x) is the sigmoid function; the term k is the number of negative samples and the term c N is the sampled context, drawn according to the empirical unigram distribution P .

Hierarchical Hypernymy Model
Vector representations for detecting hypernymy are usually encoded by standard first-order distributional co-occurrences. In this way, they are insufficient to differentiate hypernymy from other paradigmatic relations such as synonymy, meronymy, antonymy, etc. Incorporating directional measures of hypernymy to detect hypernymy by exploiting the common contexts of hypernym and hyponym improves this relation distinction, but still suffers from distinguishing between hypernymy and meronymy. Our novel approach presents two solutions to deal with these challenges. First of all, the embeddings are learned in a specific order, such that the similarity score for hypernymy is higher than the similarity score for other relations. For example, the hypernym pair animal-frog will be assigned a higher cosine score than the co-hyponymy pair eagle-frog. Secondly, the embeddings are learned to capture the distributional hierarchy between hyponym and hypernym, as an indicator to differentiate between hypernym and hyponym. For example, given a hyponym-hypernym pair (p, q), we can exploit the Euclidean norms of q and p to differentiate between the two words, such that the Euclidean norm of the hypernym q is larger than the Euclidean norm of the hyponym p.
Inspired by the distributional lexical contrast model in Nguyen et al. (2016) for distinguishing antonymy from synonymy, this paper proposes two objective functions to learn hierarchical embeddings for hypernymy.
Before moving to the details of the two objective functions, we first define the terms as follows: W(c) refers to the set of words co-occurring with the context c in a certain window-size; H(w) denotes the set of hypernyms for the word w; the two terms H + (w, c) and H − (w, c) are drawn from H(w), and are defined as follows: where cos( x, y) stands for the cosine similarity of the two vectors x and y; θ is the margin.
The set H + (w, c) contains all hypernyms of the word w that share the context c and satisfy the constraint that the cosine similarity of pair (w, c) is higher than the cosine similarity of pair (u, c) within a max-margin framework θ. Similarly, the set H − (w, c) represents all hypernyms of the word w with respect to the common context c in which the cosine similarity difference between the pair (w, c) and the pair (v, c) is within a min-margin framework θ. The two objective functions are defined as follows: where the term ∂( x, y) stands for the cosine derivative of ( x, y); and ∂ then is optimized by the negative sampling procedure.
The objective function in Equation 3 minimizes the distributional difference between the hyponym w and the hypernym u by exploiting the common context c. More specifically, if the common context c is the distinctive characteristic of the hyponym w (i.e. the common context c is closer to the hyponym w than to the hypernym u), the objective function L (w,c) tries to decrease the distributional generality of hypernym u by moving w closer to u. For example, given a hypernymhyponym pair animal-bird, the context flap is a distinctive characteristic of bird, because almost every bird can flap, but not every animal can flap. Therefore, the context flap is closer to the hyponym bird than to the hypernym animal. The model then tries to move bird closer to animal in order to enforce the similarity between bird and animal, and to decrease the distributional generality of animal.
In contrast to Equation 3, the objective function in Equation 4 minimizes the distributional difference between the hyponym w and the hypernym v by exploiting the common context c, which is a distinctive characteristic of the hypernym v. In this case, the objective function L (v,w,c) tries to reduce the distributional generality of hyponym w by moving v closer to w. For example, the context word rights, a distinctive characteristic of the hypernym animal, should be closer to animal than to bird. Hence, the model tries to move the hypernym animal closer to the hyponym bird. Given that hypernymy is an asymmetric and also a hierarchical relation, where each hypernym may contain several hyponyms, our objective functions updates simultaneously both the hypernym and all of its hyponyms; therefore, our objective functions are able to capture the hierarchical relations between the hypernym and its hyponyms. Moreover, in our model, the margin framework θ plays a role in learning the hierarchy of hypernymy, and in preventing the model from minimizing the distance of synonymy or antonymy, because synonymy and antonymy share many contexts.
In the final step, the objective function which is used to learn the hierarchical embeddings for hypernymy combines Equations 1, 2, 3, and 4 by the objective function in Equations 5 and 6:

Unsupervised Hypernymy Measure
HyperVec is expected to show the two following properties: (i) the hyponym and the hypernym are close to each other, and (ii) there exists a distributional hierarchy between hypernyms and their hyponyms. Given a hypernymy pair (u, v) in which u is the hyponym and v is the hypernym, we propose a measure to detect hypernymy and to determine the directionality of hypernymy by using the hierarchical embeddings as follows: where cos( u, v) is the cosine similarity between u and v, and · is the magnitude of the vector (or the Euclidean norm). The cosine similarity is applied to distinguish hypernymy from other re-lations, due to the first property of the hierarchical embeddings, while the second property is used to decide about the directionality of hypernymy, assuming that the magnitude of the hypernym is larger than the magnitude of the hyponym. Note that the proposed hypernymy measure is unsupervised when the resource is only used to learn hierarchical embeddings.

Experiments
In this section, we first describe the experimental settings in our experiments (Section 4.1). We then evaluate the performance of HyperVec on three different tasks: i) unsupervised hypernymy detection and directionality (Section 4.2), where we assess HyperVec on ranking and classifying hypernymy; ii) supervised hypernymy detection (Section 4.3), where we apply supervised classification to detect hypernymy; iii) graded lexical entailment (Section 4.4), where we predict the strength of hypernymy pairs.

Unsupervised Hypernymy Detection and Directionality
In this section, we assess our model on two experimental setups: i) a ranking retrieval setup that expects hypernymy pairs to have a higher similarity score than instances from other semantic relations; ii) a classification setup that requires both hypernymy detection and directionality.    (Santus et al., 2015), and LENCI&BENOTTO (Benotto, 2015). Table 1 describes the detail of these datasets in terms of the semantic relations and the number of instances. The Average Precision (AP) ranking measure is used to evaluate the performance of the measures.
In comparison to the state-of-the-art unsupervised measures compared by Shwartz et al. (2017) (henceforth, baseline models), we apply our unsupervised measure HyperScore (Equation 7) to rank hypernymy against other relations. Table 2   presents the results of using HyperScore vs. the best baseline models, across datasets. When detecting hypernymy among all other relations (which is the most challenging task), HyperScore significantly outperforms all baseline variants on all datasets. The strongest difference is reached on the BLESS dataset, where HyperScore achieves an improvement of 40% AP score over the best baseline model. When ranking hypernymy in comparison to a single other relation, HyperScore also improves over the baseline models, except for the event relation in the BLESS dataset. We assume that this is due to the different parts-ofspeech (adjective and noun) involved in the relation, where HyperVec fails to establish a hierarchy.

Classification
In this setup, we rely on three datasets of semantic relations, which were all used in various state-of-the-art approaches before, and brought together for hypernymy evaluation by Kiela et al. (2015). (i) A subset of BLESS contains 1,337 hyponym-hypernym pairs. The task is to predict the directionality of hypernymy within a binary classification. Our approach requires no threshold; we only need to compare the magnitudes of the two words and to assign the hypernym label to the word with the larger magnitude.  versed hypernym-hyponym pairs, plus additional holonym-meronym pairs, co-hyponyms and randomly matched nouns. For this classification we make use of our HyperScore measure that ranks hypernymy pairs higher than other relation pairs. A threshold decides about the splitting point between the two classes: hyper vs. other. Instead of using a manually defined threshold as done by Kiela et al. (2015), we decided to run 1 000 iterations which randomly sampled only 2% of the available pairs for learning a threshold, using the remaining 98% for test purposes. We present average accuracy results across all iterations. Figure 1b compares the default cosine similarities between the relation pairs (as applied by SGNS ) and HyperScore (as applied by HyperVec) on this task. Using HyperScore, the class "hyper" can clearly be distinguished from the class "other".
(iii) BIBLESS represents the most challenging dataset; the relation pairs from WBLESS are split into three classes instead of two: hypernymy pairs, reversed hypernymy pairs, and other relation pairs. In this case, we perform a three-way classification.
We apply the same technique as used for the WB-LESS classification, but in cases where we classify hyper we additionally classify the hypernymy direction, to decide between hyponym-hypernym pairs and reversed hypernym-hyponym pairs. Table 3 compares our results against related work. HyperVec outperforms all other methods on all three tasks. In addition we see again that an unmodified SGNS model cannot solve any of the three tasks.

Supervised Hypernymy Detection
For supervised hypernymy detection, we make use of the two datasets: the full BLESS dataset, and ENTAILMENT (Baroni et al., 2012), containing 2,770 relation pairs in total, including 1,385 hypernym pairs and 1,385 other relations pairs. We follow the same procedure as Yu et al. (2015) and Tuan et al. (2016) to assess HyperVec on the two datasets. Regarding BLESS, we extract pairs for four types of relations: hypernymy, meronymy, co-hyponymy (or coordination), and add the random relation for nouns. For the evaluation, we randomly select one concept and its relatum for testing, and train the supervised model on the 199 remaining concepts and its relatum. We then report the average accuracy across all concepts. For the ENTAILMENT dataset, we randomly select one hypernym pair for testing and train on all remaining hypernym pairs. Again, we report the average accuracy across all hypernyms. We apply an SVM classifier to detect hypernymy based on HyperVec. Given a hyponymhypernym pair (u, v), we concatenate four components to construct the vector for a pair (u, v) as follows: the vector difference between hypernym and hyponym ( v− u); the cosine similarity between the hypernym and hyponym vectors (cos( u, v)); the magnitude of the hyponym ( u ); and the magnitude of the hypernym ( v ). The resulting vector is fed into the SVM classifier to detect hypernymy. Similar to the two previous works, we train the SVM classifier with the RBF kernel, λ = 0.03125, and the penalty C = 8.0. Table 4 shows the performance of HyperVec and the two baseline models reported by Tuan et al. (2016). HyperVec slightly outperforms the method of Tuan et al. (2016) on the BLESS dataset, and is equivalent to the performance of their method on the ENTAILMENT dataset. In comparison to the method of Yu et al. (2015), HyperVec achieves significant improvements.

Graded Lexical Entailment
In this experiment, we apply HyperVec to the dataset of graded lexical entailment, HyperLex, as introduced by Vulić et al. (2016). The HyperLex dataset provides soft lexical entailment on a con-  tinuous scale, rather than simplifying into a binary decision. HyperLex contains 2,616 word pairs across seven semantic relations and two word classes (nouns and verbs). Each word pair is rated by a score that indicates the strength of the semantic relation between the two words. For example, the score of the hypernym pair duck-animal is 5.9 out of 6.0, while the score of the reversed pair animal-duck is only 1.0. We compared HyperScore against the most prominent state-of-the-art hypernymy and lexical entailment models from previous work: • Directional entailment measures (DEM) (Weeds and Weir, 2003;Weeds et al., 2004;Clarke, 2009;Kotlerman et al., 2010;Lenci and Benotto, 2012) • Generality measures (SQLS) (Santus et al., 2014) • Visual generality measures (VIS) (Kiela et al., 2015) • Consideration of concept frequency ratio (FR) (Vulić et al., 2016) • WordNet-based similarity measures (WN) (Wu and Palmer, 1994;Pedersen et al., 2004) • Order embeddings (OrderEmb) (Vendrov et al., 2016) • Skip-gram embeddings (SGNS) (Mikolov et al., 2013b;Levy and Goldberg, 2014) • Embeddings fine-tuned to a paraphrase database with linguistic constraints (PARA-GRAM) (Mrkšić et al., 2016) • Gaussian embeddings (Word2Gauss) (Vilnis and McCallum, 2015) The performance of the models is assessed through Spearman's rank-order correlation coefficient ρ (Siegel and Castellan, 1988), comparing the ranks of the models' scores and the human judgments for the given word pairs.  Table 5: Results (ρ) of HyperScore and state-ofthe-art measures and word embedding models on graded lexical entailment. Table 5 shows that HyperScore significantly outperforms both state-of-the-art measures and word embedding models.

Measures
HyperScore outperforms even the previously best word embedding model PARAGRAM by .22, and the previously best measures FR by .27.
The reason that HyperVec outperforms all other models is that the hierarchy between hypernym and hypornym within HyperVec differentiates hyponym-hypernym pairs from hypernymhyponym pairs. For example, the HyperScore for the pairs duck-animal and animal-duck are 3.02 and 0.30, respectively. Thus, the magnitude proportion of the hypernym-hyponym pair duckanimal is larger than that for the pair animal-duck.

Generalizing Hypernymy
Having demonstrated the general abilities of HyperVec, this final section explores its potential for generalization in two different ways, (i) by relying on a small seed set only, rather than using a large set of training data; and (ii) by projecting HyperVec to other languages.
Hypernymy Seed Generalization: We utilize only a small hypernym set from the hypernymy resource to train HyperVec, relying on 200 concepts from the BLESS dataset. The motivation behind using these concepts is threefold: i) these concepts are distinct and unambiguous noun concepts; ii) the concepts were equally divided between living and non-living entities; iii) concepts have been grouped into 17 broader classes. Based on the seed set, we collected the hyponyms of each concept from WordNet, and then re-trained HyperVec. On the hypernymy ranking retrieval task (Section 4.2.1), HyperScore outperforms the baselines across all datasets (cf. Table 1) with AP values of 0.39, 0.448, and 0.585 for EVALu-tion, LenciBenotto, and Weeds, respectively. For the graded lexical entailment task (Section 4.4), HyperScore obtains a correlation of ρ = 0.30, outperforming all models except for PARAGRAM with ρ = 0.32. Overall, the results show that HyperVec is indeed able to generalize hypernymy from small seeds of training data.

Generalizing Hypernymy across Languages:
We assume that hypernymy detection can be improved across languages by projecting representations from any arbitrary language into our modified English HyperVec space. We conduct experiments for German and Italian, where the languagespecific representations are obtained using the same hyper-parameter settings as for our English SGNS model (cf. Section 4.1). As corpus resource we relied on Wikipedia dumps 2 . Note that we do not use any additional resource, such as the German or Italian WordNet, to tune the embeddings for hypernymy detection. Based on the representations, a mapping function between a source language (German, Italian) and our English HyperVec space is learned, by relying on the least-squares error method from previous work using cross-lingual data (Mikolov et al., 2013a) and different modalities (Lazaridou et al., 2015).
To learn a mapping function between two languages, a one-to-one correspondence (word translations) between two sets of vectors is required. We obtained these translations by using the parallel Europarl 3 V7 corpus for German-English and Italian-English. Word alignment counts were extracted using fast align (Dyer et al., 2013). We then assigned each source word to the English word with the maximum number of alignments in the parallel corpus. We could match 25,547 pairs for DE→EN and 47,475 pairs for IT→EN.
Taking the aligned subset of both spaces, we assume that X is the matrix obtained by concatenating all source vectors, and likewise Y is the matrix obtained by concatenating all corresponding English elements. Applying the 2-regularized leastsquares error objective can be described using the following equation: Although we learn the mapping only on a subset of aligned words, it allows us to project every word in a source vocabulary to its English HyperVec position by using W.
Finally we compare the original representations and the mapped representation on the hypernymy ranking retrieval task (similar to Section 4.2.1). As gold resources we relied on German and Italian nouns pairs. For German we used the 282 German pairs collected via Amazon Mechanical Turk by Scheible and Schulte im Walde (2014). The 1,350 Italian pairs were collected via Crowdflower by Sucameli (2015) in the same way. Both collections contain hypernymy, antonymy and synonymy pairs. As before, we evaluate the ranking by AP, and we compare the cosine of the unmodified default representations against the HyperScore of the projected representations.  The results are shown in Table 6. We clearly see that for both languages the default SGNS embeddings do not provide higher similarity scores for hypernymy pairs (except for Italian Hyp/Ant), but both languages provide higher scores when we map the embeddings into the English HyperVec space.

Conclusion
This paper proposed a novel neural model HyperVec to learn hierarchical embeddings for hypernymy.
HyperVec has been shown to strengthen hypernymy similarity, and to capture the distributional hierarchy of hypernymy. Together with a newly proposed unsupervised measure HyperScore our experiments demonstrated (i) significant improvements against state-of-theart measures, and (ii) the capability to generalize hypernymy and learn the relation instead of memorizing prototypical hypernyms.