Improving Hypernymy Detection with an Integrated Path-based and Distributional Method

Detecting hypernymy relations is a key task in NLP, which is addressed in the literature using two complementary approaches. Distributional methods, whose supervised variants are the current best performers, and path-based methods, which received less research attention. We suggest an improved path-based algorithm, in which the dependency paths are encoded using a recurrent neural network, that achieves results comparable to distributional methods. We then extend the approach to integrate both path-based and distributional signals, significantly improving upon the state-of-the-art on this task.


Introduction
Hypernymy is an important lexical-semantic relation for NLP tasks. For instance, knowing that Tom Cruise is an actor can help a question answering system answer the question "which actors are involved in Scientology?". While semantic taxonomies, like WordNet (Fellbaum, 1998), define hypernymy relations between word types, they are limited in scope and domain. Therefore, automated methods have been developed to determine, for a given term-pair (x, y), whether y is an hypernym of x, based on their occurrences in a large corpus.
For a couple of decades, this task has been addressed by two types of approaches: distributional, and path-based. In distributional methods, the decision whether y is a hypernym of x is based on the distributional representations of these terms. Lately, with the popularity of word embeddings (Mikolov et al., 2013), most focus has shifted towards supervised distributional methods, in which each (x, y) term-pair is represented using some combination of the terms' embedding vectors.
In contrast to distributional methods, in which the decision is based on the separate contexts of x and y, path-based methods base the decision on the lexico-syntactic paths connecting the joint occurrences of x and y in a corpus. Hearst (1992) identified a small set of frequent paths that indicate hypernymy, e.g. Y such as X. Snow et al. (2004) represented each (x, y) term-pair as the multiset of dependency paths connecting their co-occurrences in a corpus, and trained a classifier to predict hypernymy, based on these features.
Using individual paths as features results in a huge, sparse feature space. While some paths are rare, they often consist of certain unimportant components. For instance, "Spelt is a species of wheat" and "Fantasy is a genre of fiction" yield two different paths: X be species of Y and X be genre of Y, while both indicating that X is-a Y. A possible solution is to generalize paths by replacing words along the path with their part-of-speech tags or with wild cards, as done in the PATTY system (Nakashole et al., 2012).
Overall, the state-of-the-art path-based methods perform worse than the distributional ones. This stems from a major limitation of path-based methods: they require that the terms of the pair occur together in the corpus, limiting the recall of these methods. While distributional methods have no such requirement, they are usually less precise in detecting a specific semantic relation like hypernymy, and perform best on detecting broad semantic similarity between terms. Though these approaches seem complementary, there has been rather little work on integrating them (Mirkin et al., 2006;Kaji and Kitsuregawa, 2008).
In this paper, we present HypeNET, an integrated path-based and distributional method for hypernymy detection. Inspired by recent progress in relation classification, we use a long shortterm memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to encode dependency paths. In order to create enough training data for our network, we followed previous methodology of constructing a dataset based on knowledge resources.
We first show that our path-based approach, on its own, substantially improves performance over prior path-based methods, yielding performance comparable to state-of-the-art distributional methods. Our analysis suggests that the neural path representation enables better generalizations. While coarse-grained generalizations, such as replacing a word by its POS tag, capture mostly syntactic similarities between paths, HypeNET captures also semantic similarities.
We then show that we can easily integrate distributional signals in the network. The integration results confirm that the distributional and pathbased signals indeed provide complementary information, with the combined model yielding an improvement of up to 14 F 1 points over each individual model. 1

Background
We introduce the two main approaches for hypernymy detection: distributional (Section 2.1), and path-based (Section 2.2). We then discuss the recent use of recurrent neural networks in the related task of relation classification (Section 2.3).

Distributional Methods
Hypernymy detection is commonly addressed using distributional methods. In these methods, the decision whether y is a hypernym of x is based on the distributional representations of the two terms, i.e., the contexts with which each term occurs separately in the corpus.
Earlier methods developed unsupervised measures for hypernymy, starting with symmetric similarity measures (Lin, 1998), and followed by directional measures based on the distributional inclusion hypothesis (Weeds and Weir, 2003;Kotlerman et al., 2010). This hypothesis states that the contexts of a hyponym are expected to be largely included in those of its hypernym. More recent work (Santus et al., 2014;Rimell, 2014) introduce new measures, based on the assumption that the most typical linguistic contexts of a hypernym are less informative than those of its hyponyms.
More recently, the focus of the distributional approach shifted to supervised methods. In these methods, the (x, y) term-pair is represented by a feature vector, and a classifier is trained on these vectors to predict hypernymy. Several methods are used to represent term-pairs as a combination of each term's embeddings vector: concatenation x⊕ y (Baroni et al., 2012), difference y − x (Roller et al., 2014;Weeds et al., 2014), and dot-product x · y. Using neural word embeddings (Mikolov et al., 2013;Pennington et al., 2014), these methods are easy to apply, and show good results (Baroni et al., 2012;Roller et al., 2014).

Path-based Methods
A different approach to detecting hypernymy between a pair of terms (x, y) considers the lexicosyntactic paths that connect the joint occurrences of x and y in a large corpus. Automatic acquisition of hypernyms from free text, based on such paths, was first proposed by Hearst (1992), who identified a small set of lexico-syntactic paths that indicate hypernymy relations (e.g. Y such as X, X and other Y).
In a later work, Snow et al. (2004) learned to detect hypernymy. Rather than searching for specific paths that indicate hypernymy, they represent each (x, y) term-pair as the multiset of all dependency paths that connect x and y in the corpus, and train a logistic regression classifier to predict whether y is a hypernym of x, based on these paths.
Paths that indicate hypernymy are those that were assigned high weights by the classifier. The paths identified by this method were shown to subsume those found by Hearst (1992), yielding improved performance. Variations of Snow et al.'s (2004) method were later used in tasks such as taxonomy construction (Snow et al., 2006;Kozareva and Hovy, 2010;Carlson et al., 2010;Riedel et al., 2013), analogy identification (Turney, 2006), and definition extraction (Borg et al., 2009;Navigli and Velardi, 2010).
A major limitation in relying on lexicosyntactic paths is the sparsity of the feature space. Since similar paths may somewhat vary at the lexical level, generalizing such variations into more abstract paths can increase recall. The PATTY algorithm (Nakashole et al., 2012) Figure 1: An example dependency tree of the sentence "parrot is a bird", with x=parrot and y=bird, represented in our notation as X/NOUN/nsubj/< be/VERB/ROOT/- omy of term relations from free text. For each path, they added generalized versions in which a subset of words along the path were replaced by either their POS tags, their ontological types or wild-cards. This generalization increased recall while maintaining the same level of precision.

RNNs for Relation Classification
Relation classification is a related task whose goal is to classify the relation that is expressed between two target terms in a given sentence to one of predefined relation classes. To illustrate, consider the following sentence, from the SemEval-2010 relation classification task dataset (Hendrickx et al., 2009): "The [apples] e 1 are in the [basket] e 2 ". Here, the relation expressed between the target entities is Content − Container(e 1 , e 2 ). The shortest dependency paths between the target entities were shown to be informative for this task (Fundel et al., 2007). Recently, deep learning techniques showed good performance in capturing the indicative information in such paths.
In particular, several papers show improved performance using recurrent neural networks (RNN) that process a dependency path edge-by-edge. Xu et al. (2015;2016) apply a separate long shortterm memory (LSTM) network to each sequence of words, POS tags, dependency labels and Word-Net hypernyms along the path. A max-pooling layer on the LSTM outputs is used as the input of a network that predicts the classification. Other papers suggest incorporating additional network architectures to further improve performance (Nguyen and Grishman, 2015;Liu et al., 2015).
While relation classification and hypernymy detection are both concerned with identifying semantic relations that hold for pairs of terms, they differ in a major respect. In relation classification the relation should be expressed in the given text, while in hypernymy detection, the goal is to recognize a generic lexical-semantic relation between terms that holds in many contexts. Accordingly, in relation classification a term-pair is represented by a single dependency path, while in hypernymy detection it is represented by the multiset of all dependency paths in which they co-occur in the corpus.

LSTM-based Hypernymy Detection
We present HypeNET, an LSTM-based method for hypernymy detection. We first focus on improving path representation (Section 3.1), and then integrate distributional signals into our network, resulting in a combined method (Section 3.2).

Path-based Network
Similarly to prior work, we represent each dependency path as a sequence of edges that leads from x to y in the dependency tree. 2 Each edge contains the lemma and part-of-speech tag of the source node, the dependency label, and the edge direction between two subsequent nodes. We denote each edge as lemma/P OS/dep/dir. See figure 1 for an illustration.
Rather than treating an entire dependency path as a single feature, we encode the sequence of edges using a long short-term memory (LSTM) network. The vectors obtained for the different paths of a given (x, y) pair are pooled, and the resulting vector is used for classification. Figure 2 depicts the overall network structure, which is described below.
Edge Representation We represent each edge by the concatenation of its components' vectors: where v l , v pos , v dep , v dir represent the embedding vectors of the lemma, part-of-speech, dependency label and dependency direction (along the path from x to y), respectively.
Path Representation For a path p composed of edges e 1 , ..., e k , the edge vectors v e 1 , ..., v e k are fed in order to an LSTM encoder, resulting in a vector o p representing the entire path p. The LSTM architecture is effective at capturing temporal patterns in sequences. We expect the training procedure to drive the LSTM encoder to focus on parts of the path that are more informative for the classification task while ignoring others. Term-Pair Classification Each (x, y) term-pair is represented by the multiset of lexico-syntactic paths that connected x and y in the corpus, denoted as paths(x, y), while the supervision is given for the term pairs. We represent each (x, y) term-pair as the weighted-average of its path vectors, by applying average pooling on its path vectors, as follows: where f p,(x,y) is the frequency of p in paths(x, y). We then feed this path vector to a single-layer network that performs binary classification to decide whether y is a hypernym of x.
c is a 2-dimensional vector whose components sum to 1, and we classify a pair as positive if c[1] > 0.5.

Implementation Details
To train the network, we used PyCNN. 3 We minimize the cross entropy loss using gradient-based optimization, with mini-batches of size 10 and the Adam update rule (Kingma and Ba, 2014). Regularization is applied by a dropout on each of the components' embeddings. We tuned the hyper-parameters (learning rate and dropout rate) on the validation set (see the appendix for the hyper-parameters values).
We initialized the lemma embeddings with the pre-trained GloVe word embeddings (Pennington et al., 2014), trained on Wikipedia. We tried both 3 https://github.com/clab/cnn the 50-dimensional and 100-dimensional embedding vectors and selected the ones that yield better performance on the validation set. 4 The other embeddings, as well as out-of-vocabulary lemmas, are initialized randomly. We update all embedding vectors during training.

Integrated Network
The network presented in Section 3.1 classifies each (x, y) term-pair based on the paths that connect x and y in the corpus. Our goal was to improve upon previous path-based methods for hypernymy detection, and we show in Section 6 that our network indeed outperforms them. Yet, as path-based and distributional methods are considered complementary, we present a simple way to integrate distributional features in the network, yielding improved performance.
We extended the network to take into account distributional information on each term. Inspired by the supervised distributional concatenation method (Baroni et al., 2012), we simply concatenate x and y word embeddings to the (x, y) feature vector, redefining v xy : where v wx and v wy are x and y's word embeddings, respectively, and v paths(x,y) is the averaged path vector defined in equation 1. This way, each (x, y) pair is represented using both the distributional features of x and y, and their path-based features.  Neural networks typically require a large amount of training data, whereas the existing hypernymy datasets, like BLESS (Baroni and Lenci, 2011), are relatively small. Therefore, we followed the common methodology of creating a dataset using distant supervision from knowledge resources (Snow et al., 2004;Riedel et al., 2013). Following Snow et al. (2004), who constructed their dataset based on WordNet hypernymy, and aiming to create a larger dataset, we extract hypernymy relations from several resources: WordNet (Fellbaum, 1998), DBPedia (Auer et al., 2007), Wikidata (Vrandečić, 2012) and Yago (Suchanek et al., 2007). All instances in our dataset, both positive and negative, are pairs of terms that are directly related in at least one of the resources. These resources contain thousands of relations, some of which indicate hypernymy with varying degrees of certainty. To avoid including questionable relation types, we consider as denoting positive examples only indisputable hypernymy relations (Table 1), which we manually selected from the set of hypernymy indicating relations in Shwartz et al. (2015).
Term-pairs related by other relations (including hyponymy), are considered as negative instances. Using related rather than random term-pairs as negative instances tests our method's ability to distinguish between hypernymy and other kinds of semantic relatedness. We maintain a ratio of 1:4 positive to negative pairs in the dataset.
Like Snow et al. (2004), we include only termpairs that have joint occurrences in the corpus, requiring at least two different dependency paths for each pair.

Random and Lexical Dataset Splits
As our primary dataset, we perform standard random splitting, with 70% train, 25% test and 5% validation sets.
As pointed out by , supervised distributional lexical inference methods tend  to perform "lexical memorization", i.e., instead of learning a relation between the two terms, they mostly learn an independent property of a single term in the pair: whether it is a "prototypical hypernym" or not. For instance, if the training set contains term-pairs such as (dog, animal), (cat, animal), and (cow, animal), all annotated as positive examples, the algorithm may learn that animal is a prototypical hypernym, classifying any new (x, animal) pair as positive, regardless of the relation between x and animal.  suggested to split the train and test sets such that each will contain a distinct vocabulary ("lexical split"), in order to prevent the model from overfitting by lexical memorization.
To investigate such behaviors, we present results also for a lexical split of our dataset. In this case, we split the train, test and validation sets such that each contains a distinct vocabulary. We note that this differs from , who split only the train and the test sets, and dedicated a subset of the train for validation. We chose to deviate from  because we noticed that when the validation set contains terms from the train set, the model is rewarded for lexical memorization when tuning the hyper-parameters, consequently yielding suboptimal performance on the lexically-distinct test set. When each set has a distinct vocabulary, the hyper-parameters are tuned to avoid lexical memorization and are likely to perform better on the test set. We tried to keep roughly the same 70/25/5 ratio in our lexical split. 5 The sizes of the two datasets are shown in Table 2.
Indeed, training a model on a lexically split dataset may result in a more general model, that can better handle pairs consisting of two unseen terms during inference. However, we argue that in the common applied scenario, the inference involves an unseen pair (x, y), in which x and/or y have already been observed separately. Models trained on a random split may introduce the model with a term's "prior probability" of being a hypernym or a hyponym, and this information can be exploited beneficially at inference time.

Baselines
We compare HypeNET with several state-of-theart methods for hypernymy detection, as described in Section 2: path-based methods (Section 5.1), and distributional methods (Section 5.2). Due to different works using different datasets and corpora, we replicated the baselines rather than comparing to the reported results. We use the Wikipedia dump from May 2015 as the underlying corpus of all the methods, and parse it using spaCy. 6 We perform model selection on the validation set to tune the hyper-parameters of each method. 7 The best hyper-parameters are reported in the appendix.

Path-based Methods
Snow We follow the original paper, and extract all shortest paths of four edges or less between terms in a dependency tree. Like Snow et al. (2004), we add paths with "satellite edges", i.e., single words not already contained in the dependency path, which are connected to either X or Y, allowing paths like such Y as X. The number of distinct paths was 324,578. We apply χ 2 feature selection to keep only the 100,000 most informative paths and train a logistic regression classifier.
Generalization We also compare our method to a baseline that uses generalized dependency paths. Following PATTY's approach to generalizing paths (Nakashole et al., 2012), we replace edges with their part-of-speech tags as well as with wild cards. We generate the powerset of all possible generalizations, including the original paths. See Table 3 for examples. The number of features after generalization went up to 2,093,220. Similarly to the first baseline, we apply feature selection, this time keeping the 1,000,000 most informative paths, and train a logistic regression classifier over the generalized paths. 8

Distributional Methods
Unsupervised SLQS (Santus et al., 2014) is an entropy-based measure for hypernymy detection, reported to outperform previous state-ofthe-art unsupervised methods (Weeds and Weir, 2003;Kotlerman et al., 2010). The original paper was evaluated on the BLESS dataset (Baroni and Lenci, 2011), which consists of mostly frequent words. Applying the vanilla settings of SLQS on our dataset, that contains also rare terms, resulted in low performance. Therefore, we received assistance from Enrico Santus, who kindly provided the results of SLQS on our dataset after tuning the system as follows.
The validation set was used to tune the threshold for classifying a pair as positive, as well as the maximum number of each term's most associated contexts (N ). In contrast to the original paper, in which the number of each term's contexts is fixed to N , in this adaptation it was set to the minimum between the number of contexts with LMI score above zero and N . In addition, the SLQS scores were not multiplied by the cosine similarity scores between terms, and terms were lemmatized prior to computing the SLQS scores, significantly improving recall.
As our results suggest, while this method is state-of-the-art for unsupervised hypernymy detection, it is basically designed for classifying specificity level of related terms, rather than hypernymy in particular.
Supervised To represent term-pairs with distributional features, we tried several state-of-the-art methods: concatenation x⊕ y (Baroni et al., 2012), difference y − x (Roller et al., 2014;Weeds et al., 2014), and dot-product x · y. We downloaded several pre-trained embeddings (Mikolov et al., 2013;Pennington et al., 2014) of different sizes, and trained a number of classifiers: logistic regression, SVM, and SVM with RBF kernel, which was reported by  to perform best in this setting. We perform model selection on the validation set to select the best vectors, method and regularization factor (see the appendix).  Table 4: Performance scores of our method compared to the path-based baselines and the state-of-the-art distributional methods for hypernymy detection, on both variations of the dataset -with lexical and random split to train / test / validation. Table 4 displays performance scores of HypeNET and the baselines. HypeNET Path-based is our path-based recurrent neural network model (Section 3.1) and HypeNET Integrated is our combined method (Section 3.2). Comparing the path-based methods shows that generalizing paths improves recall while maintaining similar levels of precision, reassessing the behavior found in Nakashole et al. (2012). HypeNET Path-based outperforms both path-based baselines by a significant improvement in recall and with slightly lower precision. The recall boost is due to better path generalization, as demonstrated in Section 7.1. Regarding distributional methods, the unsupervised SLQS baseline performed slightly worse on our dataset. The low precision stems from its inability to distinguish between hypernyms and meronyms, which are common in our dataset, causing many false positive pairs such as (zabrze, poland) and (kibbutz, israel). We sampled 50 false positive pairs of each dataset split, and found that 38% of the false positive pairs in the random split and 48% of those in the lexical split were holonym-meronym pairs.

Results
In accordance with previously reported results, the supervised embedding-based method is the best performing baseline on our dataset as well.
HypeNET Path-based performs slightly better, achieving state-of-the-art results. Adding distributional features to our method shows that these two approaches are indeed complementary. On both dataset splits, the performance differences between HypeNET Integrated and HypeNET Pathbased, as well as the supervised distributional method, are substantial, and statistically significant with p-value of 1% (paired t-test).
We also reassess that indeed supervised distributional methods perform worse on a lexical split . We further observe a similar reduction when using HypeNET, which is not a re-sult of lexical memorization, but rather stems from over-generalization (Section 7.1).

Qualitative Analysis of Learned Paths
We analyze HypeNET's ability to generalize over path structures, by comparing prominent indicative paths which were learned by each of the pathbased methods. We do so by finding high-scoring paths that contributed to the classification of truepositive pairs in the dataset. In the path-based baselines, these are the highest-weighted features as learned by the logistic regression classifier. In the LSTM-based method, it is less straightforward to identify the most indicative paths. We assess the contribution of a certain path p to classification by regarding it as the only path that appeared for the term-pair, and compute its TRUE label score from the class distribution: sof tmax(W · v xy )[1], set- A notable pattern is that Snow's method learns specific paths, like X is Y from (e.g. Megadeth is an American thrash metal band from Los Angeles). While Snow's method can only rely on verbatim paths, limiting its recall, the generalized version of Snow often makes coarse generalizations, such as X VERB Y from. Clearly, such a path is too general, and almost any verb assigned to it results in a non-indicative path (e.g. X take Y from). Efforts by the learning method to avoid such generalization, again, lower the recall. Hy-peNET provides a better midpoint, making finegrained generalizations by learning additional semantically similar paths such as X become Y from and X remain Y from. See table 5 for additional example paths which illustrate these behaviors.
We also noticed that while on the random split our model learns a range of specific paths such as X is Y published (learned for e.g. Y=magazine) and X is Y produced (Y=film), in the lexical split it only learns the general X is Y path for these re- Retalix Ltd. is a software company  lations. We note that X is Y is a rather "noisy" path, which may occur in ad-hoc contexts without indicating generic hypernymy relations (e.g. chocolate is a big problem in the context of children's health). While such a model may identify hypernymy relations between unseen terms, based on general paths, it is prone to over-generalization, hurting its performance, as seen in Table 4. As discussed in § 4.2, we suspect that this scenario, in which both terms are unseen, is usually not common enough to justify this limiting training setup.

Error Analysis
False Positives We categorized the false positive pairs on the random split according to the relation holding between each pair of terms in the resources used to construct the dataset. We grouped several semantic relations from different resources to broad categories, e.g. synonym includes also alias and Wikipedia redirection. Table 6 displays the distribution of semantic relations among false positive pairs. More than 20% of the errors stem from confusing synonymy with hypernymy, which are known to be difficult to distinguish.
An additional 30% of the term-pairs are re-versed hypernym-hyponym pairs (y is a hyponym of x). Examining a sample of these pairs suggests that they are usually near-synonyms, i.e., it is not that clear whether one term is truely more general than the other or not. For instance, fiction is annotated in WordNet as a hypernym of story, while our method classified fiction as its hyponym. A possible future research direction might be to quite simply extend our network to classify term-pairs simultaneously to multiple semantic relations, as in Pavlick et al. (2015). Such a multiclass model can hopefully better distinguish between these similar semantic relations.
Another notable category is hypernymy-like relations: these are other relations in the resources that could also be considered as hypernymy, but were annotated as negative due to our restrictive selection of only indisputable hypernymy relations from the resources (see Section 4.1). These include instances like (Goethe, occupation, novelist) and (Homo, subdivisionRanks, species).
Lastly, other errors made by the model often correspond to term-pairs that co-occur very few times in the corpus, e.g. xebec, a studio producing Anime, was falsely classified as a hyponym of anime.
False Negatives We sampled 50 term-pairs that were falsely annotated as negative, and analyzed the major (overlapping) types of errors (Table 7).
Most of these pairs had only few co-occurrences in the corpus. This is often either due to infrequent terms (e.g. cbc.ca), or a rare sense of x in which the hypernymy relation holds (e.g. (night,  (1) x and y co-occurred less than 25 times (average cooccurrences for true positive pairs is 99.7).
(2) Either x or y is infrequent. (3) The hypernymy relation holds for a rare sense of x. (4) (x, y) was incorrectly annotated as positive.
play) holding for "Night", a dramatic sketch by Harold Pinter). Such a term-pair may have too few hypernymy-indicating paths, leading to classifying it as negative.

Conclusion
We presented HypeNET: a neural-networks-based method for hypernymy detection. First, we focused on improving path representation using LSTM, resulting in a path-based model that performs significantly better than prior path-based methods, and matches the performance of the previously superior distributional methods. In particular, we demonstrated that the increase in recall is a result of generalizing semantically-similar paths, in contrast to prior methods, which either make no generalizations or over-generalize paths. We then extended our network by integrating distributional signals, yielding an improvement of additional 14 F 1 points, and demonstrating that the path-based and the distributional approaches are indeed complementary.
Finally, our architecture seems straightforwardly applicable for multi-class classification, which, in future work, could be used to classify term-pairs to multiple semantic relations.