When Hearst Is Not Enough: Improving Hypernymy Detection from Corpus with Distributional Models

We address hypernymy detection, i.e., whether an is-a relationship exists between words (x, y), with the help of large textual corpora. Most conventional approaches to this task have been categorized to be either pattern-based or distributional. Recent studies suggest that pattern-based ones are superior, if large-scale Hearst pairs are extracted and fed, with the sparsity of unseen (x, y) pairs relieved. However, they become invalid in some specific sparsity cases, where x or y is not involved in any pattern. For the first time, this paper quantifies the non-negligible existence of those specific cases. We also demonstrate that distributional methods are ideal to make up for pattern-based ones in such cases. We devise a complementary framework, under which a pattern-based and a distributional model collaborate seamlessly in cases which they each prefer. On several benchmark datasets, our framework achieves competitive improvements and the case study shows its better interpretability.


Introduction
A taxonomy is a semantic hierarchy of words or concepts organized w.r.t. their hypernymy (a.k.a. is-a) relationships. Being a well-structured resource of lexical knowledge, taxonomies are vital to various tasks such as question answering (Gupta et al., 2018), textual entailment (Dagan et al., 2013;Bowman et al., 2015;Yu et al., 2020b), and text generation (Biran and McKeown, 2013). When automatically building taxonomies from scratch or populating manually crafted ones, the hypernymy detection task plays a central role. For a pair of queried words (x q , y q ), hypernymy detection requires inferring the existence of a hyponymhypernym relationship between x q and y q . Due to Figure 1: The overall framework of complementary methods for hypernymy detection from corpus. Different sparsity types of queried pairs are handled with pattern-based and distributional models respectively. the good coverage and availability, free-text corpora are widely used to facilitate hypernymy detection, resulting in two lines of approaches: patternbased and distributional.
Pattern-based approaches employ pattern pairs (x, y) extracted via Hearst-like patterns (Hearst, 1992), e.g., "y such as x" and "x and other y". An example of extracted pattern pairs from corpus are shown in Figure 1. Despite their high precision, the extracted pairs suffer from sparsity which comes in two folds i.e., Type-I: x q and y q separately appear in some extracted pairs, but the pair (x q , y q ) is absent e.g., (dog, animal); or Type-II: either x q or y q is not involved in any extracted pair e.g., (crocodile, animal).
Although matrix factorization (Roller et al., 2018) or embedding techniques (Vendrov et al., 2016;Nickel and Kiela, 2017;Le et al., 2019) are widely adopted to implement pattern-based approaches, they only relieve the Type-I sparsity and cannot generalize to unseen words appearing in the Type-II pairs. On the other hand, distribu-tional ones follow, or are inspired by, the Distributional Inclusion Hypothesis (DIH; Geffet and Dagan 2005), i.e., the set of the hyponym's contexts should be roughly contained by the hypernym's. Although applicable to any word in a corpus, they are suggested to be inferior to pattern-based ones fed with sufficient extracted pairs (Roller et al., 2018;Le et al., 2019).
Since pattern-based methods have unresolved sparsity issues, while distributional ones are more broadly applicable but globally inferior, neither of them can dominate the other in every aspect. In this light, we are interested in two questions: • Is the Type-II sparsity severe in practice?
• If so, how to complement pattern-based approaches with distributional ones where the former is invalid?
To answer the first question, we conduct analyses involving estimations on real-world corpora as well as statistics of common hypernymy detection datasets. Results from both resources indicate that the likelihood of encountering the Type-II sparsity in practice could even reach up to more than 50%, which is thus non-negligible.
For the second question, we present ComHyper, a complementary framework (Sec. 4.1) which takes advantage of both pattern-based models' superior performance on Type-I cases and the broad coverage of distributional models on Type-II ones. Specifically, to deal with Type-II sparsity, instead of directly using unsupervised distributional models, ComHyper uses a training stage (Sec. 4.3) to sample from output space of a pattern-based model to train another supervised distribution model implemented by different context encoders (Sec. 4.2). In the inference stage, ComHyper uses the two models to separately handle the type of sparsity they are good at, as illustrated in Figure 1. In this manner, ComHyper relies on the partial use of pattern-based models on Type-I sparsity to secure performance no lower than distributional ones, and further attempts to lift the performance by fixing the former's blind spots (Type-II sparsity) with the latter. On several benchmarks and evaluation settings, the distributional model in ComHyper proves effective on its targeted cases, making our complementary approach outperform a competitive class of pattern-based baselines (Roller et al., 2018). Further analysis also suggests that ComHyper is robust when facing different mixtures of Type-I and -II sparsity.
Our contributions are summarized as : 1) We confirm that a specific type of sparsity issue of current pattern-based approaches is non-negligible. 2) We propose a framework of complementing patternbased approaches with distributional models where the former is invalid. 3) We systematically conduct comparisons on several common datasets, validating the superiority of our framework.

Related Work
Pattern-Based Approaches. Taxonomies from experts (e.g., WordNet (Miller, 1995)) have proved effective in various reasoning applications (Song et al., 2011;Zhang et al., 2020). Meanwhile, Hearst patterns (Hearst, 1992) make large corpora a good resource of explicit is-a pairs, resulting in automatically built hypernymy knowledge bases (Wu et al., 2012;Seitner et al., 2016) of large scales. The coverage of both words and hypernymy pairs in those resources are far from complete.
To infer unknown hypernymies between known words, e.g., implicit is-a pairs in transitive closures, pattern-based models are proposed. Roller et al. (2018) and Le et al. (2019) show that, on a broad range of benchmarks, simple matrix decomposition or embeddings on pattern-based word cooccurrence statistics provide robust performance. On Probase (Wu et al., 2012) -a Hearst-patternbased taxonomy, Yu et al. (2015) use embeddings to address the same sparsity problem. Some methods (Vendrov et al., 2016;Athiwaratkun and Wilson, 2018;Kiela, 2017, 2018;Ganea et al., 2018) embed WordNet in low-dimensional space. Depending on vectors of words learnt from known is-a pairs, the above pattern-based methods cannot induce more hypernymy pairs whose words do not appear in any pattern. Distributional Approaches. Distributional models are inspired by DIH (Geffet and Dagan, 2005). They work on only word contexts rather than extracted pairs, thus are applicable to any word in a corpus. Early unsupervised models typically propose asymmetric similarity metrics over manual word feature vectors for entailment (Weeds et al., 2004;Clarke, 2009;Santus et al., 2014). In Chang et al. (2018) andNguyen et al. (2017), the authors inject DIH into unsupervised embedding models to yield latent feature vectors with hypernymy information. Those feature vectors, manual or latent, may serve in unsupervised asymmetric metrics or to train supervised hypernymy classifiers. Shwartz et al. (2017) explore combinations of manual features and (un)supervised predictors, and suggest that unsupervised metrics are more robust w.r.t. the distribution change of training instances. Projection learning (Fu et al., 2014;Ustalov et al., 2017;Wang and He, 2020) has been used for supervised hypernymy detection. Other Improved Methods. Due to weak generalization ability of Hearst patterns, Anh et al. (2016) and Shwartz et al. (2016) relieve the constraints from strict Hearst patterns to co-occurring contexts or lexico-syntactic paths between two words. They encode the co-occurring contexts or paths using word vectors to train hypernymy embeddings or classifiers. Although leading to better recall than Hearst patterns (Washio and Kato, 2018), they limit the trained embeddings or models from generalizing to every word in a corpus. Nevertheless they have no ability to cope with the Type-II sparsity, which is the main focus of our work.
Another line of retrofitting methods , i.e., adjusting distributional vectors to satisfy external linguistic constraints, has been applied to hypernymy detection. However, they strictly require more additional resources e.g., synonym and antonym to achieve better performance (Kamath et al., 2019). To the best of our knowledge, we are the first to propose complementing the two lines of approaches to cover every word in a simple yet efficient way, with extensive analysis of the framework's potential and evaluation of performances.

Preliminaries
We formally define the aforementioned two types of sparsity, and provide some statistical insights about their impacts on pattern-based methods.

Notations and Definitions
Let V be the vocabulary of a corpus C. By applying Hearst patterns on C, a set of extracted pairs , is obtained. As in Section 2, patternbased approaches usually use P to perform matrix factorization or embedding learning. Due to their nature, only words "seen" in P, or V P = {x | (x, y) ∈ P ∨ (y, x) ∈ P}, will have respective columns/rows or embeddings. We refer to them by in-pattern (or IP for short) words. We refer to words without columns/rows or embeddings, i.e., V \ V P , by out-of-pattern (or OOP) words. Suppose a pair of words q = (x q , y q ) is queried for potential hypernymy. We say q is an IP pair if both x p and y p are IP words, or an OOP pair if either of them is OOP. Due to the need of explicit columns/rows or embeddings for both x q and y q , pattern-based approaches may only make inferences on IP pairs, but are infeasible on OOP ones.

Observations and Motivation
Given the infeasibility of pattern-based methods on OOP pairs, we are interested in what extent pattern-based methods are limited, i.e., the rough likelihood of encountering OOP pairs in practice. At first sight, Hearst patterns may have very sparse occurrences in a corpus. Nevertheless, words with higher frequencies tend to be covered by Hearst patterns and be IP words. Therefore, the possibility of encountering OOP pairs is not obvious to assess.
To shed light upon the OOP issue of patternbased methods, we conduct an analysis on the corpora and extracted pairs in Roller et al. (2018). Considering that nouns tend to be queried more for potential hypernymy than, say, verbs, we only focus on nouns. In Figure 2, we show the corpus frequency of all nouns and in-pattern nouns, and draw the following observations. 1) V P covers well the most frequent nouns in V . For the top-10 4 frequent nouns, the two lines of dots overlap well, indicating that common nouns are very likely to be involved in Hearst patterns.
2) Due to the limited size of V P , it is unable to cover the tail of V . With the frequency rank below 10 4 , the two lines begin to separate. Comparing their intersections with the x-axis, it is understandable that a limited number of IP nouns cannot cover both frequent and tail nouns in a vocabulary, whose size is several orders of magnitudes larger.
3) The likelihood of a noun being OOP is nonnegligible. The two lines enclose a triangular region, corresponding to the likelihood of a randomly drawn noun being OOP. According to our statistics, this region accounts for a non-negligible proportion of 19.9% of the total area. With the likelihood of OOP nouns at hand, we are ready to roughly estimate the likelihood of encountering OOP pairs in practice. Suppose the two words in q are nouns independently sampled from the corpus distribution. Then the probability of q being OOP, i.e., infeasible for pattern-based methods, is 1 − (1 − 0.199) 2 = 35.8%. Even if y q tends to bias towards more common words, the optimistic estimation is still above 19.9%. Table 1 lists the actual portions of OOP pairs in several commonly used datasets w.r.t. P in Roller et al. (2018). Note that neither the datasets nor P are created in favor of the other. These actual rates may be above or below the estimated interval of 19.9%-35.8%, but are all at considerable levels. Considering the above analyses, we confirm that OOP pairs are non-negligible in practice and give a positive answer to the first question in Section 1. Motivation of the Study. OOP pairs are problematic for pattern-based methods. Despite their nonnegligible existence, former pattern-based methods (Roller et al., 2018;Le et al., 2019) boldly classify them as non-hypernymy in prediction. However, distributional methods are applicable as long as the two queried words have contexts. Thus, they are ideal to complement pattern-based methods on the non-negligible minority of OOP pairs.

Framework
Our framework is illustrated in Figure 1. It consists of a pattern-based model and a distributional model cooperating on the data resource to answer an arbitrarily queried pair of words q ∈ V × V . Data Resource. To train a pattern-based model using prior solutions, our data resource includes extracted pairs P from some text corpus C. Unlike pattern-based approaches that depend solely on P, our data resource also involves the corpus C for the sake of the distributional model. Pattern-Based Model. The pattern-based model works on the extracted pairs P to serve in two roles. On the one hand, it is responsible for generalizing from statistics on P to score any in-pattern pair q ∈ V P × V P to reflect the plausibility of a hypernymy relationship. To this end, it is sufficient to adopt matrix-factorization-based (Roller et al., 2018) or embedding models (Le et al., 2019). On the other hand, the pattern-based model also provides supervision signals via a sampler for training the distributional model. We will specify this role later. Formally, we denote the pattern-based model by f : V P × V P → R. Distributional Model. Different from the patternbased model defined on IP pairs V P × V P , the distributional model has a form of g : V × V → R, i.e., it should be capable of predicting on any word pair in V × V . This invalidates the model's dependency on extracted pairs involving x q or y q . The separate contexts of x q and y q in corpus C turn out to serve as the basis and input of the distributional model, respectively. Given the superior performance of pattern-based models on IP pairs (Roller et al., 2018), the distributional model g is only responsible to answer OOP pairs.
Various choices exist to implement the distributional model. We may apply unsupervised metrics (Weeds et al., 2004;Clarke, 2009;Santus et al., 2014) on manual features extracted from contexts of x q and y q , which are robust to the distribution change of training data (Shwartz et al., 2017). However, the scores of those metrics are not necessarily in the same scale with those output by the patternbased model f for IP pairs. Such inconsistency will harm downstream systems which involve the scores for ranking or calculation.
Given sufficient supervision signals from f and the inherent noise of natural language, we implement the distributional model g by a supervised neural-network-based approach. Specifically, the network encodes the contexts of x and y in C, i.e., C(x) and C(y), to be x h and y H , respectively, and makes predictions by a dot product, i.e., Note that hypernymy is essentially asymmetric, so we distinguish x h and y H by the subscripts to reflect the asymmetry. In practice, we adopt networks with separate parameters for C(x) and C(y), which is detailed in the next section.

Encoding Queried Words
To implement the distributional model, we encode C(x) and C(y) into hypernymy-specific representations x h and y H , respectively. There are various off-the-shelf models to encode sentential contexts. We take the following four approaches. Transformed Word Vector. Instead of working directly on the original contexts C(x) and C(y), this approach takes as input the pre-trained word vectors (Mikolov et al., 2013;Pennington et al., 2014) x and y of x and y, and apply two Multi-Layer Perceptrons (MLPs), respectively: The intuition is that word vectors roughly depend on the contexts and encode the distributional semantics. To make the MLPs generalize to V rather than V P , the word vectors are fixed during training. Inspired by the post specialization in , it also takes a similar approach to generalize task-specific word vector transformations to unseen words, though their evaluation task is not hypernymy detection. NBOW with MEAN-Pooling. Given words {c j } n j=1 in a context c ∈ C(x), the Neural Bagof-Words (NBOW for short) encoder looks up and averages their pre-trained vectors c j as c, transforms c through a MLP, and averages the resulted vectors through a MEAN-pooling layer as x h : To obtain y H , a similar network is applied, though the two MLPs do not share parameters to reflect the asymmetry of hypernymy. We fix the embeddings of context word vectors during training because satisfactory performance is observed. Due to its simplicity, NBOW is efficient to train. However, it ignores the order of context words and may not well reserve semantics. CONTEXT2VEC with MEAN-Pooling. To study the impacts of positional information within the context, we also attempt to substitute the NBOW with the CONTEXT2VEC encoder (Melamud et al., 2016). In CONTEXT2VEC, two LSTMs are used to encode the left and right contexts − → c and ← − c of an occurrence of x, respectively. The two output vectors are concatenated as the final context representation c for the same transformation and averaging as for NBOW. Formally, Note that the encoder for y still has separate parameters from those of x. Hierarchical Attention Networks. NBOW and CONTEXT2VEC with MEAN-Pooling both aggregate every context word's information into x h and y H . Given several long contexts and the fixed output dimension, it is vital for encoders to capture the most useful information. Inspired by Yang et al. (2016), we incorporate attention on different words and contexts. We use a feed-forward network to estimate the importance, and combine the information, of each context word to obtain c: Then, another similar network is applied to all c (i) ∈ C(x) to obtain the representation of x h : For word y, the encoder is similar but still has separate parameters from those of x.

Training the Distributional Model
We train the distributional model g's parameters Φ with supervision signals from the pattern-based model f . To make output scores of f and g comparable, we adopt the square error between the two scores as the loss on a pair (x, y), i.e., Compared with the potentially large size of the output space, a set of random samples from it suffices to train the parameters Φ. For each IP word x ∈ V P , we uniformly sample k entries from ∆ x , the column and row involving x in the output space V P × V P : The sample for x is done on P x , a uniform distribution over ∆ x . Finally, our objective is where L(x; Φ) is the expected loss related to x: E (x (i) ,y (i) )∼Px l(x (i) , y (i) ; Φ).

Experimental Setup
We adopt the widely-used comprehensive evaluation framework 1 provided by Roller et al. (2018); Le et al. (2019). To make experimental results comparable, we align the settings as much as possible.

Corpora and Evaluation
Corpora. We used the 431k is-a pairs (243k unique) released by Roller et al. (2018). We substitute the Gigaword corpus they used by uKWac (Ferraresi, 2007) because the former is not complimentary. This decision does not affect reproducing pattern-based approaches in Roller et al. (2018). Evaluation Tasks. The three sub-tasks include 1) ranked hypernym detection: given (x q , y q ) decide whether y q is a hypernym of x q . Five datasets i.e., BLESS (Baroni and Lenci, 2011), EVAL (Santus et al., 2015), LEDS (Baroni et al., 2012), SHWARTZ (Shwartz et al., 2016) and WB-LESS (Weeds et al., 2014) are used. The positive predictions should be ranked higher over negative ones and Average Precision (AP) is used for evaluation. 2) hypernymy direction classification: determine which word in a pair has a broader meaning. Besides BLESS and WBLESS, we also use BIB-LESS (Kiela et al., 2015) and Accuracy (Acc.) is reported for binary classification. 3) graded entailment: predict scalar scores on HYPERLEX (Vulić et al., 2017). Spearman's correlation ρ between the labels and predicted scores is reported.
The statistics of datasets are shown in Table 1. The three tasks require algorithms to output scores unsupervisedly, which indicate the strength of hypernymy relationships. Note no external training data is available in the evaluation. Only extracted Hearst pattern pairs may be used for supervision.

Compared Methods
Pattern-Based Approaches. We reproduce four pattern-based methods i.e., Count, PPMI, SVD-Count, and SVD-PPMI. As in Roller et al. (2018), SVD-PPMI is generally the most competitive. Distributional Approaches. We compare with unsupervised distributional baselines in Roller et al. (2018), i.e., Cosine, Weeds Precision (WP), invCL, and SLQS. For supervised distributional baseline, we adopt the strongest model SDSN in Rei et al. (2018) and take the probability scores of binary classifier as hypernymy predictions. All the 431k  extracted pairs serve as true hypernymy pairs and false ones are generated by replacing one of the terms in true pairs with a random term.
Complementary Approaches. We adopt SVD-PPMI as the pattern-based model in our framework. We pre-train 300-dimensional word embeddings with Skip-Gram (Mikolov et al., 2013) on our corpus for the use of the distributional model. Specifically, we compare transformed word vector (W2V), NBOW/CONTEXT2VEC with MEAN-Pooling (NBOW/C2V), and Hierarchical Attention Networks (HAN) 2 . The output dimension of our four encoders is set to 300. The batch size is set to 128 and learning rate to 10 -3 . We tuned the sampling size k in {1, 3, 5, 10, 100, 200, 400, 800} on the validation set. We did not tune other hyperparameters since the default settings work well. Our code is available at https://github.com/ ccclyu/ComHyper.

Experimental Results
We aim to answer: 1) Are our distributional models supervised well by the pattern-based model? 2) Do they improve our complementary methods over the pattern-based ones? 3) Are complementary methods robust w.r.t. fewer extracted pairs?

Performance on OOP Pairs
To ensure that our supervised distributional models are working effectively on OOP pairs, we evaluate on only OOP pairs under the aforementioned settings. Because pattern-based approaches trivially give the lowest scores to OOP pairs, we only compare with distributional approaches. .992 n/a n/a n/a Table 3: Experimental results on all queried pairs. Best ones are marked bold while second-best ones underlined.  Table 2, except on LEDS, our distributional models generally achieve higher scores than unsupervised approaches. Especially, on the BLESS dataset, Cosine even gets a zero Accuracy score because it is symmetric and cannot suggest the right direction. The higher AP and Accuracy scores suggest that, supervised by the pattern-based model, our distributional models can generate better relative rankings within the scope of OOP pairs.

Main Results and Case Study
When facing both IP and OOP pairs, it is not enough to rank both types of pairs separately, since downstream systems usually require comparable scores or a unified ranking. We evaluate on the entire datasets under the aforementioned settings. We only compare with pattern-based methods and supervised distributional models because they generally outperform unsupervised ones. Table 3 provides the main results. Best results are marked bold, and second-best ones are underlined. To better interpret the results, we also provide "Oracle" scores, i.e., the upper-bounds that complementary methods can achieve. For the Detection task, Oracle scores are obtained by assigning OOP pairs having hypernymy relationships (See Table 1) the maximum score and other ones the minimum. For BLESS of Direction, the Oracle score is computed by assuming perfect predictions for OOP pairs. The Oracle scores for WB-LESS/BIBLESS of the Direction task and HYPER-LEX of Graded Entailment are not straightforward to estimate, thus are omitted.
In Table 3, complementary methods lead to superior results on Detection and Direction tasks. In eight out of nine columns, the best and second best scores are both achieved by complementary methods. Especially, large improvements (up to 25.9%) are observed on SHWARTZ with a higher OOP rate and thus a higher Oracle. In general, the HAN encoder achieves better performances. By attending to the most informative contexts and words, the HAN encoder potentially captures distributional semantics that are relevant to hypernymy relationships between queried words. Note that the relative performances between different context encoders are not necessarily consistent with those in Table 2. This is because the overall performance is not only sensitive to the relative ranking of OOP pairs, but also to their absolute scores.
In addition, with the same extracted P as supervision signals, our proposed methods show a great superiority over the supervised method (SDSN in Table 3). Both SDSN and our complementary approaches could be regraded as combining patternbased and distributional model. The key difference is that complementary methods solve Type-I sparsity with a pattern-based model, which proved to be better than distributional ones on this case, while SDSN uses a distributional model (though supervised) uniformly on both cases. Case Study. To explain the superiority of the HAN encoder, we exemplify with two true-hypernymy OOP pairs from two Detection datasets, respectively. Here, the two hyponyms are both uncommon and OOP words. Therefore, pattern-based models such as SVD-PPMI simply assign the pairs with minimum scores and rank them at the bottom. But by examining their contexts in the textual cor-… continue by walk diagonally across the field towards the old vicarage Cross two more stiles and follow the path until … …have now #num free sitting. The vicarage is a commodious residence ,a little north of the Church … …March #num when the vicar granted consent for the vicarage , to be erected into a provostry ( collegiate church) … …inventor Alva Edison also designed an apparatus called a ' kinetoscope ' , a kind of moving picture viewer … … KL Dickson 's invention of both the kinetograph and the kinetoscope stand as the most important development … … Woodville, who run one of the leading kinetoscope exhibition company ,seek to develop a movie projector system … LEDS: (vicarage, building) Rank: 1289/2770 OOP Rate: 7.55% SHWARTZ: (kinetoscope, device) Rank: 4341/52577 OOP Rate: 67.07% Figure 3: Case study of two queried pairs from two datasets, with OOP rates and actual ranks. pus, the hypernymy relationships could have been inferred, and they could have been scored higher.
In Figure 3, we show the two OOP pairs, as well as their rank according to HAN and the OOP rates of the corresponding datasets. We also demonstrate the Top-3 contexts scored by HAN and visualize the context-and word-level attention weights. We observe that HAN can attend to informative contexts and words that help capture the semantics of the OOP word. For example, in LEDS, vicarage is OOP. HAN suggests three contexts that imply its meaning well. By reading the context words and phrases highlighted by HAN, e.g., commodious residence, and collegiate church, even people not knowing the word may guess it is a type of building. With our HAN-based distributional model, the pair is successfully promoted to top 50% in the ranking, well out of and above the bottom 7.55% of OOP pairs. Similar observations are drawn for the other pair, i.e., (kinetoscope, device) with contexts moving picture viewer, and movie projector system.
We also observe that wrong predictions may be caused by extremely sparse contexts in the corpus such as famicom in the dataset SHWARTZ.

Impacts of Reduced Pairs
To analyze our complementary framework's robustness w.r.t. sparser extracted pairs P, we randomly sample {95%,75%,55%,35%,15%} of all 243k is-a pairs, and rerun SVD-PPMI, the best pattern-based approach and our complementary approaches. In Figure 4, we only illustrate the results on LEDS for Detection and BLESS for Direction. Observations on the other datasets are similar, thus are omitted. We have the following observations. First, with fewer extracted pairs, the OOP rates increase quickly, and all models generally perform worse. This is not surprising since a sparser P leads to a less informative SVD-PPMI matrix and less supervision on distributional models. Second, despite the increased OOP rates, our complementary methods consistently outperform SVD-PPMI and suffer less from increasing OOP rates especially on BLESS. Finally, among the four context encoders, HAN performs better than the others when the sampled rate is higher than 75%. However, with lower sampled rates, W2V is more robust than the others on BLESS but fails to exceed HAN on EVAL.

Conclusion and Future Work
We propose complementing pattern-based and distributional methods for hypernymy detection. As far as we know, this is the first work along this line. We formally depict two types of sparsity that extracted pairs face, and indicate that patternbased methods are invalid on the Type-II, i.e., outof-pattern pairs. By analyzing common corpora and datasets, we confirm that OOP pairs are nonnegligible for the task. To this end, we devise a complementary framework, where a pattern-based and distributional model handle IP and OOP pairs separately, while collaborating seamlessly to give unified scores. Oracle performance analysis shows that our framework has high potentials on several datasets. Supervised by the pattern-based model, the distributional model shows robust capability of scoring OOP pairs and pushing the overall performance towards the oracle bounds.
In the future, we will extend the similar approach to multilingual (Yu et al., 2020a) or crosslingual (Upadhyay et al., 2018) lexical entailment tasks. Moreover, one interesting direction is to use hyperbolic embeddings (Le et al., 2019;Balazevic et al., 2019) for pattern-based models due to their inherent modeling ability of hierarchies.