Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection

Modeling hypernymy, such as poodle is-a dog, is an important generalization aid to many NLP tasks, such as entailment, relation extraction, and question answering. Supervised learning from labeled hypernym sources, such as WordNet, limits the coverage of these models, which can be addressed by learning hypernyms from unlabeled text. Existing unsupervised methods either do not scale to large vocabularies or yield unacceptably poor accuracy. This paper introduces distributional inclusion vector embedding (DIVE), a simple-to-implement unsupervised method of hypernym discovery via per-word non-negative vector embeddings which preserve the inclusion property of word contexts. In experimental evaluations more comprehensive than any previous literature of which we are aware—evaluating on 11 datasets using multiple existing as well as newly proposed scoring functions—we find that our method provides up to double the precision of previous unsupervised methods, and the highest average performance, using a much more compact word representation, and yielding many new state-of-the-art results.


Introduction
Numerous applications benefit from compactly representing context distributions, which assign meaning to objects under the rubric of distributional semantics. In natural language processing, distributional semantics has long been used to assign meanings to words (that is, to lexemes in the dictionary, not individual instances of word tokens). The meaning of a word in the distributional sense is often taken to be the set of textual contexts (nearby tokens) in which that word appears, represented as a large sparse bag of words (SBOW). Without any supervision, Word2Vec (Mikolov et al., 2013), among other approaches based on matrix factorization (Levy et al., 2015a), successfully compress the SBOW into a much lower dimensional embedding space, increasing the scalability and applicability of the embeddings while preserving (or even improving) the correlation of geometric embedding similarities with human word similarity judgments.
While embedding models have achieved impressive results, context distributions capture more semantic information than just word similarity. The distributional inclusion hypothesis (DIH) (Weeds and Weir, 2003;Geffet and Dagan, 2005;Cimiano et al., 2005) posits that the context set of a word tends to be a subset of the contexts of its hypernyms. For a concrete example, most adjectives that can be applied to poodle can also be applied to dog, because dog is a hypernym of poodle (e.g. both can be obedient). However, the converse is not necessarily true -a dog can be straight-haired but a poodle cannot. Therefore, dog tends to have a broader context set than poodle. Many asymmetric scoring functions comparing SBOW features based on DIH have been developed for hypernymy detection (Weeds and Weir, 2003;Geffet and Dagan, 2005;Shwartz et al., 2017).
Hypernymy detection plays a key role in many challenging NLP tasks, such as textual entailment (Sammons et al., 2011), coreference (Ponzetto andStrube, 2006), relation extraction (Demeester et al., 2016) and question answering (Huang et al., 2008). Leveraging the variety of contexts and inclusion properties in context distributions can greatly increase the ability to discover taxonomic structure among words (Shwartz et al., 2017). The inability to preserve these features limits the semantic representation power and downstream applicability of some popular unsupervised learning approaches such as Word2Vec.
Recent studies (Levy et al., 2015b;Shwartz et al., 2017) have underscored the difficulty of generalizing supervised hypernymy annotations to unseen pairs -classifiers often effectively memorize prototypical hypernyms ('general' words) and ignore relations between words. These findings motivate us to develop more accurate and scalable unsupervised embeddings to detect hypernymy and propose several scoring functions to analyze the embeddings from different perspectives.

Contributions
• A novel unsupervised low-dimensional embedding method via performing non-negative matrix factorization (NMF) on a weighted PMI matrix, which can be efficiently optimized using modified skip-grams.
• Theoretical and qualitative analysis illustrate that the proposed embedding can intuitively and interpretably preserve inclusion relations among word contexts.
• Extensive experiments on 11 hypernym detection datasets demonstrate that the learned embeddings dominate previous low-dimensional unsupervised embedding approaches, achieving similar or better performance than SBOW, on both existing and newly proposed asymmetric scoring functions, while requiring much less memory and compute.

Method
The distributional inclusion hypothesis (DIH) suggests that the context set of a hypernym tends to contain the context set of its hyponyms. When representing a word as the counts of contextual co-occurrences, the count in every dimension of hypernym y tends to be larger than or equal to the corresponding count of its hyponym x: where x y means y is a hypernym of x, V is the set of vocabulary, and #(x, c) indicates the number of times that word x and its context word c co-occur in a small window with size |W | in the corpus of interest D. Notice that the concept of DIH could be applied to different context word representations. For example, Geffet and Dagan (2005) represent each word by the set of its co-occurred context words while discarding their counts. In this study, we define the inclusion property based on counts of context words in (1) because the counts are an effective and noise-robust feature for the hypernymy detection using only the context distribution of words (Clarke, 2009;Vulić et al., 2016;Shwartz et al., 2017).
Our goal is to produce lower-dimensional embeddings preserving the inclusion property that the embedding of hypernym y is larger than or equal to the embedding of its hyponym x in every dimension. Formally, the desired property can be written as where L is number of dimensions in the embedding space. We add additional non-negativity constraints, i.e. x[i] ≥ 0, y[i] ≥ 0, ∀i, in order to increase the interpretability of the embeddings (the reason will be explained later in this section). This is a challenging task. In reality, there are a lot of noise and systematic biases that cause the violation of DIH in Equation (1) (i.e. #(x, c) > #(y, c) for some neighboring word c), but the general trend can be discovered by processing thousands of neighboring words in SBOW together (Shwartz et al., 2017). After the compression, the same trend has to be estimated in a much smaller embedding space which discards most of the information in SBOW, so it is not surprising to see most of the unsupervised hypernymy detection studies focus on SBOW (Shwartz et al., 2017) and the existing unsupervised embedding methods like Gaussian embedding have degraded accuracy (Vulić et al., 2016).

Inclusion Preserving Matrix Factorization
Popular methods of unsupervised word embedding are usually based on matrix factorization (Levy et al., 2015a). The approaches first compute a co-occurrence statistic between the wth word and the cth context word as the (w, c)th element of the matrix M [w, c]. Next, the matrix M is factorized such that M [w, c] ≈ w T c, where w is the low dimension embedding of wth word and c is the cth context embedding.
The statistic in M [w, c] is usually related to pointwise mutual information (Levy et al., 2015a): Intuitively, since M [w, c] ≈ w T c, larger embedding values of w at every dimension seems to imply larger w T c, larger M [w, c], larger P M I(w, c), and thus larger co-occurrence count #(w, c). However, the derivation has two flaws: (1) c could contain negative values and (2) lower #(w, c) could still lead to larger P M I(w, c) as long as the #(w) is small enough.
To preserve DIH, we propose a novel word embedding method, distributional inclusion vector embedding (DIVE), which fixes the two flaws by performing non-negative factorization (NMF) (Lee and Seung, 2001) (3) where k I is a constant which shifts PMI value like SGNS, Z = |D| |V | is the average word frequency, and |V | is the vocabulary size. We call this weighting term #(w) Z inclusion shift. After applying the non-negativity constraint and inclusion shift, the inclusion property in DIVE (i.e. Equation (2)) implies that Equation (1) (DIH) holds if the matrix is reconstructed perfectly. The derivation is simple: If the embedding of hypernym y is greater than or equal to the embedding of its hyponym x in every dimension ( and only #(w, c) changes with w.

Optimization
Due to its appealing scalability properties during training time (Levy et al., 2015a), we optimize our embedding based on the skip-gram with negative sampling (SGNS) (Mikolov et al., 2013). The objective function of SGNS is where w ∈ R, c ∈ R, c N ∈ R, σ is the logistic sigmoid function, and k is a constant hyperparameter indicating the ratio between positive and negative samples. Levy and Goldberg (2014) demonstrate SGNS is equivalent to factorizing a shifted PMI matrix . By setting k = k I ·Z #(w) and applying non-negativity constraints to the embeddings, DIVE can be optimized using the similar objective function: where w ≥ 0, c ≥ 0, c N ≥ 0, and k I is a constant hyper-parameter. P D is the distribution of negative samples, which we set to be the corpus word frequency distribution (not reducing the probability of drawing frequent words like SGNS) in this paper. Equation (5) is optimized by ADAM (Kingma and Ba, 2015), a variant of stochastic gradient descent (SGD). The non-negativity constraint is implemented by projection (Polyak, 1969) (i.e. clipping any embedding which crosses the zero boundary after an update).
The optimization process provides an alternative angle to explain how DIVE preserves DIH.  The gradients for the word embedding w is Assume hyponym x and hypernym y satisfy DIH in Equation (1) and the embeddings x and y are the same at some point during the gradient ascent. At this point, the gradients coming from negative sampling (the second term) decrease the same amount of embedding values for both x and y. However, the embedding of hypernym y would get higher or equal positive gradients from the first term than x in every dimension because #(x, c) ≤ #(y, c). This means Equation (1) tends to imply Equation (2) because the hypernym has larger gradients everywhere in the embedding space. Combining the analysis from the matrix factorization viewpoint, DIH in Equation (1) is approximately equivalent to the inclusion property in DIVE (i.e. Equation (2)).

PMI Filtering
For a frequent target word, there must be many neighboring words that incidentally appear near the target word without being semantically meaningful, especially when a large context window size is used. The unrelated context words cause noise in both the word vector and the context vector of DIVE. We address this issue by filtering out context words c for each target word w when the PMI of the co-occurring words is too small (i.e. log( P (w,c) P (w)·P (c) ) < log(k f )). That is, we set #(w, c) = 0 in the objective function. This preprocessing step is similar to computing PPMI in SBOW (Bullinaria and Levy, 2007), where low PMI co-occurrences are removed from SBOW.

Interpretability
After applying the non-negativity constraint, we observe that each latent factor in the embedding is interpretable as previous findings suggest (Pauca et al., 2004;Murphy et al., 2012) (i.e. each dimension roughly corresponds to a topic). Furthermore, DIH suggests that a general word appears in more diverse contexts/topics. By preserving DIH using inclusion shift, the embedding of a general word (i.e. hypernym of many other words) tends to have larger values in these dimensions (topics). This gives rise to a natural and intuitive interpretation of our word embeddings: the word embeddings can be seen as unnormalized probability distributions over topics. In Figure 1, we visualize the unnormalized topical distribution of two words, rodent and mammal, as an example. Since rodent is a kind of mammal, the embedding (i.e. unnormalized topical distribution) of mammal includes the embedding of rodent when DIH holds. More examples are illustrated in our supplementary materials.

Unsupervised Embedding Comparison
In this section, we compare DIVE with other unsupervised hypernym detection methods. In this paper, unsupervised approaches refer to the methods that only train on plaintext corpus without using any hypernymy or lexicon annotation.

Experiment Setup
The embeddings are tested on 11 datasets. The first 4 datasets come from the recent review of Shwartz et al. (2017) 1 : BLESS (Baroni and Lenci, 2011), EVALution (Santus et al., 2015), Lenci/Benotto (Benotto, 2015), and Weeds (Weeds et al., 2014). The next 4 datasets are downloaded from the code repository of the H-feature detector (Roller and Erk, 2016) (Kotlerman et al., 2010). In addition, the performance on the test set of Hy-peNet (Shwartz et al., 2016) (using the random train/test split), the test set of WordNet (Vendrov et al., 2016), and all pairs in HyperLex (Vulić et al., 2016) are also evaluated. The F1 and accuracy measurements are sometimes very similar even though the quality of prediction varies, so we adopted average precision, AP@all (Zhu, 2004) (equivalent to the area under the precision-recall curve when the constant interpolation is used), as the main evaluation metric. The HyperLex dataset has a continuous score on each candidate word pair, so we adopt Spearman rank coefficient ρ (Fieller et al., 1957) as suggested by the review study of Vulić et al. (2016). Any OOV (out-of-vocabulary) word encountered in the testing data is pushed to the bottom of the prediction list (effectively assuming the word pair does not have hypernym relation).
1 https://github.com/vered1986/ UnsupervisedHypernymy 2 https://github.com/stephenroller/ emnlp2016/ We trained all methods on the first 51.2 million tokens of WaCkypedia corpus (Baroni et al., 2009) because DIH holds more often in this subset (i.e. SBOW works better) compared with that in the whole WaCkypedia corpus. The window size |W | of DIVE and Gaussian embedding are set as 20 (left 10 words and right 10 words). The number of embedding dimensions in DIVE L is set to be 100. The other hyper-parameters of DIVE and Gaussian embedding are determined by the training set of HypeNet. Other experimental details are described in our supplementary materials.

Results
If a pair of words has hypernym relation, the words tend to be similar (sharing some context words) and the hypernym should be more general than the hyponym. Section 2.4 has shown that the embedding could be viewed as an unnormalized topic distribution of its context, so the embedding of hypernym should be similar to the embedding of its hyponym but having larger magnitude. As in Hy-perVec (Nguyen et al., 2017), we score the hypernym candidates by multiplying two factors corresponding to these properties. The C·∆S (i.e. the cosine similarity multiply the difference of summation) scoring function is defined as where w p is the embedding of hypernym and w q is the embedding of hyponym. As far as we know, Gaussian embedding (GE) (Vilnis and McCallum, 2015) is the stateof-the-art unsupervised embedding method which can capture the asymmetric relations between a hypernym and its hyponyms. Gaussian embedding encodes the context distribution of each word as a multivariate Gaussian distribution, where the embeddings of hypernyms tend to have higher variance and overlap with the embedding of their hyponyms. In Table 1, we compare DIVE with Gaussian embedding 3 using the code implemented by Athiwaratkun and Wilson (2017) 4 and with word cosine similarity using skip-grams. The performances of random scores are also presented for reference. As we can see, DIVE is usually significantly better than other unsupervised embedding.

SBOW Comparison
Unlike Word2Vec, which only tries to preserve the similarity signal, the goals of DIVE cover preserving the capability of measuring not only the similarity but also whether one context distribution includes the other (inclusion signal) or being more general than the other (generality signal).
In this experiment, we perform a comprehensive comparison between SBOW and DIVE using multiple scoring functions to detect the hypernym relation between words based on different types of signal. The window size |W | of SBOW is also set as 20, and experiment setups are the same as that described in Section 3.1. Notice that the comparison is inherently unfair because most of the information would be lost during the aggressive compression process of DIVE, and we would like to evaluate how well DIVE can preserve signals of interest using the number of dimensions which is several orders of magnitude less than that of SBOW.

Unsupervised Scoring Functions
After trying many existing and newly proposed functions which score a pair of words to detect hypernym relation between them, we find that good scoring functions for SBOW are also good scoring functions for DIVE. Thus, in addition to C·∆S used in Section 3.2, we also present 4 other best performing or representative scoring functions in the experiment (see our supplementary materials for more details): 3 Note that higher AP is reported for some models in previous literature: 80 (Vilnis and McCallum, 2015) in LEDS, 74.2 (Athiwaratkun and Wilson, 2017) in LEDS, and 20.6 (Vulić et al., 2016)  ). CDE measures the degree of violation of equation (1). Equation (1) holds if and only if CDE is 1. Due to noise in SBOW, CDE is rarely exactly 1, but hypernym pairs usually have higher CDE. Despite its effectiveness, the good performance could mostly come from the magnitude of embeddings/features instead of inclusion properties among context distributions. To measure the inclusion properties between context distributions d p and d q (w p and w q after normalization, respectively), we use negative asymmetric L1 distance (−AL 1 ) 5 as one of our scoring function, where and w 0 is a constant hyper-parameter.
• Similarity plus generality: Computing cosine similarity on skip-grams (i.e. Word2Vec + C in Table 1) is a popular way to measure the similarity of two words, so we multiply the Word2Vec similarity with summation difference of DIVE or SBOW (W·∆S) as an alternative of C·∆S.

Baselines
• SBOW Freq: A word is represented by the frequency of its neighboring words. Applying PMI filter (set context feature to be 0 if its value is lower than log(k f )) to SBOW Freq only makes its performances closer to (but still much worse than) SBOW PPMI, so we omit the baseline.
• SBOW PPMI: SBOW which uses PPMI of its neighboring words as the features (Bullinaria and Levy, 2007). Applying PMI filter to SBOW PPMI usually makes the performances worse, especially when k f is large. Similarly, a constant log(k ) shifting to SBOW PPMI (i.e. max(P M I − log(k ), 0)) is not helpful, so we set both k f and k to be 1.  Table 2: AP@all (%) of 10 datasets. The box at lower right corner compares the micro average AP across all 10 datasets. Numbers in different rows come from different feature or embedding spaces. Numbers in different columns come from different datasets and unsupervised scoring functions. We also present the micro average AP across the first 4 datasets (BLESS, EVALution, Lenci/Benotto and Weeds), which are used as a benchmark for unsupervised hypernym detection (Shwartz et al., 2017). IS refers to inclusion shift on the shifted PMI matrix.   ), 0).
• SBOW all wiki: SBOW using PPMI features trained on the whole WaCkypedia.
• DIVE without the PMI filter (DIVE w/o PMI) • NMF on shifted PMI: Non-negative matrix factorization (NMF) on the shifted PMI without inclusion shift for DIVE (DIVE w/o IS). This is the same as applying the non-negative constraint on the skip-gram model.
• K-means (Freq NMF): The method first uses Mini-batch k-means (Sculley, 2010) to cluster words in skip-gram embedding space into 100 topics, and hashes each frequency count in SBOW into the corresponding topic. If running k-means on skip-grams is viewed as an approximation of clustering the SBOW context vectors, the method can be viewed as a kind of NMF (Ding et al., 2005).
DIVE performs non-negative matrix factorization on PMI matrix after applying inclusion shift and PMI filtering. To demonstrate the effectiveness of each step, we show the performances of DIVE after removing PMI filtering (DIVE w/o PMI), removing inclusion shift (DIVE w/o IS), and removing matrix factorization (SBOW PPMI w/ IS, SBOW PPMI, and SBOW all wiki). The methods based on frequency matrix are also tested (SBOW Freq and Freq NMF).

Results and Discussions
In Table 2, we first confirm the finding of the previous review study of Shwartz et al. (2017): there is no single hypernymy scoring function which always outperforms others. One of the main reasons is that different datasets collect negative samples differently. For example, if negative samples come from random word pairs (e.g. WordNet dataset), a symmetric similarity measure is a good scoring function. On the other hand, negative samples come from related or similar words in Hy-peNet, EVALution, Lenci/Benotto, and Weeds, so only estimating generality difference leads to the best (or close to the best) performance. The negative samples in many datasets are composed of both random samples and similar words (such as BLESS), so the combination of similarity and generality difference yields the most stable results.
DIVE performs similar or better on most of the scoring functions compared with SBOW consistently across all datasets in Table 2 and Table 3, while using many fewer dimensions (see Table 4). This leads to 2-3 order of magnitude savings on both memory consumption and testing time. Furthermore, the low dimensional embedding makes the computational complexity independent of the vocabulary size, which drastically boosts the scalability of unsupervised hypernym detection especially with the help of GPU. It is surprising that we can achieve such aggressive compression while preserving the similarity, generality, and in-clusion signal in various datasets with different types of negative samples. Its results on C·∆S and W·∆S outperform SBOW Freq. Meanwhile, its results on AL 1 outperform SBOW PPMI. The fact that W·∆S or C·∆S usually outperform generality functions suggests that only memorizing general words is not sufficient. The best average performance on 4 and 10 datasets are both produced by W·∆S on DIVE.
SBOW PPMI improves the W·∆S and C·∆S from SBOW Freq but sacrifices AP on the inclusion functions. It generally hurts performance to directly include inclusion shift in PPMI (PPMI w/ IS) or compute SBOW PPMI on the whole WaCkypedia (all wiki) instead of the first 51.2 million tokens. The similar trend can also be seen in Table 3. Note that AL 1 completely fails in the Hy-perLex dataset using SBOW PPMI, which suggests that PPMI might not necessarily preserve the distributional inclusion property, even though it can have good performance on scoring functions combining similarity and generality signals.
Removing the PMI filter from DIVE slightly drops the overall precision while removing inclusion shift on shifted PMI (w/o IS) leads to poor performances. K-means (Freq NMF) produces similar AP compared with SBOW Freq but has worse AL 1 scores. Its best AP scores on different datasets are also significantly worse than the best AP of DIVE. This means that only making Word2Vec (skip-grams) non-negative or naively accumulating topic distribution in contexts cannot lead to satisfactory embeddings.

Related Work
Most previous unsupervised approaches focus on designing better hypernymy scoring functions for sparse bag of word (SBOW) features. They are well summarized in the recent study (Shwartz et al., 2017). Shwartz et al. (2017) also evaluate the influence of different contexts, such as changing the window size of contexts or incorporating dependency parsing information, but neglect scalability issues inherent to SBOW methods.
A notable exception is the Gaussian embedding model (Vilnis and McCallum, 2015), which represents each word as a Gaussian distribution. However, since a Gaussian distribution is normalized, it is difficult to retain frequency information during the embedding process, and experiments on Hy-perLex (Vulić et al., 2016) demonstrate that a sim-ple baseline only relying on word frequency can achieve good results. Follow-up work models contexts by a mixture of Gaussians (Athiwaratkun and Wilson, 2017) relaxing the unimodality assumption but achieves little improvement on hypernym detection tasks. Kiela et al. (2015) show that images retrieved by a search engine can be a useful source of information to determine the generality of lexicons, but the resources (e.g. pre-trained image classifier for the words of interest) might not be available in many domains.
Order embedding (Vendrov et al., 2016) is a supervised approach to encode many annotated hypernym pairs (e.g. all of the whole Word-Net (Miller, 1995)) into a compact embedding space, where the embedding of a hypernym should be smaller than the embedding of its hyponym in every dimension. Our method learns embedding from raw text, where a hypernym embedding should be larger than the embedding of its hyponym in every dimension. Thus, DIVE can be viewed as an unsupervised and reversed form of order embedding.
Non-negative matrix factorization (NMF) has a long history in NLP, for example in the construction of topic models (Pauca et al., 2004). Non-negative sparse embedding (NNSE) (Murphy et al., 2012) and Faruqui et al. (2015) indicate that non-negativity can make embeddings more interpretable and improve word similarity evaluations. The sparse NMF is also shown to be effective in cross-lingual lexical entailment tasks but does not necessarily improve monolingual hypernymy detection (Vyas and Carpuat, 2016). In our study, we show that performing NMF on PMI matrix with inclusion shift can preserve DIH in SBOW, and the comprehensive experimental analysis demonstrates its state-of-the-art performances on unsupervised hypernymy detection.

Conclusions
Although large SBOW vectors consistently show the best all-around performance in unsupervised hypernym detection, it is challenging to compress them into a compact representation which preserves inclusion, generality, and similarity signals for this task. Our experiments suggest that the existing approaches and simple baselines such as Gaussian embedding, accumulating K-mean clusters, and non-negative skip-grams do not lead to satisfactory performance.
To achieve this goal, we propose an interpretable and scalable embedding method called distributional inclusion vector embedding (DIVE) by performing non-negative matrix factorization (NMF) on a weighted PMI matrix. We demonstrate that scoring functions which measure inclusion and generality properties in SBOW can also be applied to DIVE to detect hypernymy, and DIVE performs the best on average, slightly better than SBOW while using many fewer dimensions.
Our experiments also indicate that unsupervised scoring functions which combine similarity and generality measurements work the best in general, but no one scoring function dominates across all datasets. A combination of unsupervised DIVE with the proposed scoring functions produces new state-of-the-art performances on many datasets in the unsupervised regime.

Acknowledgement
This work was supported in part by the Center for Data Science and the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by Defense Advanced Research Agency (DARPA) contract number HR0011-15-2-0036, in part by the National Science Foundation (NSF) grant numbers DMR-1534431 and IIS-1514053 and in part by the Chan Zuckerberg Initiative under the project Scientific Knowledge Base Construction. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, or the U.S. Government, or the other sponsors.