Negative Sampling Improves Hypernymy Extraction Based on Projection Learning

We present a new approach to extraction of hypernyms based on projection learning and word embeddings. In contrast to classification-based approaches, projection-based methods require no candidate hyponym-hypernym pairs. While it is natural to use both positive and negative training examples in supervised relation extraction, the impact of positive examples on hypernym prediction was not studied so far. In this paper, we show that explicit negative examples used for regularization of the model significantly improve performance compared to the state-of-the-art approach of Fu et al. (2014) on three datasets from different languages.


Introduction
Hypernyms are useful in many natural language processing tasks ranging from construction of taxonomies (Snow et al., 2006;Panchenko et al., 2016a) to query expansion (Gong et al., 2005) and question answering (Zhou et al., 2013). Automatic extraction of hypernyms from text has been an active area of research since manually constructed high-quality resources featuring hypernyms, such as WordNet (Miller, 1995), are not available for many domain-language pairs. The drawback of pattern-based approaches to hypernymy extraction (Hearst, 1992) is their sparsity. Approaches that rely on the classification of pairs of word embeddings (Levy et al., 2015) aim to tackle this shortcoming, but they require candidate hyponym-hypernym pairs. We explore a hypernymy extraction approach that requires no candidate pairs. Instead, the method performs prediction of a hypernym embedding on the basis of a hyponym embedding.
The contribution of this paper is a novel approach for hypernymy extraction based on projection learning. Namely, we present an improved version of the model proposed by Fu et al. (2014), which makes use of both positive and negative training instances enforcing the asymmetry of the projection. The proposed model is generic and could be straightforwardly used in other relation extraction tasks where both positive and negative training samples are available. Finally, we are the first to successfully apply projection learning for hypernymy extraction in a morphologically rich language. An implementation of our approach and the pre-trained models are available online. 1

Related Work
Path-based methods for hypernymy extraction rely on sentences where both hyponym and hypernym co-occur in characteristic contexts, e.g., "such cars as Mercedes and Audi". Hearst (1992) proposed to use hand-crafted lexical-syntactic patterns to extract hypernyms from such contexts. Snow et al. (2004) introduced a method for learning patterns automatically based on a set of seed hyponym-hypernym pairs. Further examples of path-based approaches include (Tjong Kim Sang and Hofmann, 2009) and (Navigli and Velardi, 2010). The inherent limitation of the path-based methods leading to sparsity issues is that hyponym and hypernym have to co-occur in the same sentence.
Methods based on distributional vectors, such as those generated using the word2vec toolkit (Mikolov et al., 2013b), aim to overcome this sparsity issue as they require no hyponymhypernym co-occurrence in a sentence. Such methods take representations of individual words as an input to predict relations between them.
Two branches of methods relying on distributional representations emerged so far.
Methods based on word pair classification take an ordered pair of word embeddings (a candidate hyponym-hypernym pair) as an input and output a binary label indicating a presence of the hypernymy relation between the words. Typically, a binary classifier is trained on concatenation or subtraction of the input embeddings, cf. (Roller et al., 2014). Further examples of such methods include (Lenci and Benotto, 2012;Weeds et al., 2014;Levy et al., 2015;Vylomova et al., 2016).
HypeNET (Shwartz et al., 2016) is a hybrid approach which is also based on a classifier, but in addition to two word embeddings a third vector is used. It represents path-based syntactic information encoded using an LSTM model (Hochreiter and Schmidhuber, 1997). Their results significantly outperform the ones from previous pathbased work of Snow et al. (2004).
An inherent limitation of classification-based approaches is that they require a list of candidate words pairs. While these are given in evaluation datasets such as BLESS (Baroni and Lenci, 2011), a corpus-wide classification of relations would need to classify all possible word pairs, which is computationally expensive for large vocabularies. Besides, Levy et al. (2015) discovered a tendency to lexical memorization of such approaches hampering the generalization.
Methods based on projection learning take one hyponym word vector as an input and output a word vector in a topological vicinity of hypernym word vectors. Scaling this to the vocabulary, there is only one such operation per word. Mikolov et al. (2013a) used projection learning for bilingual word translation. Vulić and Korhonen (2016) presented a systematic study of four classes of methods for learning bilingual embeddings including those based on projection learning. Fu et al. (2014) were first to apply projection learning for hypernym extraction. Their approach is to learn an affine transformation of a hyponym into a hypernym word vector. The training of their model is performed with stochastic gradient descent. The k-means clustering algorithm is used to split the training relations into several groups. One transformation is learned for each group, which can account for the possibility that the projection of the relation depends on a subspace. This stateof-the-art approach serves as the baseline in our experiments. Nayak (2015) performed evaluations of distributional hypernym extractors based on classification and projection methods (yet on different datasets, so these approaches are not directly comparable). The best performing projection-based architecture proposed in this experiment is a fourlayered feed-forward neural network. No clustering of relations was used. The author used negative samples in the model by adding a regularization term in the loss function. However, drawing negative examples uniformly from the vocabulary turned out to hamper performance. In contrast, our approach shows significant improvements using manually created synonyms and hyponyms as negative samples. Yamane et al. (2016) introduced several improvements of the model of Fu et al. (2014). Their model jointly learns projections and clusters by dynamically adding new clusters during training. They also used automatically generated negative instances via a regularization term in the loss function. In contrast to Nayak (2015), negative samples are selected not randomly, but among nearest neighbors of the predicted hypernym. Their approach compares favorably to (Fu et al., 2014), yet the contribution of the negative samples was not studied. Key differences of our approach from (Yamane et al., 2016) are (1) use of explicit as opposed to automatically generated negative samples, (2) enforcement of asymmetry of the projection matrix via re-projection. While our experiments are based on the model of Fu et al. (2014), our regularizers can be straightforwardly integrated into the model of Yamane et al. (2016).

Hypernymy Extraction via
Regularized Projection Learning

Baseline Approach
In our experiments, we use the model of Fu et al. (2014) as the baseline. In this approach, the projection matrix Φ * is obtained similarly to the linear regression problem, i.e., for the given row word vectors x and y representing correspondingly hyponym and hypernym, the square matrix Φ * is fit on the training set of positive pairs P: where |P| is the number of training examples and xΦ − y is the distance between a pair of row vectors xΦ and y. In the original method, the L 2 distance is used. To improve performance, k projection matrices Φ are learned one for each cluster of relations in the training set. One example is represented by a hyponym-hypernym offset. Clustering is performed using the k-means algorithm (MacQueen, 1967).

Linguistic Constraints via Regularization
The nearest neighbors generated using distributional word vectors tend to contain a mixture of synonyms, hypernyms, co-hyponyms and other related words (Wandmacher, 2005;Heylen et al., 2008;Panchenko, 2011 where λ is the constant controlling the importance of the regularization term R. Asymmetric Regularization. As hypernymy is an asymmetric relation, our first method enforces the asymmetry of the projection matrix. Applying the same transformation to the predicted hypernym vector xΦ should not provide a vector similar (·) to the initial hyponym vector x. Note that, this regularizer requires only positive examples P: Neighbor Regularization. This approach relies on the negative sampling by explicitly providing the examples of semantically related words z of the hyponym x that penalizes the matrix to produce the vectors similar to them: Note that this regularizer requires negative samples N . In our experiments, we use synonyms of hyponyms as N , but other types of relations can be also used such as antonyms, meronyms or co-hyponyms. Certain words might have no synonyms in the training set. In such cases, we substitute z with x, gracefully reducing to the previous variation. Otherwise, on each training epoch, we sample a random synonym of the given word.
Regularizers without Re-Projection. In addition to the two regularizers described above, that rely on re-projection of the hyponym vector (xΦΦ), we also tested two regularizers without re-projection, denoted as xΦ. The neighbor regularizer in this variation is defined as follows: In our case, this regularizer penalizes relatedness of the predicted hypernym xΦ to the synonym z.
The asymmetric regularizer without re-projection is defined in a similar way.

Training of the Models
To learn parameters of the considered models we used the Adam method (Kingma and Ba, 2014) with the default meta-parameters as implemented in the TensorFlow framework (Abadi et al., 2016). 2 We ran 700 training epochs passing a batch of 1024 examples to the optimizer. We initialized elements of each projection matrix using the normal distribution N (0, 0.1).

Evaluation Metrics
In order to assess the quality of the model, we adopted the hit@l measure proposed by Frome et al. (2013) which was originally used for image tagging. For each subsumption pair (x, y) composed of the hyponym x and the hypernym y in the test set P, we compute l nearest neighbors for the projected hypernym xΦ * . The pair is considered matched if the gold hypernym y appears in the computed list of the l nearest neighbors NN l (xΦ * ). To obtain the quality score, we average the matches in the test set P: where 1(·) is the indicator function. To consider also the rank of the correct answer, we compute the area under curve measure as the area under the l − 1 trapezoids: (hit@(i) + hit@(i + 1)).

Experiment 1: The Russian Language
Dataset. In this experiment, we use word embeddings published as a part of the Russian Dis-  Figure 1: Performance of our models with re-projection as compared to the baseline approach of (Fu et al., 2014) according to the hit@10 measure for Russian (left) and English (right) on the validation set.   (Fu et al., 2014).
tributional Thesaurus (Panchenko et al., 2016b) trained on 12.9 billion token collection of Russian books. The embeddings were trained using the skip-gram model (Mikolov et al., 2013b) with 500 dimensions and a context window of 10 words. The dataset used in our experiments has been composed of two sources. We extracted synonyms and hypernyms from the Wiktionary 3 using the Wikokit toolkit (Krizhanovsky and Smirnov, 2013). To enrich the lexical coverage of the dataset, we extracted additional hypernyms from the same corpus using Hearst patterns for Russian using the PatternSim toolkit (Panchenko et al., 2012). 4 To filter noisy extractions, we used only relations extracted more than 100 times.
As suggested by Levy et al. (2015), we split the train and test sets such that each contains a distinct vocabulary to avoid the lexical overfitting. This results in 25 067 training, 8 192 validation, and 8 310 test examples. The validation and test sets contain hypernyms from Wiktionary, while the training set is composed of hypernyms and synonyms coming from both sources.
Discussion of Results. Figure 1 (left) shows performance of the three projection learning setups on the validation set: the baseline approach, the asymmetric regularization approach, and the neighbor regularization approach. Both regularization strategies lead to consistent improvements over the non-regularized baseline of (Fu et al., 2014) across various cluster sizes. The method reaches optimal performance for k = 20 clusters. Table 1 provides a detailed comparison of the performance metrics for this setting. Our approach based on the regularization using synonyms as negative samples outperform the baseline (all differences between the baseline and our models are significant with respect to the t-test). According to all metrics, but hit@1 for which results are comparable to xΦ, the re-projection (xΦΦ) improves results.

Experiment 2: The English Language
We performed the evaluation on two datasets.
EVALution Dataset. In this evaluation, word embeddings were trained on a 6.3 billion token text collection composed of Wikipedia, ukWaC (Ferraresi et al., 2008), Gigaword (Graff, 2003), and news corpora from the Leipzig Collection (Goldhahn et al., 2012). We used the skipgram model with the context window size of 8 tokens and 300-dimensional vectors.
We use the EVALution dataset (Santus et al., 2015) for training and testing the model, composed of 1 449 hypernyms and 520 synonyms, where hypernyms are split into 944 training, 65 validation and 440 test pairs. Similarly to the first experiment, we extracted extra training hypernyms using the Hearst patterns, but in contrast to Russian, they did not improve the results significantly, so we left them out for English. A reason for such difference could be the more complex morphological system of Russian, where each word has more morphological variants compared  Table 2: Performance of our approach for English without clustering (k = 1) and with the optimal number of cluster on the EVALution datasets (k = 30) and on the combined datasets (k = 25).
to English. Therefore, extra training samples are needed for Russian (embeddings of Russian were trained on a non-lemmatized corpus).
Combined Dataset. To show the robustness of our approach across configurations, this dataset has more training instances, different embeddings, and both synonyms and co-hyponyms as negative samples. We used hypernyms, synonyms and cohyponyms from the four commonly used datasets: EVALution, BLESS (Baroni and Lenci, 2011), ROOT09 (Santus et al., 2016) and K&H+N (Necsulescu et al., 2015).The obtained 14 528 relations were split into 9 959 training, 1 631 validation and 1 625 test hypernyms; 1 313 synonyms and cohyponyms were used as negative samples. We used the standard 300-dimensional embeddings trained on the 100 billion tokens Google News corpus (Mikolov et al., 2013b).
Discussion of Results. Figure 1 (right) shows that similarly to Russian, both regularization strategies lead to consistent improvements over the non-regularized baseline. Table 2 presents detailed results for both English datasets. Similarly to the first experiment, our approach consistently improves results robustly across various configurations. As we change the number of clusters, types of embeddings, the size of the training data and type of relations used for negative sampling, results using our method stay superior to those of the baseline. The regularizers without re-projection (xΦ) obtain lower results in most configurations as compared to re-projected versions (xΦΦ). Overall, the neighbor regularization yields slightly better results in comparison to the asymmetric regularization. We attribute this to the fact that some synonyms z are close to the original hyponym x, while others can be distant. Thus, neighbor regularization is able to safeguard the model during training from more errors. This is also a likely reason why the performance of both regularizers is similar: the asymmetric regularization makes sure that a re-projected vector does not belong to a semantic neighborhood of the hyponym. Yet, this is exactly what neighbor regularization achieves. Note, however that neighbor regularization requires explicit negative examples, while asymmetric regularization does not.

Conclusion
In this study, we presented a new model for extraction of hypernymy relations based on the projection of distributional word vectors. The model incorporates information about explicit negative training instances represented by relations of other types, such as synonyms and co-hyponyms, and enforces asymmetry of the projection operation. Our experiments in the context of the hypernymy prediction task for English and Russian languages show significant improvements of the proposed approach over the state-of-the-art model without negative sampling.