BiRRE: Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy Detection

The hypernymy detection task has been addressed under various frameworks. Previously, the design of unsupervised hypernymy scores has been extensively studied. In contrast, supervised classifiers, especially distributional models, leverage the global contexts of terms to make predictions, but are more likely to suffer from “lexical memorization”. In this work, we revisit supervised distributional models for hypernymy detection. Rather than taking embeddings of two terms as classification inputs, we introduce a representation learning framework named Bidirectional Residual Relation Embeddings (BiRRE). In this model, a term pair is represented by a BiRRE vector as features for hypernymy classification, which models the possibility of a term being mapped to another in the embedding space by hypernymy relations. A Latent Projection Model with Negative Regularization (LPMNR) is proposed to simulate how hypernyms and hyponyms are generated by neural language models, and to generate BiRRE vectors based on bidirectional residuals of projections. Experiments verify BiRRE outperforms strong baselines over various evaluation frameworks.


Introduction
As a type of linguistic resources, hypernymy relations refer to "is-a" relations between terms. Such relations are frequently exploited in a wide range of NLP tasks, including taxonomy induction (Mao et al., 2018), lexical entailment (Vulic et al., 2017) and Web query understanding .
In the NLP community, the task of hypernymy detection has been studied under various frameworks, e.g., unsupervised hypernym discovery (Roller et al., 2018;Chen et al., 2018;Chang et al., 2018), supervised hypernymy classification (Shwartz et al., 2016;Nguyen et al., * Corresponding author. 2017), graded lexical entailment (Vulic et al., 2017). To address unsupervised hypernym discovery, pattern-based and distributional approaches are two mainstream types of methods. Pattern-based approaches use Hearst patterns (Hearst, 1992) and their variants to extract hypernymy relations from texts (Kozareva and Hovy, 2010;Roller and Erk, 2016). Distributional methods employ hypernymy measures (or called scores) to predict hypernymy based on distributional vectors (Santus et al., 2014(Santus et al., , 2017, alleviating the pattern sparsity issue.  combine Hearst patterns and hyperbolic embeddings for unsupervised hypernym detection. Compared to unsupervised tasks, the supervised hypernymy detection task is formulated more directly, classifying a term pair as hypernymy or non-hypernymy based on two terms' representations (Yu et al., 2015;Anke et al., 2016;Nguyen et al., 2017). Although this task definition is more straightforward, the corresponding methods receive criticism because they may suffer from "lexical memorization" (Levy et al., 2015), referring to the phenomenon that they only learn whether a term is a "prototypical hypernym", rather than the actual relations between two terms. To address the problem, several methods combine other signals as inputs for hypernymy classifiers, such as dependency paths (Shwartz et al., 2016) and the WordNet concept hierarchy (Nguyen et al., 2017). Nonetheless, it is worth studying whether supervised classifiers can learn hypernymy relations purely based on distributional representations.
In this paper, we revisit supervised distributional models for hypernymy detection, and propose a representation learning framework named Bidirectional Residual Relation Embeddings (BiRRE). To handle "lexical memorization" (Levy et al., 2015), we learn a BiRRE vector for each term pair as features for the classifier, avoiding using the two terms' embeddings directly. The BiRRE vector models the possibility of a term being mapped to another in the embedding space by hypernymy relations, learned via existing neural language models and supervised signals of the training set. Specifically, we introduce the Latent Projection Model with Negative Regularization (LPMNR) to simulate how hypernyms and hyponyms are generated in the the embedding space. The BiRRE vectors are generated based on bidirectional residuals of projection results of LPMNR. Experiments over multiple public datasets and various evaluation frameworks prove that BiRRE outperforms strong baselines.
The rest of this paper is organized as follows. Section 2 summarizes the related work. The BiRRE framework is elaborated in Section 3, with experiments shown in Section 4. Finally, we conclude our paper and discuss the future work in Section 5.

Related Work
In this section, we overview related work on various tasks related to hypernymy detection. Due to space limitation, we focus on recent advances and refer readers to Wang et al. (2017a) for earlier work.
Pattern-based approaches date back to Hearst (1992), utilizing handcrafted patterns in English for text matching. An example of Hearst patterns is "[...] such as [...]". They are employed to build large-scale taxonomies (Wu et al., 2012;Faralli et al., 2019). Although Hearst patterns are fairly simple, recent studies show they are highly useful for designing hypernymy measures (Roller et al., 2018;. Other approaches aim at improving the coverage of generalized Hearst patterns by automatic pattern expansion (Kozareva and Hovy, 2010;Roller and Erk, 2016), or considering other context-rich representations (such as Heterogeneous Information Networks (Shi et al., 2019)). A potential drawback of pattern-based methods is that the recall of extraction results over specific domains is limited (Alfarone and Davis, 2015), as textual patterns are naturally sparse in the corpus.
To overcome the sparsity issue, distributional hypernymy measures model the degree of hypernymy within a term pair. A majority of these hypernymy measures are based on Distributional Inclusion Hypothesis (DIH) (Weeds et al., 2004), meaning that a hypernym covers a broader spectrum of contexts, compared to its hyponyms. The improvements and variants of DIH include (Santus et al., 2014;Chen et al., 2018;Chang et al., 2018) and many others. A comprehensive overview of distributional hypernymy measures can be found in Santus et al. (2017). Recently,  combine Hearst patterns and distributional vectors for hypernym detection. Additionally, the work of graded lexical entailment (Vulic et al., 2017) and cross-lingual graded lexical entailment (Vulic et al., 2019) aims at computing numerical scores, indicating the degree of hypernymy of a term pair.
For supervised hypernymy classification, traditional approaches employ distributional vectors of two terms as features, such as the Concat model, the Diff model, the SimDiff model (Turney and Mohammad, 2015). Recently, several approaches are proposed to learn hypernymy embeddings, considering the semantic hierarchies of concepts (Yu et al., 2015;Luu et al., 2016;Nguyen et al., 2017;Chang et al., 2018;Ganea et al., 2018;Rei et al., 2018;Chen et al., 2018). For example, Yu et al. (2015) learn hypernym and hyponym embeddings for a term by max-margin neural network. Nguyen et al. (2017) propose hierarchical embeddings for hypernymy classification, jointly trained over texts and the WordNet concept hierarchy. Rei et al. (2018) propose a directional similarity neural network based on word embeddings to predict the degree of hypernymy between two terms. Yet a number of models encode terms in the hyperbolic space, such as the hyperbolic Lorentz Model , Hyperbolic Entailment Cones (Ganea et al., 2018), and others Aly et al., 2019). The hyperbolic geometry is more capable of modeling the transitivity property of hypernymy. Additionally, patterns and distributional vectors can also be combined for supervised hypernymy prediction, as in Shwartz et al. (2016); Held and Habash (2019) and several systems submitted to SemEval 2018 Task 9 (Camacho-Collados et al., 2018).
Another type of supervised models can be categorized as projection-based approaches, which model how to map embeddings of a term to those of its hypernyms. Fu et al. (2014) is most influential, followed by a number of variants. Biemann et al. (2017); Wang et al. (2017bWang et al. ( , 2019b improve projection learning by considering explicit negative samples. The usage of orthogonal matrices is exploited in Wang et al. (2019a). One advantage is that they do not perform classification on two terms' embeddings directly, alleviating "lexical memoriza-  tion" (Levy et al., 2015). Compared to previous work, BiRRE is supervised, but does not minimize the classification error firstly. It uses LPMNR to learn hypernym/hyponym generation process by projection learning. Hence, it takes advantages of both traditional classification and projection-based approaches.

The BiRRE Framework
In this section, we first introduce the task description and the BiRRE framework. The detailed steps and justifications are elaborated subsequently.

Task Description
Given two sets of term pairs: the training sets of hypernymy D (+) = {(x i , y i )} and non-hypernymy relations D (−) = {(x i , y i )}, the task is to learn a classifier f to distinguish hypernymy vs. nonhypernymy relations. Particularly, y i is a hypernym of . For non-hypernym relations, the relation types between two terms x i and y i in D (−) can be reverse-hypernymy, synonymy, antonymy, or unrelated, depending on the respective task and dataset settings.

General Framework
The BiRRE framework is shown in Figure 1, consisting of pre-processing and three major modules.
Pre-processing: The pre-processing step of the BiRRE framework requires minimal computation. For each term pair (x i , y i ) ∈ D (+) ∪ D (−) , we retrieve the corresponding embedding vectors from any neural language models (e.g., Word2Vec, GloVe), without fine-tuning. Denote normalized embeddings of x i and y i as x i and y i , respectively.
M1: The hyponym projection module learns how to map embeddings of a hypernym to those of its hyponyms. Consider the example in Figure 2. There are usually one-to-many mappings (in semantics) from hypernyms to hyponyms. Hence, we map a hypernym to its N semantically diverse hyponyms by LPMNR. We denote the N hyponym embeddings w.r.t. y i as hypo (1) (y i ), · · · , hypo (N ) (y i ) 1 . Based on the difference between the true hyponym embeddings x i and the N predicted hyponym embeddings , we compute the hyponym residual vector res hypo (x i , y i ) to measure the "goodness" of mapping from y i to x i . As shown in Biemann et al. (2017), the explicit usage of negative samples (i.e., non-hypernymy relations) improves the performance of projection learning. In this module, we take D (+) as the training set and D (−) for regularization purposes.
M2: The hypernym projection module learns how to map embeddings of a hyponym to those of its hypernyms. Based on Figure 2, such mappings tend to be simpler. Hence, we only learn one mapping model from a hyponym to embeddings of its hypernym. We denote the hypernym embeddings as hyper(x i ). This step is learned by a simplified version of LPMNR. Similarly, we denote the hypernym residual vector as res hyper (x i , y i ), measuring the "goodness" of mapping from x i to y i . In this module, we also take D (+) as the training set and D (−) for regularization. M3: Finally, the BiRRE vector (denoted as is computed by concatenating res hypo (x i , y i ) and res hyper (x i , y i ). A feedforward neural network is trained over D (+) and D (−) for hypernymy relation classification. The parameters of M3 are learned by back propagation, with parameters of M1 and M2 fixed in this step.

Hyponym Projection (M1)
Previously, several approaches (Fu et al., 2014;Yamane et al., 2016) According to Wang et al. (2019a), the usage of orthogonal matrices has better performance for hypernymy prediction, as the the cosine similarity of Mx i and y i can be maximized when Mx i and y i are normalized.
Let M = {M (1) , · · · , M (N ) } be the parameter collection of our hyponym projection model (i.e., N d × d orthogonal projection matrices). For each hypernym y i , these N projection matrices map y i to the embeddings of N semantically diverse hyponyms M (1) y i , · · · , M (N ) y i . The major challenge is that the explicit semantics of N projections are unknown, and may vary across different datasets. To derive a unified solution for all sce-narios, we introduce a latent variable θ (1) is that it only considers hypernymy relations D (+) . The relation classification objective is not optimized. As Biemann et al. (2017) suggest, negative samples can of help for learning projection regularizers. The regularizers push the projected hyponym embeddings of a term further away from its non-hyponyms, making hypernymy and non-hypernymy relations more separable. Hence, we reformulate Eq. (1) as: where λ > 0 is the regularization balancing factor. The latent variable φ To the best of our knowledge, there is no standard off-the-shelf solution to Eq. (2). We slightly change the regularization term of Eq. (2). The objective function is changed as follows, which we refer as the Latent Projection Model with Negative Regularization (LPMNR): Optimizing Eq. (3) is non-trivial due to the existence of the unknown weights θ (p) i and φ (p) i . In this paper, we present a dual-iterative algorithm to solve the problem. All values of θ In each iteration, we update the values of θ Proof: It is trivial to see that the optimal values of each matrix is independent from each other. Hence, we only need to optimize: The problem can be transformed as: We re-write the objective function as: J(M) = 1 − tr(MB T ). Hence, we have transformed the problem into the Multi-Wahba problem (Wang et al., 2019a). J(M) is minimized when the optimal value of M is: After optimal values of M (p) are computed, the values of all M (p) y i − x i 2 are known. In this condition, we fix the values of M (p) and update all θ i . We turn the problem of minimizing Eq. (3) into the following problems: We update θ  i by constrained gradient descent where the updating formulas are: where η > 0 is the learning rate (a small decimal). θ for the new iteration, respectively. After the update of all weights, we normalize the weights to satisfy: The iterative procedure continues until convergence, with the algorithm summarized in Algorithm 2.
After training, given x j , M1 outputs N hyponym embeddings: hypo (1) (y i ) = M (1) y i , · · · , hypo (N ) (y i ) = M (N ) y i . We define the hyponym residual vector res hypo (x i , y i ) as follows: wherep is the index of the selected projection matrix that best fits for (x i , y i ) ∈ D (+) . We setp empirically as:p = argmin p x i − M (p) y i 2 .
Based on the objective in Eq. (3), if (x i , y i ) ∈ D (+) , res hypo (x i , y i ) 2 tends to be small. Otherwise, res hypo (x i , y i ) 2 would be large. Hence, it is discriminative for hypernymy classification.

Hypernym Projection (M2)
The hypernym projection module can be regarded as a simplified version of the previous module. Denote Q as the d×d projection matrix. The objective of hypernym projection is formulated as follows: It can be solved by Algorithm 1 with weights reduced and N = 1. Similar to hyponym projection, we compute the hypernym residual vector res hyper (x i , y i ) as follows:

Hypernymy Relation Classification (M3)
For each pair (x i , y i ) ∈ D (+) ∪ D (−) , we generate the BiRRE vector r i via the concatenation of two residual vectors: A feed forward neural network is trained for hypernymy vs. non-hypernymy classification over D (+) and D (−) using r i as features. To this end, we summarize the high-level training process of BiRRE, as shown in Algorithm 3. There can be zero, one or multiple hidden layers in the neural network. The detailed study of network structures will be discussed in the experiments.

Discussion
Orthogonal projections have been applied to predict various types of word relations (Ethayarajh, 2019). However, the mechanisms behind orthogonal projections in the embedding space for predicting such relations can not be fully explained by NLP researchers. In BiRRE, we use different numbers of matrices in M1 and M2, in order to capture the mappings between hypernyms and hyponyms. Due to the complicated nature of linguistics, such projections are not 100% correct. Hence, we learn the residual vectors and train a classifier (in M3) to decide which dimensions learned by M1 and M2 are best predictors for hypernymy relations. Therefore, the performance of BiRRE can be improved.

Experiments
In this section, we conduct extensive experiments to evaluate the BiRRE model over various benchmarks. We also compare it with state-of-the-art to show its effectiveness.

Experimental Settings
The default word embeddings used by our model are pre-trained by the fastText model (Bojanowski et al., 2017) over the English Wikipedia corpus of version December 2019. We train the model by ourselves using their original codes. The embedding size is set as d = 300, according to their paper. In the implementation, the parameters η and N are set to 10 −3 and max{1, lg |D (+) | } (an empirical formula), respectively. We also tune the model parameters in subsequent experiments. The neural network in M3 is fully connected and trained via the Adam algorithm with the dropout rate to be 0.1.

Experiment 1: Effectiveness of BiRRE
We use the largest hypernymy relation dataset (to our knowledge) from Shwartz et al. (2016) to test the effectiveness of BiRRE. It is created from various resources: WordNet, DBPedia, Wikidata and YAGO, and divided into random split and lexical split. Especially, the lexical split forces training, testing and validation sets contain distinct vocabularies, disabling "lexical memorization" (Levy  Table 1: Performance of different approaches over the dataset (Shwartz et al., 2016(Shwartz et al., ). et al., 2015. We follow the same evaluation steps of Shwartz et al. (2016); Rei et al. (2018) and report the results in Table 1. The network structure and parameters are tuned over the validation set. Based on the results, BiRRE consistently outperforms state-of-the-art by 3.1% and 5.6% in terms of F1. Additionally, the performance gap between lexical and random splits has been narrowed down from 6.5% (Rei et al., 2018) to 4.0% (BiRRE). It shows that BiRRE alleviates "lexical memorization", compared to other distributional models. We also conduct pairwise statistical tests between Rei et al. (2018) and our outputs. It shows that BiRRE outperforms the approach significantly.
We tune the value of λ from 0.0 to 1.0 using the development set. The results over the lexical spilt of the dataset (Shwartz et al., 2016) are shown in Figure 3(a). Bigger λ means a larger effect of negative regularization. As seen, the usage of negative regularization improves the prediction performance by a large margin. A suitable choice of λ is generally around 0.4 to 0.6. As for the neural network structures, the number of hidden nodes does not have a large impact on the model performance. Hence, we only report the results when we use the same number of nodes in hidden layers as the dimension of word embeddings d, shown in Figure 3(b). Our results are consistent with previous research, which show that adding more hidden layers can decrease the prediction accuracy, leading to model overfitting.

Experiment 2: Supervised Hypernymy Classification
We evaluate BiRRE over two benchmark datasets: BLESS (Baroni and Lenci, 2011) and ENTAIL-MENT (Baroni et al., 2012), consisting of 14,547 and 2,770 labeled term pairs, respectively. For evaluation, we follow the same "leave-one-out" evaluation protocols as used in previous research (Yu et al., 2015;Luu et al., 2016;Nguyen et al., 2017). All the experimental results are reported in terms of averaged accuracy. Because the two datasets do not have separate validation sets, we take the dataset (Shwartz et al., 2016) to tune parameters of BiRRE. To prevent "data leakage", we exclude all the data of the validation set that also appear in the test set for parameter tuning. We compare BiRRE against several previous supervised models (Mikolov et al., 2013;Yu et al., 2015;Luu et al., 2016;Nguyen et al., 2017;Wang et al., 2019a). 3 The averaged accuracy scores of all these methods are shown in Table 2. From the results, we can see that our model outperforms all previous baseline approaches, having the averaged accuracy of 98% and 93%, respectively. We also conduct the paired t-test, which shows that BiRRE sig-

Experiment 3: Ablation Study of BiRRE
We further study the effectiveness of individual residual vectors for hypernymy classification and conduct the following ablation study. Each time, we only use a unidirectional residual vector as features (i.e., res hypo (x i , y i ) and res hyper (x i , y i )). Additionally, we follow several previous papers (Yu et al., 2015;Luu et al., 2016;Nguyen et al., 2017), using the addition, offset and concatenation of embedding vectors as features (i.e., x i + y i , x i − y i and x i ⊕ y i to train the neural networks for hypernymy classification. These three models are treated as naive baselines. The experimental settings are the same as in Experiments 1 and 2. The experimental results over BLESS (Baroni and Lenci, 2011), ENTAILMENT (Baroni et al., 2012) and the lexical split of the dataset (Shwartz et al., 2016) are illustrated in Table 3. We have the following three observations. i) Traditional models using x i + y i , x i − y i and x i ⊕ y i as features do not yield satisfactory results. The most likely cause is that they suffer from the "lexical memorization" problem. ii) The hyponym residual vector res hypo (x i , y i ) is slightly more effective than the hypernym residual vector res hyper (x i , y i ). It means that the more complicated hyponym generation process is more precise and suitable for our task. iii) By combining res hypo (x i , y i ) and res hyper (x i , y i ), the proposed approach is more effective and outperforms previous methods.

Experiment 4: Hypernym Discovery
Yet another widely used evaluation framework is hypernym discovery, including three subtasks: i) ranked hyernym detection, ii) hyernymy direction   (Santus et al., 2015), LEDS (Baroni et al., 2012), SHWARTZ (Shwartz et al., 2016) and WBLESS (Weeds et al., 2014). For each test set, we use the remaining four datasets (excluding all term pairs in the current test set) to train and tune the BiRRE model. For each term in the test set, we create a ranked list of candidate hypernyms by placing positive predictions over negative. Next, for candidate hypernyms with the same relation label, we rank them by norms of BiRRE vectors to produce the final ranked list.
For the hypernymy direction classification subtask, we use three test sets: BLESS (Baroni and Lenci, 2011), WBLESS (Weeds et al., 2014) and BIBLESS (Kiela et al., 2015). Because this subtask is directly evaluated in terms of accuracy, we train the supervised BiRRE model using the external dataset (Shwartz et al., 2016) (also excluding term overlaps) and report the performance. Another subtask evaluated in Roller et al. (2018);  is graded lexical entailment (Vulic et al., 2017). Because BiRRE only produces discrete outputs, how BiRRE can be adapted for graded lexical entailment is left as future work.
The experimental results are summarized in Table 4. For comparison, we take three recent models (Nguyen et al., 2017;Roller et al., 2018;   scores generated by "spmi(x, y)" due to its superiority. We can see that BiRRE consistently outperforms baselines over most of the datasets. As for LEDS and WBLESS, the results of BiRRE and the state-of-the-art  are comparable. Hence, our supervised distributional model BiRRE can also address hypernym discovery, previously addressed by unsupervised hypernymy scores. We need to claim that models in Table 4 use different knowledge sources (either patterns or distributional vectors) for parameter learning. Strictly speaking, the gaps of scores in this set of tasks do not necessarily reflect which method is better in all situations. It still remains an open question that how to evaluate all types of methods related to hypernymy detection in a unified framework.

Experiment 5: Choice of Different Word Embeddings
We also test our model using other types of word embeddings. We consider two other types of traditional word embeddings: Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), as well as BERT (Devlin et al., 2019) representations without contexts 4 . Experiments are conducted over the same datasets as used in Experiment 3. The results are shown in Table 5, in terms of accuracy. As shown, the effect of fastText (Bojanowski et al., 2017) is slightly better than Word2Vec and GloVe. The representations of BERT do not yield satisfactory performance, probably due to the fact that the dimensionality of BERT is higher than other models, making the number of parameters in BiRRE 4 The dimensions of Word2Vec and GloVe are the same as fastText. The pre-trained BERT model we use is Google's base model, released at https://github.com/ google-research/bert.  too large to be learned. Note that the study of deep neural language models is beyond the scope of this paper, which can be explored in the future.

Conclusion and Future Work
In this paper, we present the BiRRE model for supervised hypernymy detection. It employs two projection-based hypernym and hyponym generation modules based on word embeddings to learn BiRRE vectors for hypernymy classification. Experimental results show that BiRRE outperforms state-of-the-arts over various benchmark datasets. Future work includes i) improving projection learning to model complicated linguistic properties of hypernymy; ii) extending our model to address other tasks, such as graded lexical entailment (Vulic et al., 2017) and cross-lingual graded lexical entailment (Vulic et al., 2019); and iii) exploring how deep neural language models (such as BERT (Devlin et al., 2019), Transformer-XL , XLNet ) can improve the performance of hypernymy detection.