Within-Between Lexical Relation Classification Using Path-based and Distributional Data

We propose the novel Within-Between Relation model for recognizing lexical-semantic relations between words. Our model integrates relational and distributional signals, forming an effective sub-space representation for each relation. We show that the proposed model is competitive and outperforms other baselines, across various benchmarks


Introduction and Related Work
Recognizing lexical-semantic relations between words is beneficial for a variety of NLP tasks such as machine translation (Thompson et al., 2019), relation extraction (Shen et al., 2018), natural language inference (Chen et al., 2018), and question answering (Yang et al., 2017). The lexical relation classification task assigns a word-pair (pair of words) to its corresponding relation out of a finite set of relations. This set contains lexical relations, including the random relation (indicating that the words are unrelated). Two main lexical relation classification techniques are studied in the literature: Path-based methods (Hearst, 1992;Snow et al., 2005;Nakashole et al., 2012;Riedel et al., 2013) and distributional methods (Mikolov et al., 2013a;Pennington et al., 2014;Bojanowski et al., 2017;Glavaš and Vulić, 2018a), with some effort for integrating the two (Shwartz et al., 2016a).
In this work we follow the distributional approach, which was shown to improve upon pathbased methods. This approach considers static word embeddings such as word2vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014),and Fast-Text (Bojanowski et al., 2017), which produce out-of-context vector representation for each word. Note here that while contextualized embeddings (Devlin et al., 2019;Peters et al., 2018) have replaced the use of non-contextualized embeddings * Equal contribution, order determined randomly. in many settings, static word embeddings remain the standard choice for lexical relation classification, since in this task the input word-pair is given out-of-context. Taking the word embeddings as input, a classifier is trained while considering each word's representation in the pair. The recent SphereRE method (Wang et al., 2019), a purely distributional method that learns hyperspherical relation representation, presented state-of-the-art lexical relation classification results.
While presenting state-of-the-art performance, prior distributional methods suffer from the "lexical memorization" problem Levy et al. (2015). This problem arises when a test word-pair includes a rather frequent word in the training set, which is labeled by a dominant category in training. In such cases, the supervised model often ignores the second word in the input pair and resorts to the dominant training label according to the frequent word. Notably, lexical memorization is common for prototypical hypernyms -"category" words that are frequently labeled as hypernyms. For example, the vast majority of training examples that include the word fruit are labeled as hypernymy (fruit is the hypernym of apple, banana, etc.). Therefore, at inference time, the classifier is likely to predict the hypernymy relation even for unrelated word-pairs that contain fruit, e.g., (fruit,chair).
Another relevant line of research, which inspired our work, pertains to the integration of external lexical information to improve static word embeddings (Faruqui et al., 2015;Mrkšić et al., 2016;Glavaš and Vulić, 2018b;Arora et al., 2020;Barkan et al., 2020). Most of these methods aim to modify the distributional vector space, originally learned from corpus co-occurrence data, by using additional relational constraints. To that end, these techniques rely on lexical databases, e.g., Wordnet (Miller, 1995). Notably, Arora et al. (2020) present the LEXSUB model and suggest training static word embeddings by integrating lexical-relation and distributional data, through the combination of two corresponding loss terms. When modeling lexical relations data, each relation is projected to a separate subspace. Some of these ideas are adapted in certain parts of our work, while addressing the concrete goal of lexical relation classification rather than improving generic static word embeddings.
In this work, we present the novel Within-Between Relation (WBR) model, which is inspired both by previous relation classification models as well as by generic word embedding models that consider lexical relation constraints. This is performed through the combination of two objectives, both computed over the same projected sub-spaces, for each of the individual relations. Specifically, our Between objective aims to yield optimal classification of relation instances, while the Within objective aims to bring pairs of words sharing a relation close to each other in the corresponding relation sub-space. These objectives allow the incorporation of both relation and negative sampling signals, altogether addressing much better the lexical memorization problem.

The Within-Between Relation Model
In this section, we present the WBR model. Given a word-pair sharing a relation k, WBR is optimized to classify a word-pair to the correct relation (the between relation objective), and at the same time, separate it within the k relation space from other word-pairs that do not share the relation k. Let I and K be vocabularies of words and relations, respectively. K contains lexical relations such as hypernym, antonym, etc., including the random relation (words are unrelated) and the co (stands for co-occurrence) relation that is shared by words that co-occur in the corpus. We further denote P = I × I.
Let u i , v i ∈ R d be random variables that form the context and target base representations for the word i. We further denote U = {u i } i∈I and are either zero or set to a pretrained embedding (that can be retrieved from any word embedding method such as FastText, word2vec, Glove, etc.), and τ is a precision hyperparameter. We further denote A = {a i } i∈I and B = {b i } i∈I .
In case of an undirected relation, it holds that (i Note that in the specific case of the random relation, This assumption guarantees each word-pair (i, j) ∈ P is associated with a relation k ∈ K.
Let f k : P → R be a parametric function. Our goal is to learn parameters for f k s.t. the score f k ij is high if and only if (i, j) ∈ I k . In this work, and α is a hyperparameter. This forms a cosine similarity similarity metric with temperature, is motivated in Sec. 4. Ψ k ∈ R d k ×d and Φ k ∈ R d k ×d are matrices whose entries have normal priors with zero mean and precision λ (hyperparameter). Therefore, f k learns Ψ k and Φ k that enable the projection to a new relation space k. In this space, word-pairs that share the relation k are separated from word-pairs that do not in terms of the angle between their repective vectors. Yet, in the general case, f k can be a deep neural network. An exception is k = co, for which Ψ k and Φ k are predetermined to Ψ k = Φ k = I (not learned). We further denote Ψ = {Ψ k } k∈K and Φ = {Φ k } k∈K . Finally, we denote the set of unobserved variables and the set of hyperparameters by Θ = {U, V, Ψ, Φ} and H = {A, B, τ, λ}, respectively.

The Within-Relation Likelihood
Then, the within-relation likelihood is given by:

The Between-Relation Likelihood
Denote K = K \ {co} and let R = {r ij |(i, j) ∈ P}, where r ij : P → K is an observed categorical random variable s.t. r ij = k if (i, j) ∈ I k . Then, the between-relation |ℐ| | | | | Figure 1: The WBR graphical model. likelihood is given by:

WBR Training and Inference
The vanilla WBR loss is derived by taking the negative log of the joint distribution as follows: A graphical model of WBR is presented in Fig.  1. The minimization of L vanilla (Θ) is equivalent to the Maximum A-Posteriori (MAP) estimation of Θ. However, the negative log likelihood terms in Eq. 1 contain a summation which is quadratic in the vocabulary size I, implying a prohibitive computation. Therefore, we turn to a stochastic optimization: Let C be a text corpus (a sequence of words), and Q k i = {j|(i, j) ∈ I k }. We define s k : I → I × I as a sampler s.t. s k (i) retrieves Sample a positive word j (within the window around i), and a negative word n ∈ I 7: y co i,j ← 1, y co i,n ← −1, 8: Sample n s.t. (a, n) / ∈ I k 13: y k a,b ← 1, y k a,n ← −1 14: end for 23: end for a random word-pair (i, j) ∈ Q k i if Q k i = ∅, otherwise, a random word-pair (a, b) ∈ I k . The WBR stochastic optimization algorithm is described in Algorithm 1, together with the L wbr loss function in Eq. 2.
(2) Finally, in the inference phase, the probability of i k − → j is computed by p(r ij = k|Θ * ), where Θ * is the MAP estimate (produced by Algorithm 1).

Experimental Setup and Results
In this section, we present the datasets, hyperparameters, and experiments that we conducted to

Benchmarks and Co-Occurrence Data
In order to evaluate our model, we adopted the same experimental setup from Wang et al. (2019). The lexical relation classification datasets that were considered are K&H+N (Necşulescu et al., 2015), BLESS (Baroni and Lenci, 2011), ROOT09 (Santus et al.) and EVALution (Santus et al., 2015). Since the EVALution benchmark does not contain the random relation, we add it artificially for the negative sampling purpose. Due to space limitations, we do not provide the datasets' statistics. The reader may refer to (Wang et al., 2019) for the full details of the datasets. For co-occurrence data, we extracted cooccurring word-pairs from the English Wikipedia corpus. We sampled co-occurrence data that correspond to the vocabulary of the relation classification dataset, by picking the sentences form the corpus that contain these words.

Evaluated Models
For baselines, we considered both traditional distributional models: Concat (Baroni et al., 2012) and Diff (Weeds et al., 2014), and path-based models NPB (Shwartz et al., 2016b), LexNET (Shwartz and Dagan, 2016) (which integrates both distributional model and pure path-based data), NPB+Aug and LexNET+Aug (the base models are trained on augmented dependency paths, used to improve coverege) (Washio and Kato, 2018), and the recent state-of-the-art model SphereRE (Wang et al., 2019). Note that SphereRE performs a pre-training phase for generating initial pseudo labels, and the (unlabeled) test data is used for both this phase and the training. Our method does not require the test data and does not perform and initial classification before training. We refer readers to the previous works for detailed descriptions of these baselines. Note that (Washio and Kato, 2018) reported only the F1 scores over the models that were trained using augmented dependency paths.

Ablation Study
In order to assess the contribution of each component in our model, we perform an ablation study. First, we denote WBR as the full model that is described in Section 2. In addition, we consider the following ablated versions of WBR: BR: In this version, we omit the within loss and do not learn U and V. Instead, we set U = A and U = B and keep them fixed for the entire opttimization procedure. This leads to the following loss: BR+co: In this version, we omit all the relations from the within-relation loss, except for the co (cooccurence) relation. In other words, we change the WBR loss to include the between-relation likelihood, co-occurrence likelihood and the priors as follows:

Hyperparameters Configuration
We set the projection dimension to d k = 15, the precisions to λ = τ = 10 −4 , and the temperature to α = 5. Either increasing d k or changing the precisions or the temperature did not improve the performance of WBR over the validation sets. We used the Adam optimizer (Kingma and Ba, 2015) (as OPT from Algorithm 1) with a minibatch size of 32. Similar to Wang et al. (2019), we initialized the word-level representations to the pretrained 300 dimensional FastText word embeddings (Bojanowski et al., 2017). The SphereRE model uses constant FastText word embeddings and only learns relations' embeddings. However, we train both the relations' projections and continued the training of the word embeddings. For the rest of the baselines, the hyperparameters are adopted from the corresponding papers. For training stopping criteria, we used the validation set within each dataset (by computing the F1 score). For each test set, we report the averaged precision, recall, and F1 score for each lexical relation.

Results
The results of WBR and all of the baselines over the datasets are summarized in Table 1. Overall, our WBR model provides competitive performance results comparing to the tested baseline models. The recent SphereRE model outperforms the basic BR model on all the datasets. Adding the withinrelation objective, but only with co-occurrence (BR+co) improved the performance. The results on the other benchmarks are close to SphereRE. Adding both the random relation and using the within-relation mechanism increased the performance gain over this dataset, which is extremely imbalanced compared to the rest of the datasets (Wang et al., 2019). The improvement on EVALution is reasonable since this dataset does not contain the random relation. This effect can be explained by addressing the lexical memorization problem. Finally, adding the relations data to the withinrelation objective (the full WBR model) yielded additional performance gain, causing the model to outperform the SphereRE over three datasets slightly.

Mitigating the Lexical Memorization Problem
The lexical memorization problem is alleviated by the introduction of the random relation. This feature plays a key role both for the between and within loss terms: given a word-pair (a, b) and their corresponding, ground truth relation r, we randomly sample a word n and associate the word-pair (a, n) with the random relation, unless r happens to be equal to the random relation beforehand (see Algorithm 1). This unique mechanism is designed to balance each positive word-pair with a negative one, neutralizing the effect of multiple instances of the prototypical terms (e.g., animal, fruit, etc.) on the training objective, as a kind of regularization and data augmentation. For example, consider the positive data sample (animal, b). It will be balanced with a negative sample (animal, n). Therefore, during the training phase, the between classifier learns to consider the random label each time it is given the hypernymy label. Similarly, the corresponding within (hypernymy) classifier will encounter a negative sample for each positive sample. As a result, in the inference phase, the relation classifier does not always predict the hypernym relation for (animal, x) -the classifier will consider the features of x as well, and thus mitigates the lexical memorization problem. Another way to ensure that each relation classifier exploits both words in the given pair is splitting them into two different linear projections' relation sets, Ψ for the first word, and Φ for the second. Further, using the cosine similarity measure for computing f k ij , rather than a dot product, provides a normalization effect which neutralizes frequency biases, caused by typical larger norms for frequent words.

Conclusions
We presented WBR -a novel model for lexical relation classification. WBR facilitates the novel between-within relation loss, enabling the exploitation of distributional information. WBR is evaluated on four different datasets, where it is shown to outperform various baselines across all evaluation metrics.
Labs, the Israel Science Foundation grant 1951/17 and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1).