SphereRE: Distinguishing Lexical Relations with Hyperspherical Relation Embeddings

Lexical relations describe how meanings of terms relate to each other. Typical examples include hypernymy, synonymy, meronymy, etc. Automatic distinction of lexical relations is vital for NLP applications, and also challenging due to the lack of contextual signals to discriminate between such relations. In this work, we present a neural representation learning model to distinguish lexical relations among term pairs based on Hyperspherical Relation Embeddings (SphereRE). Rather than learning embeddings for individual terms, the model learns representations of relation triples by mapping them to the hyperspherical embedding space, where relation triples of different lexical relations are well separated. Experiments over several benchmarks confirm SphereRE outperforms state-of-the-arts.

Due to its importance, automatic acquisition of lexical relations is a research focus in NLP. In early years, lexical relations in WordNet were manually compiled by linguists (Miller, 1995). Recently, path-based and distributional approaches are two major paradigms to classify a term pair into a fixed inventory of lexical relations, or to predict it as random (meaning the two terms are un-related) Wang et al., 2017a). Path-based approaches use dependency paths connecting two terms to infer lexical relations (Washio and Kato, 2018a;Roller et al., 2018). The paths usually describe relations between terms explicitly, but require the two terms co-occur in a sentence, leading to the "low coverage" problem. Apart from Hearst patterns (Hearst, 1992), there are few high-quality textual patterns to recognize lexical relations other than hypernymy. Distributional approaches consider the global contexts of terms to predict lexical relations using word embeddings (Baroni et al., 2012;Glavas and Vulic, 2018). They are reported to outperform several path-based approaches, but can suffer from the "lexical memorization" problem (Levy et al., 2015;. This is because some supervised distributional approaches learn properties of two terms separately, instead of how two terms relate to each other in the embedding space. In this paper, we aim at improving distributional approaches by learning lexical relation representations in hyperspherical embedding space, named hyperSpherical Relation Embeddings (SphereRE). Consider the example w.r.t. car in Figure 1. Word embeddings of these terms are similar to each other due to their contextual similarity. Hence, embedding offsets of term pairs can not distin-guish the three types of lexical relations well (i.e., hypernymy, synonymy and meronymy). Instead of learning individual term embeddings, we directly map all the relation triples to the hyperspherical embedding space such that different types of lexical relations have diverse embeddings in terms of angles. For example, the angle between embeddings of (car, hypernymy, vehicle) and (car, synonymy, auto) is large. In contrast, that of (car, synonymy, automobile) and (car, synonymy, auto) is small. As a result, different types of lexical relations can be distinguished. Moreover, by learning representations of lexical relation triples explicitly, our work addresses "lexical memorization" (Levy et al., 2015) from a distributional aspect.
To learn SphereRE vectors for lexical relation triples, we minimize embedding distances of term pairs that are likely to share the same lexical relation in both labeled and unlabeled data, and maximize embedding distances of different lexical relations. The distances in the hyperspherical space are defined based on the angles of embeddings. In this work, we first propose a relation-aware semantic projection model to estimate probabilistic distributions of lexical relations over unlabeled data. The SphereRE vectors are efficiently learned by Monte-Carlo techniques by transductive learning. Finally, a neural network based classifier is trained using all the features to make the final predictions of lexical relations over all unlabeled data.
We evaluate SphereRE over four benchmark datasets and the CogALex-V shared task (Santus et al., 2016a), and confirm that SphereRE is highly effective, outperforming state-of-the-art. We also evaluate the embedding quality of SphereRE.
The rest of this paper is organized as follows. Section 2 summarizes the related work. We present SphereRE in Section 3. Experiments are illustrated in Section 4, with the conclusion shown in Section 5.

Related Work
We briefly overview related work on lexical relation classification and hyperspherical learning.

Lexical Relation Classification
Among all methods, path-based and distributional approaches are two major paradigms . For hypernymy relations, Hearst patterns (Hearst, 1992) are lexical patterns frequently employed, summarized in Wang et al. (2017a).  employ an LSTMbased neural network to learn representations of dependency paths. Roller et al. (2018) use Hearst pattern based statistics derived from a large text corpus to detect hypernymy relations. For other lexical relations, LexNET  extends  to classify multiple types of lexical relations based on an integrated neural network. This type of methods requires that the two terms co-occur in a sentence. Washio and Kato (2018a) address the "low coverage" issue by augmenting dependency paths.
Distributional approaches employ term representations to predict lexical relations, which exploit global contexts of terms. Traditional methods use a combination of two terms' embeddings as the representation, such as vector concatenation (Baroni et al., 2012;Roller and Erk, 2016), vector difference (Roller et al., 2014;Weeds et al., 2014;Vylomova et al., 2016), etc. After that, a classifier is trained to predict lexical relations. Although distributional methods do not require the co-occurrence of two terms, they suffer from "lexical memorization" (Levy et al., 2015). It means the algorithms only learn the properties of the two terms, rather than the relations between them. Recently, more complicated neural networks have been proposed. Glavas and Vulic (2018) propose a Specialization Tensor Model to discriminate between four lexical relations. The model learns different specializations of input distributional embeddings w.r.t. term pairs in order to predict different types of lexical relations. Attia et al. (2016) employ a convolutional neural network in a multitask setting. Nguyen et al. (2016Nguyen et al. ( , 2017b distinguish antonymy and synonymy via word embeddings and path-based neural networks. Similar research is presented in Hashimoto et al. (2015); Washio and Kato (2018b); ; . A few works learn relation embeddings for other NLP applications Joshi et al., 2018).
Another research direction is to learn specializing embeddings. Yu et al. (2015); Luu et al. (2016);Nguyen et al. (2017a); Vulic and Mrksic (2018) (and a few others) learn hypernymy embeddings considering hierarchical structure of hypernymy relations. For other lexical relations, Mrksic et al. (2017) present the model Attract-Repel to improve qualities of word embeddings for synonymy recognition. However, they focus on one particular lexical relation, not capable of distinguishing multiple types of lexical relations.

Hyperspherical Learning
The work of hyperspherical learning is mostly in computer vision. Liu et al. (2017) propose a hperspherical network (SphereNet) for image classification. It learns angular representations on hyperspheres using hyperspherical convolution units. Wang et al. (2017c) apply the L 2 hypersphere embedding technique to face verification, optimizing cosine similarity for feature normalization. In NLP, hyperspherical learning has not been extensively used. Masumura et al. (2017) introduce hyperspherical query likelihood models for information retrieval. Mei and Wang (2016) leverage hyperspherical clustering for document categorization. Lv et al. (2018) consider sphere representations as knowledge graph embeddings. To our knowledge, few methods employ hyperspherical learning to learn representations for NLP applications. In our work, we focus on lexical relation classification and present the SphereRE model to address this problem.

The SphereRE Model
We introduce the Hyperspherical Relation Embedding (SphereRE) model in detail.

Learning Objective
We start with some basic notations. Let D and U be the labeled and unlabeled sets, consisting of term pairs (x i , y i ). Each pair (x i , y i ) corresponds to a pre-defined lexical relation type r i ∈ R. 1 The task of our work is to predict the lexical relation type r i for each pair (x i , y i ) ∈ U based on D.
Denote x i (or y i ) as the embedding of word x i (or y i ), pre-trained using any neural language models. For each lexical relation type r m ∈ R, we learn a mapping function f m ( x i ) that maps the relation subject x i to the relation object y i in the embedding space if x i and y i have the lexical relation type r m . Hence, we aim at minimizing the objective function J f with I(·) as the indicator function: If the dataset contains term pairs of several lexical relation types, together with random, unrelated term pairs, we consider "random" as a special lexical relation type.
To represent lexical relation triples in the (original) embedding space, we utilize the vector difference model (Roller et al., 2014;Weeds et al., 2014;Vylomova et al., 2016). Combining the model J f , given a term pair (x i , y i ) with lexical relation type r i , the representation of the triple is Next, we consider the hyperspherical learning objective. Based on the assumption in Figure 1, we define a symmetric function g(·, ·) to quantify the distance between two representations of lexical relation triples in the SphereRE space. Following Roller et al. (2014); Weeds et al. (2014); Vylomova et al. (2016), we employ the vector difference model to represent the relation embedding of a term pair. Because we aim at learning representations for both labeled and unlabeled data in order to make predictions, for two pairs (x i , y i ) and (x j , y j ) with lexical relation types r i and r j ((x i , y i ), (x j , y j ) ∈ D ∪ U ), we minimize the following function: where δ(r i , r j ) is the sign function that returns 1 if the two pairs share the same lexical relation type (i.e., r i = r j ) and -1 otherwise. Hence, embedding distances of term pairs that share the same lexical relation type are minimized. Embedding distances of term pairs with different lexical relation types are maximized. Refer to Figure 2 for a geometric interpretation of the objective. "v(car, auto)" is the embedding vector w.r.t. the term pair "(auto, car)" (i.e., f i ( x i ) − x i ) based on vector difference and J f , characterizing the lexical relation of the two terms in the original embedding space. For simplicity, we use an arrow to represent the embed- In summary, the objective function of lexical relation representation learning in the hyperspherical embedding space J g is defined as follows: Let Θ be all parameters in the model. The general objective of SphereRE is defined as follows, with λ 1 and λ 2 as balancing hyperparameters: It is computationally intractable to minimize J(Θ). The reasons are twofold: i) The lexical relation types r i of all pairs (x i , y i ) ∈ U should be predicted before we can minimize J(Θ). ii) The definition of J g does not directly determine how to generate the representations of lexical relation triples. Additionally, minimizing J(Θ) requires the traversal of D and U in quadratic time, leading to the high computational cost.
In the following, we present a relation-aware semantic projection model as the function f m (·). It is employed to approximate r i (for all (x i , y i ) ∈ U ). Next, the representation learning process of lexical relation triples and the lexical relation classification algorithms are introduced in detail.

Relation-aware Semantic Projection
For each pair (x i , y i ) ∈ U , we approximate r i from a probabilistic perspective, as an initial prediction step. Following Wang and He (2016); Yamane et al. (2016); Wang et al. (2017b), for each lexical relation type r m ∈ R, we utilize a mapping matrix M m ∈ R d×d as f m ( x i ) where d is the dimension of pre-trained word embeddings. After adding a Tikhonov regularizer on M m , the learning objective function J m w.r.t. one specific lexical relation type r m ∈ R over D can be re-written as follows: The minimization of J m has a closed-form solution. The optimal solution M * m is as follows: where X m and Y m are two n m × d data matrices, with n m being the number of term pairs that have the lexical relation type r m ∈ R in D. The i-th rows of X m and Y m are the embedding vectors of the i-th sample (x i , y i ) ∈ D that has the lexical relation type r m ∈ R. E is a d × d identity matrix.
For each lexical relation type r m ∈ R, we train a semantic projection model based on Eq. (1). Af-ter that, a simple lexical relation prediction classifier is trained over D based on the following where ⊕ is the vector concatenation operator. M 1 , · · · , M |R| are projection matrices w.r.t. |R| lexical relation types r 1 , · · · , r |R| .
Based on J m , if (x i , y i ) has the lexical relation type r m , the norm of M m x i − y i is likely to be small. On the contrary, the norms of M n x i − y i (1 ≤ n ≤ |R|, n = m) are likely to be large. Therefore, the features are highly discriminative for lexical relation classification.
For each pair (x i , y i ) ∈ U , the classifier outputs an |R|-dimensional probabilistic distribution over all lexical relation types R. In this work, we denote p i,m as the probability of (x i , y i ) ∈ U having the lexical relation type r m ∈ R.

Relation Representation Learning
After we have computed the probability p i,m for all (x i , y i ) ∈ U and all r m ∈ R, we focus on the objective J g . The goal is to learn a d r -dimensional vector r i for each (x i , y i ) ∈ D ∪ U , regarded the representation of the lexical relation triple (named the SphereRE vector).
To avoid the high complexity and the propagation effect of predicted errors, inspired by Perozzi et al. (2014); Grover and Leskovec (2016), we reformulate J g and the function g(·, ·) via the Skipgram model (Mikolov et al., 2013a) over neighboring graphs. Let N b(x i , y i ) be the neighbors of a term pair (x i , y i ) in the SphereRE space, where each term pair (x j , y j ) ∈ N b(x i , y i ) is likely to share the same lexical relation type as (x i , y i ). To ensure that term pairs with the same lexical relation type have similar SphereRE vectors, the problem of optimizing J g can be reformulated by maximizing the probability of predicting the neighbors of (x i , y i ) given its SphereRE vector r i . Therefore, we define a new objective function J g to re-Condition Value of wi,j (xi, yi) ∈ D, (xj, yj) ∈ D, ri = rj 1 (xi, yi) ∈ D, (xj, yj) ∈ D, ri = rj 0 (xi, yi) ∈ D, (xj, yj) ∈ U, ri = rm 1 2 pj,m(cos(Mm xi − xi, Mm xj − xj) + 1) (xi, yi) ∈ U, (xj, yj) ∈ D, rj = rm 1 2 pi,m(cos(Mm xi − xi, Mm xj − xj) + 1) (xi, yi) ∈ U, (xj, yj) ∈ U 1 2 rm∈R pi,mpj,m · (cos(Mm xi − xi, Mm xj − xj) + 1) place J g based on the negative log likelihood: A remaining problem is to define the neighborhood N b(x i , y i ) properly, to preserve the hyperspherical similarity property of the distance func- In this work, we introduce a weight factor w i,j ∈ [0, 1] w.r.t. two pairs (x i , y i ) and (x j , y j ) in D ∪ U that quantifies the similarity between the two pairs in the SphereRE space. If (x i , y i ) ∈ D and (x j , y j ) ∈ D, because the true lexical relation types are known, we simply have: w i,j = I(r i = r j ).
We continue to discuss other conditions. If i) (x i , y i ) ∈ D has the lexical relation type r m , and ii) the lexical relation type of (x j , y j ) ∈ U is unknown but is predicted to be r m with probability p j,m , the similarity between (x i , y i ) and (x j , y j ) in terms of angles is defined using the weighted cosine similarity function in the range of (0, 1): A similar case holds for (x i , y i ) ∈ U and (x j , y j ) ∈ D. If (x i , y i ) ∈ U and (x j , y j ) ∈ U , because the lexical relation types of both pairs are unknown, we compute the weight w i,j by summing up all the weighted cosine similarities over all possible lexical relation types in R: Readers can also refer to Table 1 for a summarization of the choices of w i,j .
To reduce computational complexity, we propose a Monte-Carlo based sampling and learning method to learn SphereRE vectors based on the values of w i,j . The algorithm is illustrated in Algorithm 1. It starts with the random initialization of SphereRE vector r i for each (x i , y i ) ∈ D ∪ U . An iterative process randomly selects one pair (x i , y i ) as the starting point. The next pair (x j , y j ) is selected with probability as follows: where D mini is a mini-batch of term pairs randomly selected from D ∪ U . In this way, the algorithm only needs to traverse |D mini | pairs instead of |D| + |U | pairs. This process continues, resulting in a sequence of pairs, denoted as S: S = {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x |S| , y |S| )}. Denote l as the window size. We approximate J g in Eq.
(2) by − (x i ,y i )∈S i+l j=i−l(j =i) log Pr((x j , y j )| r i ) using the negative sampling training technique of the Skip-gram model (Mikolov et al., 2013a,b).
The values of SphereRE vectors r i are continuously updated until all the iterations stop. We can see that r i s are the low-dimensional representations of lexical relation triples, encoded in the hyperspherical space. The process is shown in Algorithm 1.
Algorithm 1 SphereRE Learning 1: for each (xi, yi) ∈ D ∪ U do 2: Randomly initialize SphereRE vector ri; 3: end for 4: for i = 1 to max iteration do 5: Sample a sequence based on Eq. (3): Update all SphereRE vectors ri by minimizing − (x i ,y i )∈S i+l j=i−l(j =i) log Pr((xj, yj)| ri); 7: end for In practice, we find that there is a drawback of the sampling process. Because the predictions for all (x i , y i ) ∈ U are probabilistic, it leads to the situation where the algorithm prefers to choose term pairs in D to form the sequence S. The low sampling rate of U results in the poor representation learning quality of these pairs. Here, we employ a boosting approach to increase chances of (x i , y i ) ∈ U being selected based on stratified sampling. The values of all probabilities p i,m are multiplied by a factor γ > 1, i.e., p i,m ← p i,m γ. 3

Lexical Relation Classification
Finally, we train a lexical relation classifier. For each pair (x i , y i ) ∈ D, we train a classifier over projection-based features. r i is the SphereRE vector of (x i , y i ) that encodes the relation triple in the SphereRE space.
We follow the work (Shwartz and Dagan, 2016) by using a fully-connected feed-forward neural network, shown in Figure 3. The input layer has |R| × d + d r nodes. We add only one hidden layer, followed by an |R|-dimensional output layer with softmax as the prediction function. The neural network is trained using the stochastic gradient descent algorithm, and is employed to predict the lexical relations for all (x i , y i ) ∈ U . The highlevel procedure is summarized in Algorithm 2. Algorithm 2 Lexical Relation Classification 1: for each lexical relation type rm ∈ R do 2: Compute M * m by Eq. (1); 3: end for 4: Train a classifier over D over F(xi, yi); 5: for each pair (xi, yi) ∈ U do 6: Predict distribution pi,m by the classifier; 7: end for 8: Learning ri for all (xi, yi) ∈ D ∪ U by Algorithm 1; 9: Train a neural network over D by features F * (xi, yi); 10: for each pair (xi, yi) ∈ U do 11: Predict the lexical relation ri by the neural network; 12: end for

Experiments
In this section, we conduct extensive experiments to evaluate SphereRE and compare it with stateof-the-art to make the convincing conclusion.

Datasets and Experimental Settings
In the experiments, we train a fastText model (Bojanowski et al., 2017) over the English Wikipedia corpus to generate term embeddings. The dimensionality d is set to 300. To evaluate the effectiveness of SphereRE, we use four public datasets for multi-way classification of lexical relations: K&H+N (Necsulescu et al., 2015), BLESS (Baroni and Lenci, 2011), ROOT09 (Santus et al., 2016b) and EVALution (Santus et al., 2015). We also evaluate SphereRE over the subtask 2 of the CogALex-V shared task (Santus et al., 2016a). The statistics are summarized in Table 2.
We follow the exact same experimental settings to partition the four public datasets into training, validation and testing sets as in . The partition of the CogALex dataset is the same as those in the default settings of the CogALex-V shared task (Santus et al., 2016a). The default settings for SphereRE are as follows: µ = 0.001, d r = 300, |D mini | = 20, |S| = 100, γ = 2 and l = 3. We run Algorithm 1 in 500 iterations. We also report how the changes of the neural network architecture and parameters affect the performance over the validation sets afterwards. It should be further noted that we do not set the values of λ 1 and λ 2 in the implementation because we employ sampling based techniques to learn r i , instead of directly optimizing J(Θ).

Experiments over Four Public Datasets
We report the results of SphereRE and compare it with state-of-the-art over four public datasets.

General Performance
To compare SphereRE with others, we consider following baselines: • Concat (Baroni et al., 2012), Diff (Weeds et al., 2014): They are classical distributional methods using vector concatenation and vector difference as features. A neural network without hidden layers is trained. • NPB : It uses a pathbased LSTM neural network to classify lexical relations. It is implemented by Shwartz Relation   K&H+N  BLESS  ROOT09  EVALution  CogALex  Antonym  ---1,600  601  Attribute  -2,731  -1,297  -Co-hyponym  25,796  3,565  3,200  --Event  -3,824  ---Holonym  ---544  -Hypernym  4,292  1,337  3,190  1,880  637  Meronym  1,043  2,943  -654  387  Random  26,378  12,146  6,372  -5,287  Substance meronym  ---317  -Synonym  ---1,  and Dagan (2016) and only considers dependency paths. The results of SphereRE and all the baselines are summarized in Table 3. We compute the Precision, Recall and F1 score for each lexical relation, and report the average scores over all the relations, weighted by the support. We can see that classification distributional approaches perform worse than integrated neural networks (such as ), because they are not capable of learning the true relations between terms. The proposed approach SphereRE consistently outperforms all the baselines over the four datasets in terms of F1 scores. When the type of lexical relations becomes larger (e.g., EVALution), the improvement of SphereRE are less significant than that of other datasets (e.g., BLESS, ROOT09). The most possible cause is that errors induced by relation-aware semantic projection are more likely to propagate to subsequent steps.

Study on Neural Network Architectures
We adjust the neural network architecture (shown in Figure 3) and report the performance over the validation sets in Figure 4. As shown, adding more hidden layers does not improve the performance of lexical relation classification. In some datasets (e.g., EVALution), the performance even drops, indicating a sign of overfitting. We change the number of hidden nodes when we use one hidden layer in the network. The results show that the setting does not affect the performance greatly.

Study on Monte-Carlo Sampling
We continue to study how the settings of Monte-Carlo sampling affect the quality of the SphereRE vectors. We adjust the number of iterations and the parameter γ. The performance is shown in Figure 5. As seen, more iterations contribute to the higher quality of embeddings. After a sufficient number of iterations (> 500), the performance becomes stable. As for the choice of γ, smaller values lead to the low sampling rates of unlabeled data, hence lower the prediction performance. In contrast, an overly large γ induces too many errors in relation-aware semantic projection to the sampling process. Hence, a balanced setting of γ is required.

Feature Analysis
We further study whether adding the SphereRE vectors contributes to lexical relation classification. We remove all the these embeddings and use the rest of the features to make prediction based on the same neural architecture and parameter settings. The results are shown in Table 4. By learning the SphereRE vectors and adding them to the classifier, the performance improves in all four datasets.

Experiments over the CogALex-V Shared Task
We evaluate SphereRE over the CogALex-V shared task (Santus et al., 2016a), where participants are asked to classify 4,260 term pairs into 5 lexical relations: synonymy, antonymy, hypernymy, meronymy and random. The training set contains 3,054 pairs. This task is the most challenging because i) it considers random relations as noise, discarding it from the averaged F1 score; ii) the training set is small; and iii) it enforces lexical spilt of the training and testing sets, disabling "lexical memorization" (Levy et al., 2015). In this shared task, GHHH (Attia et al., 2016) and LexNET  are top-two systems with the highest performance. The most recent work on CogALex-V is STM (Glavas and Vulic, 2018). SphereRE achieves the averaged F1 score of 47.1% (excluding the random relations), outperforming state-of-the-art. Additionally, as reported in previous studies, the "lexical memorization" effect (Levy et al., 2015) is rather severe for hypernymy relations. Although SphereRE is fully distributional, it achieves the highest F1 score of 53.8%.

Analysis of SphereRE Vector Qualities
We conduct additional experiments to evaluate the qualities of Sphere vectors. The first set of experiments evaluates whether top-k most similar relation triples of a given relation triple share the same lexical relation type. This task is called topk similar lexical relation retrieval. In this task, the similarity between two relation triples is quantified by the cosine similarity of the two corresponding SphereRE vectors. The score is reported by Precision@k. Higher Precision@k scores indicate SphereRE vectors with better quality, because lexical relation triples with the same lexical relation type should have similar Sphere vectors. In the experiments, we compute the Precision@k over all the labeled (training) and unlabeled (testing) sets of all five datasets. The results are shown in Table 6 in terms of Average Precision@k (AP@k) (with k = 1, 5, 10).
As seen, SphereRE has near perfect performance (over 95% for AP@1, over 90% for AP@5 and AP@10) over training sets of all five datasets. This is because in representation learning, all the labels (i.e., lexical relation types) of these term pairs are already known. Hence, SphereRE preserves distributional characteristics of these labeled datasets well. As for unlabeled datasets, the performance drops slightly over K&H+N, BLESS and ROOT09. The performance is not very satis-    factory over EVALution and CogALex, due to the internal challenges of lexical relation classification over the two datasets. This is because they contain a relatively large number of lexical relation types and random, unrelated term pairs. To have a more intuitive understanding of these learned SphereRE vectors, we plot the embeddings in Figure 6 by t-SNE (Maaten and Hinton, 2008). Due to space limitation, we only plot SphereRE vectors in part of the training and testing sets from ROOT09 and EVALution. For training data, we can see a clear separation of different lexical relation types. The slight "messiness" w.r.t. testing data indicates learning errors.

Error Analysis
For error analysis, we randomly sample 300 cases of prediction errors and ask human annotators to analyze the most frequent causes. We present several cases in Table 7. The largest number of errors (approximately 42%) occur due to the random relations in K&H+N, BLESS, ROOT09 and CogALex. These relations are large in quantity and blurry in semantics, misleading the classifier to predict other lexical relations as random.
Another large proportion of errors (about 31%) are related to unbalanced ratio of relations (apart from random). The number of some types of lexical relation triples in the training set is small (e.g., Meronym in EVALution, Synonym in CogALex). As a result, the representation learning w.r.t. these relation triples is relatively of lower quality.

Conclusion and Future Work
In this paper, we present a representation learning model to distinguish lexical relations based on Hyperspherical Relation Embeddings (SphereRE). It learns representations of lexical relation triples by mapping them to the hyperspherical embedding space. The lexical relations between term pairs are predicted using neural networks over the learned embeddings. Experiments over four benchmark datasets and CogALex-V show SphereRE outperforms state-of-the-art methods.
In the future, we will improve our model to deal with datasets containing a relatively large number of lexical relation types and random term pairs. Additionally, the mapping technique used for relation-aware semantic projection can be further improved to model different linguistic properties of lexical relations (e.g., the "one-to-many" mappings for meronymy).