Specialising Word Vectors for Lexical Entailment

We present LEAR (Lexical Entailment Attract-Repel), a novel post-processing method that transforms any input word vector space to emphasise the asymmetric relation of lexical entailment (LE), also known as the IS-A or hyponymy-hypernymy relation. By injecting external linguistic constraints (e.g., WordNet links) into the initial vector space, the LE specialisation procedure brings true hyponymy-hypernymy pairs closer together in the transformed Euclidean space. The proposed asymmetric distance measure adjusts the norms of word vectors to reflect the actual WordNet-style hierarchy of concepts. Simultaneously, a joint objective enforces semantic similarity using the symmetric cosine distance, yielding a vector space specialised for both lexical relations at once. LEAR specialisation achieves state-of-the-art performance in the tasks of hypernymy directionality, hypernymy detection, and graded lexical entailment, demonstrating the effectiveness and robustness of the proposed asymmetric specialisation model.


Introduction
Word representation learning has become a research area of central importance in NLP, with its usefulness demonstrated across application areas such as parsing (Chen and Manning, 2014), machine translation (Zou et al., 2013), and many others (Turian et al., 2010;Collobert et al., 2011). Standard techniques for inducing word embeddings rely on the distributional hypothesis (Harris, 1954), using co-occurrence information from large textual corpora to learn meaningful word representations (Mikolov et al., 2013;Levy and Goldberg, 2014;Pennington et al., 2014;Bojanowski et al., 2017).
A major drawback of the distributional hypothesis is that it coalesces different relationships between words, such as synonymy and topical relatedness, into a single vector space. A popular solution is to go beyond stand-alone unsupervised learning and fine-tune distributional vector spaces by using external knowledge from human-or automaticallyconstructed knowledge bases. This is often done as a post-processing step, where distributional vectors are gradually refined to satisfy linguistic constraints extracted from lexical resources such as WordNet (Faruqui et al., 2015;Mrkšić et al., 2016), the Paraphrase Database (PPDB) (Wieting et al., 2015), or BabelNet Vulić et al., 2017a). One advantage of post-processing methods is that they treat the input vector space as a black box, making them applicable to any input space.
A key property of these methods is their ability to transform the vector space by specialising it for a particular relationship between words. 1 Prior work has predominantly focused on distinguishing between semantic similarity and conceptual relatedness (Faruqui et al., 2015;Vulić et al., 2017b). In this paper, we introduce a novel post-processing model which specialises vector spaces for the lexical entailment (LE) relation.
Our novel LE specialisation model, termed LEAR (Lexical Entailment Attract-Repel), is inspired by ATTRACT-REPEL, a state-of-the-art general spe- and 2) by imposing an LE ordering using vector norms, adjusting them so that higher-level concepts have larger norms (e.g., cialisation framework . 2 The key idea of LEAR, illustrated by Figure 1, is to pull desirable (ATTRACT) examples described by the constraints closer together, while at the same time pushing undesirable (REPEL) word pairs away from each other. Concurrently, LEAR (re-)arranges vector norms so that norm values in the Euclidean space reflect the hierarchical organisation of concepts according to the given LE constraints: put simply, higher-level concepts are assigned larger norms. Therefore, LEAR simultaneously captures the hierarchy of concepts (through vector norms) and their similarity (through their cosine distance).
The two pivotal pieces of information are combined into an asymmetric distance measure which quantifies the LE strength in the specialised space. After specialising four well-known input vector spaces with LEAR, we test them in three standard word-level LE tasks (Kiela et al., 2015b): 1) hypernymy directionality; 2) hypernymy detection; and 3) combined hypernymy detection/directionality. Our specialised vectors yield notable improvements over the strongest baselines for each task, with each input space, demonstrating the effectiveness and robustness of LEAR specialisation. 2 https://github.com/nmrksic/attract-repel The employed asymmetric distance allows one to make graded assertions about hierarchical relationships between concepts in the specialised space. This property is evaluated using HyperLex, a recent graded LE dataset . The LEAR-specialised vectors push state-of-the-art Spearman's correlation from 0.540 to 0.686 on the full dataset (2,616 word pairs), and from 0.512 to 0.705 on its noun subset (2,163 word pairs).
The code for the LEAR model is available from: github.com/nmrksic/lear.

The ATTRACT-REPEL Framework
Let V be the vocabulary, A the set of ATTRACT word pairs (e.g., intelligent and brilliant), and R the set of REPEL word pairs (e.g., vacant and occupied). The ATTRACT-REPEL procedure operates over mini-batches of such pairs B A and B R . For ease of notation, let each word pair (x l , x r ) in these two sets correspond to a vector pair (x l , x r ), so that a mini-batch of k 1 word pairs is given by (similarly for B R , which consists of k 2 example pairs).
Next, the sets of pseudo-negative examples T A = [(t 1 l , t 1 r ), . . . , (t k 1 l , t k 1 r )] and T R = [(t 1 l , t 1 r ), . . . , (t k 2 l , t k 2 r )] are defined as pairs of negative examples for each ATTRACT and REPEL example pair in mini-batches B A and B R . These negative examples are chosen from the word vectors present in B A or B R so that, for each ATTRACT pair (x l , x r ), the negative example pair (t l , t r ) is chosen so that t l is the vector closest (in terms of cosine distance) to x l and t r is closest to x r . Similarly, for each REPEL pair (x l , x r ), the negative example pair (t l , t r ) is chosen from the remaining in-batch vectors so that t l is the vector furthest away from x l and t r is furthest from x r .
The negative examples are used to: a) force AT-TRACT pairs to be closer to each other than to their respective negative examples; and b) to force RE-PEL pairs to be further away from each other than from their negative examples. The first term of the cost function pulls ATTRACT pairs together: where cos denotes cosine similarity, τ (x) = max(0, x) is the hinge loss function and δ att is the attract margin which determines how much closer these vectors should be to each other than to their respective negative examples. The second part of the cost function pushes REPEL word pairs away from each other: In addition to these two terms, an additional regularisation term is used to preserve the abundance of high-quality semantic content present in the distributional vector space, as long as this information does not contradict the injected linguistic constraints. If V (B) is the set of all word vectors present in the given mini-batch, then: where λ reg is the L2 regularization constant and x i denotes the original (distributional) word vector for word x i . The full ATTRACT-REPEL cost function is given by the sum of all three terms.

LEAR: Encoding Lexical Entailment
In this section, the ATTRACT-REPEL framework is extended to model lexical entailment jointly with (symmetric) semantic similarity. To do this, the method uses an additional source of external lexical knowledge: let L be the set of directed lexical entailment constraints such as (corgi, dog), (dog, animal), or (corgi, animal), with lower-level concepts on the left and higher-level ones on the right (the source of these constraints will be discussed in Section 3). The optimisation proceeds in the same way as before, considering a mini-batch of LE pairs B L consisting of k 3 word pairs standing in the (directed) lexical entailment relation. Unlike symmetric similarity, lexical entailment is an asymmetric relation which encodes a hierarchical ordering between concepts. Inferring the direction of the entailment relation between word vectors requires the use of an asymmetric distance function. We define three different ones, all of which use the word vector's norms to impose an ordering between high-and low-level concepts: The lexical entailment term (for the j-th asymmetric distance, j ∈ 1, 2, 3) is defined as: The first distance serves as the baseline: it uses the word vectors' norms to order the concepts, that is to decide which of the words is likely to be the higher-level concept. In this case, the magnitude of the difference between the two norms determines the 'intensity' of the LE relation. This is potentially problematic, as this distance does not impose a limit on the vectors' norms. The second and third metric take a more sophisticated approach, using the ratios of the differences between the two norms and either: a) the sum of the two norms; or b) the larger of the two norms. In doing that, these metrics ensure that the cost function only considers the norms' ratios. This means that the cost function no longer has the incentive to increase word vectors' norms past a certain point, as the magnitudes of norm ratios grow in size much faster than the linear relation defined by the first distance function.
To model the semantic and the LE relations jointly, the LEAR cost function jointly optimises the four terms of the expanded cost function:

LE Pairs as ATTRACT Constraints
The combined cost function makes use of the batch of lexical constraints B L twice: once in the defined asymmetric cost function LE j , and once in the symmetric ATTRACT term Att(B L , T L ). This means that words standing in the lexical entailment relation are forced to be similar both in terms of cosine distance (via the symmetric ATTRACT term) and in terms of the asymmetric LE distance from Eq. (6).

Decoding Lexical Entailment
The defined cost function serves to encode semantic similarity and LE relations in the same vector space. Whereas the similarity can be inferred from the standard cosine distance, the LEAR optimisation embeds lexical entailment as a combination of the symmetric ATTRACT term and the newly defined asymmetric LE j cost function. Consequently, the metric used to determine whether two words stand in the LE relation must combine the two cost terms as well. We define the LE decoding metric as: where dcos(x, y) denotes the cosine distance. This decoding function combines the symmetric and the asymmetric cost term, in line with the combination of the two used to perform LEAR specialisation. In the evaluation, we show that combining the two cost terms has a synergistic effect, with both terms contributing to stronger performance across all LE tasks used for evaluation.

Experimental Setup
Starting Distributional Vectors To test the robustness of LEAR specialisation, we experiment with a variety of well-known, publicly available English word vectors: 1) Skip-Gram with Negative Sampling (SGNS) (Mikolov et al., 2013) trained on the Polyglot Wikipedia (Al-Rfou et al., 2013) by Levy and Goldberg (2014); 2) GLOVE Common Crawl (Pennington et al., 2014); 3) CONTEXT2VEC (Melamud et al., 2016), which replaces CBOW contexts with contexts based on bidirectional LSTMs (Hochreiter and Schmidhuber, 1997); and 4) FAST-TEXT (Bojanowski et al., 2017), a SGNS variant which builds word vectors as the sum of their constituent character n-gram vectors. 3 Linguistic Constraints We use three groups of linguistic constraints in the LEAR specialisation model, covering three different relation types which are all beneficial to the specialisation process: directed 1) lexical entailment (LE) pairs; 2) synonymy pairs; and 3) antonymy pairs. Synonyms are included as symmetric ATTRACT pairs (i.e., the B A pairs) since they can be seen as defining a trivial symmetric IS-A relation (Rei and Briscoe, 2014;. For a similar reason, antonyms are clear REPEL constraints as they anticorrelate with the LE relation. 4 Synonymy and antonymy constraints are taken from prior work (Zhang et al., 2014;Ono et al., 2015): they are extracted from WordNet (Fellbaum, 1998) and Roget (Kipfer, 2009). In total, we work with 1,023,082 synonymy pairs (11.7 synonyms per word on average) and 380,873 antonymy pairs (6.5 per word). 5 As in prior work (Nguyen et al., 2017;Nickel and Kiela, 2017), LE constraints are extracted from the WordNet hierarchy, relying on the transitivity of the LE relation. This means that we include both direct and indirect LE pairs in our set of constraints (e.g., (pangasius, fish), (fish, animal), and (pangasius, animal)). We retained only noun-noun and verb-verb pairs, while the rest were discarded: the final number of LE constraints is 1, 545,630. 6 Training Setup We adopt the original ATTRACT-REPEL model setup without any fine-tuning. Hyperparameter values are set to: δ att = 0.6, δ rep = 0.0, λ reg = 10 −9 . The models are trained for 5 epochs with the AdaGrad algorithm (Duchi et al., 2011), with batch sizes set to k 1 = k 2 = k 3 = 128 for faster convergence.

Results and Discussion
We test and analyse LEAR-specialised vector spaces in two standard word-level LE tasks used in prior work: hypernymy directionality and detection (Section 4.1) and graded LE (Section 4.2).

LE Directionality and Detection
The first evaluation uses three classification-style tasks with increased levels of difficulty. The tasks are evaluated on three datasets used extensively in the LE literature (Roller et al., 2014;Santus et al., 2014;Weeds et al., 2014;Shwartz et al., 2017;Nguyen et al., 2017), compiled into an integrated evaluation set by Kiela et al. (2015b). 7 The first task, LE directionality, is conducted on 1,337 LE pairs originating from the BLESS evaluation set (Baroni and Lenci, 2011). Given a true LE pair, the task is to predict the correct hypernym. With LEAR-specialised vectors this is achieved by simply comparing the vector norms of each concept in a pair: the one with the larger norm is the hypernym (see Figure 1).
The second task, LE detection, involves a binary classification on the WBLESS dataset (Weeds et al., 2014) which comprises 1,668 word pairs standing in a variety of relations (LE, meronymy-holonymy, co-hyponymy, reversed LE, no relation). The model has to detect a true LE pair, that is, to distinguish between the pairs where the statement X is a (type of) Y is true from all other pairs. With LEAR vectors, this classification is based on the asymmetric distance score: if the score is above a certain threshold, we classify the pair as "true LE", otherwise as "other". While Kiela et al. (2015b) manually define the threshold value, we follow the approach of Nguyen et al. (2017) and cross-validate: in each of the 1,000 iterations, 2% of the pairs are sampled for threshold tuning, and the remaining 98% are used for testing. The reported numbers are therefore average accuracy scores. 8 8 We have conducted more LE directionality and detection experiments on other datasets such as EVALution (Santus et al., 2015), the N1 N2 dataset of Baroni et al. (2012), and the dataset of Lenci and Benotto (2012) with similar performances and findings. We do not report all these results for brevity and clarity of presentation.
The final task, LE detection and directionality, concerns a three-way classification on BIBLESS, a relabeled version of WBLESS. The task is now to distinguish both LE pairs (→ 1) and reversed LE pairs (→ −1) from other relations (→ 0), and then additionally select the correct hypernym in each detected LE pair. We apply the same test protocol as in the LE detection task.
The performance of the four LEAR-specialised word vector collections is shown in Figure 2 (together with the strongest baseline scores for each of the three tasks). The comparative analysis confirms the increased complexity of subsequent tasks. LEAR specialisation of each of the starting vector spaces consistently outperformed all baseline scores across all three tasks. The extent of the improvements is correlated with task difficulty: it is lowest for the easiest directionality task (0.92 → 0.96), and highest for the most difficult detection plus directionality task (0.81 → 0.88).
The results show that the two LEAR variants which do not rely on absolute norm values and  perform a normalisation step in the asymmetric distance (D2 and D3) have an edge over the D1 variant which operates with unbounded norms. The difference in performance between D2/D3 and D1 is even more pronounced in the graded LE task (see Section 4.2). This shows that the use of unbounded vector norms diminishes the importance of the symmetric cosine distance in the combined asymmetric distance. Conversely, the synergistic combination used in D2/D3 does not suffer from this issue. The high scores achieved with each of the four word vector collections show that LEAR is not dependent on any particular word representation architecture. Moreover, the extent of the performance improvements in each task suggests that LEAR is able to reconstruct the concept hierarchy coded in the input linguistic constraints.
Moreover, we have conducted a small experiment to verify that the LEAR method can generalise beyond what is directly coded in pairwise external constraints. A simple WordNet lookup baseline yields accuracy scores of 0.82 and 0.80 on the directionality and detection tasks, respectively. This baseline is outperformed by LEAR: its scores are 0.96 and 0.92 on the two tasks when relying on the same set of WordNet constraints.

Importance of Vector Norms
To verify that the knowledge concerning the position in the semantic hierarchy actually arises from vector norms, we also manually inspect the norms after LEAR specialisation. A few examples are provided in Table 1. They indicate a desirable pattern in the norm values which imposes a hierarchical ordering on the concepts. Note that the original distributional SGNS model (Mikolov et al., 2013) does not normalise vectors to unit length after training. However, these norms are not at all correlated with the desired hierarchical ordering, and are therefore useless for LE-related applications: the non-specialised distributional SGNS model scores 0.44, 0.48, and 0.34 on the three tasks, respectively.

Graded Lexical Entailment
Asymmetric distances in the LEAR-specialised space quantify the degree of lexical entailment between any two concepts. This means that they can be used to make fine-grained assertions regarding the hierarchical relationships between concepts. We test this property on HyperLex , a gold standard dataset for evaluating how well word representation models capture graded LE, grounded in the notions of concept (proto)typicality (Rosch, 1973;Medin et al., 1984) and category vagueness (Kamp and Partee, 1995;Hampton, 2007) from cognitive science. HyperLex contains 2,616 word pairs (2,163 noun pairs and 453 verb pairs) scored by human raters in the [0, 6] interval following the question "To what degree is X a (type of) Y?" 9 As shown by the high inter-annotator agreement on HyperLex (0.85), humans are able to consistently reason about graded LE. 10 However, current state-of-the-art representation architectures are far from this ceiling. For instance,  evaluate a plethora of architectures and report a high-score of only 0.320 (see the summary table in Figure 3). Two recent representation models (Nickel and Kiela, 2017;Nguyen et al., 2017) focused on the LE relation in particular (and employing the same set of WordNet-based constraints as LEAR) report the highest score of 0.540 (on the entire dataset) and 0.512 (on the noun subset).

Results and Analysis
We scored all HyperLex pairs using the combined asymmetric distance described by Equation (7), and then computed Spearman's rank correlation with the ground-truth ranking. Our results, together with the strongest baseline scores, are summarised in Figure 3.
The summary table in Figure 3(c) shows the Hy-perLex performance of several prominent LE models. We provide only a quick outline of these models here; further details can be found in the original papers. FREQ-RATIO exploits the fact that more general concepts tend to occur more frequently in textual corpora. SGNS (COS) uses non-specialised  (2017), we use Spearman's rank correlation scores on: a) the entire dataset (2,616 noun and verb pairs); and b) its noun subset (2,163 pairs). The summary table shows the performance of other well-known architectures on the full HyperLex dataset, compared to the best results achieved using LEAR specialisation.
SGNS vectors and quantifies the LE strength using the symmetric cosine distance between vectors. A comparison of these models to the best-performing LEAR vectors shows the extent of the improvements achieved using the specialisation approach.
LEAR-specialised vectors also outperform SLQS-SIM (Santus et al., 2014) and VISUAL (Kiela et al., 2015b), two LE detection models similar in spirit to LEAR. These models combine symmetric semantic similarity (through cosine distance) with an asymmetric measure of lexical generality obtained either from text (SLQS-SIM) or visual data (VISUAL). The results on HyperLex indicate that the two generality-based measures are too coarsegrained for graded LE judgements. These models were originally constructed to tackle LE directionality and detection tasks (see Section 4.1), but their performance is surpassed by LEAR on those tasks as well. The VISUAL model outperforms SLQS-SIM. However, its numbers on BLESS (0.88), WBLESS (0.75), and BIBLESS (0.57) are far from the topperforming LEAR vectors (0.96, 0.92, 0.88). 11 WN-BEST denotes the best result with asymmetric similarity measures which use the WordNet structure as their starting point (Wu and Palmer, 1994;Pedersen et al., 2004). This model can be observed as a model that directly looks up the full WordNet structure to reason about graded lexical entailment. The reported results from Figure 3(c) suggest it is more effective to quantify the LE re-lation strength by using WordNet as the source of constraints for specialisation models such as HY-PERVEC or LEAR.
WORD2GAUSS (Vilnis and McCallum, 2015) represents words as multivariate K-dimensional Gaussians rather than points in the embedding space: it is therefore naturally asymmetric and was used in LE tasks before, but its performance on Hy-perLex indicates that it cannot effectively capture the subtleties required to model graded LE. However, note that the comparison is not strictly fair as WORD2GAUSS does not leverage any external knowledge. An interesting line for future research is to embed external knowledge within this representation framework.
Most importantly, LEAR outperforms three recent (and conceptually different) architectures: ORDER-EMB (Vendrov et al., 2016), POINCARÉ (Nickel and Kiela, 2017), and HYPERVEC (Nguyen et al., 2017). Like LEAR, all of these models complement distributional knowledge with external linguistic constraints extracted from WordNet. Each model uses a different strategy to exploit the hierarchical relationships encoded in these constraints (their approaches are discussed in Section 5). 12 However, LEAR, as the first LE-oriented post-processor, is able to utilise the constraints more effectively than its competitors. Another advantage of LEAR is its applicability to any input   Figures 3(a) and 3(b) indicate that the two LEAR variants which rely on norm ratios (D2 and D3), rather than on absolute (unbounded) norm differences (D1), achieve stronger performance on Hy-perLex. The highest correlation scores are again achieved by D2 with all input vector spaces.

Further Discussion
Why Symmetric + Asymmetric? In another experiment, we analyse the contributions of both LErelated terms in the LEAR combined objective function (see Section 2.2). We compare three variants of LEAR: 1) a symmetric variant which does not arrange vector norms using the LE j (B L ) term (SYM-ONLY); 2) a variant which arranges norms, but does not use LE constraints as additional symmetric AT-TRACT constraints (ASYM-ONLY); and 3) the full LEAR model, which uses both cost terms (FULL). The results with one input space (similar results are achieved with others) are shown in Table 2. This table shows that, while the stand-alone ASYM-ONLY term seems more beneficial than the SYM-ONLY one, using the two terms jointly yields the strongest performance across all LE tasks.

LE and Semantic Similarity
We also test whether the asymmetric LE term harms the (normindependent) cosine distances used to represent semantic similarity. The LEAR model is compared to the original ATTRACT-REPEL model making use of the same set of linguistic constraints. Two true semantic similarity datasets are used for evaluation: SimLex-999 (Hill et al., 2015) and SimVerb-3500 (Gerz et al., 2016). There is no significant difference in performance between the two models, both of which yield similar results on SimLex (Spearman's rank correlation of ≈ 0.71) and SimVerb (≈ 0.70). This proves that cosine distances remain preserved during the optimization of the asymmetric objective performed by the joint LEAR model.

Related Work
Vector Space Specialisation A standard approach to incorporating external information into vector spaces is to pull the representations of similar words closer together. Some models integrate such constraints into the training procedure: they modify the prior or the regularisation (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015a), or use a variant of the SGNSstyle objective (Liu et al., 2015;Osborne et al., 2016;Nguyen et al., 2017). Another class of models, popularly termed retrofitting, fine-tune distributional vector spaces by injecting lexical knowledge from semantic databases such as WordNet or the Paraphrase Database (Faruqui et al., 2015;Wieting et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2016;. LEAR falls into the latter category. However, while previous post-processing methods have focused almost exclusively on specialising vector spaces to emphasise semantic similarity (i.e., to distinguish between similarity and relatedness by explicitly pulling synonyms closer and pushing antonyms further apart), this paper proposed a principled methodology for specialising vector spaces for asymmetric hierarchical relations (of which lexical entailment is an instance). Its starting point is the state-of-the-art similarity specialisation framework of , which we extend to support the inclusion of hierarchical asymmetric relationships between words.

Word Vectors and Lexical Entailment
Since the hierarchical LE relation is one of the fundamental building blocks of semantic taxonomies and hierarchical concept categorisations (Beckwith et al., 1991;Fellbaum, 1998), a significant amount of research in semantics has been invested into its automatic detection and classification. Early work relied on asymmetric directional measures (Weeds et al., 2004;Clarke, 2009;Kotlerman et al., 2010;Lenci and Benotto, 2012, i.a.) which were based on the distributional inclusion hypothesis (Geffet and Dagan, 2005) or the distributional informativeness or generality hypothesis (Herbelot and Ganesalingam, 2013;Santus et al., 2014). However, these approaches have recently been superseded by methods based on word embeddings. These methods build dense real-valued vectors for capturing the LE relation, either directly in the LE-focused space (Vilnis and McCallum, 2015;Vendrov et al., 2016;Henderson and Popa, 2016;Nickel and Kiela, 2017;Nguyen et al., 2017) or by using the vectors as features for supervised LE detection models (Tuan et al., 2016;Shwartz et al., 2016;Nguyen et al., 2017;Glavaš and Ponzetto, 2017).
Several LE models embed useful hierarchical relations from external resources such as WordNet into LE-focused vector spaces, with solutions coming in different flavours. The model of  is a dynamic distance-margin model optimised for the LE detection task using hierarchical WordNet constraints. This model was extended by Tuan et al. (2016) to make use of contextual sentential information. A major drawback of both models is their inability to make directionality judgements. Further, their performance has recently been surpassed by the HYPERVEC model of Nguyen et al. (2017). This model combines WordNet constraints with the SGNS distributional objective into a joint model. As such, the model is tied to the SGNS objective and any change of the distributional modelling paradigm implies a change of the entire HY-PERVEC model. This makes their model less versatile than the proposed LEAR framework. Moreover, the results achieved using LEAR specialisation achieve substantially better performance across all LE tasks used for evaluation.
Another model similar in spirit to LEAR is the ORDER-EMB model of Vendrov et al. (2016), which encodes hierarchical structure by imposing a partial order in the embedding space: higher-level concepts get assigned higher per-coordinate values in a d-dimensional vector space. The model minimises the violation of the per-coordinate orderings during training by relying on hierarchical WordNet constraints between word pairs. Finally, the POINCARÉ model of Nickel and Kiela (2017) makes use of hyperbolic spaces to learn generalpurpose LE embeddings based on n-dimensional Poincaré balls which encode both hierarchy and semantic similarity, again using the WordNet constraints. A similar model in hyperbolic spaces was proposed by Chamberlain et al. (2017). In this paper, we demonstrate that LE-specialised word embeddings with stronger performance can be induced using a simpler model operating in more intuitively interpretable Euclidean vector spaces.

Conclusion and Future Work
This paper proposed LEAR, a vector space specialisation procedure which simultaneously injects sym-metric and asymmetric constraints into existing vector spaces, performing joint specialisation for two properties: lexical entailment and semantic similarity. Since the former is not symmetric, LEAR uses an asymmetric cost function which encodes the hierarchy between concepts by manipulating the norms of word vectors, assigning higher norms to higher-level concepts. Specialising the vector space for both relations has a synergistic effect: LEAR-specialised vectors attain state-of-the-art performance in judging semantic similarity and set new high scores across four different lexical entailment tasks. The code for the LEAR model is available from: github.com/nmrksic/lear.
In future work, we plan to apply a similar methodology to other asymmetric relations (e.g., meronymy), as well as to investigate finegrained models which can account for differing path lengths from the WordNet hierarchy. We will also extend the model to reason over words unseen in input lexical resources, similar to the recent post-specialisation model oriented towards specialisation of unseen words for similarity (Vulić et al., 2018). We also plan to test the usefulness of LEspecialised vectors in downstream natural language understanding tasks. Porting the model to other languages and enabling cross-lingual applications such as cross-lingual lexical entailment (Upadhyay et al., 2018) is another future research direction.