Specializing Distributional Vectors of All Words for Lexical Entailment

Semantic specialization methods fine-tune distributional word vectors using lexical knowledge from external resources (e.g. WordNet) to accentuate a particular relation between words. However, such post-processing methods suffer from limited coverage as they affect only vectors of words seen in the external resources. We present the first post-processing method that specializes vectors of all vocabulary words – including those unseen in the resources – for the asymmetric relation of lexical entailment (LE) (i.e., hyponymy-hypernymy relation). Leveraging a partially LE-specialized distributional space, our POSTLE (i.e., post-specialization for LE) model learns an explicit global specialization function, allowing for specialization of vectors of unseen words, as well as word vectors from other languages via cross-lingual transfer. We capture the function as a deep feed-forward neural network: its objective re-scales vector norms to reflect the concept hierarchy while simultaneously attracting hyponymy-hypernymy pairs to better reflect semantic similarity. An extended model variant augments the basic architecture with an adversarial discriminator. We demonstrate the usefulness and versatility of POSTLE models with different input distributional spaces in different scenarios (monolingual LE and zero-shot cross-lingual LE transfer) and tasks (binary and graded LE). We report consistent gains over state-of-the-art LE-specialization methods, and successfully LE-specialize word vectors for languages without any external lexical knowledge.


Introduction
Word-level lexical entailment (LE), also known as the TYPE-OF or hyponymy-hypernymy relation, is a fundamental asymmetric lexico-semantic relation (Collins and Quillian, 1972;Beckwith et al., 1991). * Both authors contributed equally to this work.
The set of these relations constitutes a hierarchical structure that forms the backbone of semantic networks such as WordNet (Fellbaum, 1998). Automatic reasoning about word-level LE benefits a plethora of tasks such as natural language inference (Dagan et al., 2013;Bowman et al., 2015;Williams et al., 2018), text generation (Biran and McKeown, 2013), metaphor detection (Mohler et al., 2013), and automatic taxonomy creation (Snow et al., 2006;Navigli et al., 2011;Gupta et al., 2017).
To mitigate this deficiency, a standard solution is a post-processing step: distributional vectors are gradually refined to satisfy linguistic constraints extracted from external resources such as Word-Net (Fellbaum, 1998) or BabelNet (Navigli and Ponzetto, 2012). This process, termed retrofitting or semantic specialization, is beneficial to language understanding tasks (Faruqui, 2016; and is extremely versatile as it can be applied on top of any input distributional space. Retrofitting methods, however, have a major weakness: they only locally update vectors of words seen in the external resources, while leaving vectors of all other unseen words unchanged, as illustrated in Figure 1. Recent work Ponti et al., 2018) has demonstrated how to specialize the full distributional space for the symmetric relation of semantic (dis)similarity. The so-called post-specialization model learns a global and explicit specialization function that imitates the transformation from the distributional space to the retrofitted space, and applies it to the large subspace of unseen words' vectors.
In this work, we present POSTLE, an all-words post-specialization model for the asymmetric LE relation. This model propagates the signal on the hierarchical organization of concepts to the ones unseen in external resources, resulting in a word vector space which is fully specialized for the LE relation. Previous LE specialization methods simply integrated available LE knowledge into the input distributional space , or provided means to learn dense word embeddings of the external resource only Kiela, 2017, 2018;Ganea et al., 2018;Sala et al., 2018). In contrast, we show that our POSTLE method can combine distributional and external lexical knowledge and generalize over unseen concepts.
The main contribution of POSTLE is a novel global transformation function that re-scales vector norms to reflect the concept hierarchy while simultaneously attracting hyponymy-hypernymy word pairs to reflect their semantic similarity in the specialized space. We propose and evaluate two variants of this idea. The first variant learns the global function through a deep non-linear feed-forward network. The extended variant leverages the deep feed-forward net as the generator component of an adversarial model. The role of the accompanying discriminator is then to distinguish between original LE-specialized vectors (produced by any initial post-processor) from vectors produced by transforming distributional vectors with the generator.
We demonstrate that the proposed POSTLE methods yield considerable gains over state-of-the-art LE-specialization models (Nickel and Kiela, 2017;, with the adversarial variant having an edge over the other. The gains are observed with different input distributional spaces in several LE-related tasks such as hypernymy detection and directionality, and graded lexical entailment. What is more, the highest gains are reported for resource-lean data scenarios where a high percentage of words in the datasets is unseen. Finally, we show how to LE-specialize distributional spaces for target languages that lack external lexical knowledge. POSTLE can be coupled with any model for inducing cross-lingual embedding spaces (Conneau et al., 2018;Artetxe et al., 2018;Smith et al., 2017). If this model is unsupervised, the procedure effectively yields a zero-shot LE specialization transfer, and holds promise to support the construction of hierarchical semantic networks for resource-lean languages in future work.

Post-Specialization for LE
Our post-specialization starts with the Lexical Entailment Attract-Repel (LEAR) model , a state-of-the-art retrofitting model for LE, summarized in §2.1. While we opt for LEAR because of its strong performance and ease of use, it is important to note that our POSTLE models ( §2.2 and §2.3) are not in any way bound to LEAR and can be applied on top of any LE retrofitting model.

Initial LE Specialization: LEAR
LEAR fine-tunes the vectors of words observed in a set of external linguistic constraints C = S ∪A∪L, consisting of synonymy pairs S such as (clever, intelligent), antonymy pairs A such as (war, peace), and lexical entailment (i.e., hyponymy-hypernymy) pairs L such as (dog, animal). For the L pairs, the order of words is important: we assume that the left word always refers to the hyponym.
Extending the ATTRACT-REPEL model for symmetric similarity specialization (Mrkšić et al., 2017), LEAR defines two types of objectives: 1) the ATTRACT (Att) objective aims to bring closer together in the vector space words that are semantically similar (i.e., synonyms and hyponymhypernym pairs); 2) the REPEL (Rep) objective pushes further apart vectors of dissimilar words (i.e., antonyms).
k=1 be the set of K word pairs for which the Att or Rep score is to be computed -these are the positive examples. The set of corresponding negative examples T is created by coupling each positive ATTRACT example (x l , x r ) with a negative example pair (t l , t r ), where t l is the vector closest (in terms of cosine similarity, within the batch) to x l and t r vector closest to x r . The Att objective for a batch of ATTRACT constraints B A is then given as: (1) τ (x) = max(0, x) is the hinge loss and δ att is the similarity margin imposed between the negative and positive vector pairs. In contrast, for each positive REPEL example, the negative example (t l , t r ) couples the vector t l that is most distant from x l and t r , most distant from x r . The Rep objective for a batch of REPEL word pairs B R is then: LEAR additionally defines a regularization term in order to preserve the useful semantic information from the original distributional space. With V (B) as the set of distinct words in a constraint batch B, the regularization term is: Reg(B) = λ reg x∈V (B) y − x 2 , where y is the LEARspecialization of the distributional vector x, and λ reg is the regularization factor. Crucially, LEAR forces specialized vectors to reflect the asymmetry of the LE relation with an asymmetric distance-based objective. The goal is to preserve the cosine distances in the specialized space while steering vectors of more general concepts (those found higher in the concept hierarchy) to take larger norms. 1   test several asymmetric objectives, and we adopt the one reported to be the most robust: (3) B L denotes a batch of LE constraints. The full LEAR objective is then defined as: In summary, LEAR pulls words from synonymy and LE pairs closer together (Att(B S , T S ) and Att(B L , T L )), while simultaneously pushing vectors of antonyms further apart (Rep(B A , T A )) and enforcing asymmetric distances for hyponymyhypernymy pairs (LE (B L )).

Post-Specialization Model
The retrofitting model (LEAR) specializes vectors only for a subset of the full vocabulary: the words it has seen in the external lexical resource. Such resources are still fairly incomplete, even for major languages (e.g., WordNet for English), and fail to cover a large portion of the distributional vocabulary (referred to as unseen words). The transformation of the seen subspace, however, provides evidence on the desired effects of LE-specialization. We seek a post-specialization procedure for LE (termed POSTLE) that propagates this useful signal to the subspace of unseen words and LE-specializes the entire distributional space (see Figure 1). Let X s be the subset of the distributional space containing vectors of words seen in lexical constraints and let Y s denote LE-specialized vectors of those words produced by the initial LE specialization model. For seen words, we pair their original distributional vectors x s ∈ X s with corresponding LEAR-specialized vectors y s : post-specialization then directly uses pairs (x s , y s ) as training instances for learning a global specialization function, which is then applied to LE-specialize the remainder of the distributional space, i.e., the specialization function learned from (X s , Y s ) is applied to the subspace of unseen words' vectors X u .
Let G(x i ; θ G ) : R d → R d (with d as the dimensionality of the vector space) be the specialization function we are trying to learn using pairs of distributional and LEAR-specialized vectors as training instances. We first instantiate the postspecialization model G(x i ; θ G ) as a deep fullyconnected feed-forward network (DFFN) with H hidden layers and m units per layer. The mapping of the j-th hidden layer is given as: activ refers to a non-linear activation function, 2 x (j−1) is the output of the previous layer (x (0) is the input distributional vector), and (W (j) , b (j) ), j ∈ {1, . . . , H} are the model's parameters θ G . The aim is to obtain predictions G(x s ; θ G ) that are as close as possible to the corresponding LEAR-specializations y s . For symmetric similaritybased post-specialization prior work relied on cosine distance to measure discrepancy between the predicted and expected specialization . Since we are specializing vectors for the asymmetric LE relation, the predicted vector G(x s ; θ G ) has to match y s not only in direction (as captured by cosine distance) but also in size (i.e., the vector norm). Therefore, the POSTLE objective augments cosine distance dcos with the absolute difference of G(x s ; θ G ) and y s norms: 3 The hyperparameter δ n determines the contribution of the norm difference to the overall loss.

Adversarial LE Post-Specialization
We next extend the DFFN post-specialization model with an adversarial architecture (ADV), following Ponti et al. (2018) who demonstrated its usefulness for similarity-based specialization. The intuition behind the adversarial extension is as follows: the specialization function G(x s ; θ G ) should not only produce vectors that have high cosine similarity and similar norms with corresponding LEARspecialized vectors y s , but should also ensure that these vectors seem "natural", that is, as if they were indeed sampled from Y s . We can force the postspecialized vectors G(x s ; θ G ) to be legitimate samples from the Y s distribution by introducing an adversary that learns to discriminate whether a given vector has been generated by the specialization function or directly sampled from Y s . Such adversaries prevent the generation of unrealistic outputs, as demonstrated in computer vision (Pathak et al., 2016;Ledig et al., 2017;Odena et al., 2017). The DFFN function G(x; θ G ) from §2.2 can be seen as the generator component. We couple the generator with the discriminator D(x; θ D ), also instantiated as a DFFN. The discriminator performs binary classification: presented with a word vector, it predicts whether it has been produced by G or 3 Simply minimizing Euclidean distance also aligns vectors in terms of both direction and size. However, we consistently obtained better results by the objective function from Eq. (6). sampled from the LEAR-specialized subspace Y s . On the other hand, the generator tries to produce vectors which the discriminator would misclassify as sampled from Y s . The discriminator's loss is defined via negative log-likelihood over two sets of inputs; generator produced vectors G(x s ; θ G ) and LEAR specializations y s : Besides minimizing the similarity-based loss L S , the generator has the additional task of confusing the discriminator: it thus perceives the discriminator's correct predictions as its additional loss L G : We learn G's and D's parameters with stochastic gradient descent -to reduce the co-variance shift and make training more robust, each batch contains examples of the same class (either only predicted vectors or only LEAR vectors). Moreover, for each update step of L G we alternate between s D update steps for L D and s S update steps for L S .

Cross-Lingual LE Specialization Transfer
The POSTLE models enable LE specialization of vectors of words unseen in lexical constraints. Conceptually, this also allows for a LE-specialization of a distributional space of another language (possibly without any external constraints), provided a shared bilingual distributional word vector space. To this end, we can resort to any of the methods for inducing shared cross-lingual vector spaces (Ruder et al., 2018). What is more, most recent methods successfully learn the shared space without any bilingual signal (Conneau et al., 2018;Artetxe et al., 2018;Chen and Cardie, 2018;Hoshen and Wolf, 2018). Let X t be the distributional space of some target language for which we have no external lexical constraints and let P (x; θ P ) : R dt → R ds be the (linear) function projecting vectors x t ∈ X t to the distributional space X ds of the source language with available lexical constraints for which we trained the post-specialization model. We then simply obtain the LE-specialized space Y t of the target language by composing the projection P with the post-specialization G (see Figure 1): In §4.3 we report on language transfer experiments with three different linear projection models P in order to verify the robustness of the cross-lingual LE-specialization transfer. 4

Experimental Setup
Distributional Vectors. To test the robustness of the POSTLE approach, we experiment with two pre-trained English word vector spaces: (1) vectors trained by Levy and Goldberg (2014)  Linguistic Constraints. We use the same set of constraints as LEAR in prior work : synonymy and antonymy constraints from (Zhang et al., 2014;Ono et al., 2015) are extracted from WordNet and Roget's Thesaurus (Kipfer, 2009). As in other work on LE specialization (Nguyen et al., 2017;Nickel and Kiela, 2017), asymmetric LE constraints are extracted from Word-Net, and we collect both direct and indirect LE pairs (i.e., (parrot, bird), (bird, animal), and (parrot, animal) are in the LE set) In total, we work with 1,023,082 pairs of synonyms, 380,873 pairs of antonyms, and 1,545,630 LE pairs.
Training Configurations. For LEAR, we adopt the hyperparameter setting reported in the original paper: δ att = 0.6, δ rep = 0, λ reg = 10 −9 . For POS-TLE, we fine-tune the hyperparameters via random search on the validation set: 1) DFFN uses H = 4 hidden layers, each with 1, 536 units and Swish as the activation function (Ramachandran et al., 2018); 2) ADV relies on H = 4 hidden layers, each with m = 2, 048 units and Leaky ReLU (slope 0.2) (Maas et al., 2014) for the generator. The discriminator uses H = 2 layers with 1, 024 units and Leaky ReLU. For each update based on the generator loss (L G ), we perform s S = 3 updates based on the similarity loss (L S ) and s D = 5 updates based on the discriminator loss (L D ). The value for the norm difference contribution in L S is set to to δ n = 0.1 (see Eq. (6)) for both POSTLE variants. We train POSTLE models using SGD with the batch size 32, the initial learning rate 0.1, and a decay rate of 0.98 applied every 1M examples.
Asymmetric LE Distance. The distance that measures the strength of the LE relation in the specialized space reflects both the cosine distance between the vectors as well as the asymmetric difference between their norms : LE-specialized vectors of general concepts obtain larger norms than vectors of specific concepts. True LE pairs should display both a small cosine distance and a negative norm difference. Therefore, in different LE tasks we can rank the candidate pairs in the ascending order of their asymmetric LE distance I LE . The LE distances are trivially transformed into binary LE predictions, using a binarization threshold t: if I LE (x, y) < t, we predict that LE holds between words x and y with vectors x and y.

Evaluation and Results
We extensively evaluate the proposed POSTLE models on two fundamental LE tasks: 1) predicting graded LE and 2) LE detection (and directionality), in monolingual and cross-lingual transfer settings.

Predicting Graded LE
The asymmetric distance I LE can be directly used to make fine-grained graded assertions about the hierarchical relationships between concepts. Following previous work (Nickel and Kiela, 2017;, we evaluate graded LE on the standard HyperLex dataset . 5 HyperLex contains 2,616 word pairs (2,163 noun pairs, the rest are verb pairs) rated by humans by 5 Graded LE is a phenomenon deeply rooted in cognitive science and linguistics: it captures the notions of concept prototypicality (Rosch, 1973;Medin et al., 1984) and category vagueness (Kamp and Partee, 1995;Hampton, 2007). We refer the reader to the original paper for a more detailed discussion. estimating on a [0, 6] scale the degree to which the first concept is a type of the second concept.
Results and Discussion. We evaluate the performance of LE specialization models in a deliberately controlled setup: we (randomly) select a percentage of HyperLex words (0%, 30%, 50%, 70%, 90% and 100%) which are allowed to be seen in the external constraints, and discard the constraints containing other HyperLex words, making them effectively unseen by the initial LEAR model. In the 0% setting all constraints containing any of the HyperLex words have been removed, whereas in the 100% setting, all available constraints are used. The scores are summarized in Figure 2. The 0% setting is especially indicative of POS-TLE performance: we notice large gains in performance without seeing a single word from HyperLex in the external resource. This result verifies that the POSTLE models can generalize well to words unseen in the resources. Intuitively, the gap between POSTLE and LEAR is reduced in the settings where LEAR "sees" more words. In the 100% setting we report the same scores for LEAR and POSTLE: this is an artefact of the HyperLex dataset construction as all HyperLex word pairs were sampled from WordNet (i.e., the coverage of test words is 100%). Another finding is that in the resource-leaner 0% and 30% settings POSTLE outperforms two other baselines (Nguyen et al., 2017;Nickel and Kiela, 2017), despite the fact that the two baselines have "seen" all HyperLex words. The results further indicate that POSTLE yields gains on top of different initial distributional spaces. As expected, the scores are higher with the more sophisticated ADV variant.

LE Detection
Detection and Directionality Tasks. We now evaluate POSTLE models on three binary classification datasets commonly used for evaluating LE models (Roller et al., 2014;Shwartz et al., 2017;Nguyen et al., 2017), compiled into an integrated benchmark by Kiela et al. (2015). 6 The first task, LE directionality, is evaluated on 1,337 true LE pairs (DBLESS) extracted from BLESS (Baroni and Lenci, 2011). The task tests the models' ability to predict which word in the LE pair is the hypernym. This is simply achieved by taking the word with a larger word vector norm as the hypernym. The second task, LE detection, is evaluated on the WBLESS dataset (Weeds et al., 2014), comprising 1,668 word pairs standing in one of several lexical relations (LE, meronymy-holonymy, co-hyponymy, reverse LE, and no relation). The models have to distinguish true LE pairs from pairs that stand in other relations (including the reverse LE). We score all pairs using the I LE distance. Following Nguyen et al. (2017), we find the threshold t via cross-validation. 7 Finally, we evaluate LE detection and directionality simultaneously on BIB-LESS, a relabeled variant of WBLESS. The task is to detect true LE pairs (including the reverse LE pairs), and also to determine the relation directionality. We again use I LE to detect LE pairs, and then compare the vector norms to select the hypernym.
For all three tasks, we consider two evaluation  settings: 1) in the FULL setting we use all available lexical constraints (see §3) for the initial LEAR specialization; 2) in the DISJOINT setting, we remove all constraints that contain any of the test words, making all test words effectively unseen by LEAR.
Results and Discussion. The accuracy scores on *BLESS test sets are provided in Table 1. 8 Our POSTLE models display exactly the same performance as LEAR in the FULL setting: this is simply because all words found in *BLESS datasets are covered by the lexical constraints, and POSTLE does not generalize the initial LEAR transformation to unseen test words. In the DISJOINT setting, however, LEAR is left "blind" as it has not seen a single test word in the constraints: it leaves distributional vectors of *BLESS test words identical. In this setting, LEAR performance is equivalent to the original distributional space. In contrast, learning to generalize the LE specialization function from LEAR-specializations of other words, POSTLE models are able to successfully LE-specialize vectors of test *BLESS words. As in the graded LE, the adversarial POSTLE architecture outperforms the simpler DFFN model.

Cross-Lingual Transfer
Finally, we evaluate cross-lingual transfer of LE specialization. We train POSTLE models using distributional (FASTTEXT) English (EN) vectors as input. Afterwards, we apply those models to the distributional vector spaces of two other languages, French (FR) and Spanish (ES), after mapping them into the same space as English as described in §2.4. We experiment with several methods to induce cross-lingual word embeddings: 1) MUSE, an adversarial unsupervised model fine-tuned with the closed-form Procustes solution (Conneau et al., 2018); 2) an unsupervised self-learning algorithm that iteratively bootstraps new bilingual seeds, initialized according to structural similarities of the monolingual spaces (Artetxe et al., 2018); 3) an orthogonal linear mapping with inverse softmax, supervised by 5K bilingual seeds (Smith et al., 2017).
We test POSTLE-specialized Spanish and French word vectors on WN-Hy-ES and WN-Hy-FR, two equally sized datasets (148K word pairs) created by Glavaš and Ponzetto (2017) using the ES WordNet (Gonzalez-Agirre et al., 2012) and the FR WordNet (Sagot and Fišer, 2008). We perform a ranking evaluation: the aim is to rank LE pairs above pairs standing in other relations (meronyms, synonyms, antonyms, and reverse LE). We rank word pairs in the ascending order based on I LE , see Eq. (10).
Results and Discussion. The average precision (AP) ranking scores achieved via cross-lingual transfer of POSTLE are shown in Table 2. We report AP scores using three methods for cross-lingual word embedding induction, and compare their performance to two baselines: 1) random word pair scoring, and 2) the original (FASTTEXT) vectors.
The results uncover the inability of distributional vectors to capture LE -they yield lower performance than the random baseline, which strongly emphasizes the need for the LE-specialization. The transferred POSTLE yields an immense improve-ment over the distributional baselines (up to +0.428, i.e. +118%). Again, the adversarial architecture surpasses DFFN across the board, with the single exception of EN-ES transfer coupled with Artetxe et al. (2018)'s cross-lingual model. Furthermore, transfers with unsupervised (Ar, Co) and supervised bilingual mapping (Sm) yield comparable performance. This implies that a robust LE-specialization of distributional vectors for languages with no lexico-semantic resources is possible even without any bilingual signal or translation effort.

Related Work
Vector Space Specialization. In general, lexical specialization models fall into two categories: 1) joint optimization models and 2) post-processing or retrofitting models. Joint models integrate external constraints directly into the distributional objective of embedding algorithms such as Skip-Gram and CBOW (Mikolov et al., 2013), or Canonical Correlation Analysis (Dhillon et al., 2015). They either modify the prior or regularization of the objective (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015) or augment it with factors reflecting external lexical knowledge (Liu et al., 2015;Ono et al., 2015;Osborne et al., 2016;Nguyen et al., 2017). Each joint model is tightly coupled to a specific distributional objective: any change to the underlying distributional model requires a modification of the whole joint model and expensive retraining.
In contrast, retrofitting models (Faruqui et al., 2015;Rothe and Schütze, 2015;Wieting et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2016;Mrkšić et al., 2017; use external constraints to posthoc fine-tune distributional spaces. Effectively, this makes them applicable to any input distributional space, but they modify only vectors of words seen in the external resource. Nonetheless, retrofitting models tend to outperform joint models in the context of both similarity-based (Mrkšić et al., 2016) and LE specialization .
The recent post-specialization paradigm has been so far applied only to the symmetric semantic similarity relation.  generalize over the retrofitting ATTRACT-REPEL (AR) model (Mrkšić et al., 2017) by learning a global similarityfocused specialization function implemented as a DFFN. Ponti et al. (2018) further propose an adversarial post-specialization architecture. In this work, we show that post-specialization represents a vi-able methodology for specializing all distributional word vectors for the LE relation as well.
Modeling Lexical Entailment. Extensive research effort in lexical semantics has been dedicated to automatic detection of the fundamental taxonomic LE relation. Early approaches (Weeds et al., 2004;Clarke, 2009;Kotlerman et al., 2010;Lenci and Benotto, 2012, inter alia) detected LE word pairs by means of asymmetric direction-aware mechanisms such as distributional inclusion hypothesis (Geffet and Dagan, 2005), and concept informativeness and generality (Herbelot and Ganesalingam, 2013;Santus et al., 2014;Shwartz et al., 2017), but were surpassed by more recent methods that leverage word embeddings.
Embedding-based methods either 1) induce LEoriented vector spaces using text (Vilnis and Mc-Callum, 2015;Vendrov et al., 2016;Henderson and Popa, 2016;Nguyen et al., 2017;Chang et al., 2018; and/or external hierarchies Kiela, 2017, 2018;Sala et al., 2018) or 2) use distributional vectors as features for supervised LE detection models (Baroni et al., 2012;Tuan et al., 2016;Shwartz et al., 2016;Glavaš and Ponzetto, 2017;Rei et al., 2018). Our POSTLE method belongs to the first group.  proposed LEAR, a retrofitting LE model which displays performance gains on a spectrum of graded and ungraded LE evaluations compared to joint specialization models (Nguyen et al., 2017). However, LEAR still specializes only the vectors of words seen in external resources. The same limitation holds for a family of recent models that embed concept hierarchies (i.e., trees or directed acyclic graphs) in hyperbolic spaces (Nickel and Kiela, 2017;Chamberlain et al., 2017;Nickel and Kiela, 2018;Sala et al., 2018;Ganea et al., 2018). Although hyperbolic spaces are arguably more suitable for embedding hierarchies than the Euclidean space, the "Euclidean-based" LEAR has been proven to outperform the hyperbolic embedding of the WordNet hierarchy across a range of LE tasks .
The proposed POSTLE framework 1) mitigates the limited coverage issue of retrofitting LEspecialization models, and 2) removes the problem of dependence on distributional objective in joint models. Unlike retrofitting models, POSTLE LEspecializes vectors of all vocabulary words, and unlike joint models, it is computationally inexpensive and applicable to any distributional vector space.

Conclusion
We have presented POSTLE, a novel neural postspecialization framework that specializes distributional vectors of all words -including the ones unseen in external lexical resources -to accentuate the hierarchical asymmetric lexical entailment (LE or hyponymy-hypernymy) relation. The benefits of our two all-words POSTLE model variants have been shown across a range of graded and binary LE detection tasks on standard benchmarks. What is more, we have indicated the usefulness of the POSTLE paradigm for zero-shot cross-lingual LE specialization of word vectors in target languages, even without having any external lexical knowledge in the target. In future work, we will experiment with more sophisticated neural architectures, other resource-lean languages, and bootstrapping approaches to LE specialization. Code and POSTLE-specialized vectors are available at: [https://github.com/ashkamath/POSTLE].