Generalized Tuning of Distributional Word Vectors for Monolingual and Cross-Lingual Lexical Entailment

Lexical entailment (LE; also known as hyponymy-hypernymy or is-a relation) is a core asymmetric lexical relation that supports tasks like taxonomy induction and text generation. In this work, we propose a simple and effective method for fine-tuning distributional word vectors for LE. Our Generalized Lexical ENtailment model (GLEN) is decoupled from the word embedding model and applicable to any distributional vector space. Yet – unlike existing retrofitting models – it captures a general specialization function allowing for LE-tuning of the entire distributional space and not only the vectors of words seen in lexical constraints. Coupled with a multilingual embedding space, GLEN seamlessly enables cross-lingual LE detection. We demonstrate the effectiveness of GLEN in graded LE and report large improvements (over 20% in accuracy) over state-of-the-art in cross-lingual LE detection.

Embedding specialization methods remedy for the semantic vagueness of distributional spaces, forcing the vectors to conform to external linguistic constraints (e.g., synonymy or LE word pairs) in order to emphasize the lexico-semantic relation of interest (e.g., semantic similarity of LE) and diminish the contributions of other types of semantic association. Lexical specialization models generally belong to one of the two families: (1) joint optimization models and (2) retrofitting (also known as fine-tuning or post-processing) models. Joint models incorporate linguistic constraints directly into the objective of an embedding model, e.g., Skip-Gram (Mikolov et al., 2013), by modifying the prior or regularization of the objective (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015) or by augmenting the objective with additional factors reflecting linguistic constraints (Ono et al., 2015;Osborne et al., 2016;Nguyen et al., 2017). Joint models are tightly coupled to a concrete embedding model -any modification to the underlying embedding models warrants a modification of the whole joint model, along with the expensive retraining. Conversely, retrofitting models (Faruqui et al., 2015;Wieting et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2017;Vulić and Mrkšić, 2018, inter alia) change the distributional spaces post-hoc, by fine-tuning word vectors so that they conform to external linguistic constraints. Advantageously, this makes retrofitting models more flexible, as they can be applied to any pre-trained distributional space. On the downside, retrofitting models specialize only the vectors of words seen in constraints, leaving vectors of unseen words unchanged.
In this work, we propose an LE-specialization framework that combines the strengths of both  Figure 1: High-level illustration of GLEN. Row #1: LE-retrofitting -specializes only vectors of constraint words (from language L1); Row #2: GLEN -learns the specialization function f using constraints (from L1) as supervision; Row #3: Cross-lingual GLEN: LE-tuning of vectors from language L2 -f applied to L2 vectors projected (function g) to the L1 embedding space. model families: unlike joint models, our generalized LE specialization (dubbed GLEN) is easily applicable to any embedding space. Yet, unlike the retrofitting models, it LE-specializes the entire distributional space and not just the vectors of words from external constraints. GLEN utilizes linguistic constraints as training examples in order to learn a general LE-specialization function (instantiated simply as a feed-forward neural net), which can then be applied to the entire distributional space. The difference between LE-retrofitting and GLEN is illustrated in Figure 1. Moreover, with GLEN's ability to LE-specialize unseen words we can seamlessly LE-specialize word vectors of another language (L2), assuming we previously project them to the distributional space of L1 for which we had learned the specialization function. To this end, we can leverage any from the plethora of resource-lean methods for learning the cross-lingual projection (function g in Figure 1) between monolingual distributional vector spaces (Smith et al., 2017;Conneau et al., 2018;Artetxe et al., 2018, inter alia). 1 Conceptually, GLEN is similar to the explicit retrofitting model of , who focus on the symmetric semantic similarity relation. In contrast, GLEN has to account for the asymmetric nature of the LE relation. Besides joint (Nguyen et al., 2017) and retrofitting ) models for LE, there is a number of supervised LE detection models that employ distributional vectors as input features (Tuan et al., 2016;Shwartz et al., 2016;Glavaš and Ponzetto, 2017;Rei et al., 2018). These models, however, predict LE for pairs of words, but do not produce LE-specialized word vectors, which are directly pluggable into downstream models.

Generalized Lexical Entailment
Following LEAR , the state-of-the-art LE-retrofitting model, we use three types of linguistic constraints to learn the general specialization f : synonyms, antonyms, and LE (i.e., hyponym-hypernym) pairs. Similarityfocused specialization models tune only the direction of distributional vectors (Mrkšić et al., 2017;Ponti et al., 2018). In LEspecialization we need to emphasize similarities but also reflect the hierarchy of concepts offered by LE relations (e.g., car should be similar to both Ferrari and vehicle but is a hyponym only of vehicle). GLEN learns a specialization function f that rescales vector norms in order to reflect the hierarchical LE relation. To this end, we use the following asymmetric distance between vectors defined in terms of their Euclidean norms: Simultaneously, GLEN aims to bring closer together in direction vectors for synonyms and LE pairs and to push vectors of antonyms further apart. We use the cosine distance d C as a symmetric measure of direction (dis)similarity between vectors. We combine the asymmetric distance d N and symmetric d C in different objective functions that we optimize to learn the LE-specialization function f .

Lexical Constraints as Training Instances.
For each constraint type -synonyms, antonyms, and LE pairs -we create separate batches of train- be the batches of K LE, synonymy, and antonymy pairs, respectively. For each constraint (x 1 , x 2 ) we create a pair of negative vectors (y 1 , y 2 ) such that y 1 is the vector within the batch (except x 2 ), closest to x 1 and y 2 the vector closest to x 2 (but not x 1 ) in terms of some distance or similarity metric. For LE constraints, we find y 1 and y 2 that minimize d N (x 1 , y 1 )+d C (x 1 , y 1 ) and d N (y 2 , x 2 ) + d C (x 2 , y 2 ), respectively. Intuitively, we want our model to predict a smaller LE distance d N +d C for a positive LE pair (x 1 , x 2 ) than for negative pairs (x 1 , y 1 ) and (x 2 , y 2 ) in the specialized space. By choosing the most-challenging negative pairs, i.e., y 1 and y 2 that are respectively closest to x 1 and x 2 in terms of LE distance in the distributional space, we force our model to learn a more robust LE specialization function (this is further elaborated in the description of the objective function). Analogously, for positive synonym pairs, y 1 and y 2 are the vectors closest to x 1 and x 2 , respectively, but in terms of only the (symmetric) cosine distance d C . Finally, for antonyms, y 1 is the vector maximizing d C (x 1 , y 1 ) and y 2 the vector that maximizes d C (x 2 , y 2 ). In this case, we want the vectors of antonyms x 1 and x 2 after specialization to be further apart from one another (according to d C ) than from, respectively, the vectors y 1 and y 2 that are most distant to them in the original distributional space. A training batch, with K entailment (E), synonymy (S), or antonymy (A) instances, is obtained by coupling constraints (x 1 , x 2 ) with their negative vectors (y 1 , y 2 ): being the embedding size), transforms the distributional space to the space that better captures the LE relation. Once we learn the specialization function f (i.e., we tune the parameters θ), we can LE-specialize the entire distributional embedding space X (i.e., the vectors of all vocabulary words): X = f (X; θ). For simplicity, we define f to be a (fully-connected) feed-forward net with H hidden layers of size d h and non-linear activation ψ. The i-th hidden layer (i ∈ {1, . . . , H}) is parametrized by the weight matrix W i and the bias vector b i : 2 Objectives and Training. We define four losses which we combine into training objectives for different constraint types (E, S, and A). The asymmetric loss l a forces the asymmetric margin-based distance d N to be larger for negative pairs (x 1 , y 1 ) and (y 2 , x 2 ) than for the positive (true LE) pair (x 1 , x 2 ) by at least the margin δ a : where τ (x) = max(0, x) is the ramp function. The similarity loss l s pushes the vectors x 1 , and x 2 to be direction-wise closer to each other than to negative vectors y 1 and y 2 , by margin δ s : 2 The 0-th "hidden layer" is the input distributional vector: h 0 (x; θ0) = x and θ0 = ∅, following the notation of Eq. (2).
The dissimilarity loss l d pushes vectors x 1 and x 2 further away from each other than from respective negative vectors y 1 and y 2 , by the margin δ d : We also define the regularization loss l r , preventing f from destroying the useful semantic information contained in distributional vectors: Finally, we define different objectives for different constraints types (E, S, and A): where λ a and λ r scale the contributions of the asymmetric and regularization losses, respectively. J E pushes LE vectors to be similar in direction (loss l s ) and different in norm (loss l a ) after specialization. J S forces vectors of synonyms to be closer together (loss l s ) and J A vectors of antonyms to be further apart (loss l d ) in direction after specialization, both without affecting vector norms. We tune hyperparameters (δ a , δ s , δ d , λ a , and λ r ) via cross-validation, with train and validation portions containing randomly shuffled E, S, and A batches.
Inference. We infer the strength of the LE relation between vectors x 1 = f (x 1 ) and x 2 = f (x 2 ) with an asymmetric LE distance combining d C and d N : True LE pairs should have a small d C and negative d N . We thus rank LE candidate word pairs according to their I LE scores, from smallest to largest. For the binary LE detection, I LE is binarized via threshold t: if I LE < t, we predict that LE holds.
Cross-Lingual (CL) LE Specialization. After learning the generalized LE-specialization function f , we can apply it to specialize any vector that comes from the same distributional vector space that we used in training. Let L 1 be the language for which we have the linguistic constraints and let X L1 be its corresponding distributional space. Let X L2 be the distributional space of another language L 2 . Assuming a function g : R d L2 → R d L1 that projects vectors from X L2 to X L1 , we can straightforwardly LE-specialize the distributional space of L 2 by composing functions f and g: Recently, a large number of projection-based models have been proposed for inducing bilingual word embedding spaces (Smith et al., 2017;Conneau et al., 2018;Artetxe et al., 2018;Ruder et al., 2018a;Joulin et al., 2018, inter alia), most of them requiring limited (word-level) or no bilingual supervision. Based on a few thousand (manually created or automatically induced) word-translation pairs, these models learn a linear mapping W g that projects the vectors from X L2 to the space X L1 : g(X L2 ) = X L2 W g . The crosslingual space is then given as: X L1 ∪ X L2 W g . Due to simplicity and robust downstream performance, 3 we opt for the simple supervised learning of the cross-lingual projection matrix W g (Smith et al., 2017) based on (closed-form) solution of the Procrustes problem (Schönemann, 1966). Let X S ⊂ X L2 and X T ⊂ X L1 be the subsets of the two monolingual embedding spaces, containing (row-aligned) vectors of word translations. We then obtain the projection matrix as W g = UV , where UΣV is the singular value decomposition of the product matrix X T X S .

Evaluation
Experimental Setup. We work with Wikipediatrained FASTTEXT embeddings (Bojanowski et al., 2017). We take English constraints from previous work -synonyms and antonyms were created from WordNet and Roget's Thesaurus (Zhang et al., 2014;Ono et al., 2015); LE constraints were collected from WordNet by  and contain both direct and transitively obtained LE pairs. We retain the constraints for which both words exist in the trimmed (200K) FASTTEXT vocabulary, resulting in a total of 1,493,686 LE, 521,037 synonym, and 141,311 antonym pairs. We reserve 4,000 constraints (E: 2k, S: 1k, A: 1k) for validation and use the rest for training. We identify the following best hyperparameter configuration via grid search: H = 5, d h = 300, ψ = tanh, δ a = 1, δ s = δ d = 0.5, λ a = 2, and λ r = 1. We apply a dropout (keep rate 0.5) to each hidden layer of f . We train in mini-batches of K = 50 constraints and learn with the Adam algorithm (Kingma and Ba, 2015): initial learning rate 10 −4 .

Graded Lexical Entailment
We use I LE to predict the strength of LE between words. We evaluate GLEN against the state-of-theart LE-retrofitting model LEAR  on the HyperLex dataset  which contains 2,616 word pairs (83% nouns, 17% verbs) judged (0-6 scale) by human annotators for the degree to which the LE relation holds. We evaluate the models in a deliberately controlled setup: we (randomly) select a subset of HyperLex words (0%, 10%, 30%, 50%, 70%, 90%, and 100%) that we allow models to "see" in the constraints, removing constraints with any other HyperLex word. 4 Results and Discussion. The graded LE performance is shown in Table 1 for all seven setups. Graded LE results suggest that GLEN is robust and generalizes well to unseen words: the drop in performance between the 0% and 100% setups is mere 4% for GLEN (compared to a 50% drop for LEAR). Results in the 0% setting, in which GLEN improves over the distributional space by more than 30 points most clearly demonstrate its effectiveness. 5 GLEN, however, lags behind LEAR in setups where LEAR has seen 70% or more of test words. This is intuitive: LEAR specializes the vector of each particular word using only the constraints containing that word; this gives LEAR higher specialization flexibility at the expense of generalization ability. In contrast, GLEN's specialization function is affected by all constraints and has to work for all words; GLEN trades the effectiveness of LEAR's word-specific updates for seen words, for the ability to generalize over unseen words. In a sense, there is a trade-off between the ability to generalize the LE-specialization over unseen words and the performance for seen words. Put differently, by learning a general specialization function -i.e., by using linguistic constraints merely as training instances -GLEN is prevented from "overfitting" to seen words. Evaluation settings like our 90% or 100% settings, in which GLEN is outperformed by a pure retrofitting model, are however unrealistic in view of downstream tasks: for any concrete downstream task (e.g., textual entailment or taxonomy induction), it is highly unlikely that the LE-specialization model will have seen almost all of the test words (words for which LE inference is required) in its training linguistic constraints; this is why GLEN's ability to generalize LE-specialization to unseen words (as indicated by 0%-50% settings) is particularly important.

Cross-Lingual LE Detection
Neither joint (Nguyen et al., 2017) nor retrofitting models   We induce the CL embeddings (i.e., learn the projections W g , see Section §2) by projecting AR, FR, and RU embeddings to the EN space in a supervised fashion, by finding the optimal solution to the Procrustes problem for given 5K word translation pairs (for each language pair). 6 We compare GLEN with more complex models from (Upadhyay et al., 2018): they couple two methods for inducing syntactic CL embeddings -CL-DEP (Vulić, 2017) and BI-SPARSE (Vyas and Carpuat, 2016) -with 6 We automatically translated 5K most frequent EN words to AR, FR, and RU with Google Translate.  an LE scorer based on the distributional inclusion hypothesis (Geffet and Dagan, 2005).
Results. GLEN's cross-lingual LE detection performance is shown in Table 2. GLEN dramatically outperforms CL LE detection models from (Upadhyay et al., 2018), with an average edge of 24% on HYPO datasets and 16% on the COHYP datasets. 7 This accentuates GLEN's generalization ability: it robustly predicts CL LE, although trained only on EN constraints. GLEN performs better for EN-AR and EN-RU than for EN-FR: we believe this to merely be an artifact of the (rather small) test sets. We find GLEN's CL performance for more distant language pairs (EN-AR, EN-RU) especially encouraging as it holds promise of successful transfer of LE-specialization to resource-lean languages lacking external linguistic resources.

Conclusion
We presented GLEN, a general framework for specializing word embeddings for lexical entailment. Unlike existing LE-specialization models (Nguyen et al., 2017;, GLEN learns an explicit specialization function using linguistic constraints as training examples. The learned LE-specialization function is then applied to vectors of words (1) unseen in constraints and (2) from different languages. GLEN displays robust graded LE performance and yields massive improvements over state-of-the-art in cross-lingual LE detection. We next plan to evaluate GLEN on multilingual and cross-lingual graded LE datasets  and release a large multilingual repository of LE-specialized embeddings. We make GLEN (code and resources) available at: https://github.com/codogogo/glen.