Cross-lingual Semantic Specialization via Lexical Relation Induction

Semantic specialization integrates structured linguistic knowledge from external resources (such as lexical relations in WordNet) into pretrained distributional vectors in the form of constraints. However, this technique cannot be leveraged in many languages, because their structured external resources are typically incomplete or non-existent. To bridge this gap, we propose a novel method that transfers specialization from a resource-rich source language (English) to virtually any target language. Our specialization transfer comprises two crucial steps: 1) Inducing noisy constraints in the target language through automatic word translation; and 2) Filtering the noisy constraints via a state-of-the-art relation prediction model trained on the source language constraints. This allows us to specialize any set of distributional vectors in the target language with the refined constraints. We prove the effectiveness of our method through intrinsic word similarity evaluation in 8 languages, and with 3 downstream tasks in 5 languages: lexical simplification, dialog state tracking, and semantic textual similarity. The gains over the previous state-of-art specialization methods are substantial and consistent across languages. Our results also suggest that the transfer method is effective even for lexically distant source-target language pairs. Finally, as a by-product, our method produces lists of WordNet-style lexical relations in resource-poor languages.

Semantic specialization techniques are therefore leveraged to stress a relation of interest such as semantic similarity (Wieting et al., 2015;Ponti et al., 2018) or lexical entailment (Nguyen et al., 2017; over other types of semantic association in the word vector space. The best-performing specialization models (cf. Ponti et al. 2018) are executed as vector space post-processors. In short, these techniques force the distributional vectors to conform to external linguistic constraints (e.g., synonymy, meronymy, lexical entailment) extracted from structured external resources (e.g., WordNet, BabelNet) to emphasize the particular relation. As post-processors they are applicable to any input distributional space.
A critical requirement for all specialization techniques is the set of linguistic constraints drawn from the curated external semantic resource. Such resources contain incomplete information even in resource-rich languages (e.g., English WordNet), while the resources are scarcer or even non-existent for many other languages. A solution was proposed recently to deal with incomplete information in a resource-rich language: the specialization function learned on the subset of words observed in the external resource gets propagated to the entire vocabulary in a step called post-specialization ). Yet, another fundamental question concerning specialization techniques is still unresolved: how to enable specialization in virtually any language, even when the language completely lacks external lexical resources?
In this work, we therefore propose a novel approach for cross-lingual specialization transfer based on Lexical Relation Induction (CLSRI). CLSRI leverages lexical information from a resource-rich language to enable specialization in any target language, without observing a single lexical constraint in the target language. The transfer method consists of two main steps: 1) We induce a noisy set of constraints in the target language through automatic word translation via a shared cross-lingual word vector space Joulin et al., 2018). 2) To mitigate the noise from the translation process, the initial set of noisy constraints is then refined in a relation prediction phase: we adjust a state-of-the-art neural method for lexical relation classification (Glavaš and Vulić, 2018a) and use it to predict the validity of each noisy constraint obtained in the first step. Finally, a standard specialization technique (including the post-specialization step) can then be used monolingually in the target language, starting from the set of refined target language constraints.
We verify the usefulness of our specialization transfer method in the intrinsic word similarity task for 8 target languages, followed by 3 downstream tasks in 5 languages: lexical simplification, dialog state tracking, and semantic textual similarity. We observe large improvements over purely distributional word vectors for all target languages and in all tasks. Moreover, we show that the proposed specialization transfer method consistently outperforms the direct specialization transfer based on the composition of the crosslingual projection and the post-specialization function (Ponti et al., 2018), with substantial gains across all experimental setups. In order to boost the integration of external lexical knowledge into distributional models beyond English, we will release our code and lists of WordNet-style lexical relations generated by our transfer method for all target languages at: https://github.com/ cambridgeltl/xling-postspec.

Related Work
Conflating distinct (both paradigmatic and syntagmatic) lexico-semantic relations is a well-known property of distributional word vectors; semantic specialization of such spaces for a particular lexicosemantic relation (e.g., semantic similarity or lexical entailment) benefits a number of tasks, e.g., dialog state tracking Ponti et al., 2018), spoken language understanding (Kim et al., 2016b,a), text simplification (Glavaš and Vulić, 2018b;Ponti et al., 2018), and cross-lingual transfer of resources (Vulić et al., 2017a).
In contrast, retrofitting (also known as postprocessing) methods tune the pre-trained distributional vectors post-hoc based on the provided external constraints. Despite the fact that joint models specialize the entire space, whereas the first generation of retrofitting models specializes only the vectors of words seen in lexical constraints, the latter yield better downstream performance (Mrkšić et al., 2016). Moreover, while the joint models are tightly coupled to a concrete word embedding objective, retrofitting models can be applied on top of any distributional vector space.
Post-specialization Ponti et al., 2018;Kamath et al., 2019) is a generalization of retrofitting that specializes the entire distributional space: 1) it learns a global specialization function using before-and after-retrofitting vectors of words from lexical constraints as training examples and 2) it applies the global specialization functions to vectors of words unseen in lexical constraints. Similar to retrofitting, post-specialization can be applied to any vector space, but also (like joint specialization models) specializes the full distributional space.
Since it learns a global and explicit specialization function, post-specialization can be used for cross-lingual specialization transfer. Assuming a shared cross-lingual embedding space , a post-specialization function induced on the source language subspace can be directly applied to the target language sub- Figure 1: High-level illustration of our CLSRI framework for semantic specialization.
Step 1: a network of lexical relations in a source language (red dots, left) is translated into a target language (blue dots, right) through a shared vector space (center).
Step 2: a lexical relation classifier (center) trained on vector pairs sampled from the source language (left) prunes the constraints in the noisy target network (right).
Step 3: the refined constraints are used to attract or repel the corresponding vectors (golden edges, left); this transformation is learned by a deep feed-forward network (center) and applied to the full target vector space.
space (Glavaš and Vulić, 2018b;Ponti et al., 2018). In this work, we propose a different approach: we use a shared cross-lingual space to (noisily) translate lexical constraints from source to target language, and then use a relation-prediction model (trained on the source language constraints) to filter out the invalid target language constraints. This allows for monolingual application of retrofitting or post-specialization in the target language. Our experiments show that the proposed specialization transfer via lexical relation induction (CLSRI) outperforms the previous state-of-the-art specialization transfer method of Ponti et al. (2018). 3 Methodology CLSRI in a Nutshell. In cross-lingual semantic specialization our goal is to fine-tune the distributional vectors of a target language L t leveraging structured knowledge in the form of lexical constraints, available only for a resource-rich source language L s . To this end, we propose a two-step translate-and-refine procedure for the induction of target language constraints, described in § 3.1. We first translate words in each L s constraint by retrieving their nearest neighbour in L t from a shared cross-lingual L s -L t embedding space . Such a translation procedure will generate noisy constraints in the target language due to (1) imperfect word translation via the cross-lingual embedding space and (2) polysemy in L s and translation of incorrect senses of L s words. We thus subsequently refine the noisy set of target constraints by having a state-of-the-art neural model for lexico-semantic relation prediction (Glavaš and Vulić, 2018a), trained on the L s constraints, discern valid from invalid L t constraints.
Following that, we perform monolingual retrofitting and post-specialization in the target language L t , as outlined in § 3.2. The L t distributional vectors can be specialized with the cleaned L t constraints using any off-the-shelf retrofitting model (Faruqui et al., 2015;Mrkšić et al., 2016;Lengerich et al., 2018, inter alia). In this work we opt for the best-performing retrofitting model ATTRACT-REPEL (AR) Vulić et al., 2017b). AR specializes only the words seen in the cleaned L t constraints. As the final step, we generalize AR's specialization to the entire target vocabulary with a post-specialization model (Ponti et al., 2018) that learns the global specialization function from pairs of distributional and ARspecialized vectors of words from L t constraints. A visual summary of our transfer model is presented in Figure 1.
Our proposed CLSRI specialization conceptually differs from an existing cross-lingual specialization transfer methodology (Ponti et al., 2018;Glavaš and Vulić, 2018b), in which the global specialization function is learned in the source language L s and then transferred directly to the target language L s via a shared cross-lingual embedding space.

Induction and Refinement of Constraints
Step 1: Constraint Translation. Following the established methodology of , constraints drawn from external resources are usually split into two broad sets: 1) ATTRACT constraints couple words that should have similar representations (e.g., synonyms like complicated and complex or direct hyponym-hypernym pairs like parrot and bird); and 2) REPEL constraints indicate which word pairs should appear far-flung in the space (e.g., antonyms like ancient and recent).
Given a set A s of ATTRACT word pairs and a set R s of REPEL word pairs, each word pair (w l s , w r s ) from the vocabulary of the source language V s is automatically translated into the target language with vocabulary V t using a shared cross-lingual L s -L t word embedding space. We create the crosslingual space X CL by learning a linear map W CL that projects the distributional space of the target language X t to the distributional space X s of the source language, i.e., X CL = X s ∪ X t W CL . We translate each word w s from each linguistic constraint in L s by looking for the nearest neighbour of its vector x s in the projected target space X t W CL . We employ recently proposed Relaxed Cross-domain Similarity Local Scaling (RCSLS) model of Joulin et al. (2018) to learn the projection matrix W CL and induce the bilingual space X CL . 1 1 RCSLS substantially outperforms competing models on the task of bilingual lexicon induction as shown in a recent comparative study , and has been designed to optimize performance exactly on the word translation task.
Step 2: Cleaning Noisy Constraints. The L t constraints we obtain by translating L s constraints via a cross-lingual L 1 -L 2 embedding space are expected to be noisy (as validated later in § 5), i.e., a shared cross-lingual space obtained via a linear projection matrix is far from ideal. The translations are going to be particularly noisy for pairs of distant languages for which the projection-based methods for inducing cross-lingual embedding spaces (including RCSLS) generally yield lower bilingual lexicon induction (BLI) performance (Søgaard et al., 2018;Joulin et al., 2018;. In the next step, we therefore clean the noisy L t constraints obtained via this imperfect translation procedure. To this end, we leverage the state-ofthe-art model for lexical relation prediction: the Specialization Tensor Model (STM) (Glavaš and Vulić, 2018a). STM is a neural model that predicts lexical relations for pairs of input distributional vectors based on multi-view projections of those vectors. Each slice of the STM's central specialization tensor specifies a different projection. We modify the original N -ary STM classifier to now model binary classification, and train two instances of the model: one that predicts whether a pair of words represents a valid ATTRACT constraint (A-STM), and another that predicts valid REPEL constraints (R-STM). We train both models with the training instances created from the clean L s constraints.
Given a pair of vectors (x l , x r ) that corresponds to a clean linguistic constraint (w l s , w r s ) from A s (or R s ), each vector is transformed with k feedforward networks (FFNs) of the STM model. The paired projections of the two vectors resulting from each FFN are scored with a parameterized biaffine product, producing k latent scores describing the nature of the relation between the input vectors. The k-dimensional latent feature vector is finally passed to a FFN, which performs binary classification. 2 The complete objective is summarized in Equation (1): where ⊕ stands for concatenation, and the output layer activations are denoted as σ for sigmoid and τ for tanh.
The pairs (x l , x r ) created from A s and R s constitute positive training instances for A-STM and R-STM, respectively. For each classifier we couple each positive training instance with two types of negative training instances: (1) we create a negative instance by substituting a member of the pair (x l or x r ) with a randomly sampled vector from one of the other pairs in the same training batch; (2) we create a negative instance by randomly sampling a constraint from the opposing set of constraints, that is, we turn a constraint from A s into a negative example for R-STM, and, conversely, a constraint from R s into a negative training instance for A-STM. We train the A-STM and R-STM models with training instances created from L s constraints and then use the trained model to predict the validity of the translated L t constraints. We retain only the subsets of L t constraints A t and R t deemed valid by A-STM and R-STM, respectively. Vectors of L s words (during training) and vectors of L t words (at inference) are taken from the induced bilingual L s -L t space X CL = X s ∪ X t W CL .

Semantic Specialization
We can now directly feed A t and R t to any retrofitting model and (monolingually) specialize any distributional space in the target language. We first run the state-of-the-art retrofitting model ATTRACT-REPEL (AR)  with A t and R t constraints. AR however, specializes only the words present in A t and R t . In the next step, we generalize AR's specialization to the full vocabulary V t with the state-of-the-art postspecialization model . For completeness, we briefly summarize AR and the postspecialization model of .
Retrofitting with ATTRACT-REPEL. Each constraint from A t and R t is used to fine-tune the distance between their corresponding vectors (x l , x r ) in the target L t distributional space. Let B A be a batch of vector pairs created from ATTRACT constraints A t and B R the batch of vector pairs created from REPEL constraints R t . For each batch B A and each batch B R , we construct batches of corresponding negative pairs T A (B A ) and T R (B R ), containing new pairs of words sampled among those present in the batch of positive pairs. In particular, half of the negative examples t l and t r for ATTRACT (or RE-PEL) pairs are chosen by retrieving the nearest (or farthest) neighbours to x l and x r , respectively, in terms of cosine similarity. Another half are random negative examples.
AR minimizes an objective based on max-margin loss between positive pairs and their corresponding negative pairs. More precisely, its objective has three loss components: L AR = Att(B A , T A ) + Rep(B R , T R ) + P re(B A , B R ). The first component ensures that word pairs from each B A are drawn closer together than those in the corresponding T A up to a certain "attract" margin δ A : where τ (z) = max(0, z) is ramp function. Analogously, Rep(B R , T R ) forces the vectors of words in B R pairs to be further away than the vectors of their corresponding T R pairs by a margin δ R . Finally, P re(B A , B R ) is the regularization objective that preserves the useful semantic information from the distributional space by minimizing the Euclidean distance between original and changed vectors. 3 Post-Specialization. By virtue of AR retrofitting, only the subset of vectors of L t words observed in the refined L t constraints are specialized. The specialized subspace, however, contains useful information for propagating the specialization to the rest of the vocabulary V t (i.e., to the vectors of L t words unseen in A t and R t ). Post-specialization aims to learn a global specialization function G : X t ∈ R d → X t ∈ R d that approximates the perturbation patterns of AR as captured by changes in vectors of seen words from A t and R t . G is learned as a non-linear mapping between pairs (x i , y i ), where x i ∈ X t is a distributional vectors of some constraint word (from A t or R t ) and y i is its corresponding AR-specialized vector. In line with  and Ponti et al. (2018), we implement this function as a deep feed-forward neural network with l hidden layers of size h and a final linear layer with weight W ∈ R h×d . We optimize model parameters θ G by minimizing a contrastive margin ranking loss with random confounders (Weston et al., 2011, inter alia). The cosine similarity between a distributional vector transformed with G and the corresponding "gold" vector (i.e., ARspecialized vector) is forced to be larger than that between the former and randomly sampled confounders (k of them) by a margin δ M M : Once the global specialization transformation G is learned, it is applied to the whole distributional space of our target language: Y t = G θ G (X t ).
Note that with our proposed specialization approach CLSRI, we execute the retrofitting and postspecialization completely monolingually in the target language L t on the automatically induced constraints in the target language. In contrast, existing work Glavaš and Vulić, 2018b;Ponti et al., 2018) transfers the post-specialization function learned for the source language L s to the target language L t via a cross-lingual vector space. This fundamental design difference is illustrated in Figure 1 and empirically validated in §5.

Experimental Setup
Lexical Constraints. The assortment of English constraints for specialization is the same as in prior work (Zhang et al., 2014;Ono et al., 2015;Ponti et al., 2018). These constraints concern the lexical relations documented in WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009). Initially, they amount to 1,023,082 synonymy/ATTRACT word pairs and 380,873 antonymy/REPEL pairs, which cover 14.6% of the 200K most frequent English words, as found in the vocabulary of FASTTEXT vectors (Bojanowski et al., 2017). The number of constraints is substantially reduced in the target languages 4 after the induction process from § 3.1, both after the rough translation and after the refinement via relation prediction. The actual numbers are reported in Figure 2. Early stopping was implemented based on the F 1 score on a development set comprising 5% of the source language constraints. The hidden layer dimensionality is 300, and we use k = 5 specialization sub-tensors. Regarding the quality of the STM predictions, the best models achieve an F 1 score of 81.4 on ATTRACT constraints, and an F 1 score of 66.9 on REPEL constraints. 7 AR and Post-Specialization. We retain the exact hyper-parameter configuration for ATTRACT-REPEL from the original work : δ A = 0.6, δ R = 0.0, λ P = 10 −9 . Adagrad (Duchi et al., 2011) is employed to optimize the model parameters for 5 epochs, feeding batches of size |B A | = |B R | = 50, again as in prior work.
of Skip-Gram with Negative Sampling (SGNS) that builds representations for each word's constituent character n-grams and sums them up to obtain the entire word's representation. 6 https://github.com/facebookresearch/ fastText/tree/master/alignment Owing to the difference in the amount of supervision, the post-specialization model has partially non-overlapping configurations for the baseline model of Ponti et al. (2018) and our CLSRI model. For both models each of the l = 3 hidden layers of the feed-forward network is composed of h = 2, 048 hidden units, and is non-linearly activated by LeakyReLU (Maas et al., 2013). We apply a dropout of 0.2 both in input and between hidden layers. In Eq. (3)

Results and Discussion
We evaluate different specialization models across several target languages on the intrinsic word similarity task and three downstream language understanding tasks where distinguishing between true semantic similarity and conceptual relatedness is crucial: dialog state tracking, lexical simplification, and semantic textual similarity. The choice of tasks has also been driven by the availability of standardized evaluation data in different languages.

Word Similarity
Evaluation Setup. The intrinsic evaluation is based on a set of (true) word similarity benchmarks manually translated from (subsets of) the English SimLex  and re-scored in the target languages. 9 In particular, the benchmarks 8 For both models, the hyper-parameters are chosen with a grid search over the intervals h={1024, 2048, 4096}, l={2, 3}, lr={0.1, 0.01, 0.001}, and optimizers in {Adam, SGD}, using a held-out dev set (10% of the constraints). 9 In contrast to other datasets like WordSim-353 (Finkelstein et al., 2002) or MEN (Bruni et al., 2014), SimLex encourages scores to distinguish between pure semantic similarity (actual synonyms) and broad topical relatedness. are collected from the work of Leviant and Reichart (2015) for German, Italian, and Russian (999 pairs), 10 from  for Hebrew and Croatian (999 pairs), 11 from Venekoski and Vankka (2017) for Finnish (300), 12 from Mykowiecka et al. (2018) for Polish (999), 13 and from Ercan and Yıldız (2018) for Turkish (500). 14 We measure the Spearman's ρ rank correlation between the gold human-elicited word pair similarity scores and the cosine similarity of the corresponding word vectors retrieved from each vector space.
Results and Analysis. We summarize the results for word similarity in Table 1. The full CLSRI-PS model outperforms both the distributional vectors and the baseline method for cross-lingual specialization (Ponti et al., 2018). In all languages but two (DE and RU) even the CLSRI-AR model without post-specialization is superior to both baselines, and the post-specialization step additionally improves the results, supporting the findings from prior work . Crucially, the performance of CLSRI-PS remains strong even for distant language pairs (e.g., for EN-HE, EN-TR or EN-FI), whereas the X-PS baseline shows a drop in performance for such cases. We suspect that it is because the success of our CLSRI-PS method depends less on the quality of the underlying shared cross-lingual vector space, which is known to deteriorate for more distant language pairs (Søgaard et al., 2018;.

Dialog State Tracking
A standard language understanding evaluation task used in prior work on semantic specialization Ponti et al., 2018, inter alia) is dialog state tracking (DST) (Henderson et al., 2014;. A DST model is a fundamental building block of statistical modular dialogue systems (Young, 2010). Its task is to maintain the information of the user's goals during a multi-turn conversation by updating the dialog belief state at each turn. Distinguishing true similarity as captured in specialized word vectors from broader relatedness is crucial for DST to succeed: e.g., a dialog system for restaurant bookings should not confuse the western and the eastern part of town, or Thai and Japanese cuisine.  Evaluation Setup. To be directly comparable to prior work when evaluating the effects of specialized word embeddings on DST, we rely on the Neural Belief Tracker (NBT) v2 : it is a fully statistical DST model that operates solely on the basis of pretrained word vectors , and they are pivotal to its performance. 15 Again following prior work, our evaluation data come from the multilingual Wizard-of-Oz (WOZ) dataset , which is available in two target languages: German and Italian . It contains 1,200 dialogues split into training (600 dialogues), development (200), and test data (400). We report the standard DST metric of joint goal accuracy: it refers to the proportion of dialog turns where all the users goals were correctly identified.

Model
Results and Analysis. The results on the German and Italian DST task are summarized in Table 2. Several findings emerge from the results. First, as already confirmed in prior work Ponti et al., 2018), vectors specialized for semantic similarity are indeed important for DST: we observe improvements with all specialized vectors. The highest gains are observed with the full CSLRI-PS model. This confirms two main intuitions: 1) our proposed specialization transfer via lexical induction in the target language is more robust than 15 Note that the original NBT framework in the English DST task has been recently surpassed by more intricate taskspecific architectures (Zhong et al., 2018;Ren et al., 2018), but its lightweight design coupled with its strong dependence on input word vectors still makes it a convenient means to evaluate the effects of different specialization methods.
the previous X-PS method of Ponti et al. (2018), and 2) the full-vocabulary post-specialization step is again useful as the initial CSLRI-AR model cannot match the performance of CSLRI-PS.

Lexical Simplification
Lexical simplification (LS) aims to automatically replace complex words (i.e., specialized terms, words used less frequently and known to fewer speakers) with their simpler in-context synonyms: the simplified text must be grammatical and retain the meaning of the original text. Lexical simplification critically depends on discerning semantic similarity from other types of semantic relatedness, as the meaning of the original text might not be preserved otherwise (e.g., "The orange automobile crashed." vs. "The orange wheel crashed.").
Evaluation Setup. To evaluate the effects of similarity-based specialization on LS, we employ Light-LS (Glavaš andŠtajner, 2015), a languageagnostic LS tool that makes simplifications based on word similarities in a given vector space. The quality of similarity-based information encoded in the vector space encode is thus expected to directly correlate with the performance of Light-LS. We use LS datasets for Italian (IT) (Tonelli et al., 2016), Spanish (ES) (Saggion et al., 2015;Saggion, 2017), and Portuguese (PT) (Hartmann et al., 2018) to evaluate the specialized spaces in those languages. We rely on the standard LS evaluation metric of Accuracy (Horn et al., 2014;Glavaš andŠtajner, 2015): it quantifies both the quality and frequency of replacements as a number of correct simplifications divided by the total number of complex words.
Results and Analysis. The results are reported in Table 3. As shown in previous work Ponti et al., 2018), retrofitting (CLSRI-AR) and the cross-lingual post-specialization transfer (X-PS) are substantially better in the LS task than the original distributional space. However, our full CLSRI-PS model results in substantial boosts in the  LS task (13-17%) over the previous best reported scores of X-PS as well as over CLSRI-AR.

Semantic Text Similarity
Evaluation Setup. Finally, we also carry downstream evaluation in the semantic textual similarity (STS) task. The Arabic dataset constructed for Se-mEval 2017 track 1 16 (Cer et al., 2017) consists of sentence pairs scored from 0 (semantic independence) to 5 (semantic equivalence). We augment the training set with all the data for English (translated with Google Translate) from previous editions of the shared task. To classify sentence pairs, we employ the CNN-HTCI model (Shao, 2017). Each sentence is encoded with a convolutional network into a hidden representation. Then, the interaction between the pair of representations is evaluated as their element-wise multiplication and absolute difference. A fully connected network takes this interaction as input, and infers the similarity score.
Results and Analysis. We report the accuracy scores for the Arabic STS in Table 3. Interestingly, for STS both X-PS and CLSRI-AR damage the performance of the distributional baseline. However, the full CLSRI-PS model still shows a substantial improvement over all baselines. This again suggests its wide stability and effectiveness.
To empirically validate the importance of noisy constraints refinement (see § 3.1), we have also evaluated an ablated variant of CLSRI-PS without the refinement step: this model variant relies only on noisy translations of L s lexical constraints. While this variant leads to improvements over the X-PS baseline across the board, it is consistently outperformed by the full CSLRI-PS model in downstream tasks: e.g., the gains with the full model are 2-3% in the LS task, and 2% in the Arabic STS task. Since the full CSLRI-PS model does not require 16 http://alt.qcri.org/semeval2017/task1/ any additional input for the lexical prediction step (i.e., it operates with the same set of L s constraints as the translation step), these results suggest that both steps should be applied for improved specialization in the target languages.

Future Work
As a supplemental benefit of CLSRI, the constraints induced by translation and pruning hold promise to create WordNet-style resources for languages that lack structured linguistic knowledge. While the relations extracted in this proof-of-concept paper do not cover the rich and expressive set of Word-Net relations in its entirety, they are nonetheless sufficient to create parts of the core WordNet structure with synsets (synonyms) and lexical relations across synsets (antonyms) from scratch.
Furthermore, our method is amenable be extended to contextualized embeddings (Lauscher et al., 2019) and/or other WordNet lexical relations such as hypernyms and hyponyms. In recent works, procedures of retrofitting  and post-specialization (Kamath et al., 2019) have been developed for lexical entailment. These procedures can be easily adapted to the semantic specialization step presented in § 3.2, whereas constraint translation and refinement ( § 3.1) are relation-agnostic. We will exploit these directions in future work.

Conclusion
We have proposed a new method for cross-lingual transfer of semantic specialization via induction of lexical constraints in a resource-poor target language. We have verified its usefulness in intrinsic and extrinsic language understanding tasks and across a spectrum of target languages. We report consistent improvements over previous state-of-theart specialization methods. Crucially, our method is robust to target languages that are distant from source languages, as its performance is consistent across all considered language pairs.