Explicit Retrofitting of Distributional Word Vectors

Semantic specialization of distributional word vectors, referred to as retrofitting, is a process of fine-tuning word vectors using external lexical knowledge in order to better embed some semantic relation. Existing retrofitting models integrate linguistic constraints directly into learning objectives and, consequently, specialize only the vectors of words from the constraints. In this work, in contrast, we transform external lexico-semantic relations into training examples which we use to learn an explicit retrofitting model (ER). The ER model allows us to learn a global specialization function and specialize the vectors of words unobserved in the training data as well. We report large gains over original distributional vector spaces in (1) intrinsic word similarity evaluation and on (2) two downstream tasks − lexical simplification and dialog state tracking. Finally, we also successfully specialize vector spaces of new languages (i.e., unseen in the training data) by coupling ER with shared multilingual distributional vector spaces.


Introduction
Algebraic modeling of word vector spaces is one of the core research areas in modern Natural Language Processing (NLP) and its usefulness has been shown across a wide variety of NLP tasks (Collobert et al., 2011;Chen and Manning, 2014;Melamud et al., 2016). Commonly employed distributional models for word vector induction are based on the distributional hypothesis (Harris, 1954), i.e., they rely on word co-occurrences obtained from large text corpora (Mikolov et al., 2013b;Pennington et al., 2014;Levy and Goldberg, 2014a;Levy et al., 2015;Bojanowski et al., 2017).
The dependence on purely distributional knowledge results in a well-known tendency of fusing semantic similarity with other types of semantic relatedness Schwartz et al., 2015) in the induced vector spaces. Consequently, the similarity between distributional vectors indicates just an abstract semantic association and not a precise semantic relation (Yih et al., 2012;Mohammad et al., 2013). For example, it is difficult to discern synonyms from antonyms in distributional spaces. This property has a particularly negative effect on NLP applications like text simplification and statistical dialog modeling, in which discerning semantic similarity from other types of semantic relatedness is pivotal to the system performance (Glavaš anď Stajner, 2015;Faruqui et al., 2015;Mrkšić et al., 2016;Kim et al., 2016b).
A standard solution is to move beyond purely unsupervised learning of word representations, in a process referred to as word vector space specialization or retrofitting. Specialization models leverage external lexical knowledge from lexical resources, such as WordNet (Fellbaum, 1998), the Paraphrase Database (Ganitkevitch et al., 2013), or BabelNet (Navigli and Ponzetto, 2012), to specialize distributional spaces for a particular lexical relation, e.g., synonymy (Faruqui et al., 2015; or hypernymy (Glavaš and Ponzetto, 2017). External constraints are commonly pairs of words between which a particular relation holds.
Existing specialization methods exploit the external linguistic constraints in two prominent ways: (1) joint specialization models modify the learning objective of the original distributional model by integrating the constraints into it (Yu and Dredze, 2014;Kiela et al., 2015;Nguyen et al., 2016, inter alia); (2) post-processing models fine-tune distributional vectors retroactively after training to satisfy the external constraints (Faruqui et al., 2015;Mrkšić et al., 2017, inter alia). The latter, in general, outperform the former (Mrkšić et al., 2016). Retrofitting models can be applied to arbitrary distributional spaces but they suffer from a major limitation -they locally update only vectors of words present in the external constraints, whereas vectors of all other (unseen) words remain intact. In contrast, joint specialization models propagate the external signal to all words via the joint objective.
In this paper, we propose a new approach for specializing word vectors that unifies the strengths of both prior strategies, while mitigating their limitations. Same as retrofitting models, our novel framework, termed explicit retrofitting (ER), is applicable to arbitrary distributional spaces. At the same time, the method learns an explicit global specialization function that can specialize vectors for all vocabulary words, similar as in joint models. Yet, unlike the joint models, ER does not require expensive re-training on large text corpora, but is directly applied on top of any pre-trained vector space. The key idea of ER is to directly learn a specialization function in a supervised setting, using lexical constraints as training instances. In other words, our model, implemented as a deep feedforward neural architecture, learns a (non-linear) function which "translates" word vectors from the distributional space into the specialized space.
We show that the proposed ER approach yields considerable gains over distributional spaces in word similarity evaluation on standard benchmarks Gerz et al., 2016), as well as in two downstream tasks -lexical simplification and dialog state tracking. Furthermore, we show that, by coupling the ER model with shared multilingual embedding spaces (Mikolov et al., 2013a;Smith et al., 2017), we can also specialize distributional spaces for languages unseen in the training data in a zero-shot language transfer setup. In other words, we show that an explicit retrofitting model trained with external constraints from one language can be successfully used to specialize the distributional space of another language.
Joint Specialization Models. These models integrate external constraints into the distributional training procedure of general word embedding algorithms such as CBOW, Skip-Gram (Mikolov et al., 2013b), or Canonical Correlation Analysis (Dhillon et al., 2015. They modify the prior or the regularization of the original objective (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015) or integrate the constraints directly into the, e.g., an SGNS-or CBOW-style objective (Liu et al., 2015;Ono et al., 2015;Bollegala et al., 2016;Osborne et al., 2016;Nguyen et al., 2016Nguyen et al., , 2017. Besides generally displaying lower performance compared to retrofitting methods (Mrkšić et al., 2016), these models are also tied to the distributional objective and any change of the underlying distributional model induces a change of the entire joint model. This makes them less versatile than the retrofitting methods.
Post-Processing Models. Models from the popularly termed retrofitting family inject lexical knowledge from external resources into arbitrary pretrained word vectors (Faruqui et al., 2015;Rothe and Schütze, 2015;Wieting et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2016). These models fine-tune the vectors of words present in the linguistic constraints to reflect the ground-truth lexical knowledge. While the large majority of specialization models from both classes operate only with similarity constraints, a line of recent work (Mrkšić et al., 2016;Vulić et al., 2017b) demonstrates that knowledge about both similar and dissimilar words leads to improved performance in downstream tasks. The main shortcoming of the existing retrofitting models is their inability to specialize vectors of words unseen in external lexical resources.
Our explicit retrofitting framework brings together desirable properties of both model classes: (1) unlike joint models, it does not require adaptation to the underlying distributional model and expensive re-training, i.e., it is applicable to any pre-trained distributional space; (2) it allows for easy integration of both similarity and dissimilarity constraints into the specialization process; and (3) unlike post-processors, it specializes the full vocabulary of the original distributional space and not only vectors of words from external constraints.

Explicit Retrofitting
Our explicit retrofitting (ER) approach, illustrated by Figure 1a, consists of two major components: (1) an algorithm for preparing training instances from external lexical constraints, and (2) a supervised specialization model, based on a deep feedforward neural network. This network, shown in Figure 1b learns a non-linear global specialization function from the training instances.

From Constraints to Training Instances
referring to the associated vocabulary) and let X = {x i } N i=1 be the corresponding specialized vector space that we seek to obtain through explicit retrofitting. Let C = {(w i , w j , r) l } L l=1 be the set of L linguistic constraints from an external lexical resource, each consisting of a pair of vocabulary words w i and w j and a semantic relation r that holds between them. The most recent state-of-the-art retrofitting work Vulić et al., 2017b) suggests that using both similarity and dissimilarity constraints leads to better performance compared to using only similarity constraints. Therefore, we use synonymy and antonymy relations from external resources, i.e., r l ∈ {ant, syn}. Let g be the function measuring the distance between words w i and w j based on their vector representations. The algorithm for preparing training instances from constraints is guided by the following assumptions: 1. All synonymy pairs (w i , w j , syn) should have a minimal possible distance score in the spe-cialized space, i.e., g(x i , x j ) = g min ; 1 2. All antonymy pairs (w i , w j , ant) should have a maximal distance in the specialized space, i.e., g(x i , x j ) = g max ; 2 3. The distances g(x i , x k ) in the specialized space between some word w i and all other words w k that are not synonyms or antonyms of w i should be in the interval (g min , g max ).
Our goal is to discern semantic similarity from semantic relatedness by comparing, in the specialized space, the distances between word pairs (w i , w j , r) ∈ C with distances that words w i and w j from those pairs have with other vocabulary words w m . It is intuitive to enforce that the synonyms are as close as possible and antonyms as far as possible. However, we do not know what the distances between non-synonymous and nonantonymous words g(x i , x m ) in the specialized space should look like. This is why, for all other words, similar to (Faruqui et al., 2016;, we assume that the distances in the specialized space for all word pairs not found in C should stay the same as in the distributional space: . This way we preserve the useful semantic content available in the original distributional space. In downstream tasks most errors stem from vectors of semantically related words (e.g., car driver) being as similar as vectors of semantically similar words (e.g., carautomobile).
To anticipate this, we compare the distances of pairs (w i , w j , r) ∈ C with the distances for pairs (w i , w m ) and (w j , w n ), where w m and w n are negative examples: the vocabulary words that are most similar to w i and w j , respectively, in the original distributional space X. Concretely, for each constraint (w i , w j , r) ∈ C we retrieve (1) K vocabulary words {w k m } K k=1 that are closest in the input distributional space (according to the distance function g) to the word w i and (2) K vocabulary words {w k n } K k=1 that are closest to the word w j . We then create, for each constraint (w i , w j , r) ∈ C, a corresponding set M (termed micro-batch) of 2K + 1 embedding pairs coupled with a corresponding distance in the input distributional space:

Training instances (micro-batches)
x' i =f(x i ) (b) Supervised specialization model Figure 1: (a) High-level illustration of the explicit retrofitting approach: lexical constraints, i.e., pairs of synonyms and antonyms, are transformed into respective micro-batches, which are then used to train the supervised specialization model. (b) The low-level implementation of the specialization model, combining the non-linear embedding specialization function f , defined as the deep fully-connected feed-forward network, with the distance metric g, measuring the distance between word vectors after their specialization.

Non-Linear Specialization Function
Our retrofitting framework learns a global explicit specialization function which, when applied on a distributional vector space, transforms it into a space that better captures semantic similarity, i.e., discerns similarity from all other types of semantic relatedness. We seek the optimal parameters θ of the parametrized function f (x; θ) : The specialized embedding x i of the word w i is then obtained as x i = f (x i ; θ). The specialized space X is obtained by transforming distributional vectors of all vocabulary words, X = f (X; θ). We define the specialization function f to be a multi-layer fully-connected feed-forward network with H hidden layers and non-linear activations φ. The illustration of this network is given in Figure 1b. The i-th hidden layer is defined with a weight matrix W i and a bias vector b i : where θ i is the subset of network's parameters up to the i-th layer. Note that in this notation, . Let d h be the size of the hidden layers. The network's parameters are then as follows:

Optimization Objectives
We feed the micro-batches consisting of 2K + 1 training instances to the specialization model (see Section 3.1). Each training instance consists of a pair of distributional (i.e., unspecialized) embedding vectors x i and x j and a score g denoting the desired distance between the specialized vectors x i and x j of corresponding words w i and w j .
Mean Square Distance Objective (ER-MSD). Let our training batch consist of N training instances, The simplest objective function is then the difference between the desired and obtained distances of specialized vectors: By minimizing the MSD objective we simply force the specialization model to produce a specialized embedding space X in which distances between all synonyms amount to g min , distances between all antonyms amount to g max and distances between all other word pairs remain the same as in the original space. The MSD objective does not leverage negative examples: it only indirectly enforces that synonym (or antonym) pairs (w i , w j ) have smaller (or larger) distances than corresponding non-constraint word pairs (w i , w k ) and (w j , w k ).
Contrastive Objective (ER-CNT). An alternative to MSD is to directly contrast the distances of constraint pairs (i.e., antonyms and synonyms) with the distances of their corresponding negative examples, i.e., the pairs from their respective microbatch (cf. Eq. (1) in Section 3.1). Such an objective should directly enforce that the similarity scores for synonyms (antonyms) (w i , w j ) are larger (or smaller, for antonyms) than for pairs (w i , w k ) and (w j , w k ) involving the same words w i and w j , respectively. Let S and A be the sets of microbatches created from synonymy and antonymy con- be one micro-batch created from one synonymy constraint and let M a be the analogous micro-batch created from one antonymy constraint. Let us then assume that the first triple (i.e., for i = 1) in every microbatch corresponds to the constraint pair and the remaining 2K triples (i.e., for i ∈ {2, . . . , 2K + 1}) to respective non-constraint word pairs. We then define the contrastive objective as follows: where g is a short-hand notation for the distance between vectors in the specialized space, i.e., g (x 1 , x 2 ) = g(x 1 , x 2 ) = g(f (x 1 ), f (x 2 )).
Topological Regularization. Because the distributional space X already contains useful semantic information, we want our specialized space X to move similar words closer together and dissimilar words further apart, but without disrupting the overall topology of X. To this end, we define an additional regularization objective that measures the distance between the original vectors x 1 and x 2 and their specialized counterparts x 1 = f (x 1 ) and x 2 = f (x 2 ), for all examples in the training set: We minimize the final objective function J = J + λJ REG . J is either J MSD or J CNT and λ is the regularization factor which determines how strictly we retain the topology of the original space. Linguistic Constraints. We experiment with the sets of linguistic constraints used in prior work (Zhang et al., 2014;Ono et al., 2015). These constraints, extracted from WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009), comprise a total of 1,023,082 synonymy word pairs and 380,873 antonymy word pairs. Although this seems like a large number of linguistic constraints, there is only 57,320 unique words in all synonymy and antonymy constraints combined, and not all of these words are found in the dictionary of the pre-trained distributional vector space. For example, only 15.3% of the words from constraints are found in the whole vocabulary of SGNS-W2 embeddings. Similarly, we find only 13.3% and 14.6% constraint words among the 200K most frequent words from the GLOVE-CC and FASTTEXT vocabularies, respectively. This low coverage emphasizes the core limitation of current retrofitting methods, being able to specialize only the vectors of words seen in the external constraints, and the need for our global ER method which can specialize all word vectors from the distributional space.

ER Model Configuration.
In all experiments, we set the distance function g to cosine distance: g(x 1 , x 2 ) = 1 − (x 1 · x 2 /( x 1 x 2 )) and use the hyperbolic tangent as activation, φ = tanh. For each constraint (w i , w j ), we create K = 4 corresponding negative examples for both w i and w j , resulting in micro-batches with 2K + 1 = 9 training instances. 3 We separate 10% of the created micro-batches as the validation set. We then tune the hyper-parameter values, the number of hidden layers H = 5 and their size d h = 1000, and the topological regularization factor λ = 0.3 by minimizing the model's objective J on the validation set. We train the model in mini-batches, each containing N b = 100 constraints (i.e., 900 training instances, see above), using the Adam optimizer (Kingma and Ba, 2015) with initial learning rate set to 10 −4 . We use the loss on the validation set as the early stopping criteria.

Word Similarity
Evaluation Setup. We first evaluate the quality of the explicitly retrofitted embedding spaces intrinsically, on two word similarity benchmarks: SimLex-999 dataset  and SimVerb-3500 (Gerz et al., 2016), a recent dataset containing human similarity ratings for 3,500 verb pairs. 4 We use Spearman's ρ rank correlation between gold and predicted word pair scores as the evaluation metric. We evaluate the specialized embedding spaces in two settings. In the first setting, termed lexically disjoint, we remove from our training set all linguistic constraints that contain any of the words found in SimLex or SimVerb. This way, we effectively evaluate the model's ability to generalize the specialization function to unseen words. In the second setting (lexical overlap) we retain the constraints containing SimLex or SimVerb words in the training set. For comparison, we also report performance of the state-of-the-art local retrofitting model ATTRACT-REPEL , which is able to specialize only the words from the linguistic constraints.
Results. The results with our ER model applied to three distributional spaces are shown in Table 1. The scores suggest that the proposed ER model is universally useful and robust. The ER-specialized spaces outperform original distributional spaces across the board, for both objective functions. The results in the lexically disjoint setting are especially indicative of the improvements achieved by the ER. For example, we achieve a correlation gain of 18% for the GLOVE-CC vectors on SimLex using a specialization function learned without seeing a single constraint with any SimLex word.
In the lexical overlap setting, we observe substantial gains only for GLOVE-CC. The modest gains in this setting with FASTTEXT and SGNS-W2 in fact strengthen the impression that the ER model learns a general specialization function, i.e., it does not "overfit" to words from linguistic constraints. The ER model with the contrastive objective (ER-CNT) yields better performance on average than the one using the simpler square distance objective (ER-MSD). This is expected, given that the contrastive objective enforces the model to distinguish pairs of semantically (dis)similar words from pairs of semantically related words.
Finally, the post-processing ATTRACT-REPEL model based on local vector updates seems to substantially outperform the ER method in this task. The gap is especially visible for FASTTEXT and SGNS-W2 vectors. However, since ATTRACT-REPEL specializes only words seen in linguistic constraints, 5 its performance crucially depends on the coverage of test set words in the constraints. ATTRACT-REPEL excels on the intrinsic evaluation as the constraints cover 99.2% of SimLex words and 99.9% of SimVerb words. However, its usefulness is less pronounced in real-life downstream scenarios in which such high coverage cannot be guaranteed, as demonstrated in Section 5.3.

Analysis.
We examine in more detail the performance of the ER model with respect to (1) the type of constraints used for training the model: synonyms and antonyms, only synonyms, or only antonyms and (2) the extent to which we retain the topology of the original distributional space (i.e., with respect to the value of the topological regularization factor λ). All reported results were obtained by specializing the GLOVE-CC distributional space in the lexically disjoint setting (i.e., employed constraints did not contain any of the SimLex or SimVerb words).
In Table 2 we show the specialization performance of the ER-CNT models (H = 5, λ = 0.3), using different types of constraints on SimLex-999 (SL) and SimVerb-3500 (SV). We compare the standard model, which exploits both synonym and antonym pairs for creating training instances, with the models employing only synonym and only antonym constraints, respectively. Clearly, we obtain the best specialization when combining synonyms and antonyms. Note, however, that using   only synonyms or only antonyms also improves over the original distributional space. Next, in Figure 2 we depict the specialization performance (on SimLex and SimVerb) of the ER models with different values of the topology regularization factor λ (H fixed to 5). The best performance for is obtained for λ = 0.3. Smaller lambda values overly distort the original distributional space, whereas larger lambda values dampen the specialization effects of linguistic constraints.

Language Transfer
Readily available large collections of synonymy and antonymy word pairs do not exist for many languages. This is why we also investigate zeroshot specialization: we test if it is possible, with the help of cross-lingual word embeddings, to transfer the specialization knowledge learned from English constraints to languages without any training data.  Table 3: Spearman's ρ correlation scores for German, Italian, and Croatian embeddings in the transfer setup: the vectors are specialized using the models trained on English constraints and evaluated on respective language-specific SimLex-999 variants.
tor space 6 containing word vectors of three other languages -German, Italian, and Croatian -along with the English vectors. 7 Concretely, we map the Italian CBOW vectors (Dinu et al., 2015), German FastText vectors trained on German Wikipedia (Bojanowski et al., 2017), and Croatian Skip-Gram vectors trained on HrWaC corpus (Ljubešić and Erjavec, 2011) to the GLOVE-CC English space. We create the translation pairs needed to learn the projections by automatically translating 4,000 most frequent English words to all three other languages with Google Translate. We then employ the ER model trained to specialize the GLOVE-CC space using the full set of English constraints, to specialize the distributional spaces of other languages. We evaluate the quality of the specialized spaces on the respective SimLex-999 dataset for each language (Leviant and Reichart, 2015;. Results. The results are provided in Table 3. They indicate that the ER models can substantially improve (e.g., by 13% for German vector space) over distributional spaces also in the language transfer setup without seeing a single constraint in the target language. These transfer results hold promise to support vector space specialization even for resource-lean languages. The more sophisticated contrastive ER-CNT model variant again outperforms the simpler ER-MSD variant, and it does so for all three languages, which is consistent with the findings from the monolingual English experiments (see Table 1).

Downstream Tasks
We now evaluate the impact of our global ER method on two downstream tasks in which differentiating semantic similarity from semantic relatedness is particularly important: lexical text simplification (LS) and dialog state tracking (DST).

Lexical Text Simplification
Lexical simplification aims to replace complex words -used less frequently and known to fewer speakers -with their simpler synonyms that fit into the context, that is, without changing the meaning of the original text. Because retaining the meaning of the original text is a strict requirement, complex words need to be replaced with semantically similar words, whereas replacements with semantically related words (e.g., replacing "pilot" with "airplane" in "Ferrari's pilot won the race") produce incorrect text which is more difficult to comprehend.

Simplification Using Distributional Vectors.
We use the LIGHT-LS lexical simplification algorithm of Glavaš andŠtajner (2015) which makes the word replacement decisions primarily based on semantic similarities between words in a distributional vector space. 8 For each word in the input text LIGHT-LS retrieves most similar replacement candidates from the vector space. The candidates are then ranked according to several measures of simplicity and fitness for the context. Finally, the replacement is made if the top-ranked candidate is estimated to be simpler than the original word. By plugging-in vector spaces specialized by the ER model into LIGHT-LS, we hope to generate true synonymous candidates more frequently than with the unspecialized distributional space.
Evaluation Setup. We evaluate LIGHT-LS on the LS dataset crowdsourced by Horn et al. (2014).
For each indicated complex word Horn et al. (2014) collected 50 manual simplifications. We use two evaluation metrics from prior work (Horn et al., 2014;Glavaš andŠtajner, 2015) to quantify the quality and frequency of word replacements: (1)  accurracy (A) is the number of correct simplifications made (i.e., when the replacement made by the system is found in the list of manual replacements) divided by the total number of indicated complex words; and (2) change (C) is the percentage of indicated complex words that were replaced by the system (regardless of whether the replacement was correct). We plug into LIGHT-LS both unspecialized and specialized variants of three previously used English embedding spaces: GLOVE-CC, FASTTEXT, and SGNS-W2. Additionally, we again evaluate specializations of the same spaces produced by the state-of-the-art local retrofitting model ATTRACT-REPEL .
Results and Analysis. The results with LIGHT-LS are summarized in Table 4. ER-CNT model yields considerable gains over unspecialized spaces for both metrics. This suggests that the ER-specialized embedding spaces allow LIGHT-LS to generate true synonymous candidate replacements more often than with unspecialized spaces, and also verifies the importance of specialization for the LS task. Our ER-CNT model now also yields better results than ATTRACT-REPEL in a real-world downstream task. Only 59.6 % of all indicated complex words and manual replacement candidates from the LS dataset are now covered by the linguistic constraints. This accentuates the need to specialize the full distributional space in downstream applications as done by the ER model, while ATTRACT-REPEL is limited to local vector updates only of words seen in the constraints. By learning a global specialization function the proposed ER models seem more resilient to the observed drop in coverage of test words by linguistic constraints. Table 5 shows example substitutions of LIGHT-LS when using different embedding spaces: original GLOVE-CC space and its specializations obtained with ER-CNT and ATTRACT-REPEL.

Dialog State Tracking
Finally, we also evaluate the importance of explicit retrofitting in a downstream language understand-   Table 6: DST performance of GLOVE-CC embeddings specialized using explicit retrofitting.
ing task, namely dialog state tracking (DST) (Henderson et al., 2014;Williams et al., 2016). A DST model is typically the first component of a dialog system pipeline (Young, 2010), tasked with capturing user's goals and updating the dialog state at each dialog turn. Similarly as in lexical simplification, discerning similarity from relatedness is crucial in DST (e.g., a dialog system should not recommend an "expensive pub in the south" when asked for a "cheap bar in the east").
Evaluation Setup. To evaluate the impact of specialized word vectors on DST, we employ the Neural Belief Tracker (NBT), a DST model that makes inferences purely based on pre-trained word vectors . 9 NBT composes word embeddings into intermediate utterance and context representations. For full model details, we refer the reader to the original paper. Following prior work, our DST evaluation is based on the Wizard-of-Oz (WOZ) v2.0 dataset  which contains 1,200 dialogs (600 training, 200 validation, and 400 test dialogs). We evaluate performance of the distributional and specialized GLOVE-CC embeddings and report it in terms of joint goal accuracy (JGA), a standard DST evaluation metric. All reported results are averages over 5 runs of the NBT model.

Results
. We show DST performance in Table 6. The DST results tell a similar story like word similarity and lexical simplification results -the ER 9 https://github.com/nmrksic/neural-belief-tracker model substantially improves over the distributional space. With linguistic specialization constraints covering 57% of words from the WOZ dataset, ER model's performance is on a par with the ATTRACT-REPEL specialization. This further confirms our hypothesis that the importance of learning a global specialization for the full vocabulary in downstream tasks grows with the drop of the test word coverage by specialization constraints.

Conclusion
We presented a novel method for specializing word embeddings to better discern similarity from other types of semantic relatedness. Unlike existing retrofitting models, which directly update vectors of words from external constraints, we use the constraints as training examples to learn an explicit specialization function, implemented as a deep feedforward neural network. Our global specialization approach resolves the well-known inability of retrofitting models to specialize vectors of words unseen in the constraints. We demonstrated the effectiveness of the proposed model on word similarity benchmarks, and in two downstream tasks: lexical simplification and dialog state tracking. We also showed that it is possible to transfer the specialization to languages without linguistic constraints.
In future work, we will investigate explicit retrofitting methods for asymmetric relations like hypernymy and meronymy. We also intend to apply the method to other downstream tasks and to investigate the zero-shot language transfer of the specialization function for more language pairs. ER code is publicly available at: https:// github.com/codogogo/explirefit.