Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources

Word vector specialisation (also known as retrofitting) is a portable, light-weight approach to fine-tuning arbitrary distributional word vector spaces by injecting external knowledge from rich lexical resources such as WordNet. By design, these post-processing methods only update the vectors of words occurring in external lexicons, leaving the representations of all unseen words intact. In this paper, we show that constraint-driven vector space specialisation can be extended to unseen words. We propose a novel post-specialisation method that: a) preserves the useful linguistic knowledge for seen words; while b) propagating this external signal to unseen words in order to improve their vector representations as well. Our post-specialisation approach explicits a non-linear specialisation function in the form of a deep neural network by learning to predict specialised vectors from their original distributional counterparts. The learned function is then used to specialise vectors of unseen words. This approach, applicable to any post-processing model, yields considerable gains over the initial specialisation models both in intrinsic word similarity tasks, and in two downstream tasks: dialogue state tracking and lexical text simplification. The positive effects persist across three languages, demonstrating the importance of specialising the full vocabulary of distributional word vector spaces.


Introduction
Word representation learning is a key research area in current Natural Language Processing (NLP), with its usefulness demonstrated across a range of tasks (Collobert et al., 2011;Chen and Manning, 2014;Melamud et al., 2016b). The standard techniques for inducing distributed word representations are grounded in the distributional hypothesis (Harris, 1954): they rely on co-occurrence information in large textual corpora (Mikolov et al., 2013b;Pennington et al., 2014;Levy and Goldberg, 2014;Levy et al., 2015;Bojanowski et al., 2017). As a result, these models tend to coalesce the notions of semantic similarity and (broader) conceptual relatedness, and cannot accurately distinguish antonyms from synonyms Schwartz et al., 2015). Recently, we have witnessed a rise of interest in representation models that move beyond stand-alone unsupervised learning: they leverage external knowledge in human-and automaticallyconstructed lexical resources to enrich the semantic content of distributional word vectors, in a process termed semantic specialisation. This is often done as a post-processing (sometimes referred to as retrofitting) step: input word vectors are fine-tuned to satisfy linguistic constraints extracted from lexical resources such as WordNet or BabelNet (Faruqui et al., 2015;. The use of external curated knowledge yields improved word vectors for the benefit of downstream applications (Faruqui, 2016). At the same time, this specialisation of the distributional space distinguishes between true similarity and relatedness, and supports language understanding tasks (Kiela et al., 2015;. While there is consensus regarding their benefits and ease of use, one property of the post-processing specialisation methods slips under the radar: most existing post-processors update word embeddings only for words which are present (i.e., seen) in the external constraints, while vectors of all other (i.e., unseen) words remain unaffected. In this work, we propose a new approach that extends the specialisation framework to unseen words, relying on the transformation of the vector (sub)space of seen words. Our intuition is that the process of finetuning seen words provides implicit information on how to leverage the external knowledge to unseen words. The method should preserve the already injected knowledge for seen words, simultaneously propagating the external signal to unseen words in order to improve their vectors.
The proposed post-specialisation method can be seen as a two-step process, illustrated in Fig. 1a: 1) We use a state-of-the-art specialisation model to transform the subspace of seen words from the input distributional space into the specialised subspace; 2) We learn a mapping function based on the transformation of the "seen subspace", and then apply it to the distributional subspace of unseen words. We allow the proposed post-specialisation model to learn from large external linguistic resources by implementing the mapping as a deep feed-forward neural network with non-linear activations. This allows the model to learn the generalisation of the fine-tuning steps taken by the initial specialisation model, itself based on a very large number (e.g., hundreds of thousands) of external linguistic constraints.
As indicated by the results on word similarity and two downstream tasks (dialogue state tracking and lexical text simplification) our postspecialisation method consistently outperforms state-of-the-art methods which specialise seen words only. We report improvements using three distinct input vector spaces for English and for three test languages (English, German, Italian), verifying the robustness of our approach.

Related Work and Motivation
Vector Space Specialisation A standard approach to incorporating external and background knowledge into word vector spaces is to pull the representations of similar words closer together and to push words in undesirable relations (e.g., antonyms) away from each other. Some models integrate such constraints into the training procedure and jointly optimize distributional and nondistributional objectives: they modify the prior or the regularisation (Yu and Dredze, 2014;Xu et al., 2014;Kiela et al., 2015), or use a variant of the SGNS-style objective (Liu et al., 2015;Ono et al., 2015;Osborne et al., 2016;Nguyen et al., 2017). In theory, word embeddings obtained by these joint models could be as good as representations produced by models which finetune input vector space. However, their performance falls behind that of fine-tuning methods (Wieting et al., 2015). Another disadvantage is that their architecture is tied to a specific underlying model (typically word2vec models).
In contrast, fine-tuning models inject external knowledge from available lexical resources (e.g., WordNet, PPDB) into pre-trained word vectors as a post-processing step (Faruqui et al., 2015;Rothe and Schütze, 2015;Wieting et al., 2015;Nguyen et al., 2016;Cotterell et al., 2016;. Such post-processing models are popular because they offer a portable, flexible, and light-weight approach to incorporating external knowledge into arbitrary vector spaces, yielding state-of-the-art results on language understanding tasks (Faruqui et al., 2015;Kim et al., 2016;Vulić et al., 2017b).
Existing post-processing models, however, suffer from a major limitation. Their modus operandi is to enrich the distributional information with external knowledge only if such knowledge is present in a lexical resource. This means that they update and improve only representations of words actually seen in external resources. Because such words constitute only a fraction of the whole vocabulary (see Sect. 4), most words, unseen in the constraints, retain their original vectors. The main goal of this work is to address this shortcoming by specialising all words from the initial distributional space.

Methodology: Post-Specialisation
Our starting point is the state-of-the-art specialisation model ATTRACT-REPEL (AR) , outlined in Sect. 3.1. We opt for the AR model due to its strong performance and ease of use, but we note that the proposed postspecialisation approach for specialising unseen words, described in Sect. 3.2, is applicable to any post-processor, as empirically validated in Sect. 5.

Initial Specialisation Model: AR
Let V s be the vocabulary, A the set of synonymous ATTRACT word pairs (e.g., rich and wealthy), and R the set of antonymous REPEL word pairs (e.g., increase and decrease). The ATTRACT-REPEL procedure operates over mini-batches of such pairs B A and B R . Let each word pair (x l , x r ) in these sets correspond to a vector pair (x l , x r ). A mini-batch of b att attract word pairs is given by Next, the sets of negative examples T A = [(t 1 l , t 1 r ), . . . , (t k 1 l , t k 1 r )] and T R = [(t 1 l , t 1 r ), . . . , (t k 2 l , t k 2 r )] are defined as pairs of negative examples for each A and R pair in mini-batches B A and B R . These negative examples are chosen from the word vectors present in B A or B R so that, for each A pair (x l , x r ), the negative example pair (t l , t r ) is chosen so that t l is the vector closest (in terms of cosine distance) to x l and t r is closest to x r . 1 The negatives are used 1) to force A pairs to be closer to each other than to their respective negative examples; and 2) to force R pairs to be further away from each other than from their negative examples. The first term of the cost function pulls A pairs together: where τ (z) = max(0, z) is the standard rectifier function (Nair and Hinton, 2010) and δ att is the attract margin: it determines how much closer these vectors should be to each other than to their respective negative examples. The second, REPEL term in the cost function is analogous: it pushes R word pairs away from each other by the margin δ rep . Finally, in addition to the A and R terms, a regularisation term is used to preserve the semantic content originally present in the distributional vector space, as long as this information does not contradict the injected external knowledge. Let V(B) be the set of all word vectors present in a mini-batch, the distributional regularisation term is then: where λ reg is the L 2 -regularisation constant and x i denotes the original (distributional) word vector for word x i . The full ATTRACT-REPEL cost function is finally constructed as the sum of all three terms.

Specialisation of Unseen Words
Problem Formulation The goal is to learn a global transformation function that generalises the perturbations of the initial vector space made by ATTRACT-REPEL (or any other specialisation procedure), as conditioned on the external constraints. The learned function propagates the signal coded in the input constraints to all the words unseen during the specialisation process. We seek a regression function f : R dim → R dim , where dim is the vector space dimensionality. It maps word vectors from the initial vector space X to the specialised target space X . Let X = f (X) refer to the predicted mapping of the vector space, while the mapping of a single word vector is denoted An input distributional vector space X d represents words from a vocabulary V d . V d may be divided into two vocabulary subsets: with the accompanying vector subspaces X d = X s X u . V s refers to the vocabulary of seen words: those that appear in the external linguistic constraints and have their embeddings changed in the specialisation process. V u denotes the vocabulary of unseen words: those not present in the constraints and whose embeddings are unaffected by the specialisation procedure.
The AR specialisation process transforms only the subspace X s into the specialised subspace X s . All words x i ∈ V s may now be used as training examples for learning the explicit mapping function f from X s into X s . If N = |V s |, we in fact rely on N training pairs: Function f can then be applied to unseen words x ∈ V u to yield the specialised subspace X u = f (X u ). The specialised space containing all words is then X f = X s ∪ X u . The complete high-level post-specialisation procedure is outlined in Fig. 1a.
Note that another variant of the approach could obtain X f as X f = f (X d ), that is, the entire distributional space is transformed by f . However, this variant seems counter-intuitive as it forgets the actual output of the initial specialisation procedure and replaces word vectors from X s with their approximations, i.e., f -mapped vectors. 2 Objective Functions As mentioned, the N seen words x i ∈ V s in fact serve as our "pseudotranslation" pairs supporting the learning of a crossspace mapping function. In practice, in its highlevel formulation, our mapping problem is equivalent to those encountered in the literature on crosslingual word embeddings where the goal is to learn a shared cross-lingual space given monolingual vector spaces in two languages and N 1 translation pairs (Mikolov et al., 2013a;Vulić and Korhonen, 2016b;Artetxe et al., 2016Artetxe et al., , 2017Conneau et al., 2017;Ruder et al., 2017). In our setup, the standard objective based on L 2 -penalised

attract-repel mapping
Training Pairs: Seen

Input
Hidden 1 Hidden 2 Hidden H-1 Hidden H Output X u X' u (b) Low-level implementation: deep feed-forward neural network Figure 1: (a) High-level illustration of the post-specialisation approach: the subspace X s of the initial distributional vector space X d = X s ∪ X u is first specialised/fine-tuned by the ATTRACT-REPEL specialisation model (or any other post-processing model) to obtain the transformed subspace X s . The words present (i.e., seen) in the input set of linguistic constraints are now assigned different representations in X s (the original distributional vector) and X s (the specialised vector): they are therefore used as training examples to learn a non-linear cross-space mapping function. This function is then applied to all word vectors x i ∈ X u representing words unseen in the constraints to yield a specialised subspace X u . The final space is X f = X s ∪ X u , and it contains transformed representations for all words from the initial space The actual implementation of the non-linear regression function which maps from X u to X u : a deep feed-forward fully-connected neural net with non-linearities and H hidden layers.
least squares may be formulated as follows: where || · || 2 F denotes the squared Frobenius norm. In the most common form f (X s ) is simply a linear map/matrix W f ∈ R dim×dim (Mikolov et al., 2013a) as follows: f (X) = W f X.
After learning f based on the X s → X s transformation, one can simply apply f to unseen words: . This linear mapping model, termed LINEAR-MSE, has an analytical solution (Artetxe et al., 2016), and has been proven to work well with cross-lingual embeddings. However, given that the specialisation model injects hundreds of thousands (or even millions) of linguistic constraints into the distributional space (see later in Sect. 4), we suspect that the assumption of linearity is too limiting and does not fully hold in this particular setup.
Using the same L 2 -penalized least squares objective, we can thus replace the linear map with a nonlinear function f : R dim → R dim . The non-linear mapping, illustrated by Fig. 1b, is implemented as a deep feed-forward fully-connected neural network (DFFN) with H hidden layers and non-linear activations. This variant is called NONLINEAR-MSE. Another variant objective is the contrastive margin-based ranking loss with negative sampling (MM) similar to the original ATTRACT-REPEL objective, used in other applications in prior work (e.g., for cross-modal mapping) Frome et al., 2013;Kummerfeld et al., 2015). Let x i = f (x i ) denote the predicted vector for the word x i ∈ V s , and let x i refer to the "true" vector of x i in the specialised space X s after the AR specialisation procedure. The MM loss is then defined as follows: where cos is the cosine similarity measure, δ mm is the margin, and k is the number of negative samples. The objective tries to learn the mapping f so that each predicted vector x i is by the specified margin δ mm closer to the correct target vector

Experimental Setup
Starting Word Embeddings (X d = X s ∪ X u ) To test the robustness of our approach, we experiment with three well-known, publicly available collections of English word vectors: 1) Skip-Gram with Negative Sampling (SGNS-BOW2) (Mikolov et al., 2013b) trained on the Polyglot Wikipedia (Al-Rfou et al., 2013) by Levy and Goldberg (2014) using bag-of-words windows of size 2; 2) GLOVE Common Crawl (Pennington et al., 2014); and 3) FASTTEXT (Bojanowski et al., 2017), a SGNS variant which builds word vectors as the sum of their constituent character n-gram vectors. All word embeddings are 300-dimensional. 4 AR Specialisation and Constraints (X s → X s ) We experiment with linguistic constraints used before by Vulić et al., 2017a): they extracted monolingual synonymy/ATTRACT pairs from the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013;Pavlick et al., 2015) (640,435 synonymy pairs in total), while their antonymy/REPEL constraints came from BabelNet (Navigli and Ponzetto, 2012) (11,939 pairs). 5 The coverage of V d vocabulary words in the constraints illustrates well the problem of unseen words with the fine-tuning specialisation models. For instance, the constraints cover only a small subset of the entire vocabulary V d for SGNS-BOW2: 16.6%. They also cover only 15.3% of the top 200K most frequent V d words from FASTTEXT.
Network Design and Parameters (X u → X u ) The non-linear regression function f : R d → R d is a DFFN with H hidden layers, each of dimensionality d 1 = d 2 = . . . = d H = 512 (see Fig. 1b). Non-linear activations are used in each layer and N i=1 τ δmm − cos( x i, x i) . For instance, with δmm = 1.0 the idea is to learn a mapping f that, for each xi enforces the predicted vector and the correct target vector to have a maximum cosine similarity. We do not report the results with this variant as, although it outscores the MSE-style objective, it was consistently outperformed by the MM objective. 4 For further details regarding the architectures and training setup of the used vector collections, we refer the reader to the original papers. Additional experiments with other word vectors, e.g., with CONTEXT2VEC (Melamud et al., 2016a) (which uses bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) for context modeling), and with dependency-word based embeddings (Bansal et al., 2014;Melamud et al., 2016b) lead to similar results and same conclusions. 5 We have experimented with another set of constraints used in prior work (Zhang et al., 2014;Ono et al., 2015), reaching similar conclusions: these were extracted from Word-Net (Fellbaum, 1998) and Roget (Kipfer, 2009), and comprise 1,023,082 synonymy pairs and 380,873 antonymy pairs. omitted only before the final output layer to enable full-range predictions (see Fig. 1b again).
The choices of non-linear activation and initialisation are guided by recent recommendations from the literature. First, we use swish (Ramachandran et al., 2017;Elfwing et al., 2017) as nonlinearity, defined as swish(x) = x · sigmoid(βx). We fix β = 1 as suggested by Ramachandran et al. (2017). 6 Second, we use the HE normal initialisation (He et al., 2015), which is preferred over the XAVIER initialisation (Glorot and Bengio, 2010) for deep models (Mishkin and Matas, 2016;Li et al., 2016), although in our experiments we do not observe a significant difference in performance between the two alternatives. We set H = 5 in all experiments without any fine-tuning; we also analyse the impact of the network depth in Sect. 5.
Optimisation For the AR specialisation step, we adopt the original suggested model setup. Hyperparameter values are set to: δ att = 0.6, δ rep = 0.0, λ reg = 10 −9 . The models are trained for 5 epochs with Adagrad (Duchi et al., 2011), with batch sizes set to b att = b rep = 50, again as in the original work.
For training the non-linear mapping with DFFN ( Fig. 1b), we use the Adam algorithm (Kingma and Ba, 2015) with default settings. The model is trained for 100 epochs with early stopping on a validation set. We reserve 10% of all available seen data (i.e., the words from V s represented in X s and X s ) for validation, the rest are used for training. For the MM objective, we set δ mm = 0.6 and k = 25 in all experiments without any fine-tuning.

Intrinsic Evaluation: Word Similarity
Evaluation Protocol The first set of experiments evaluates vector spaces with different specialisation procedures intrinsically on word similarity benchmarks: we use the SimLex-999 dataset , and SimVerb-3500 (Gerz et al., 2016), a recent verb pair similarity dataset providing similarity ratings for 3,500 verb pairs. 7 Spearman's ρ  Table 1: Spearman's ρ correlation scores for three word vector collections on two English word similarity datasets, SimLex-999 (SL) and SimVerb-3500 (SV), using different mapping variants, evaluation protocols, and word vector spaces: from the initial distributional space X d to the fully specialised space X f . H = 5. rank correlation is used as the evaluation metric. We evaluate word vectors in two settings. First, in a synthetic hold-out setting, we remove all linguistic constraints which contain words from the SimLex and SimVerb evaluation data, effectively forcing all SimLex and SimVerb words to be unseen by the AR specialisation model. The specialised vectors for these words are estimated by the learned non-linear DFFN mapping model. Second, the all setting is a standard "real-life" scenario where some test (SimLex/SimVerb) words do occur in the constraints, while the mapping is learned for the remaining words.

Results and Analysis
The results with the three word vector collections are provided in Tab. 1. In addition, Fig. 2 plots the influence of the network to discern between the two, so that related but non-similar words (e.g. tiger and jungle) have a low rating.
depth H on the model's performance.
The results suggest that the mapping of unseen words is universally useful, as the highest correlation scores are obtained with the final fully specialised vector space X f for all three input spaces. The results in the hold-out setup are particularly indicative of the improvement achieved by our postspecialisation method. For instance, it achieves a +0.2 correlation gain with GLOVE on both SimLex and SimVerb by specialising vector representations for words present in these datasets without seeing a single external constraint which contains any of these words. This suggests that the perturbation of the seen subspace X s by ATTRACT-REPEL contains implicit knowledge that can be propagated to X u , learning better representations for unseen words. We observe small but consistent improvements across the board in the all setup. The smaller gains can be explained by the fact that a majority of SimLex and SimVerb words are present in the external constraints (93.7% and 87.2%, respectively).
The scores also indicate that both non-linearity and the chosen objective function contribute to the quality of the learned mapping: largest gains are reported with the NONLINEAR-MM variant which a) employs non-linear activations and b) replaces the basic mean-squared-error objective with maxmargin. The usefulness of the latter has been established in prior work on cross-space mapping learning . The former indicates that the initial AR transformation is non-linear. It is guided by a large number of constraints; their effect cannot be captured by a simple linear map as in prior work on, e.g., cross-lingual word embeddings (Mikolov et al., 2013a;Ruder et al., 2017).
Finally, the analysis of the network depth H indicates that going deeper helps only to a certain extent. Adding more layers allows for a richer parametrisation of the network (which is beneficial given the number of linguistic constraints used by AR). This makes the model more expressive, but it seems to saturate with larger H values.

Post-Specialisation with Other Post-Processors
We also verify that our post-specialisation approach is not tied to the ATTRACT-REPEL method, and is indeed applicable on top of any post-processing specialisation method. We analyse the impact of postspecialisation in the hold-out setting using the original retrofitting (RFit) model (Faruqui et al., 2015) and counter-fitting (CFit)  in lieu of attract-repel. The results on word similarity with the best-performing NONLINEAR-MM variant are summarised in Tab. 2.
The scores again indicate the usefulness of postspecialisation. As expected, the gains are lower Figure 3: DST labels (user goals given by slot-value pairs) in a multi-turn dialogue (Mrkšić et al., 2015).  than with ATTRACT-REPEL. RFit falls short of CFit as by design it can leverage only synonymy (i.e., ATTRACT) external constraints.

Downstream Task I: DST
Next, we evaluate the usefulness of postspecialisation for two downstream tasks -dialogue state tracking and lexical text simplification -in which discerning semantic similarity from other types of semantic relatedness is crucial. We first evaluate the importance of post-specialisation for a downstream language understanding task of dialogue state tracking (DST) (Henderson et al., 2014;Williams et al., 2016), adopting the evaluation protocol and data of .

DST: Model and Evaluation
The DST model is the first component of modern dialogue pipelines (Young, 2010), which captures the users' goals at each dialogue turn and then updates the dialogue state. Goals are represented as sets of constraints expressed as slot-value pairs (e.g., food=Chinese). The set of slots and the set of values for each slot constitute the ontology of a dialogue domain. The probability distribution over the possible states is the system's estimate of the user's goals, and it is used by the dialogue manager module to select the subsequent system response . An example in Fig. 3 illustrates the DST pipeline. For evaluation, we use the Neural Belief Tracker (NBT), a state-of-the-art DST model which was the first to reason purely over pre-trained word vectors . 8 The NBT uses no hand-crafted semantic lexicons, instead composing word vectors into intermediate utterance and context representations. 9 For full model details, we refer the reader to the original paper. The importance of word vector specialisation for the DST task (e.g., distinguishing between synonyms and antonyms by pulling northern and north closer in  the vector space while pushing north and south away) has been established . Again, as in prior work the DST evaluation is based on the Wizard-of-Oz (WOZ) v2.0 dataset , comprising 1,200 dialogues split into training (600 dialogues), development (200), and test data (400). In all experiments, we report the standard DST performance measure: joint goal accuracy, and report scores as averages over 5 NBT training runs.

Results and Analysis
We again evaluate word vectors in two settings: 1) hold-out, where linguistic constraints with words appearing in the WOZ data are removed, making all WOZ words unseen by ATTRACT-REPEL; and 2) all. The results for the English DST task with different GLOVE word vector variants are summarised in Tab. 3; similar trends in results are observed with two other word vector collections. The scores maintain conclusions established in the word similarity task. First, semantic specialisation with ATTRACT-REPEL is again beneficial, and discerning between synonyms and antonyms improves DST performance. However, specialising unseen words (the final X u vector space) yields further improvements in both evaluation settings, supporting our claim that the specialisation signal can be propagated to unseen words.
This downstream evaluation again demonstrates the importance of non-linearity, as the peak scores are reported with the NONLINEAR-MM variant. More substantial gains in the all setup are observed in the DST task compared to the word similarity task. This stems from a lower coverage of the WOZ data in the AR constraints: 36.3% of all WOZ words are unseen words. Finally, the scores are higher on average in the all setup, since this setup uses more external constraints for AR, and consequently uses more training examples to learn the mapping.
Other Languages We test the portability of our framework to two other languages for which we have similar evaluation data: German (DE) and Italian (IT). SimLex-999 has been translated and rescored in the two languages by Leviant and Reichart (2015), and the WOZ data were translated and adapted by . Exactly the same setup is used as in our English experiments, without any additional language-specific fine-tuning. Linguistic constraints were extracted from the same sources: synonyms from the PPDB (135,868 in DE,362,452 in IT), antonyms from BabelNet (4,124 in DE, and 16,854 in IT). Our starting distributional vector spaces are taken from prior work: IT vectors are from , DE vectors are from (Vulić and Korhonen, 2016a). The results are summarised in Tab. 4.
Our post-specialisation approach yields consistent improvements over the initial distributional space and the AR specialisation model in both tasks and for both languages. We do not observe any gain on IT SimLex in the all setup since IT constraints have almost complete coverage of all IT SimLex words (99.3%; the coverage is 64.8% in German). As expected, the DST scores in the all setup are higher than in the hold-out setup due to a larger number of constraints and training examples.
Lower absolute scores for Italian and German compared to the ones reported for English are due to multiple factors, as discussed previously by : 1) the AR model uses less linguistic constraints for DE and IT; 2) distributional word vectors are induced from smaller corpora; 3) linguistic phenomena (e.g., cases and compounding in DE) contribute to data sparsity and also make the DST task more challenging. However, it is important to stress the consistent gains over the vector space specialised by the state-of-the-art ATTRACT-REPEL model across all three test languages. This indicates that the proposed approach is languageagnostic and portable to multiple languages.

Downstream Task II: Lexical Simplification
In our second downstream task, we examine the effects of post-specialisation on lexical simplification (LS) in English. LS aims to substitute complex words (i.e., less commonly used words) with their simpler synonyms in the context. Simplified text must keep the meaning of the original text, which is discerning similarity from relatedness is important (e.g., in "The automobile was set on fire" the word "automobile" should be replaced with "car" or "vehicle" but not with "wheel" or "driver").   Table 5: Lexical simplification performance with post-specialisation applied on three input spaces.
We employ LIGHT-LS (Glavaš and Štajner, 2015), a lexical simplification algorithm that: 1) makes substitutions based on word similarities in a semantic vector space, and 2) can be provided an arbitrary embedding space as input. 10 For a complex word, LIGHT-LS considers the most similar words from the vector space as simplification candidates. Candidates are ranked according to several features, indicating simplicity and fitness for the context (semantic relatedness to the context of the complex word). The substitution is made if the best candidate is simpler than the original word. By providing vector spaces post-specialised for semantic similarity to LIGHT-LS, we expect to more often replace complex words with their true synonyms.
We evaluate LIGHT-LS performance in the all setup on the LS benchmark compiled by Horn et al. (2014), who crowdsourced 50 manual simplifications for each complex word. As in prior work, we evaluate performance with the following metrics: 1) Accurracy (Acc.) is the number of correct simplifications made (i.e., the system made the simplification and its substitution is found in the list of crowdsourced substitutions), divided by the total number of indicated complex words; 2) Changed (Ch.) is the percentage of indicated complex words 10 https://github.com/codogogo/lightls that were replaced by the system (whether or not the replacement was correct).
LS results are summarised in Tab. 5. Postspecialised vector spaces consistently yield 5-6% gain in Accuracy compared to respective distributional vectors and embeddings specialised with the state-of-the-art ATTRACT-REPEL model. Similar to DST evaluation, improvements over ATTRACT-REPEL demonstrate the importance of specialising the vectors of the entire vocabulary and not only the vectors of words from the external constraints.

Conclusion and Future Work
We have presented a novel post-processing model, termed post-specialisation, that specialises word vectors for the full vocabulary of the input vector space. Previous post-processing specialisation models fine-tune word vectors only for words occurring in external lexical resources. In this work, we have demonstrated that the specialisation of the subspace of seen words can be leveraged to learn a mapping function which specialises vectors for all other words, unseen in the external resources. Our results across word similarity and downstream language understanding tasks show consistent improvements over the state-of-the-art specialisation method for all three test languages.
In future work, we plan to extend our approach to specialisation for asymmetric relations such as hypernymy or meronymy (Glavaš and Ponzetto, 2017;Nickel and Kiela, 2017;Vulić and Mrkšić, 2018). We will also investigate more sophisticated non-linear functions. The code is available at: https://github.com/cambridgeltl/ post-specialisation/.