Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization

Semantic specialization is a process of fine-tuning pre-trained distributional word vectors using external lexical knowledge (e.g., WordNet) to accentuate a particular semantic relation in the specialized vector space. While post-processing specialization methods are applicable to arbitrary distributional vectors, they are limited to updating only the vectors of words occurring in external lexicons (i.e., seen words), leaving the vectors of all other words unchanged. We propose a novel approach to specializing the full distributional vocabulary. Our adversarial post-specialization method propagates the external lexical knowledge to the full distributional space. We exploit words seen in the resources as training examples for learning a global specialization function. This function is learned by combining a standard L2-distance loss with a adversarial loss: the adversarial component produces more realistic output vectors. We show the effectiveness and robustness of the proposed method across three languages and on three tasks: word similarity, dialog state tracking, and lexical simplification. We report consistent improvements over distributional word vectors and vectors specialized by other state-of-the-art specialization frameworks. Finally, we also propose a cross-lingual transfer method for zero-shot specialization which successfully specializes a full target distributional space without any lexical knowledge in the target language and without any bilingual data.


Introduction
Word representation learning is a mainstay of modern Natural Language Processing (NLP), and its usefulness has been proven across a wide spectrum of NLP applications (Collobert et al., 2011;Chen and Manning, 2014;Melamud et al., 2016b, inter alia). Standard distributional word vector models * Both authors equally contributed to this work. are grounded in the distributional hypothesis (Harris, 1954), that is, they leverage information about word co-occurrences in large text corpora (Mikolov et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014;. This dependence on contextual signal results in a well-known tendency to conflate semantic similarity with other types of semantic association Schwartz et al., 2015; in the induced word vector spaces. 1 A common remedy is to move beyond purely unsupervised word representation learning, in a process referred to as semantic specialization or retrofitting. Specialization methods exploit lexical knowledge from external resources, such as Word-Net (Fellbaum, 1998) or the Paraphrase Database (Ganitkevitch et al., 2013) to refine the semantic properties of pre-trained vectors and specialize the distributional spaces for a particular relation, e.g., synonymy (i.e., true similarity) (Faruqui et al., 2015; or hypernymy (Nickel and Kiela, 2017;Nguyen et al., 2017;. The best-performing specialization models (cf. ) are deployed as post-processors of the vector space: distributional vectors are finetuned to satisfy linguistic constraints extracted from external resources to offer improved support to downstream NLP applications (Faruqui, 2016). Such models are versatile as they can be applied to arbitrary distributional spaces, but they have a major drawback: they locally update only vectors of words present in linguistic constraints (i.e., seen words), whereas vectors of all other (i.e., unseen) words remain intact (see Figure 1).  Figure 1: High-level illustration of the adversarial post-specialization process and cross-lingual zeroshot specialization, described in detail in §2.  have recently proposed a model which, based on the updates of vectors of seen words, learns a global specialization function that can be applied to the large subspace of unseen words. Their global method, termed postspecialization and implemented as a deep feedforward network, effectively specializes all distributional vectors.
In this paper, we propose a new approach to post-specialization which addresses the following two research questions: a) Is it possible to use a more sophisticated learning approach to yield more realistic specialized vectors for the full vocabulary? b) Given that specialization methods inherently require a large number of constraints, is it possible to specialize distributional word vectors where such resources are scarce or non-existent? Our novel model learns the global specialization function by casting the feed-forward specialization network as a generator component of an adversarial architecture, see Figure 2. The corresponding discriminator component learns to discern original specialized vectors (produced by any local specialization model) from vectors produced by transforming distributional vectors with the feed-forward post-specialization network (i.e., the generator).
We show that the proposed adversarial model yields state-of-the-art performance on standard word similarity benchmarks, outperforming the post-specialization model of . We further demonstrate the effectiveness of the pro-posed model in two downstream tasks: lexical text simplification and dialog state tracking. Finally, we demonstrate that, by coupling our adversarial specialization model with any unsupervised model for inducing bilingual vector spaces, such as the algorithm proposed by Conneau et al. (2018), we can successfully perform zero-shot language transfer of the specialization, that is, we can specialize distributional spaces of languages without any linguistic constraints in those languages, and without any bilingual data.

Methodology
The post-specialization procedure ) is a two-step process. First, a subspace of vectors for words observed in external resources is fine-tuned using any off-the-shelf specialization model, such as the original retrofitting model (Faruqui et al., 2015), counter-fitting (Mrkšić et al., 2016), dLCE (Nguyen et al., 2016, or state-of-theart ATTRACT-REPEL (AR) specialization . We outline the initial specialization algorithms in §2.1. In the second step, the initial specialization is propagated to the entire vocabulary, including words not observed in the resources, relying on an adversarial architecture augmented with a distance loss. This adversarial post-specialization model, compatible with any specialization model, is described in §2.2.
Finally, in §2.3, we introduce a cross-lingual zero-shot specialization model which transfers the specialization to a target language without any lexical resources. An overview of the proposed methodology from this section is provided in Figure 1.

Initial Specialization
Linguistic Constraints. Adopting the nomenclature from , post-processing models are generally guided by two broad sets of constraints: 1) ATTRACT constraints specify which words should be close to each other in the finetuned vector space (e.g. synonyms like graceful and amiable); 2) REPEL constraints describe which words should be pulled away from each other (e.g. antonyms like innocent and sinful). Earlier postprocessors (Faruqui et al., 2015;Wieting et al., 2015) operate only with ATTRACT constraints, and are thus not suited to model both aspects contributing to the specialization process.
We first outline the state-of-the-art ATTRACT-REPEL specialization model  which leverages both sets of constraints. Here, we again stress two important aspects relevant to our post-specialization model: a) all initial specialization models fine-tune only representations for the subspace of words seen in the external constraints, while all other words remain unaffected by specialization; b) post-specialization is not tied to ATTRACT-REPEL in particular; it is applicable on top of any other post-processor. 2 Specialization of Seen Words. The key idea is to inject the knowledge from linguistic constraints into pre-trained distributional word vectors. Given a set A of ATTRACT word pairs and a set R of REPEL word pairs, each word pair (v l , v r ) from the vocabulary V s of seen words present in these sets can be represented as a vector pair (x l , x r ).
The optimization is driven by mini-batches of ATTRACT pairs B A (batch size k A ), and of REPEL pairs B R (size k R ). For both of these, two sets of negative example pairs of equal size are drawn from the 2(k A + k R ) vectors occurring in B A and B R . This defines the minibatches Negative examples t l and t r for ATTRACT (or REPEL) pairs are the nearest (or farthest) neighbours by cosine similarity to x l and x r , respectively. They ensure that the paired vectors for words in the constraints are closer to each other (or more distant for antonyms) than to their respective negative examples.
The overall objective function consists of three terms. The first term pulls ATTRACT pairs together: is the standard rectifier (Nair and Hinton, 2010). δ A is the ATTRACT margin: it specifies the tolerance for the difference between the two distances (with the other pair member and with the negative example). The second term, Rep(B R , T R ), is similar but now pushes REPEL pairs away from each other, relying on the REPEL margin δ R : The final term is tasked to preserve the quality of the original vectors through L 2 -regularization: y i is the vector specialized from the original distributional vector x i , and λ P is a regularization hyper-parameter. The optimizer finally minimizes the following objective:

Adversarial Post-Specialization
Motivation. The AR method affects only a subset of the full vocabulary V, and consequently only a (small) subspace of the original space X (see Figure 1). In particular, it specializes the embeddings X s corresponding to V s , the vocabulary of words observed in the constraints. It leaves the embeddings X u corresponding to all other (unseen) words V u identical. Nevertheless, the perturbation underwent by the original observed embeddings can provide evidence about the general effects of specialization. In particular, it allows to learn a global mapping function f : X ∈ R d → Y ∈ R d for d-dimensional vectors. The parameters for this function can be trained in a supervised fashion from pairs of original and initially specialized word embeddings (x Figure 2. Subsequently, the mapping can be applied to distributional word vectors x u from the vocabulary of unseen words V u to predictŷ u , their specialized counterpart. This procedure, called post-specialization, effectively propagates the information stored in the external constraints to the entire word vector space. However, this mapping should not just model the inherent transformation, but also ensure that the resulting vector is 'natural'. In particular, assuming that word representations lie on a manifold, the mapping should return one of its values. The intuition behind our formulation of the training objective is that: a) an L 2 -distance loss can retrieve a faithful mapping whereas b) an adversarial loss can prevent unrealistic outputs, as already proven in the the visual domain (Pathak et al., 2016;Ledig et al., 2017;Odena et al., 2017). Objective Function. The pairs of original and specialized embeddings for seen words allow to train the global mapping function. In principle, this can be any differentiable parametrized function G(x; θ G ).  showed that nonlinear functions ensure a better mapping than linear transformations which seem inadequate to mimic the complex perturbations of the specialization process, guided by possibly millions of pairwise constraints. Our preliminary experiments corroborate this intuition. Thus, in this work we also opt for implementing G(x; θ G ) as a deep neural network. Each of the l hidden layers of size h non-linearly transforms its input. The output layer is a linear transformation into the predictionŷ ∈ R d . The parameters θ G are learned by minimizing the L 2 distance between the training pairs. In particular, the loss is a contrastive margin-based ranking loss with negative sampling (MM) as proposed by Weston et al. (2011, inter alia). The gist of this loss is that the first component increases the cosine similarity cos of predicted and initially specialized vectors of the same word up to a margin δ M M . On the other hand, the second component encourages the predicted vectors to distance themselves from k random confounders. These are negative examples sampled uniformly from the batch B excluding the current vector: One of the original contributions of this work is combining the L 2 distance with an adversarial loss, resulting in an auxiliary-loss Generative Adversarial Network (AuxGAN) as shown in Figure 2. The role of the adversarial component, as mentioned above, is to 'soften' the mapping and guarantee realistic outputs from the target distribution.
The mapping can be considered a generator G(x|θ G ). On top of this, a discriminator D(x|θ D ), implemented also as a multi-layer neural net, tries to distinguish whether a vector is sampled from the predicted vectors or the AR-specialized vectors. Its output layer performs binary classification through softmax. The objective minimizes the loss L D : In a two-player game (Goodfellow et al., 2014), the generator is trained to fool the discriminator by maximizing log(1 − P (0|G(x i ; θ G ); θ D )). However, to avoid vanishing gradients of G early on, the loss L G is reformulated by swapping the labels of Eq. (5) as follows: log P (specialized = 0|yi; θD) (6) During the optimization procedure through stochastic gradient descent, we alternate among s steps for L D , one step for L G , and one step for L MM to avoid the overfitting of D. The reason why s ≥ 1 is that D can be kept close to a minimum of its loss function by updating G less frequently.

Zero-shot Transfer to Other Languages
Once the AuxGAN has learned a global mapping function G(x; θ G ) in a resource-rich language, it can be directly applied to unseen words. In this work, we propose a method to additionally postspecialize the whole vocabulary V t of a resourcepoor target language. We assume a real-world scenario where no target language constraints are available to specialize it directly.
What is more, we assume that no bilingual data or dictionaries are available either. Hence, we rely on unsupervised cross-lingual word embedding induction, and in particular on Conneau et al. (2018)'s method. By virtue of these assumptions, there is no limitation to the range of potential target languages that can be specialized. Incidentally, please note that the proposed transfer method is equally applicable on top of other cross-lingual word embedding induction methods. These may require more bilingual supervision to learn the crosslingual vector space. 3 After learning the shared cross-lingual word embedding space in an unsupervised fashion (Conneau et al., 2018), the global post-specialization function learnt on the seen source language vectors is applied to the target language vectors, since they lie in the same shared space (see Figure 1 again). By virtue of the transfer, linguistic constraints in the source language can enhance the distributional vectors of target language vocabularies. Conneau et al. (2018) learn a shared crosslingual vector space as follows. They first learn a coarse initial mapping between two monolingual embedding spaces in two different languages through a GAN where the generator is a linear transformation with an orthogonal matrixŴ. Its loss is identical to Eq. (5) and Eq. (6), but unlike our Aux-GAN model it discriminates between embeddings drawn from the source language and the target language distributions. Using the shared space, they extract for each source vector the closest target vector according to a distance metric designed to mitigate the hubness problem (Radovanović et al., 2010), the Cross-Domain Similarity Local Scaling (CSLS).
This creates a bilingual synthetic dictionary that allows to further refine the coarse initial mapping. In particular, the optimal parameters for the linear mapping minimizing the L 2 -distance between source-target pairs are provided by the closed-form Procrustes solution (Schönemann, 1966) based on singular value decomposition (SVD): where || · || F is the Frobenius norm. After mapping the original target embeddings into the shared space with this method, we post-specialize them with the function outlined in §2.2, learnt on the source language. This yields the specialized target vectorŝ Y t = G(Ŵ X t ; θ G ).

Experimental Setup
Distributional Vectors. We estimate the robustness of adversarial post-specialization by experimenting with three widely used collections of distributional English vectors. 1) SGNS-W2 vectors are trained on the cleaned and tokenized Polyglot Wikipedia (Al-Rfou et al., 2013) using Skip-Gram with Negative Sampling (SGNS) (Mikolov et al., 2013) by Levy and Goldberg (2014) with bag-ofwords contexts (window size is 2). 2) GLOVE-CC are GloVe vectors trained on the Common Crawl (Pennington et al., 2014). 3) FASTTEXT are vectors trained on Wikipedia with a SGNS variant that builds word vectors by summing the vectors of their constituent character n-grams . All vectors are 300-dimensional. 4 Constraints and Initial Specialization. We experiment with the sets of linguistic constraints used in prior work (Zhang et al., 2014;Ono et al., 2015;. These constraints, extracted from WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009), comprise a total of 1,023,082 synonymy/ATTRACT word pairs and 380,873 antonymy/REPEL pairs.
Note that the sets of constraints cover only a fraction of the full distributional vocabulary, providing direct motivation for post-specialization methods which are able to specialize the full vocabulary. For instance, only 15.3% of the SGNS-W2 vocabulary words are seen words present in the constraints. 5 The constraints are initially injected into the distributional vector space (see Figure 1 again) using ATTRACT-REPEL, a state-of-the-art specialization model, for which we adopt the original suggested model setup . 6 Hyperparameter values are set to: δ A = 0.6, δ R = 0.0, λ P = 10 −9 . The models are trained for 5 epochs with Adagrad (Duchi et al., 2011), with batch sizes set to k A = k R = 50, again as in the original work.
AuxGAN Setup and Hyper-Parameters. Both the generator and the discriminator are feedforward nets with l = 2 hidden layers, each of size h = 2048, and LeakyReLU as non-linear activation (Maas et al., 2013). The dropout for the input and hidden layers of the generator is 0.2 and for the input layer of the discriminator 0.1. In evaluation, the noise is blanketed out in order to ensure a deterministic mapping . Moreover, we smooth the golden labels for prediction by a factor of 0.1 to make the model less vulnerable to adversarial examples (Szegedy et al., 2016). We train our model with SGD for 10 epochs of 1 million iterations each, feeding mini-batches of size 32. For each pair in a batch we generate 25 negative examples; s = 5 (see §2.2). As a way to normalize the mini-batches (Salimans et al., 2016), these are constructed to contain exclusively either original or specialized vectors. At each epoch, the initial learning rate of 0.1 is decayed by a factor of 0.98, or 0.5 if the score on the validation set (computed as the average cosine similarity between the predicted and AR-specialized embeddings) 7 has not increased. The hyper-parameters k and δ M M are tuned via grid search on the validation set.
Zero-Shot Specialization Setup. The GAN discriminator for learning a shared cross-lingual vector space (see §2.3) has hyper-parameters identical to the AuxGAN. The generator instead is a linear layer initialized as an identity matrix and enforced to lie on the manifold of orthogonal matrices during training (Cisse et al., 2017). No dropout is used. The unsupervised validation metric for early stop-5 The respective coverage for the 200K most frequent GLOVE-CC and FASTTEXT words is only 13.3% and 14.6%. 6 https://github.com/nmrksic/ attract-repel 7 The score is computed as the average cosine similarity between the original and specialized embeddings. ping is the cosine distance between dictionary pairs extracted with the CSLS similarity metric.

Word Similarity
Evaluation Setup. We first evaluate adversarial post-specialization intrinsically, using two standard word similarity benchmarks for English: SimLex-999  and SimVerb-3500 (Gerz et al., 2016), a dataset containing human similarity ratings for 3,500 verb pairs. 8 The evaluation measure is Spearman's ρ rank correlation between gold and predicted word pair similarity scores.
We evaluate word vectors in two settings, similar to . a) In the synthetic DIS-JOINT setting, we discard all linguistic constraints that contain any of the words found in SimLex or SimVerb. This means that all test words from Sim-Lex and SimVerb are effectively unseen words, and through this setting we are able to in vitro evaluate the model's ability to generalize the specialization function to unseen words. b) In the FULL setting we leverage all constraints. This is a standard "reallife" scenario where some test words do occur in the constraints, while the mapping is learned for the remaining words. We use the FULL setting in all subsequent downstream applications ( §4.2). We compare our model to ATTRACT-REPEL (AR), which specializes only the vectors of words occurring in the constraints. We also provide comparisons to a post-specialization model of  which specializes the full vocabulary, but substitutes the AuxGAN architecture from §2.2 with a deep 5-layer feed-forward neural net also based on the max-margin loss (see Eq. (4)) to learn the mapping function (POST-DFFN).
Results and Analysis. The results are summarized in Table 1. The scores suggest that the proposed adversarial post-specialization model is universally useful and robust: we observe gains over input distributional word vectors for all three vector collections. The results in the DISJOINT setting illustrate the core limitation of the initial specialization/post-processing models and indicate the extent of improvement achieved when generalizing the specialization function to unseen words through adversarial post-specialization. Moreover, the scores suggest that the more sophisticated adversarial post-specialization method (AUXGAN) outperforms POST-DFFN across a large number of experimental runs, verifying its effectiveness. We observe only modest and inconsistent gains over ATTRACT-REPEL and POST-DFFN in the FULL setting. However, the explanation of this finding is straightforward: 99.2% of SimLex words and 99.9% of SimVerb words are present in the external constraints, making this an unrealistic evaluation scenario. The usefulness of the initial ATTRACT-REPEL specialization is less pronounced in reallife downstream applications in which such high coverage cannot be guaranteed, as shown in §4.2.

Downstream Tasks
We next evaluate the embedding spaces specialized with the AuxGAN method in two tasks in which discerning semantic similarity from semantic relatedness is crucial: lexical text simplification (LS) and dialog state tracking (DST).

Lexical Text Simplification
The goal of lexical simplification is to replace complex words (typically words that are used less often in language and are therefore less familiar to readers) with their simpler synonyms, without infringing the grammaticality and changing the meaning of the text. Replacing complex words with related words instead of true synonyms affects the original meaning (e.g., Ferrari pilot Vettel vs Ferrari airplane Vettel) and often yields ungrammatical text (e.g., they drink all pizzas).
LS Using Word Vectors. We use Light-LS, a publicly available LS tool based on word embeddings (Glavaš and Štajner, 2015). Light-LS generates and then ranks substitution candidates based on similarity in the input word vector space. The  quality of the space thus directly affects LS performance: by plugging any word vector space into Light-LS, we extrinsically evaluate that embedding space for LS. Furthermore, the better the embedding space captures true semantic similarity, the better the substitutions made by Light-LS.
Evaluation Setup. We use the standard LS dataset of Horn et al. (2014). It contains 500 sentences with indicated complex words (one word per sentence) that have to be substituted with simpler synonyms. For each word, simplifications were crowdsourced from 50 human annotators. Following prior work (Horn et al., 2014;Glavaš and Štajner, 2015), we evaluate the performance of Light-LS using the metric that quantifies both the quality and the frequency of word replacements: Accurracy (Acc) metric is the number of correct simplifications made divided by the total number of complex words.
Results and Analysis. Scores for all three pretrained vector spaces are shown in Table 2. Similar to the word similarity task, embedding spaces produced with post-specialization models outperform the vectors produced with AR and original distributional vectors. The gains are now more pronounced in the real-life FULL setup, as only 59.6 % of all indicated complex words and substitution candidates from the LS dataset are covered in the external con-  straints. Adversarial post-specialization (AUXGAN) has a slight edge over the post-specialization with a simple feed-forward network (POST-DFFN) for FASTTEXT and SGNS-W2 embeddings, but not for GLOVE-CC vectors. In general, the fact that both post-specialization methods outperform ATTRACT-REPEL by a wide margin shows the importance of specializing the full word vector space for downstream NLP applications.

Dialog State Tracking
Finally, we evaluate the importance of fullvocabulary (adversarial) post-specialization in another language understanding task: dialog state tracking (DST) (Henderson et al., 2014;Williams et al., 2016), which is a standard task to measure the impact of specialization in prior work . A DST model is typically the first component of a dialog system pipeline (Young, 2010), tasked with capturing user's goals and updating the dialog belief state at each dialog turn. Distinguishing similarity from relatedness is crucial for DST (e.g., a dialog system should not recommend an "expensive restaurant in the west" when asked for an "affordable pub in the north").
Evaluation Setup. To evaluate the effects of specialized word vectors on DST, following prior work we utilize the Neural Belief Tracker (NBT), a statistical DST model that makes inferences purely based on pre-trained word vectors . 9 Again, as in prior work the DST evaluation is based on the Wizard-of-Oz (WOZ) v2.0 dataset , comprising 1,200 dialogues split into training (600 dialogues), development (200), and test data (400). We report the standard DST metric: joint goal accuracy (JGA), the proportion of dialog turns where all the user's search goal constraints were correctly identified, computed as average over 5 NBT runs.  Results and Analysis. We show English DST performance in the FULL setting in Table 3. Only NBT performance with GLOVE-CC vectors is reported for brevity, as similar performance gains are observed with the other two pre-trained vector collections. The results confirm our findings established in the other two tasks: a) initial AR specialization of distributional vectors is useful, but b) it is crucial to specialize the full vocabulary for improved performance (e.g., 57% of all WOZ words are present in the constraints), and c) the more sophisticated AUXGAN model yields additional gains.

Cross-Lingual Zero-Shot Specialization
Evaluation Setup. Large collections of linguistic constraints do not exist for many languages. Therefore, we test if the specialization knowledge from a resoure-rich language (i.e., English) can be transferred to resource-lean target languages (see §2.3). We simulate resource-lean scenarios using two target languages: Italian (IT) and German (DE). 10 We evaluate zero-specialized IT and DE FASTTEXT vectors, using English FASTTEXT vectors as the source, on the same three tasks as before. We report the same evaluation measures, using the following evaluation data: 1) IT and DE SimLex-999 datasets (Leviant and Reichart, 2015) for word similarity; 2) IT lexical simplification data (SIMPI-TIKI) (Tonelli et al., 2016); 3) IT and DE WOZ data  for DST.
Results and Analysis. The results are summarized in Table 4. The gains over the original distributional vectors are substantial across all three tasks and for both languages. This finding indicates that the semantic content of distributional vectors can be enriched even for languages without any readily available lexical resources. The gap between performances of language transfer and the monolingual setting is explained by the noise introduced by the bilingual vector alignment and the different ways concepts are lexicalized across languages, as studied by semantic typology (Ponti et al., 2018). Nonetheless, in the long run, these transfer results hold promise to support the specialization of vector spaces even for resource-lean languages, and their applications.

Related Work
Vector Space Specialization. Specialization methods embed external information into vector spaces. Some of them integrate external linguistic constraints into distributional training and jointly optimize distributional and non-distributional objectives: they modify the prior or the regularization (Yu and Dredze, 2014;Kiela et al., 2015), or use a variant of the SGNS-style objective (Liu et al., 2015;Ono et al., 2015;Osborne et al., 2016).
Other models inject external knowledge from available lexical resources (e.g., WordNet, PPDB) into pre-trained word vectors as a post-processing step (Faruqui et al., 2015;Rothe and Schütze, 2015;Wieting et al., 2015;Nguyen et al., 2016;Mrkšić et al., 2016;Cotterell et al., 2016;. They offer a portable, flexible, and lightweight approach to incorporating external knowledge into arbitrary vector spaces, outperforming less versatile joint models and yielding state-of-theart results on language understanding tasks (Mrkšić et al., 2016;Kim et al., 2016;. By design, these methods fine-tune only vectors of words seen in external resources.  suggest that specializing the full vocabulary is beneficial for downstream applications. Comparing to their work, we show that a more sophisticated adversarial post-specialization can yield further gains across different tasks and boost full-vocabulary specialization in resourcelean settings through cross-lingual transfer. Generative Adversarial Networks. GANs were originally devised to generate images from input noise variables (Goodfellow et al., 2014). The generation process is typically conditioned on discrete labels or data from other modalities, such as text (Mirza and Osindero, 2014). Otherwise, the condition can take the form of real data in input rather than (or in addition to) noise: in this case, the generator parameters are better conceived as a mapping function. For instance, it can bridge between pixelto-pixel  or character-to-pixel (Reed et al., 2016) transformations.
The GAN objective can be mixed with more traditional loss functions: in these cases, apart from trying to fool the discriminator, the generator also minimizes the distance between input and target data (Pathak et al., 2016;Li and Wand, 2016;Ledig et al., 2017). The distance can be formulated as the mean squared error between the input and the target (Pathak et al., 2016), their feature maps (Li and Wand, 2016), both (Zhu et al., 2016), or a loss calculated on feature maps of a deep convolutional network (Ledig et al., 2017).
In the textual domain, adversarial models have been proven to support domain adaptation (Ganin et al., 2016) and language transfer  by learning domain/language-invariant latent features. Adversarial training also powers unsupervised mapping between monolingual vector spaces to learn cross-lingual word embeddings (Zhang et al., 2017;Conneau et al., 2018). In this work, we show how to apply adversarial techniques to the problem of vector specialization, which has a substantial impact on language understanding tasks.

Conclusion and Future Work
We have presented adversarial post-specialization, a novel model supported by adversarial training which specializes word vectors for the full vocabulary of the input distributional vector space, including words unseen in external lexical resources. We have also introduced a method for zero-shot specialization of word vectors in languages without any external resources. The benefits of adversarial post-specialization and its zero-shot transfer have been demonstrated across three tasks (word similarity, lexical text simplification, and dialog state tracking) and for three languages.
In future work, we will explore more sophisticated adversarial models such as Cycle-GAN . Moreover, we will experiment with bootstrapping approaches to extract new lexical constraints from post-specialized embeddings. We also plan to extend the method to asymmetric relations (e.g., hypernymy) and to more target (resource-lean) languages. The code is available at https://github.com/cambridgeltl/ adversarial-postspec.