Unsupervised Cross-lingual Transfer of Word Embedding Spaces

Cross-lingual transfer of word embeddings aims to establish the semantic mappings among words in different languages by learning the transformation functions over the corresponding word embedding spaces. Successfully solving this problem would benefit many downstream tasks such as to translate text classification models from resource-rich languages (e.g. English) to low-resource languages. Supervised methods for this problem rely on the availability of cross-lingual supervision, either using parallel corpora or bilingual lexicons as the labeled data for training, which may not be available for many low resource languages. This paper proposes an unsupervised learning approach that does not require any cross-lingual labeled data. Given two monolingual word embedding spaces for any language pair, our algorithm optimizes the transformation functions in both directions simultaneously based on distributional matching as well as minimizing the back-translation losses. We use a neural network implementation to calculate the Sinkhorn distance, a well-defined distributional similarity measure, and optimize our objective through back-propagation. Our evaluation on benchmark datasets for bilingual lexicon induction and cross-lingual word similarity prediction shows stronger or competitive performance of the proposed method compared to other state-of-the-art supervised and unsupervised baseline methods over many language pairs.


Introduction
Word embeddings are well known to capture meaningful representations of words based on large text corpora (Mikolov et al., 2013;Pennington et al., 2014).Training word vectors using monolingual corpora is a common practice in various NLP tasks.However, how to establish cross-lingual semantic mapping among mono-lingual embeddings remain an open challenge as the availability of resources and benchmarks are highly imbalanced across languages.
Recently, increasing effort of research has been motivated to address this challenge.Successful cross-lingual word mapping will benefit many cross-lingual learning tasks, such as transforming text classification models trained in resourcerich languages to low-resource languages.Downstream applications include word alignment, text classification, named entity recognition, dependency parsing, POS-tagging, and more (Søgaard et al., 2015).Most methods for cross-lingual transfer of word embeddings are based on supervised or semi-supervised learning, i.e., they require cross-lingual supervision such as humanannotated bilingual lexicons and parallel corpora (Lu et al., 2015;Smith et al., 2017;Artetxe et al., 2016).Such a requirement may not be met for many language pairs in the real world.
This paper proposes an unsupervised approach to the cross-lingual transfer of monolingual word embeddings, which requires zero cross-lingual supervision.The key idea is to optimize the mapping in both directions for each language pair (say A and B), in the way that the word embedding translated from language A to language B will match the distribution of word embedding in language B. And when translated back from B to A, the word embedding after two steps of transfer will be maximally close to the original word embedding.A similar property holds for the other direction of the loop (from B to A and then from A back to B).Specifically, we use the Sinkhorn distance (Cuturi, 2013) to capture the distributional similarity between two set of embeddings after transformation, which we found empirically superior to the KL-divergence (Zhang et al., 2017a) and distance to nearest neighbor (Artetxe et al., 2017;Conneau et al., 2017) with regards to the quality of learned arXiv:1809.03633v1[cs.CL] 10 Sep 2018 transformation as well as the robustness under different training conditions.
Our novel contributions in the proposed work include: • We propose an unsupervised learning framework which incorporates the Sinkhorn distance as a distributional similarity measure in the back-translation loss function.• We use a neural network to optimize our model, especially to implement the Sinkhorn distance whose calculation itself is an optimization problem.• Unlike previous models which only consider cross-lingual transformation in a single direction, our model jointly learns the word embedding transfer in both directions for each language pair.• We present an intensive comparative evaluation where our model achieved the state-ofthe-art performance for many language pairs in cross-lingual tasks.

Related Work
We divide the related work into supervised and unsupervised categories.Representative methods in both categories are included in our comparative evaluation (Section 3.4).We also discuss some related work in unsupervised domain transfer in addition.
Supervised Methods: There is a rich body of supervised methods for learning cross-lingual transfer of word embeddings based on bilingual dictionaries (Mikolov et al., 2013;Faruqui and Dyer, 2014;Artetxe et al., 2016;Xing et al., 2015;Duong et al., 2016;Gouws and Søgaard, 2015), sentence-aligned corpora (Kočiskỳ et al., 2014;Hermann and Blunsom, 2014;Gouws et al., 2015) and document-aligned corpora (Vulić and Moens, 2016;Søgaard et al., 2015).The most relevant line of work is that by Mikolov et al. (2013) where they showed monolingual word embeddings are likely to share similar geometric properties across languages although they are trained separately and hence cross-lingual mapping can be captured by a linear transformation across embedding spaces.Several follow-up studies tried to improve the cross-lingual transformation in various ways (Faruqui and Dyer, 2014;Artetxe et al., 2016;Xing et al., 2015;Duong et al., 2016;Ammar et al., 2016;Artetxe et al., 2016;Zhang et al., 2016;Shigeto et al., 2015).Nevertheless, all these methods require bilingual lexicons for supervised learning.Vulić and Korhonen (2016) showed that 5000 high-quality bilingual lexicons are sufficient for learning a reasonable cross-lingual mapping.
Unsupervised Methods have been studied to establish cross-lingual mapping without any humanannotated supervision.Earlier work simply relied on word occurrence information only (Rapp, 1995;Fung, 1996) while later efforts have considered more sophisticated statistics in addition (Haghighi et al., 2008).The main difficulty in unsupervised learning of cross-lingual mapping is the formulation of the objective function, i.e., how to measure the goodness of an induced mapping without any supervision is a non-trivial question.Cao et al. (2016) tried to match the mean and standard deviation of the embedded word vectors in two different languages after mapping the words in the source language to the target language.However, such an approach has shown to be sub-optimal because the objective function only carries the first and second order statistics of the mapping.Artetxe et al. (2017) tried to impose an orthogonal constraint to their linear transformation model and minimize the distance between the transferred source-word embedding and its nearest neighbor in the target embedding space.Their method, however, requires a seed bilingual dictionary as the labeled training data and hence is not fully unsupervised.(Zhang et al., 2017a;Barone, 2016) adapted a generative adversarial network (GAN) to make the transferred embedding of each source-language word indistinguishable from its true translation in the target embedding space (Goodfellow et al., 2014) (Zhu et al., 2017;Taigman et al., 2016;Yi et al., 2017).Among those, our work is mostly inspired by the work on CycleGAN (Zhu et al., 2017), and we adopt their cycled consistent loss over images into our back-translation loss.One key difference of our method from CycleGAN is that they used the training loss of an adversarial classifier as an indicator of the distributional distance, but instead, we introduce the Sinkhorn distance in our objective function and demonstrate its superiority over the representative method using adversarial loss (Zhang et al., 2017a).

Proposed Method
Our system takes two sets of monolingual word embeddings of dimension d as input, which are trained separately on two languages.We denote them as During the training of monolingual word embedding for X and Y , we also have the access to the word frequencies, represented by vectors r ∈ N n and c ∈ N m for X and Y , respectively.Specifically, r i is the frequency for word (embedding) x i and similarly for c j of y j .As illustrated in Figure 3, our model has two mappings: G : X → Y and F : Y → X.We further denote transferred embedding from X as G(X) := {G(x i )} n i=1 and correspondingly for F (Y ).
In the unsupervised setting, the goal is to learn the mapping G and F without any paired word translation.To achieve this, our loss function consists of two parts: Sinkhorn distance (Cuturi, 2013) for matching the distribution of transferred embedding to its target embedding distribution; and a back-translation loss for preventing degenerated transformation.

Definition
Sinkhorn distance is a recently proposed distance between probability distributions.We use the Sinkhorn distance to measure the closeness be- Although the vocabulary sizes of two languages could be different, we are able to sample minibatches of equal size from G(X) and Y .therefore we assume n = m in the following derivation.
To compute Sinkhorn distance, we firstly compute a distance matrix M (G) ∈ R n×m between G(X) and Y where M (G) ij is the distance measure between G(x i ) and y j .The superscript on M (G)  indicates the distance that depends on a parameterized transformation G.For instance, if we choose Euclidean distance as a measure (see Section 3.1.3for more discussions), we will have Given the distance matrix, the Sinkhorn distance between P G(X) and P Y is defined as: where •, • is the Forbenius dot-product and U α (r, c) is an entropy constrained transport poly- The Sinkhorn distance tope, defined as Note that P is non-negative and the first two constraints make its element-wise sum be 1.Therefore, P can be seen as a set of probability distributions.The same applies for r and c since they are frequencies.h is the entropy function defined on any probability distributions and α is a hyperparameter to choose.For any probabilistic matrix P ∈ U α (r, c), it can be viewed as the joint probability of (G(X), Y ).The first two constraints ensure that P has marginal distribution on G(X) as P G(X) and on Y as P Y .We can also view P ij as the evidence for establishing a translation between word vector x i and word vector y j .
An intuitive interpretation of equation ( 1) is that we are trying to find the optimal transport probability P under the entropy constraint such that the total distance to transport from G(X) to Y is minimized.

Computing Sinkhorn Distance d sh (G)
Cuturi (2013) showed that the optimal solution of formula (1) has the form P * = diag(u)Kdiag(v) , where u and v are some nonnegative vectors and K (G) := e −λM (G) ; λ is the Lagrange multiplier for the entropic constraint in 2 and each α in Equation (1) has one corresponding λ.The Sinkhorn distance can be efficiently computed by a matrix scaling algorithm.We present the pseudo code in Algorithm 1.Note that the computation of d sh (G) only requires matrixvector multiplication.Therefore, we can compute and back propagate the gradient of d sh (G) with regards to the parameters in G using standard deep learning libraries.We show our implementation details in Section 3.4 and supplementary material.

Choice of the Distance Metric
In Section 3.1.1,we used the Euclidean distance of vector pairs to define M (G) and Sinkhorn distance d sh (G).However, in our preliminary experiment, we found that Euclidean distance of unnormalized vectors gave poor performance.Therefore, following the common practice, we normalize all word embedding vectors to have a unit L2 norm in the construction of M (G) .
As pointed out in Theorem 1 of Cuturi (2013), M (G) must be a valid metric in order to make d sh (G) a valid metric.For example, the commonly used cosine distance, which is defined as CosDist(a, b) = 1 − cos(a, b), is not a valid metric because it does not satisfy triangle inequality 1 .Thus, for constructing M (G) , we propose the square root cosine distance (SqrtCosDist) below: Obviously, the last term is the Euclidean distance between normalized input vectors â and b.Since Euclidean distance is a valid metric, it follows that SqrtCosDist satisfies all the axioms for a valid metric.

Objective Function
Given enough capacity, G is capable to transfer X to Y for arbitrary word-to-word mappings.
To ensure that, we learn a meaningful translation and also to regularize the search space of possible transformations, we enforce the word embedding after the forward and the backward transformation should not diverge much from its original direction.We simply choose the back-translation loss based on the cosine similarity: where cos is the cosine similarity.
Putting everything together, we minimize the following objective function.
) where hyper-parameter β controls the relative weight of the last term against the first two terms in the objective function.By definition, computation of d sh (G) or d sh (F ) involves another minimization problem as shown in Equation ( 1).We solve it using the matrix scaling algorithm in Section 3.1.2,and treat d sh (G) as a deterministic and differentiable function of parameters in G.The same holds for d sh (F ) and F .

Wasserstein GAN Training for Good Initial Point
In preliminary experiments, we found that our objective 6 is sensitive to the initialization of the weight in G and F in the purely unsupervised setting.It requires a good initial setting of the parameters to avoid getting stuck in the poor local minimal.To address this sensitivity issue, we employed a similar approach as in (Zhang et al., 2017b;Aldarmaki et al., 2018) to firstly used an adversarial training approach to learn G and F and use them as the initial point for training our full objective 6.More specifically, we choose to minimize the optimal transport distance below.
U is the transport polytope without entropy constraint, defined as follows.
We optimize the distance above by its dual form and through adversarial training, which is also known as Wasserstein GAN (WGAN) (Arjovsky et al., 2017).We applied the optimization trick proposed by Gulrajani et al. (2017).
Although the first phase of adversarial training could be unstable, and the performance is lower than using the Sinkhorn distance, the adversarial training narrows down the search space of model parameters and boosting the training of our proposed model.

Implementation
We implemented transformation G and F by a linear transformation.The dimension of the input and output are the same with the word embedding dimension d.2For all the experiments in the subsequent section, the β in ( 6) was set to be 0.1.For hyper-parameters from the computation of Sinkhorn distance, we choose λ = 10 and run the matrix scaling algorithm for 20 iterations.Due to the space constraint, a detailed implementation description is presented in the supplementary material.The code of our implementation is publicly available3 .

Experiments
We conducted an evaluation of our approach in comparison with state-of-the-art supervised/unsupervised methods on several evaluation benchmarks for bilingual lexicon induction (Task 1) and word similarity prediction (Task 2).We include our main results in this section and report the ablation study in the supplementary material.

Monolingual Word Embedding Data
All the methods being evaluated in both tasks take monolingual word embedding in each language as the input data.We use publicly available pretrained word embeddings trained on Wikipedia articles: (1) a smaller set of word embeddings of dimension 50 trained on comparable Wikipedia dump in five languages (Zhang et al., 2017a) 4 and (2) a larger set of word embeddings of dimension 300 trained on Wikipedia dump in 294 languages (Bojanowski et al., 2016) 5 .For conve-nience, we name the two sets WE-Z and WE-C, respectively.

Bilingual Lexicon Data
We need true translation pairs of words for evaluating methods in bilingual lexicon induction (Task 1).We followed previous studies and prepared two datasets below.
LEX-Z: Zhang et al. (2017a) constructed the bilingual lexicons from various resources.Since their ground truth word pairs are not released, we followed their procedure, crawled bilingual dictionaries and randomly separated them into the training and testing set of equal size.6Note that our proposed method did not utilize the training set.It was only used by supervised baseline methods described in Section 4.2.There are eight language pairs (order counted); the corresponding dataset statistics are summarized in Table 1.We use WE-Z embeddings in this dataset.
LEX-C: This lexicon was constructed by Conneau et al. (2017) and contains more translation pairs than LEX-Z.They divided them into training and testing set.We run our model and the baseline methods on 16 language pairs.For each language pair, the training set contains 5, 000 unique query words and the testing set has 1, 500 query words.We followed Conneau et al. (2017) and set the search space of candidate translations to be the 200, 000 most frequent words in each target language.We use WE-C embeddings in this dataset.

Bilingual Word Similarity Data
For bilingual word similarity prediction (Task 2) we need the true labels for evaluation.Following Conneau et al. (2017), we used the Se-mEval 2017 competition dataset, where human annotators measured the cross-lingual similarity of nominal word pairs according to the five-point Likert scale.This dataset contains word pairs across five languages: English (en), German (de), Spanish (es), Italian (it), and Farsi (fa).Each language pair has about 1,000 word pairs annotated with a real similarity score ranging from 0 to 4.

Baseline Methods
We evaluated the same set of supervised and unsupervised baselines for comparative evaluation in  (2017). 7We fed all the supervised methods with the bilingual dictionaries in the training portions of the LEX-Z and LEX-C datasets, respectively.
For unsupervised baselines we include the methods of Zhang et al. (2017a) and Conneau et al. (2017), whose source code is publicly available as provided by the authors.8

Results in Bilingual Lexicons Induction
(Task 1) Bilingual lexicon induction is a task to induce a translation in the target language for each query word in the source language.After the query word and the target-language words are represented in the same embedding space (or after our system maps the query word from the source embedding space to the target embedding space), the k nearest target words are retrieved based on their cosine similarity scores with respect to the query vector.
If the k retrieved target words contain any valid translation according to the gold bilingual lexicon, the translation (retrieval) is considered successful.The fraction of the correctly translated source words in the test set is defined as accuracy@k, Methods tr-en en-tr es-en en-es zh-en en-zh it-en en-it Supervised Mikolov et al. (2013)    which is conventional metric in benchmark evaluations.
Table 2 shows the accuracy@1 for all the methods on LEX-Z in our evaluation.We can see that our method outperformed the other unsupervised baselines by a large margin on all the eight language pairs.Compared with the supervised methods, our method is still competitive (the best or the second-best scores on four out of eight language pairs), even ours does not require cross-lingual supervision.Also, we notice the performance variance over different language pairs.Our method outperforms all the methods (supervised and unsupervised combined) on the English-Spanish (enes) pair, perhaps for the reasons that these two languages are most similar to each other, and that the monolingual word embeddings for this pair in the comparable corpus are better aligned than the other language pairs.On the other hand, all the methods including ours have the worst performance on the English-Turkish (en-tr) pair.Another observation is the performance differences in the two directions of the language pair.For example, the performance of it-en is better than en-it for all methods in table 2. A part of the reason is that there are more unique English words than non-English words in the evaluation set.This would cause direction "xx-en" to be easier than "en-xx" because there are often multiple valid ground truth English translations for each query in "xx".But the same may not hold for the opposite direction of "en-xx".Nevertheless, the relative performance of our method compared to others is quite robust over different language pairs and different directions of the ground truth annotated by humans.Following the convention in benchmark evaluations for this task, we compute the Pearson correlation between the model-induced similarity scores and the human-annotated similarity scores over testing word pairs for each language pair.A higher correlation score with the ground truth represents the better quality of induced embeddings.All systems use the cosine similarity between the transformed embedding of each query and the word embedding of its paired translation as the predicted similarity score.
Table 5 summarizes the performance of all the methods in cross-lingual word similarity prediction.We can see that the unsupervised methods, including ours, perform equally well as the supervised methods, which is highly encouraging.

Conclusion
In this paper, we presented a novel method for cross-lingual transformation of monolingual embeddings in an unsupervised manner.By simultaneously optimizing the bi-directional mappings w.r.t.Sinkhorn distances and back-translation losses on both ends, our model enjoys its prediction power as well as robustness, with the impressive performance on multiple evaluation benchmarks.For future work, we would like to extend this work in the semi-supervised setting where insufficient bilingual dictionaries are available.

Figure 1 :
Figure 1: The model takes monolingual word embedding X and Y as input.G and F are embedding transfer functions parameterized by a neural network, which are represented by solid arrows.The dashed lines indicate the input for our objective losses, namely the Sinkhorn distance and back-translation loss .
Thus various work in the field of unsupervised domain adaptation or unsupervised transfer learning can shed light on our problem.For example, He et al. (2016) proposed a semi-supervised method for machine translation to utilize large monolingual corpora.Shen et al. (2017) used unsupervised learning to transfer sentences of different sentiments.Recent work in computer vision addresses the problem of image style transfer without any annotated training data 19.41 10.81 68.73 41.19 45.88 45.37 59.83 41.26

Table 2 :
The accuracy@k scores of all methods in bilingual lexicon induction on LEX-Z.The best score for each language pair is bold-faced for the supervised and unsupervised categories, respectively.Language pair "A-B" means query words are in language A and the search space of word translations is in language B. Languages are paired among English(en), Turkish (tr), Spanish (es), Chinese (zh) and Italian (it).

Table 3 :
The accuracy@k scores of all methods in bilingual lexicon induction on LEX-C.The best score for each language pair is bold-faced for the supervised and unsupervised categories, respectively.Languages are paired among English(en), Bulgarian(bg), Catalan(ca), Swedish(sv) and Latvian(lv)."-" means that during the training time, the model failed to converge to reasonable local minimal and hence the result is omitted in the table.