Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction

Cross-lingual natural language processing hinges on the premise that there exists invariance across languages. At the word level, researchers have identified such invariance in the word embedding semantic spaces of different languages. However, in order to connect the separate spaces, cross-lingual supervision encoded in parallel data is typically required. In this paper, we attempt to establish the cross-lingual connection without relying on any cross-lingual supervision. By viewing word embedding spaces as distributions, we propose to minimize their earth mover’s distance, a measure of divergence between distributions. We demonstrate the success on the unsupervised bilingual lexicon induction task. In addition, we reveal an interesting finding that the earth mover’s distance shows potential as a measure of language difference.


Introduction
Despite tremendous variation and diversity, languages are believed to share something in common. Indeed, this belief forms the underlying basis of computational approaches to cross-lingual transfer (Täckström et al., 2013, inter alia), otherwise it would be inconceivable for the transfer to successfully generalize.
Linguistic universals manifest themselves at various levels of linguistic units. At the word level, there is evidence that different languages represent concepts with similar structure (Youn et al., 2016). Interestingly, as computational models of word semantics, monolingual word embeddings also exhibit isomorphism across languages (Mikolov et al., 2013a). This finding opens up the possibility to use a simple transformation, e.g. a linear map, to connect separately trained word embeddings cross-lingually. Learning such a transformation typically calls for cross-lingual supervision from parallel data (Faruqui and Dyer, 2014;Lu et al., 2015;Smith et al., 2017).
In this paper, we ask the question: Can we uncover the transformation without any cross-lingual supervision? At first sight, this task appears formidable, as it would imply that a bilingual semantic space can be constructed by using monolingual corpora only. On the other hand, the existence of structural isomorphism across monolingual embedding spaces points to the feasibility of this task: The transformation exists right there only to be discovered by the right tool.
We propose such a tool to answer the above question in the affirmative. The key insight is to view embedding spaces as distributions, and the desired transformation should make the two distributions close. This naturally calls for a measure of distribution closeness, for which we introduce the earth mover's distance. Therefore, our task can be formulated as the minimization of the earth mover's distance between the transformed source embedding distribution and the target one with respect to the transformation. Importantly, the minimization is performed at the distribution level, and hence no word-level supervision is required.
We demonstrate that the earth mover's distance minimization successfully uncovers the transformation for cross-lingual connection, as evidenced by experiments on the bilingual lexicon induction task. In fact, as an unsupervised approach, its performance turns out to be highly competitive with supervised methods. Moreover, as an interesting byproduct, the earth mover's distance provides a distance measure that may quantify a facet of language difference.  Figure 1: An illustration of our earth mover's distance minimization formulation. The subplots on the left schematically visualize Chinese and English embeddings. Due to isomorphism, there exists a simple transformation G that aligns the two embedding spaces well, as shown on the right. We expect to find the transformation G by minimizing the earth mover's distance without the need for cross-lingual word-level supervision, because the earth mover's distance holistically measures the closeness between two sets of weighted points. It computes the minimal cost of transporting one set of points to the other, whose weights are indicated by the sizes of squares and dots. We show the transport scheme in the right subplot with arrows, which can be interpreted as word translations.

Aligning Isomorphic Embeddings
As discovered by previous work (Mikolov et al., 2013a), monolingual word embeddings exhibit isomorphism across languages, i.e., they appear similar in structure. However, as they are trained independently, the specific "orientation" of each embedding space is arbitrary, as illustrated in the left part of Figure 1. In order to connect the separate embedding spaces, we can try to transform the source embeddings so that they align well with target ones. Naturally, we need a measure for the quality of the alignment to guide our search for the transformation.
As we aim to eliminate the need for crosslingual supervision from word translation pairs, the measure cannot be defined at the word level as in previous work (Mikolov et al., 2013a). Rather, it should quantify the difference between the entire distributions of embeddings. With this in mind, we find the earth mover's distance to be a suitable choice (Zhang et al., 2016b). Its workings are illustrated in the right part of Figure 1. We can think of target embeddings as piles of earth, and transformed source embeddings as holes to be filled. Then the earth mover's distance computes the minimal cost of moving the earth to fill the holes. Clearly, if the two sets of embeddings align well, the earth mover's distance will be small. Therefore, we can try to find the transformation that minimizes the earth mover's distance.
Another desirable feature of the earth mover's distance is that the computed transport scheme can be readily interpreted as translations. Moreover, this interpretation naturally handles multiple alternative translations. For example, the Chinese word "mao" can be translated to "cat" or "kitten", as shown in Figure 1.

The Form of the Transformation
The approximate isomorphism across embedding spaces inspires researchers to use a simple form of transformation. For example, Mikolov et al. (2013a) chose to use a linear transformation, i.e. the transformation G parametrized by a matrix. Later, proposals for using an orthogonal transformation are supported empirically (Xing et al., 2015;Zhang et al., 2016c;Artetxe et al., 2016) and theoretically (Smith et al., 2017). Indeed, an orthogonal transformation has desirable properties in this setting. If G is an orthogonal matrix that transforms the source embeddings into the target space, then its transpose (also its inverse) G performs transformation in the reverse direction. In that case, any word embedding a can be recovered by transforming back and forth because G Ga = a. Moreover, computing the cosine similarity between a source embedding a and a target embedding b will be independent of the semantic space in which the similarity is measured, because b Ga/ Ga b = a G b/ a G b . Therefore we are inclined to use an orthogonal transformation for our task.

The Earth Mover's Distance
The earth mover's distance (EMD) is a powerful tool widely used in computer vision and natural language processing (Rubner et al., 1998;Kusner et al., 2015;Huang et al., 2016;Zhang et al., 2016b,a). Mathematically speaking, the EMD defines a distance between probability distributions. In the discrete case, a probability distribution can be represented by a sum of Dirac delta functions. For a pair of discrete distributions P 1 = i u i δ x i and P 2 = j v j δ y j , the EMD is defined as (1) where c (x i , y j ) gives the ground distance between x i and y j , and U (u, v) is known as the transport polytope, defined as (2) After solving the minimization program (1), the transport matrix T stores information of the transport scheme: A non-zero T ij indicates the amount of probability mass transported from y j to x i . For our task, this can be interpreted as evidence for word translation (Zhang et al., 2016b), as indicated by arrows in the right part of Figure 1.
The EMD is closely related to the Wasserstein distance in mathematics, defined as where Γ (P 1 , P 2 ) denotes the set of all joint distributions γ (x, y) with marginals P 1 and P 2 on the first and second factors respectively. As we can see, the Wasserstein distance generalizes the EMD to allow continuous distributions. In our context, we will use both terms interchangeably.  Figure 2: The Wasserstein GAN for unsupervised bilingual lexicon induction. The generator G transforms the source word embeddings into the target space. The critic D takes both sets of embeddings and tries to estimate their Wasserstein distance, and this information will be passed to the generator G during training to guide it towards minimizing the Wasserstein estimate.

Approaches
In our task, we are interested in a pair of distributions of word embeddings, one for the source language and the other for the target language. A source word embedding w S s is a d-dimensional column vector that represents the s-th source word in the V S -sized source language vocabulary. Its distribution is characterized by a positive vector Notations are similar for the target side. We assume the embeddings are normalized to have unit L 2 norm, which makes no difference to the result as we use cosine to measure semantic similarity.
Under this setting, we develop two approaches to our EMD minimization idea, called WGAN (Section 3.1) and EMDOT (Section 3.2) respectively.

Wasserstein GAN (WGAN)
Generative adversarial nets (GANs) are originally proposed to generate natural images (Goodfellow et al., 2014). They can generate sharp images if trained well, but they are notoriously difficult to train. Therefore, a lot of research efforts have been dedicated to the investigation into stabler training (Radford et al., 2015;Salimans et al., 2016;Nowozin et al., 2016;Metz et al., 2016;Poole et al., 2016;Arjovsky and Bottou, 2017), and the recently proposed Wasserstein GAN ) is a promising technique along this line of research.
While the original GAN is formulated as an adversarial game (hence its name), the Wasserstein GAN can be directly understood as minimizing the Wasserstein distance (3). Figure 2 illustrates the concept in the context of our unsupervised bilingual lexicon induction task. The generator G takes source word embeddings and transforms them, with the goal that the transformed source distribution P G(S) and the target distribution P T should be close as measured by the Wasserstein distance. The critic D takes both transformed source word embeddings and target word embeddings and attempts to accurately estimate their Wasserstein distance, which will guide the generator during training. The overall objective is where t are the distributions of transformed source word embeddings and target word embeddings. Here we do not impose the orthogonal constraint on G to facilitate the use of a gradient-based optimizer. With the ground distance c being Euclidean distance L 2 , the Kantorovich-Rubinstein duality (Villani, 2009) gives where the supremum is over all K-Lipschitz functions f . As neural networks are universal function approximators (Hornik, 1991), we can attempt to approximate f with a neural network, called the critic D, with weight clipping to ensure the function family is K-Lipschitz. Therefore the objective of the critic is Conceptually, the critic D assigns scores f D to real target embeddings and fake ones generated by the generator G. When the objective (6) is trained until optimality, the difference of the scores will approximate the Wasserstein distance up to a multiplicative constant. The generator G then aims to minimize the approximate distance, which leads to

EMD Minimization Under Orthogonal Transformation (EMDOT)
Alternative to minimizing the Wasserstein distance by duality, the primal program with the orthogonal constraint can be formalized as where O (d) is the orthogonal group in dimension d. The exact solution to this minimization program is NP-hard (Ding and Xu, 2016). Fortunately, an alternating minimization procedure is guaranteed to converge to a local minimum (Cohen and Guibas, 1999). Starting from an initial matrix G (0) , we alternate between the following subprograms repeatedly: (10) The minimization in (9) is the EMD program (1), with existing solvers available. For better scalability, we choose an approximate solver (Cuturi, 2013).
The minimization in (10) aims to find the transformation G (k+1) with cross-lingual connection provided in T (k) . This is exactly the supervised scenario, and previous works typically resort to gradient-based solvers (Mikolov et al., 2013a). But they can be cumbersome especially as we impose the orthogonal constraint on G. Fortunately, if we choose the ground distance c to be the squared Euclidean distance L 2 2 , the program (10) is an extension of the orthogonal Procrustes problem (Schönemann, 1966), which admits a closedform solution: where U and V are obtained from a singular value decomposition (SVD): Note that the SVD is efficient because it is performed on a d × d matrix, which is typically lowdimensional. Choosing c = L 2 2 is also motivated by its equivalence to the cosine dissimilarity, as proved in Appendix A.

Discussion
Starting from the idea of earth mover's distance minimization, we have developed two approaches towards the goal. They employ different optimization techniques, which in turn lead to different practical choices. For example, we choose c = L 2 2 for the EMDOT approach to obtain a closed-form solution to the subprogram (10), otherwise we would have to use gradient-based solvers. In contrast, the WGAN approach calls for c = L 2 because the Kantorovich-Rubinstein duality takes a simple form only in this case.
The EMDOT approach is attractive for several reasons: It is consistent for training and testing (the equivalence between the ground distance c = L 2 2 and cosine dissimilarity), compatible with the orthogonal constraint, mathematically sound (without much assumption and approximation), guaranteed to converge, almost hyperparameter free, and fast in speed (the alternating subprograms have either effective approximate solvers or closed-form solutions). However, it suffers from a serious limitation: The alternating minimization procedure only converges to local minima, and they often turn out to be rather poor in practice.
Although the WGAN approach employs a stochastic-gradient-based optimizer (RMSProp) and does not guarantee global optima either, it works reasonably well in practice. It seems better at exploring the parameter space and finally landing in a neighborhood of a good optimum. Like other success stories of using stochastic-gradientbased optimizers to train neural networks, theoretical understanding of the behavior remains elusive.
We can enjoy the best of both worlds by incorporating the merits of both approaches: First the WGAN approach locates a good neighborhood of the parameter space, and then, starting from a reasonable initialization, the EMDOT approach efficiently explores the neighborhood to achieve enhanced performance.

Experiments
We first investigate the learning behavior of our WGAN approach, and then present experiments on the bilingual lexicon induction task, followed by a showcase of the earth mover's distance as a language distance measure. Details of the data sets and hyperparameters are described in Appendices B and C.

Learning Behavior of WGAN
We analyze the learning behavior of WGAN by looking at a typical training trajectory on Chinese-English. During training, we save 100 models, translate based on the nearest neighbor, and The three curves all correlate well. The Wasserstein estimate is rescaled because its magnitude is irrelevant.
record their accuracy as the bilingual lexicon induction performance indicator at these training checkpoints. In theory, the critic objective (6) provides an estimate of the Wasserstein distance up to a multiplicative constant, and a smaller Wasserstein distance should mean the transformed source embedding space and the target embedding space align better, which should in turn result in a better bilingual lexicon. This is validated in Figure 3 by the correlation between Wasserstein estimate and accuracy. Therefore, the Wasserstein estimate can serve as an indicator for the bilingual lexicon induction performance, and we can save the model with the lowest value during training as the final model.
In Figure 3, we also plot the value of G G − I F , which indicates the degree of orthogonality of the transformation matrix G. Interestingly, this also correlates nicely with the other curves, even though our WGAN formulation does not encourage G towards orthogonality. This finding confirms that a good transformation matrix is indeed close to orthogonality, and empirically justifies the orthogonal constraint for the EMDOT formulation.
Finally, we observe that the curves in Figure 3 are not very smooth. This means that although WGAN does well in exploring the parameter space and locating a reasonable transforma-method # seeds zh-en es-en it-en ja-zh tr-en

Bilingual Lexicon Induction Performance
We test the quality of the cross-lingual transformation by evaluating on the bilingual lexicon induction task for five language pairs: Chinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English.
As the EMD automatically handles multiple alternative translations, we follow (Zhang et al., 2016b,a) to use F 1 score as the preferred evaluation metric.

Baselines
Our formulation is based on the isomorphism found across monolingual word embeddings. This idea has led to previous supervised methods: • Translation matrix (TM) (Mikolov et al., 2013a): the pioneer of this type of methods, using linear transformation. We use a publicly available implementation. 1 • Isometric alignment (IA) (Zhang et al., 2016c): an extension of TM by augmenting its learning objective with the isometric (orthogonal) constraint. Although Zhang et al. (2016c) had subsequent steps for their POS tagging task, it could be used for bilingual lexicon induction as well. Although they need seed word translation pairs to train and thus not directly comparable to our system, we nonetheless report their results using {50, 100, 200, 500} seeds for a ballpark range of expected performance on this task, and skip the set of 500 seeds when testing all systems. We ensure the same input embeddings for these methods and ours. Their seeds are obtained through Google Translate (details in Appendix B.2). We apply the EMD as a postprocessing step (Zhang et al., 2016b) to allow them to handle multiple alternative translations. This is also done for our WGAN approach, as it does not produce the transport scheme to interpret as translation due to its duality formulation. Table 1 shows the F 1 scores on the five language pairs. As we can see, WGAN successfully finds a transformation that produces reasonable word translations. On top of that, EMDOT considerably improves the performance, which indicates that EMDOT refines the transformation found by WGAN. Similar behavior across language pairs proves the generality of our approaches, as they build on embeddings learned from monolingual corpora without language-specific engineering. The quality of the embeddings, thus, will have an important effect on the performance, which may explain the lower scores on Turkish-English, as this lowresource setting may lack sufficient data to produce reliable embeddings. Higher noise levels in the preprocessing and ground truth for this lan-zh-en es-en it-en ja-zh tr-en EMD 0.650 0.445 0.559 0.599 0.788 typology dissimilarity 0.467 0.342 0.259 0.433 0.541 geographical distance (km) 8161 1246 1464 2095 2854 Table 2: The earth mover's distance (EMD), typology dissimilarity, and geographical distance for Chinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. The EMD shows correlation with both factors of linguistic difference.

Results
guage pair (cf. the supplemental material), as well as the morphological richness of Turkish, may also be contributing factors to the relatively low scores.
Concerning the supervised methods TM and IA, they attain better performance with more supervision from seeds, as expected. For TM in particular, hundreds of seeds are needed for generalization, in line with the finding in (Vulić and Korhonen, 2016). Below that threshold, its performance drops dramatically, and this is when IA fares better with the orthogonal constraint. This indicates the importance of orthogonality when the seeds are few, or even zero as faced by our system. As the number of seeds increases, the performance of the supervised methods converges to a level comparable to our system.

The EMD as Language Distance
As our system minimizes the earth mover's distance between embeddings of two languages, we show here the final EMD can indicate the degree of difference between languages, serving as a proxy for language distance. Table 2 lists the EMD for the five language pairs considered in this paper, as well as their typology dissimilarity and geographical distance. The typology dissimilarity is computed from features in the WALS database (Dryer and Haspelmath, 2013). It is defined as one minus relative Hamming similarity, which is in turn defined as the number of agreeing features divided by the number of total features available for the language pair (Albu, 2006;Cysouw, 2013b). As a rough approximation, the geographical distance is measured by the distance between the capital cities of the countries where the considered languages are spoken (Eger et al., 2016).
The typology dissimilarity reflects genealogical influence on the divergence between languages, while the geographical distance indicates the effect of language contact. Both play important roles in shaping the languages we perceive today, and they also correlate with each other (Cysouw, 2013a). As we analyze Table 2, we find the EMD may be explained by both factors. Spanish-English and Italian-English are close both genealogically and geographically, and their EMD values are the lowest. English, Chinese, and Japanese belong to different language families, but the geographical proximity of the latter two enables intensive language contact, especially for the vocabularies, causing relatively smaller EMD. Finally, Turkish and English are distant in both aspects, and the EMD between them is large. Note that, however, the large EMD may also be caused by the relatively poor quality of monolingual embeddings due to low resource, and this should be a caveat of using the EMD to measure language distance.

Bilingual Lexicon Induction
Bilingual lexicon induction is a long-standing research task in cross-lingual natural language processing. Traditional methods build statistical models for monolingual word co-occurrence, and combine cross-lingual supervision to solve the task. As word alignment for parallel sentences can produce fairly good bilingual lexica (Och and Ney, 2003), these methods focus on non-parallel data with a seed lexicon as cross-lingual supervision (Rapp, 1999;Gaussier et al., 2004).
An exception that does not rely on cross-lingual supervision is the decipherment approach Knight, 2012, 2013;Dou et al., 2015). It views the source language as a cipher for the target language, and solves a statistical model that attempts to decipher the source language.
There is a recent work that aims to remove the need for cross-lingual supervision (Cao et al., 2016). Similar to ours, the underlying idea is to match cross-lingually at the level of distribution rather than word. However, the distributions considered in that work are the hidden states of neural embedding models during the course of training. They are assumed to be Gaussian, so that the matching of distributions reduces to matching their means and variances, but this assumption is hard to justify and interpret. In contrast, our proposal does not make any assumption on the distributions, and directly matches the transformed source embedding distribution with the target distribution by minimizing their earth mover's distance.
Another attempt to learn cross-lingual embedding transformation without supervision is (Barone, 2016).
Architectures of generative adversarial nets and adversarial autoencoders (Makhzani et al., 2015) are experimented, but the reported results are not positive. We tried the publicly available code on our data and obtained negative results as well. This outcome is likely caused by the training difficulty pointed out by (Arjovsky and Bottou, 2017), as traditional GAN training minimizes Jensen-Shannon divergence between distributions, which can provide pathological gradient to the generator and hamper its learning. The use of Wasserstein GAN addresses this problem and allows our simple architecture to be trained successfully.

Language Distance
Quantifying language difference is an open question with on-going efforts that put forward better measures based on manually compiled data (Albu, 2006;Hammarström and O'Connor, 2013). Researchers in computational linguistics also try to contribute corpus-based approaches to this question. Parallel data is typically exploited, and ideas range from information-theoretic (Juola, 1998), statistical (Mayer and Cysouw, 2012), to graphbased (Eger et al., 2016;Asgari and Mofrad, 2016). To our knowledge, the earth mover's distance is proposed for language distance for the first time, with the distinctive feature of relying on nonparallel data only.

The Earth Mover's Distance
First introduced into computer vision (Rubner et al., 1998), the earth mover's distance also finds application in natural language processing (Kusner et al., 2015;Huang et al., 2016), including bilingual lexicon induction (Zhang et al., 2016b,a). Zhang et al. (2016b) build upon bilingual word embeddings and apply the EMD program as a postprocessing step to automatically produce multiple alternative translations. Later, Zhang et al. (2016a) introduce the EMD into the training objective of bilingual word embeddings as a regularizer. These previous works rely on crosslingual supervision, and do not approach the task from the view of embedding transformation, while our work formulates the task as EMD minimization to allow zero supervision.
Apart from the usage as a regularizer (Zhang et al., 2016a), the EMD can also play other roles in optimization programs designed for various applications (Cuturi and Doucet, 2014;Frogner et al., 2015;Montavon et al., 2016).

Conclusion and Future Work
In this work, we attack the problem of finding cross-lingual transformation between monolingual word embeddings in a purely unsupervised setting. We introduce earth mover's distance minimization to tackle this task by exploiting its distribution-level matching to sidestep the requirement for word-level cross-lingual supervision. Even though zero supervision poses a clear challenge, our system attains competitive performance with supervised methods for bilingual lexicon induction. In addition, the earth mover's distance provides a natural measure that may prove helpful for quantifying language difference.
We have implemented the earth mover's distance minimization framework from two paths, and their combination has worked well, but both can be potentially improved by recent advances in optimization techniques (Gulrajani et al., 2017;Ding and Xu, 2016). Future work should also evaluate the earth mover's distance between more languages to assess its quality as language distance.

A Proof
The following proof shows that using squared Euclidean distance as the ground distance (c = L 2 2 ) is equivalent to using cosine dissimilarity when minimizing Equation (10).   (Strassel and Tracey, 2016), and its English side is preprocessed by NLTK. The statistics of the preprocessed corpora is given in Table 3.

B.2 Seed Word Translation Pairs
The seed word translation pairs for the translation matrix (TM) and isometric alignment (IA) approaches are obtained as follows. First, we ask Google Translate 8 to translate the source language vocabulary. Then the target translations are queried again and translated back to the source language, and those that do not match the original source words are discarded. This helps to ensure the translation quality. Finally, the translations are discarded if they fall out of our target language vocabulary.

B.3 Ground Truth
As the ground truth bilingual lexicon for evaluation, we use Chinese-English Translation Lexicon Version 3.0 (LDC2002L27) for the Chinese-English pair. For Spanish-English and Italian-English, we access Open Multilingual WordNet 9 through NLTK. For Japanese-Chinese, we use an in-house lexicon. For Turkish-English, we build a set of ground truth translation pairs in the same way as how we obtain seed word translation pairs from Google Translate, described above.

C.1 WGAN
We parametrize the critic D as a feed-forward neural network with one hidden layer of 500 neurons. The generator G is initialized with a random orthogonal matrix. The expectations in critic and generator objectives (6)(7) are approximated by minibatches of 1024 samples. We train for 10 7 minibatches. Most other hyperparameters follow from ) except the learning rates, for which larger values of 0.05 and 0.0005 are used for the generator and the critic respectively for faster convergence.

C.2 EMDOT
The approximate EMD solver (Cuturi, 2013) gives fairly accurate approximation with orders of magnitude speedup. However, it makes the transport matrix T no longer sparse. This is problematic, as we rely on interpreting a non-zero T st as evidence to translate the s-th source word to the t-th target word (Zhang et al., 2016b). We therefore retain the largest pV S elements of T , where p encodes our belief of the expected number of translations a source word can have. We set p = 1.3. The alternating minimization procedure converges very fast. We run 10 iterations.

C.3 Monolingual Word Embeddings
As input monolingual word embeddings to the tested systems, we train the CBOW model (Mikolov et al., 2013b) with default hyperparameters in word2vec 10 . The embedding dimension d is 50.