Gromov-Wasserstein Alignment of Word Embedding Spaces

Cross-lingual or cross-domain correspondences play key roles in tasks ranging from machine translation to transfer learning. Recently, purely unsupervised methods operating on monolingual embeddings have become effective alignment tools. Current state-of-the-art methods, however, involve multiple steps, including heuristic post-hoc refinement strategies. In this paper, we cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms. Indeed, we exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages. We show that our OT objective can be estimated efficiently, requires little or no tuning, and results in performance comparable with the state-of-the-art in various unsupervised word translation tasks.


Introduction
Many key linguistic tasks, within and across languages or domains, including machine translation, rely on learning cross-lingual correspondences between words or other semantic units. While the associated alignment problem could be solved with access to large amounts of parallel data, broader applicability relies on the ability to do so with largely mono-lingual data, from Part-of-Speech (POS) tagging (Zhang et al., 2016), dependency parsing (Guo et al., 2015), to machine translation . The key subtask of bilingual lexical induction, for example, while long standing as a problem (Fung, 1995;Rapp, 1995Rapp, , 1999, has been actively pursued recently (Artetxe et al., 2016;Zhang et al., 2017a;Conneau et al., 2018).
Current methods for learning cross-domain correspondences at the word level rely on distributed representations of words, building on the observation that mono-lingual word embeddings exhibit similar geometric properties across languages Mikolov et al. (2013). While most early work assumed some, albeit minimal, amount of parallel data (Mikolov et al., 2013;Dinu et al., 2014;Zhang et al., 2016), recently fully-unsupervised methods have been shown to perform on par with their supervised counterparts (Conneau et al., 2018;Artetxe et al., 2018). While successful, the mappings arise from multiple steps of processing, requiring either careful initial guesses or postmapping refinements, including mitigating the effect of frequent words on neighborhoods. The associated adversarial training schemes can also be challenging to tune properly (Artetxe et al., 2018).
In this paper, we propose a direct optimization approach to solving correspondences based on recent generalizations of optimal transport (OT). OT is a general mathematical toolbox used to evaluate correspondence-based distances and establish mappings between probability distributions, including discrete distributions such as point-sets. However, the nature of mono-lingual word embeddings renders the classic formulation of OT inapplicable to our setting. Indeed, word embeddings are estimated primarily in a relational manner to the extent that the algorithms are naturally interpreted as metric recovery methods (Hashimoto et al., 2016). In such settings, previous work has sought to bypass this lack of registration by jointly optimizing over a matching and an orthogonal mapping (Rangarajan et al., 1997;Zhang et al., 2017b). Due to the focus on distances rather than points, we instead adopt a relational OT formulation based on the Gromov-Wasserstein distance that measures how distances between pairs of words are mapped across languages. We show that the resulting mapping admits an efficient solution and requires little or no tuning.
In summary, we make the following contributions: • We propose the use of the Gromov-Wasserstein distance to learn correspondences between word embedding spaces in a fully-unsupervised manner, leading to a theoretically-motivated optimization problem that can be solved efficiently, robustly, in a single step, and requires no post-processing or heuristic adjustments.
• To scale up to large vocabularies we realize an extended mapping to words not part of the original optimization problem.
• We show that the proposed approach performs on par with state-of-the-art neural network based methods on benchmark word translation tasks, while requiring a fraction of the computational cost and/or hyperparameter tuning.

Problem Formulation
In the unsupervised bilingual lexical induction problem we consider two languages with vocabularies V x and V y , represented by word embeddings For simplicity, we let m = n and d x = d y , although our methods carry over to the general case with little or no modifications. Our goal is to learn an alignment between these two sets of words without any parallel data, i.e., we learn to relate x (i) ↔ y (j) with the implication that w x i translates to w y j . As background, we begin by discussing the problem of learning an explicit map between embeddings in the supervised scenario. The associated training procedure will later be used for extending unsupervised alignments (Section 3.2).

Supervised Maps: Procrustes
In the supervised setting, we learn a map T : X → Y such that T (x (i) ) ≈ y (j) whenever w y j is a translation of w x i . Let X and Y be the matrices whose columns are vectors x (i) and y (j) , respectively. Then we can find T by solving where · F is the Frobenius norm A F = i,j |a ij | 2 . Naturally, both the difficulty of finding T and the quality of the resulting alignment depend on the choice of space F. A classic approach constrains T to be orthonormal matrices, i.e., rotations and reflections, resulting in the orthogonal Procrustes problem where O(n) = {P ∈ R n×n | P P = I}.
One key advantage of this formulation is that it has a closed-form solution in terms of a singular value decomposition (SVD), whereas for most other choices of constraint set F it does not. Given an SVD decomposition UΣV of XY , the solution to problem (2) is P * = UV (Schönemann, 1966). Besides obvious computational advantage, constraining the mapping between spaces to be orthonormal is justified in the context of word embedding alignment because orthogonal maps preserve angles (and thus distances), which is often the only information used by downstream tasks (e.g., for nearest neighbor search) that rely on word embeddings. (Smith et al., 2017) further show that orthogonality is required for self-consistency of linear transformations between vector spaces.
Clearly, the Procrustes approach only solves the supervised version of the problem as it requires a known correspondence between the columns of X and Y. Steps beyond this constraint include using small amounts of parallel data (Zhang et al., 2016) or an unsupervised technique as the initial step to generate pseudo-parallel data (Conneau et al., 2018) before solving for P.

Unsupervised Maps: Optimal Transport
Optimal transport formalizes the problem of finding a minimum cost mapping between two point sets, viewed as discrete distributions. Specifically, we assume two empirical distributions over embeddings, e.g., where p and q are vectors of probability weights associated with each point set. In our case, we usually consider uniform weights, e.g., p i = 1/n and q j = 1/m, although if additional information were provided (such as in the form of word frequencies), those could be naturally incorporated via p and q (see discussion at the end of Section 3). We find a transportation map T realizing where the cost c(x, T (x)) is typically just x − T (x) and T # µ = ν implies that the source points must exactly map to the targets. However, such a map need not exist in general and we instead follow a relaxed Kantorovich's formulation. In this case, the set of transportation plans is a polytope: The cost function is given as a matrix C ∈ R n×m , e.g., C ij = x (i) − y (j) . The total cost incurred by Γ is Γ, C := ij Γ ij C ij . Thus, the discrete optimal transport (DOT) problem consists of finding a plan Γ that solves Problem (5) is a linear program, and thus can be solved exactly in O(n 3 log n) with interior point methods. However, regularizing the objective leads to more efficient optimization and often better empirical results. The most common such regularization, popularized by Cuturi (2013), involves adding an entropy penalization: The solution of this strictly convex optimization problem has the form Γ * = diag (a) K diag (b), with K = e − C λ (element-wise), and can be obtained efficiently via the Sinkhorn-Knopp algorithm, a matrix-scaling procedure which iteratively computes: where denotes entry-wise division. The derivation of these updates is immediate from the form of Γ * above, combined with the marginal constraints Γ1 n = p, Γ 1 n = q (Peyré and Cuturi, 2018). Although simple, efficient and theoreticallymotivated, a direct application of discrete OT for unsupervised word translation is not appropriate. One reason is that the mono-lingual embeddings are estimated in a relative manner, leaving, e.g., an overall rotation unspecified. Such degrees of freedom can dramatically change the entries of the cost matrix C ij = x (i) − y (j) and the resulting transport map. One possible solution is to simultaneously learn an optimal coupling and an orthogonal transformation (Zhang et al., 2017b). The transport problem is then solved iteratively, using where P is in turn chosen to minimize the transport cost (via Procrustes). While promising, the resulting iterative approach is sensitive to initialization, perhaps explaining why Zhang et al. (2017b) used an adversarially learned mapping as the initial step. The computational cost can also be prohibitive (Artetxe et al., 2018) though could be remedied with additional development.
We adopt a theoretically well-founded generalization of optimal transport for pairs of points (their distances), thus in line with how the embeddings are estimated in the first place. We explain the approach in detail in the next Section.

Transporting across unaligned spaces
In this section we introduce the Gromov-Wasserstein distance, describe an optimization algorithm for it, and discuss how to extend the approach to out-of-sample vectors.

The Gromov Wasserstein Distance
The classic optimal transport requires a distance between vectors across the two domains. Such a metric may not be available, for example, when the sample sets to be matched do not belong to the same metric space (e.g., different dimension). The Gromov-Wasserstein distance (Mémoli, 2011) generalizes optimal transport by comparing the metric spaces directly instead of samples across the spaces. In other words, this framework operates on distances between pairs of points calculated within each domain and measures how these distances compare to those in the other domain. Thus, it requires a weaker but easy to define notion of distance between distances, and operates on pairs of points, turning the problem from a linear to a quadratic one.
Formally, in its discrete version, this framework considers two measure spaces expressed in terms of within-domain similarity matrices (C, p) and (C , q) and a loss function defined between similarity pairs: . In this framework, L(C ik , C jl ) can also be understood as the cost of "matching" i to j and k to l.
All the relevant values of L(·, ·) can be put in a 4-th order tensor L ∈ R N 1 ×N 1 ×N 2 ×N 2 , where L ijkl = L(C ik , C jl ). As before, we seek a cou-   Figure 1: The Gromov-Wasserstein distance is well suited for the task of cross-lingual alignment because it relies on relational rather than positional similarities to infer correspondences across domains. Computing it requires two intra-domain similarity (or equivalently cost) matrices (left & center), and it produces an optimal coupling of source and target points with minimal discrepancy cost (right).
pling Γ specifying how much mass to transfer between each pair of points from the two spaces. The Gromov-Wasserstein problem is then defined as solving (8) Compared to problem (5), this version is substantially harder since the objective is now not only non-linear, but non-convex too. 1 In addition, it requires operating on a fourth-order tensor, which would be prohibitive in most settings. Surprisingly, this problem can be optimized efficiently with first-order methods, whereby each iteration involves solving a traditional optimal transport problem (Peyré et al., 2016). Furthermore, for suitable choices of loss function L, Peyré et al. (2016) show that instead of the O(N 2 1 N 2 2 ) complexity implied by naive fourthorder tensor product, this computation reduces to O(N 2 1 N 2 + N 1N 2 2 ) cost. Their approach consists of solving (5) by projected gradient descent, which yields iterations that involve projecting onto Π(p, q) a pseudo-cost matrix of the form where and f 1 , f 2 , h 2 , h 2 are functions that depend on the loss L. We provide an explicit algorithm for the case L = L 2 at the end of this section.
1 In fact, the discrete (Monge-type) formulation of the problem is essentially an instance of the well-known (and NP-hard) quadratic assignment problem (QAP).
Once we have solved (8), the optimal transport coupling Γ * provides an explicit (soft) matching between source and target samples, which for the problem of interest can be interpreted as a probabilistic translation: for every pair of words (w ij provides a likelihood that these two words are translations of each other. This itself is enough to translate, and we show in the experiments section that Γ * by itself, without any further post-processing, provides highquality translations. This stands in sharp contrast to mapping-based methods, which rely on nearest-neighbor computation to infer translations, and thus become prone to hub-word effects which have to be mitigated with heuristic postprocessing techniques such as Inverted Softmax (Smith et al., 2017) and Cross-Domain Similarity Scaling (CSLS) (Conneau et al., 2018). The transportation coupling Γ, being normalized by construction, requires no such artifacts.
The Gromov-Wasserstein problem (8) possesses various desirable theoretical properties, including the fact that for a suitable choice of the loss function it is indeed a distance: Solving problem (8) therefore yields a fascinating accompanying notion: the Gromov-Wasserstein distance between languages, a measure of semantic discrepancy purely based on the relational characterization of their word embeddings. Owing to Theorem 3.1, such values can be interpreted as distances, so that, e.g., the triangle inequality holds among them. In Section 4.4 we compare various languages in terms of their GWdistance.
Finally, we note that whenever word frequency counts are available, those would be used for p and q. If they are not, but words are sorted according to occurrence (as they often are in popular off-the-shelf embedding formats), one can estimate rank-probabilities such as Zipf power laws, which are known to accurately model multiple languages (Piantadosi, 2014). In order to provide a fair comparison to previous work, throughout our experiments we use uniform distributions so as to avoid providing our method with additional information not available to others.

Scaling Up
While the pure Gromov-Wasserstein approach leads to high quality solutions, it is best suited to small-to-moderate vocabulary sizes, 2 since its optimization becomes prohibitive for very large problems. For such settings, we propose a twostep approach in which we first match a subset of the vocabulary via the optimal coupling, after which we learn an orthogonal mapping through a modified Procrustes problem. Formally, suppose we solve problem (8) for a reduced matrices X 1:k and Y i:k consisting of the first columns k of X and Y, respectively, and let Γ * be the optimal coupling. We seek an orthogonal matrix that best recovers the barycentric mapping implied by Γ * . Namely, we seek to find P which solves: Just as problem (2), it is easy to show that this Procrustes-type problem has a closed form solution in terms of a singular value decomposition. Namely, the solution to (10) is P * = UV , where UΣV * = X 1:m Γ * Y 1:m . After obtaining this projection, we can immediately map the rest of the embeddings viaŷ (j) = P * y (j) .
We point out that this two-step procedure resembles that of Conneau et al. (2018). Both ultimately produce an orthogonal mapping obtained by solving a Procrustes problems, but they differ in the way they produce pseudo-matches to allow for such second-step: while their approach relies

Algorithm 1 Gromov-Wasserstein Computation for Word Embedding Alignment
Input: Source and target embeddings X, Y. Regularization λ. Probability vectors p, q. // Compute intra-language similarities C s ← cos(X, X), on an adversarially-learned transformation, we use an explicit optimization problem.
We end this section by discussing parameter and configuration choices. To leverage the fast algorithm of Peyré et al. (2016), we always use the L 2 distance as the loss function L between cost matrices. On the other hand, we observed throughout our experiments that the choice of cosine distance as the metric in both spaces consistently leads to better results, which agrees with common wisdom on computing distances between word embeddings. This leaves us with a single hyperparameter to control: the entropy regularization term λ. By applying any sensible normalization on the cost matrices (e.g., dividing by the mean or median value), we are able to almost entirely eliminate sensitivity to that parameter. In practice, we use a simple scheme in all experiments: we first try the same fixed value (λ = 5 × 10 −5 ), and if the regularization proves too small (by leading to floating point errors), we instead use λ = 1 × 10 −4 . We never had to go beyond these two values in all our experiments. We emphasize that at no point we use train (let alone test) supervision available with many datasets-model selection is done solely in terms of the unsupervised objective. Pseudocode for the full method (with L = L 2 and cosine similarity) is shown here as Algorithm 1.

Experiments
Through this experimental evaluation we seek to: (i) understand the optimization dynamics of the proposed approach ( §4.2), evaluate its performance on benchmark cross-lingual word embedding tasks ( §4.3), and (iii) qualitatively investigate the notion of distance-between-languages it computes ( §4.4). Rather than focusing solely on prediction accuracy, we seek to demonstrate that the proposed approach offers a fast, principled, and robust alternative to state-of-the-art multi-step methods, delivering comparable performance.

Evaluation Tasks and Methods
Datasets We evaluate our method on two standard benchmark tasks for cross-lingual embeddings. First, we consider the dataset of Conneau et al. (2018), which consists of word embeddings trained with FASTTEXT (Bojanowski et al., 2017) on Wikipedia and parallel dictionaries for 110 language pairs. Here, we focus on the language pairs for which they report results: English (EN) from/to Spanish (ES), French (FR), German (DE), Russian (RU) and simplified Chinese (ZH). We do not report results on Esperanto (EO) as dictionaries for that language were not provided with the original dataset release. For our second set of experiments, we consider the-substantially harder 3 -dataset of (Dinu et al., 2014), which has been extensively compared against in previous work. It consists of embeddings and dictionaries in four pairs of languages; EN from/to ES, IT, DE, and FI (Finnish). 3 We discuss the difference in hardness of these two benchmark datasets in Section 4.3.
Methods To see how our fully-unsupervised method compares with methods that require (some) cross-lingual supervision, we follow (Conneau et al., 2018) and consider a simple but strong baseline consisting of solving a procrustes problem directly using the available cross-lingual embedding pairs. We refer to this method simply as PROCRUSTES. In addition, we compare against the fully-unsupervised methods of Zhang et al. (2017a), Artetxe et al. (2018) and Conneau et al. (2018). 4 As proposed by the latter, we use CSLS whenever nearest neighbor search is required, which has been shown to improve upon naive nearest-neighbor retrieval in multiple work.

Training Dynamics of G-W
As previously mentioned, our approach involves only two optimization choices, one of which is required only for very large settings. When running Algorithm 1 for the full set of embeddings is infeasible (due to memory limitations), one must decide what fraction of the embeddings to use during optimization. In our experiments, we use the largest possible size allowed by memory constraints, which was found to be K = 20, 000 for the personal computer we used.
The other-more interesting-optimization choice involves the entropy regularization parameter λ used within the Sinkhorn iterations. Large regularization values lead to denser optimal coupling Γ * , while less regularization leads to sparser solutions, 5 at the cost of a harder (more  Table 1: Performance (P@1) of unsupervised and minimally-supervised methods on the dataset of Conneau et al. (2018). The time columns shows the average runtime in minutes of an instance (i.e., one language pair) of the method in this task on the same quad-core CPU machine.
In Figure 2 we show the training dynamics of our method when learning correspondences between word embeddings from the dataset of Conneau et al. (2018). As expected, larger values of λ lead to smoother improvements with faster runtime-per-iteration, at a price of some drop in performance. In addition, we found that computing GW distances between closer languages (such as EN and FR) leads to faster convergence than for more distant ones (such as EN and RU, in Fig. 2c).
Worth emphasizing are three desirable optimization properties that set apart the Gromov-Wasserstein distance from other unsupervised alignment approaches, particularly adversarialtraining ones: (i) the objective decreases monotonically (ii) its value closely follows the true metric of interest (translation, which naturally is not available during training) and (iii) there is no risk of degradation due to overtraining, as is the case for adversarial-based methods trained with stochastic gradient descent (Conneau et al., 2018).

Benchmark Results
We report the results on the dataset of Conneau et al. (2018) in Table 1. The strikingly high performance of all methods on this task belies the hardness of the general problem of unsupervised cross-lingual alignment. Indeed, as pointed out by Artetxe et al. (2018), the FASTTEXT embeddings provided in this task are trained on very large and highly comparable-across languagescorpora (Wikipedia), and focuses on closely related pairs of languages. Nevertheless, we carry out experiments here to have a broad evaluation of our approach in both easier and harder settings.
Next, we present results on the more challengto a permutation matrix, which gives a hard-matching solution to the transportation problem (Peyré and Cuturi, 2018).  (Dinu et al., 2014). Bottom: Normalizing the cost matrices leads to better optimization and improved performance.
ing dataset of (Dinu et al., 2014) in Table 2. Here, we rely on the results reported by (Artetxe et al., 2018) since by the time of writing the present work their implementation was not available yet. Part of what makes this dataset hard is the wide discrepancy between word distance across languages, which translates into uneven distance matrices (Figure 3), and in turn leads to poor results for G-W. To account for this, previous work has relied on an initial whitening step on the embeddings. In our case, it suffices to normalize the pairwise similarity matrices to the same range to obtain substantially better results. While we have observed that careful choice of the regularization parameter λ can obviate the need for this step, we opt for the normalization approach since it allows us to optimize without having to tune λ.
We compare our method (with and without nor-   Dinu et al. (2014) with runtimes in minutes. Those marked with † are from (Artetxe et al., 2018). Note that their runtimes correspond to GPU computation, while ours are CPU-minutes, so the numbers are not directly comparable.
malization) against alternative approaches in Table 2. Note that we report the runtimes of Artetxe et al. (2018) as-is, which are obtained by running on a Titan XP GPU, while our runtimes are, as before, obtained purely by CPU computation.

Qualitative Results
As mentioned earlier, Theorem 3.1 implies that the optimal value of the Gromov-Wasserstein problem can be legitimately interpreted as a distance between languages, or more explicitly, between their word embedding spaces. This distributional notion of distance is completely determined by pairwise geometric relations between these vectors. In Figure 4 we show the values GW(C s , C t , p, q) computed on the FASTTEXT word embeddings of Conneau et al. (2018) corresponding to the most frequent 2000 words in each language. Overall, these distances conform to our intuitions: the cluster of romance languages exhibits some of the shortest distances, while classical Chinese (ZH) has the overall largest discrepancy with all other languages. But somewhat surprisingly, Russian is relatively close to the romance languages in this metric. We conjecture that this could be due to Russian's rich morphology (a trait shared by romance languages but not English). Furthermore, both Russian and Spanish are prodrop languages (Haspelmath, 2001) and share syntactic phenomena, such as dative subjects (Moore and Perlmutter, 2000;Melis et al., 2013) and differential object marking (Bossong, 1991), which might explain why ES is closest to RU overall.
On the other hand, English appears remarkably isolated from all languages, equally distant from its germanic (DE) and romance (FR) cousins. Indeed, other aspects of the data (such as corpus size) might be underlying these observations.

Related Work
Study of the problem of bilingual lexical induction goes back to Rapp (1995) and Fung (1995). While the literature on this topic is extensive, we focus here on recent fully-unsupervised and minimallysupervised approaches, and refer the reader to one of various existing surveys for a broader panorama (Upadhyay et al., 2016;Ruder et al., 2017).
Methods with coarse or limited parallel data. Most of these fall in one of two categories: methods that learn a mapping from one space to the other, e.g., as a least-squares objective (e.g., (Mikolov et al., 2013)) or via orthogonal transformations Zhang et al. (2016); Smith et al. (2017); Artetxe et al. (2016), and methods that find a com-mon space on which to project both sets of embeddings (Faruqui and Dyer, 2014;Lu et al., 2015).
Fully Unsupervised methods. Conneau et al. (2018) and Zhang et al. (2017a) rely on adversarial training to produce an initial alignment between the spaces. The former use pseudo-matches derived from this initial alignment to solve a Procrustes (2) alignment problem. Our Gromov-Wasserstein framework can be thought of as providing an alternative to these adversarial training steps, albeit with a concise optimization formulation and producing explicit matches (via the optimal coupling) instead of depending on nearest neighbor search, as the adversarially-learnt mappings do. Zhang et al. (2017b) also leverage optimal transport distances for the cross-lingual embedding task. However, to address the issue of nonalignment of embedding spaces, their approach follows the joint optimization of the transportation and procrustes problem as outlined in Section 2.2. This formulation makes an explicit modeling assumption (invariance to unitary transformations), and requires repeated solution of Procrustes problems during alternating minimization. Gromov-Wasserstein, on the other hand, is more flexible and makes no such assumption, since it directly deals with similarities rather than vectors. In the case where it is required, such an orthogonal mapping can be obtained by solving a single procrustes problem, as discussed in Section 3.2.

Discussion and future work
In this work we provided a direct optimization approach to cross-lingual word alignment. The Gromov-Wasserstein distance is well-suited for this task as it performs a relational comparison of word-vectors across languages rather than wordvectors directly. The resulting objective is concise, and can be optimized efficiently. The experimental results show that the resulting alignment framework is fast, stable and robust, yielding near stateof-the-art performance at a computational cost orders of magnitude lower than that of alternative fully unsupervised methods.
While directly solving Gromov-Wasserstein problems of reasonable size is feasible, scaling up to large vocabularies made it necessary to learn an explicit mapping via Procrustes. GPU computations or stochastic optimization could help avoid this secondary step.