Geometry-aware domain adaptation for unsupervised alignment of word embeddings

We propose a novel manifold based geometric approach for learning unsupervised alignment of word embeddings between the source and the target languages. Our approach formulates the alignment learning problem as a domain adaptation problem over the manifold of doubly stochastic matrices. This viewpoint arises from the aim to align the second order information of the two language spaces. The rich geometry of the doubly stochastic manifold allows to employ efficient Riemannian conjugate gradient algorithm for the proposed formulation. Empirically, the proposed approach outperforms state-of-the-art optimal transport based approach on the bilingual lexicon induction task across several language pairs. The performance improvement is more significant for distant language pairs.


Introduction
Learning bilingual word embeddings is an important problem in natural language processing (Mikolov et al., 2013;Faruqui and Dyer, 2014;Artetxe et al., 2016;Conneau et al., 2018), with usage in cross-lingual information retrieval (Vulić and Moens, 2015), text classification (Wan et al., 2011;Klementiev et al., 2012), machine translation (Artetxe et al., 2018c) etc. Given a sourcetarget language pair, the aim is to represent the words in both languages in a common embedding space. This is usually achieved by learning a linear function that maps word embeddings of one language to the embedding space of the other language (Mikolov et al., 2013).
Learning unsupervised cross-lingual mapping may be viewed as an instance of the more general unsupervised domain adaptation problem (Ben-David et al., 2007;Gopalan et al., 2011;Sun et al., 2016;Mahadevan et al., 2018). The latter fundamentally aims at aligning the input feature (embeddings) distributions of the source and target domains (languages). In this paper, we take this point of view and learn cross-lingual word alignment by finding alignment between the second order statistics of the source and the target language embedding space.
We formulate a novel optimization problem on the set of doubly stochastic matrices. The objective function consists of matching covariances of words from source to target languages in a leastsquares sense. For optimization, we exploit the fact that the set of doubly stochastic matrices has rich geometry and forms a Riemannian manifold (Douik and Hassibi, 2019). The Riemannian optimization framework (Absil et al., 2008;Edelman et al., 1998;Smith, 1994) allows to propose a computationally efficient conjugate gradient algorithm (Douik and Hassibi, 2019). Experiments show the efficacy of the proposed approach on the bilingual lexicon induction benchmark, especially on the language pairs involving distant languages.

Motivation and Related Work
We introduce the bilingual word alignment setup followed by a discussion on domain adaptation approaches. Bilingual alignment. Let X ∈ R n×d and Z ∈ R n×d be d-dimensional word embeddings of n words of the source and the target languages, re-spectively. The aim is to learn a linear operator W : R d → R d that best approximates source embeddings in the target language space.
In the supervised setup, a list of source words and their translations in the target language is provided. This is represented by an alignment matrix Y of size n × n, where Y ij = 1 if j-th word in the target language is a translation of the i-th word in the source language and Y ij = 0 otherwise. A standard way to learn orthogonal W is by solving the orthogonal Procrustes problem (Artetxe et al., 2016;Smith et al., 2017), i.e., where · Fro is the Frobenius norm and I is the identity matrix. Problem (1) has the closed-form solution W = UV , where U and V are the respective left and right orthogonal factors of the singular value decomposition of X YZ (Schönemann, 1966).
In the unsupervised setting, Y is additionally unknown apart from W. Most unsupervised works (Zhang et al., 2017b;Artetxe et al., 2018b;Conneau et al., 2018) tackle this challenge by learning Y and W jointly. However, their performance rely on finding a good initialization candidate for the alignment matrix Y (Zhang et al., 2017b;Alaux et al., 2019;Jawanpuria et al., 2020).
Performing optimization over the set of binary matrices, Y ∈ {0, 1} n×n , to learn the bilingual alignment matrix is computationally hard. Hence, some works (Zhang et al., 2017b;Xu et al., 2018) view the source and the target word embedding spaces as two distributions and learn Y as the transformation that makes the two distributions close. This viewpoint is based on the theory of optimal transport (Villani, 2009;Peyré and Cuturi, 2019). Y is, thus, modeled as a doubly stochastic matrix: the entries in Y ∈ [0, 1] and each row/column sums to 1. Permutation matrices are extreme points in the space of doubly stochastic matrices.
Alvarez-Melis and Jaakkola (2018) propose learning the doubly stochastic Y as a transport map between the metric spaces of the words in the source and the target languages. They optimize the Gromov-Wasserstein (GW) distance, which measures how distances between pairs of words are mapped across languages. For learning Y, they propose to where DS n := {Y ∈ R n×n : Y ≥ 0, Y 1 = 1 and Y1 = 1} is the set of n×n doubly stochastic matrices, Y ≥ 0 implies entry-wise non-negativity, 1 is a column vector of ones, and C X = XX and C Z = ZZ are n × n word covariance matrices of source and target languages, respectively. An iterative scheme is proposed for solving (2), where each iteration involves solving an optimal transport problem with entropic regularization (Peyré et al., 2016;Peyré and Cuturi, 2019). The optimal transport problem is solved with the popular Sinkhorn algorithm (Cuturi, 2013). It should be noted that the GW approach (2) only learns Y. The linear operator to map source language word embedding to the target language embedding space can then be learned by solving (1). Domain adaptation. Domain adaption refers to transfer of information across domains and has been an independent research of interest in many fields including natural language processing (Daumé III, 2007;Borgwardt et al., 2006;Adel et al., 2017;Baktashmotlagh et al., 2013;Fukumizu et al., 2007;Wang et al., 2015;Prettenhofer and Stein, 2011;Wan et al., 2011;Sun et al., 2016;Mahadevan et al., 2018;Ruder, 2019). One modeling of interest is by Sun et al. (2016), who motivate a linear transformation on the features in source and target domains. In (Sun et al., 2016), the linear map A ∈ R d×d is solved by where D 1 and D 2 are d × d are feature covariances of source and target domains (e.g., D X = X X and D Z = Z Z), respectively. Interestingly, (3) has a closed-form solution and shows good performance on standard benchmark domain adaptation tasks (Sun et al., 2016).

Domain Adaptation Based Cross-lingual Alignment
The domain adaptation solution strategies of (Sun et al., 2016;Mahadevan et al., 2018) can be motivated directly for the cross-lingual alignment problem by dealing with word covariances instead of feature covariances. However, the cross-lingual word alignment problem additionally has a bidirectional symmetry: if Y aligns X to Z, then Y aligns Z to X. We exploit this to propose a bi-directional domain adaptation scheme based on (3). The key idea is to adapt the second order information of the source and the target languages into each other's domain. We formulate the above as follows: The first term in the objective function Fro adapts the domain of X (source) into Z (target). Equivalently, minimizing only the first term in the objective function of (4) leads to row indices in Y X aligning closely with the row indices of Z. Similarly, minimizing only the second term YC Z Y − C X 2 Fro adapts Z (now treated as the source domain) into X (now treated as the target domain), which means that the row indices YZ and X are closely aligned. Overall, minimizing both the terms of the objective function allows to learn the alignment matrix Y from X to Z and Y from Z to X simultaneously. Empirically, we observe that bi-directionality acts as a self regularization, leading to optimization stability and better generalization ability.
The differences of the proposed formulation (4) with respect to the GW formulation (2) are two fold. First, the formulation (2) maximizes the inner product between Y C X Y and C Z . This inner product is sensitive to differences in the norms of Y C X Y and C Z . The proposed approach circumvents this issue since (4) explicitly penalizes entry-wise mismatch between Y C X Y and C Z . Second, the GW algorithm for (2) is sensitive to choices of the entropic regularization parameter (Alvarez-Melis and Jaakkola, 2018;Peyré and Cuturi, 2019). In our case, no such regularization is required.
Most recent works that solve optimal transport problem by optimizing over doubly stochastic matrices employ the Sinkhorn algorithm with entropic regularization (Cuturi, 2013;Peyré et al., 2016;Peyré and Cuturi, 2019). In contrast, we exploit the Riemannian manifold structure of the set of doubly stochastic matrices (DS n ) recently studied in (Douik and Hassibi, 2019). DS n is endowed with a smooth Fisher information metric (inner product) that makes the manifold smooth (Douik and Hassibi, 2019;Sun et al., 2015;Lebanon and Lafferty, 2004). In differential geometric terms, DS n has the structure of a Riemannian submanifold. This makes computation of optimization-related ingredients, e.g., gradient and Hessian of a function, projection operators, and retraction operator, straightforward. Leveraging the versatile Riemannian optimization framework (Absil et al., 2008;Edelman et al., 1998;Smith, 1994), the constrained problem (4) is conceptually transformed to an unconstrained problem over the nonlinear manifold. Consequently, most unconstrained optimization algorithms generalize well to manifolds. We solve (4) using the Riemannian conjugate gradient algorithm (Absil et al., 2008;Douik and Hassibi, 2019).
There exist several manifold optimization toolboxes such as Manopt (Boumal et al., 2014), Pymanopt (Townsend et al., 2016), Manopt.jl (Bergmann, 2019), McTorch (Meghwanshi et al., 2018) or ROPTLIB (Huang et al., 2016), which have scalable off-the-shelf generic implementation of Riemannian algorithms. We use Manopt for our experiments, where we only need to provide the objective function (4) and its derivative with respect to Y. The manifold optimization related ingredients are handled by Manopt internally. The computational cost per iteration of the algorithm is O(n 2 ), which is similar to that of GW (Alvarez-Melis and Jaakkola, 2018).
We term our algorithm as Manifold Based Alignment (MBA) algorithm.

Experiments
We compare the proposed algorithm MBA with state-of-the-art GW alignment algorithm (Alvarez-Melis and Jaakkola, 2018) for the bilingual induction (BLI) task. Both the algorithms use second order statistics (word covariance matrices) to learn the word alignment between two languages. In our experimental setup, we first learn the word alignment between the source and the target languages and then compute cross-lingual mapping by solving the Procrustes problem (1). For inference of nearest neighbors, we employ the cross-domain similarity local scaling (CSLS) similarity score (Conneau et al., 2018). We report Precision@1 (P@1) as in (Alvarez-Melis and Jaakkola, 2018;Artetxe et al., 2018b) for the BLI task.
We show results on the MUSE dataset (Conneau et al., 2018), which consists of fastText monolingual embeddings for different languages (Bojanowski et al., 2017) and dictionaries between several languages (but mostly with English). Follow-Method de-xx en-xx es-xx fr-xx it-xx pt-xx xx-de xx-en xx-es xx-fr xx-it xx-pt avg.    For GW, we use the original codes shared by Alvarez-Melis and Jaakkola (2018) and follow their recommendations on tuning the entropic regularization parameter and scaling of covariance matrices C X and C Z . As a practical implementation of MBA, we incrementally increase n starting from 1000 to 20 000 every fixed-number of iterations.
We begin by discussing the results on six closeby European languages in Table 1. We observe that both MBA and GW perform similarly when the languages are related. Hence, in the second set of experiments, we consider other European languages that are distant to English. We observe from Table 2 that MBA outperforms GW, by an average BLI score of 6 points, in this challenging setting. Table 3 reports results on language pairs involving English and three non-European languages. We again observe that the proposed algorithm MBA performs significantly better than GW. Overall, the experiments show the benefit of a geometric optimization framework.

Conclusion
Aligning the metric spaces of languages has a wide usage in cross-lingual applications. A popular approach in literature is the Gromov-Wasserstein (GW) alignment approach (Mémoli, 2011;Peyré et al., 2016;Alvarez-Melis and Jaakkola, 2018), which constructs a transport map by viewing the two embedding spaces as distributions. In contrast, we have viewed unsupervised bilingual word alignment as an instance of the more general unsupervised domain adaptation problem. In particular, our formulation allows search over the space of doubly stochastic matrices and induces bi-directional mapping between the source and target words. Both are motivated solely from the language perspective. The Riemannian framework allows to exploit the geometry of the doubly stochastic manifold. Empirically, we observe that the proposed algorithm MBA outperforms the GW algorithm for learning bilingual mapping (Alvarez-Melis and Jaakkola, 2018), demonstrating the benefit of geometric optimization modeling.