A Relaxed Matching Procedure for Unsupervised BLI

Recently unsupervised Bilingual Lexicon Induction(BLI) without any parallel corpus has attracted much research interest. One of the crucial parts in methods for the BLI task is the matching procedure. Previous works impose a too strong constraint on the matching and lead to many counterintuitive translation pairings. Thus We propose a relaxed matching procedure to find a more precise matching between two languages. We also find that aligning source and target language embedding space bidirectionally will bring significant improvement. We follow the previous iterative framework to conduct experiments. Results on standard benchmark demonstrate the effectiveness of our proposed method, which substantially outperforms previous unsupervised methods.


Introduction
Pretrained word embeddings (Mikolov et al., 2013b) are the basis of many other natural language processing and machine learning systems. Word embeddings of a specific language contain rich syntax and semantic information. Mikolov et al. (2013a) stated that the continuous embedding spaces exhibit similar structures across different languages, and we can exploit the similarity by a linear transformation from source embedding space to target embedding space. This similarity derives the Bilingual Lexicon Induction(BLI) task. The goal of bilingual lexicon induction is to align two languages' embedding space and generates word translation lexicon automatically. This fundamental problem in natural language processing benefits much other research such as sentence translation (Rapp, 1995;Fung, 1995), unsupervised machine translation (Lample et al., 2017), cross-lingual information retrieval (Lavrenko et al., 2002).
Recent endeavors (Lample et al., 2018;Alvarez-Melis and Jaakkola, 2018;; † Yong Zhang is the corresponding author. Artetxe et al., 2017) have proven that unsupervised BLI's performance is even on par with the supervised methods. A crucial part of these approaches is the matching procedure, i.e., how to generate the translation plan. Alvarez-Melis and Jaakkola (2018) used Gromov-Wasserstein distance to approximate the matching between languages.  regarded it as a classic optimal transport problem and used the sinkhorn algorithm (Cuturi, 2013) to compute the translation plan.
In this work, we follow the previous iterative framework but use a different matching procedure. Previous iterative algorithms required to compute an approximate 1 to 1 matching every step. This 1 to 1 constraint brings out many redundant matchings. Thus in order to avoid this problem, we relax the constraint and control the relaxation degree by adding two KL divergence regularization terms to the original loss function. This relaxation derives a more precise matching and significantly improves performance. Then we propose a bidirectional optimization framework to optimize the mapping from source to target and from target to source simultaneously. In the section of experiments, we verify the effectiveness of our method, and results show our method outperforms many SOTA methods on the BLI task.

Background
The early works for the BLI task require a parallel lexicon between languages. Given two embedding matrices X and Y with shape n × d (n:word number, d:vector dimension) of two languages and word x i in X is the translation of word y i in Y , i.e., we get a parallel lexicon X → Y . Mikolov et al. (2013a) pointed out that we could exploit the similarities of monolingual embedding spaces by learning a linear transformation W such that Xing et al. (2015) stated that enforcing an orthogonal constraint on W would improve performance. There is a closed-form solution to this problem called Procrutes: Under the unsupervised condition without parallel lexicon, i.e., vectors in X and Y are totally out of order, Lample et al. (2018) proposed a domainadversarial approach for learning W . On account of the ground truth that monolingual embedding spaces of different languages keep similar spatial structures, Alvarez-Melis and Jaakkola (2018) applied the Gromov-Wasserstein distance based on infrastructure to find the corresponding translation pairings between X and Y and further derived the orthogonal mapping Q.  formulated the unsupervised BLI task as where O d is the set of orthogonal matrices and P n is is the set of permutation matrices.Given Q, estimating P in Problem (2) is equivalent to the minimization of the 2-Wasserstein distance between the two sets of points: XQ and Y .
where D ij = x i Q − y j 2 2 and D, P = i,j P ij D ij denotes the matrix inner product.  proposed a stochastic algorithm to estimate Q and P jointly. Problem (3) is the standard optimal transport problem that can be solved by Earth Mover Distance linear program with O(n 3 ) time complexity. Considering the computational cost, Zhang et al. (2017) and  used the Sinkhorn algorithm (Cuturi, 2013) to estimate P by solving the entropy regularized optimal tranpsort problem (Peyré et al., 2019).
We also take Problem (2) as our loss function and our model shares a similar alternative framework with . However, we argue that the permutation matrix constraint on P is too strong, which leads to many inaccurate and redundant matchings between X and Y , so we relax it by unbalanced optimal transport. Alaux et al. (2019) extended the line of BLI to the problem of aligning multiple languages to a common space.  estimated Q by a density matching method called normalizing flow. Artetxe et al. (2018) proposed a multi-step framework of linear transformations that generalizes a substantial body of previous work. Garneau et al. (2019) further investigated the robustness of Artetxe et al. (2018)'s model by introducing four new languages that are less similar to English than the ones proposed by the original paper. Artetxe et al. (2019) proposed an alternative approach to this problem that builds on the recent work on unsupervised machine translation.

Proposed Method
In this section, we propose a method for the BLI task. As mentioned in the background, we take Problem (2) as our loss function and use a similar optimization framework in  to estimate P and Q alternatively. Our method focuses on the estimation of P and tries to find a more precise matching P between XQ and Y . Estimation of Q is by stochastic gradient descent. We also propose a bidirectional optimization framework in section 3.2.

Relaxed Matching Procedure
Regarding embedding set X and Y as two dis- Standard optimal transport enforces the optimal transport plan to be the joint distribution P ∈ P n . This setting leads to the result that every mass in µ should be matched to the same mass in ν. Recent application of unbalanced optimal transport  shows that the relaxation of the marginal condition could lead to more flexible and local matching, which avoids some counterintuitive matchings of source-target mass pairs with high transportation cost.
The formulation of unbalanced optimal transport (Chizat et al., 2018a) differs from the balanced optimal transport in two ways. Firstly, the set of transport plans to be optimized is generalized to R I×J + . Secondly, the marginal conditions of the Problem (3) are relaxed by two KL-divergence terms. min where
We estimate P by considering the relaxed Problem (4) instead of the original Problem (3) in . Problem (4) could also be solved by entropy regularization with the generalized Sinkhorn algorithm (Chizat et al., 2018b;Peyré et al., 2019).
In short, we already have an algorithm to obtain the minimum of the Problem (4). In order to avoid the hubness phenomenon, we replace l 2 distance of embedding with the rcsls distance proposed in Joulin et al. (2018) formalized as D ij = rcsls(x i Q, y j ). rcsls can not provide significantly better results than euclidean distance in our evaluation. However, previous study suggests that RCSLS could be considered as a better metric between words than euclidean distance. So we propose our approach with RCSLS. The "relaxed matching" procedure and the "bi-directional optimization" we proposed bring most of the improvement.
We call this relaxed estimation of P as Relaxed Matching Procedure(RMP). With RMP only when two points are less than some radius apart from each other, they may be matched together. Thus we can avoid some counterintuitive matchings and obtain a more precise matching P . In the section of experiments we will verify the effectiveness of RMP.

Bidirectional Optimization
Previous research solved the mapping X to Y and the mapping Y to X as two independent problems, i.e., they tried to learn two orthogonal matrix Q 1 and Q 2 to match the XQ 1 with Y and Y Q 2 with X, respectively. Intuitively from the aspect of point cloud matching, we consider these two problems set rand = random()

5:
if rand mod 2 = 1 then 6: Run RMP by solving Problem (4) and obtain P * end for 14: end for in opposite directions are symmetric. Thus we propose an optimization framework to solve only one Q for both directions.
In our approach, we match XQ with Y and Y Q T with X simultaneously. Based on the stochastic optimization framework of , we randomly choose one direction to optimize at each iteration.
The entire process of our method is summarized in Algorithm 2. At iteration i, we start with sampling batches X b , Y b with shape R b×d . Then we generate a random integer rand and choose to map X b Q to Y b or map Y b Q T to X b by rand's parity. Given the mapping direction, we run the RMP procedure to solve Problem (4) by sinkhorn and obtain a matching matrix P * between X b Q and Y b (or Y b Q T and X). Finally we use gradient descent and procrutes to update Q by the given P * . The procedure of Q's update is detailed in .

Experiments
In this section, we evaluate our method in two settings. First, We conduct distillation experiments to verify the effectiveness of RMP and bidirectinal optimization. Then we compare our method consisting of both RMP and bi-directional optimization with various SOTA methods on the BLI task.
DataSets * We conduct word translation experiments on 6 pairs of languages and use pretrained word embedding from fasttext. We use the bilingual dictionaries opensourced in the work (Lample et al., 2018) as our evaluate set.We use the CSLS retrieval method for evaluation as Lample et al. (2018) in both settings. All the translation accuracy reported is the precision at 1 with CSLS criterion. We open the source code on Github † .

Main Results
Through the experimental evaluation, we seek to demonstrate the effectiveness of our method compared to other SOTA methods. The word embeddings are normalized and centered before entering the model. We start with a batch size 500 and 2000 iterations each epoch. We double the batch size and quarter the iteration number after each epoch. First 2.5K words are taken for initialization, and samples are only drawn from the first 20K words in the frequently ranking vocabulary. The coefficients λ 1 and λ 2 of the relaxed terms in Problem (4)  In Table 1, it's shown that leading by an average of 2 percentage points, our approach outperforms other unsupervised methods in most instances and is on par with the supervised method on some language pairs. Surprisingly we find that our method achieves significant progress in some tough cases such as English -Russian, English -Italian, which contain lots of noise. Our method guarantees the precision of mapping computed every step which achieves the effect of noise reduction.
However, there still exists an noticeable gap between our method and the supervised RCSLS method, which indicates further research can be conducted to absorb the superiority of this metric to unsupervised methods.
We also compare our method with W.Proc on two non-English pairs including FR-DE and FR-ES to show how bidirectional relaxed matching improves the performance and results are presented in Table 2. Most of the recent researches didn't report results of non-English pairs, which makes it hard for fair comparison. However from the results in Table 2, we could find that our method keeps an advantage over W.Proc. Note that the W.Proc. results here are our implementation rather than that are reported in the original paper.

Ablation Study
The algorithms for BLI could be roughly divided into three parts: 1. initialization, 2 iterative optimization, and 3. refinement procedure, such as Lample et al. (2017). W.Proc.  only covers the first two parts. Our approaches, i.e. relaxed matching and bi-directional optimization are categorized into the second part. To ensure a fair comparison, W.Proc.-Refine is compared to ours-Refine which is discussed in next section. To verify the effectiveness of RMP and bidirectional optimization directly, we apply them to the method proposed in  one by one. We take the same implementation and hyperparameters reported in their paper and code ‡ but using RMP to solve P instead of ordinary 2-Wasserstein. On four language pairs, We applied RMP, bidirectional optimization and refinement procedure to original W.Proc. gradually and evaluate the performance change. In Figure 1 it's clearly shown that after applying bidirectional RMP, the translation accuracy improves by 3 percentage averagely. The results of 'WP-RMP' are worse than 'WP-RMP- ‡ https://github.com/facebookresearch/fastText/alignment bidirection' but better than original 'WP'. Moreover, we find that by applying RMP, a more precise P not only eliminates many unnecessary matchings but also leads to a faster converge of the optimization procedure. Furthurmore, the effectiveness of refinement procedure is quite significant.
To summarize, we consider the average of scores (from en-es to ru-en). By mitigating the counterintuitive pairs by polysemies and obscure words, the "relaxed matching" procedure improves the average score about 2 points, the "bi-directional optimization" improves the average score about 0.6 points. From the results we could get some inspiration that our ideas of relaxed matching and bidirectional optimization can also be applied to other frameworks such as adversarial training by Lample et al. (2017) and Gromov-Wasserstein by Alvarez-Melis and Jaakkola (2018).

Conclusion
This paper focuses on the matching procedure of BLI task. Our key insight is that the relaxed matching mitigates the counter-intuitive pairs by polysemy and obscure words, which is supported by comparing W.Proc.-RMP with W.Proc in Table 1. The optimal transport constraint considered by W.Proc. is not proper for BLI tasks. Moreover, Our approach also optimizes the translation mapping Q in a bi-directional way, and has been shown better than all other unsupervised SOTA models with the refinement in Table 1.