Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a quadratic problem to learn a orthogonal matrix aligning a bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion. Our experiments on standard benchmarks show that our approach outperforms the state of the art on word translation, with the biggest improvements observed for distant language pairs such as English-Chinese.


Introduction
Previous work has proposed to learn a linear mapping between continuous representations of words by employing a small bilingual lexicon as supervision. The transformation generalizes well to words that are not observed during training, making it possible to extend the lexicon. Another application is to transfer predictive models between languages (Klementiev et al., 2012).
The first simple method proposed by Mikolov et al. (2013b) has been subsequently improved by changing the problem parametrization. One successful suggestion is to 2 -normalize the word vectors and to constrain the linear mapping to be orthogonal (Xing et al., 2015). An alignment is then efficiently found using orthogonal Procrustes (Artetxe et al., 2016;Smith et al., 2017), improving the accuracy on standard benchmarks.
Yet, the resulting models suffer from the socalled "hubness problem": some word vectors tend to be the nearest neighbors of an abnormally high number of other words. This limitation is now addressed by applying a corrective metric at inference time, such as the inverted softmax (ISF) (Smith et al., 2017) or the cross-domain similarity local scaling (CSLS) (Conneau et al., 2017). This is not fully satisfactory because the loss used for inference is not consistent with that employed for training. This observation suggests that the square loss is suboptimal and could advantageously be replaced by a loss adapted to retrieval.
In this paper, we propose a training objective inspired by the CSLS retrieval criterion. We introduce convex relaxations of the corresponding objective function, which are efficiently optimized with projected subgradient descent. This loss can advantageously include unsupervised information and therefore leverage the representations of words not occurring in the training lexicon.
Our contributions are as follows. First we introduce our approach and empirically evaluate it on standard benchmarks for word translation. We obtain state-of-the-art bilingual mappings for more than 25 language pairs. Second, we specifically show the benefit of our alternative loss function and of leveraging unsupervised information. Finally, we show that with our end-to-end formulation, a non-orthogonal mapping achieves better results. The code for our approach is a part of the fastText library 1 and the aligned vectors are available on https://fasttext.cc/.

Preliminaries on bilingual mappings
This section introduces pre-requisites and prior works to learn a mapping between two languages, using a small bilingual lexicon as supervision.
We start from two sets of continuous representations in two languages, each learned on monolingual data. Let us introduce some notation. Each word i ∈ {1, . . . , N } in the source language (respectively target language) is associated with a vector x i ∈ R d (respectively y i ∈ R d ). For simplicity, we assume that our initial lexicon, or seeds, corresponds to the first n pairs (x i , y i ) i∈{1,...,n} . The goal is to extend the lexicon to all source words i ∈ {n + 1, . . . , N } that are not seeds. Mikolov et al. (2013b) learn a linear mapping W ∈ R d×d between the word vectors of the seed lexicon that minimizes a measure of discrepancy between mapped word vectors of the source language and word vectors of the target language: where is a loss function, typically the square loss 2 (x, y) = x − y 2 2 . This leads to a least squares problem, which is solved in closed form.
Orthogonality. The linear mapping W is constrained to be orthogonal, i.e. such that W W = I d , where I d is the d-dimensional identity matrix. This choice preserves distances between word vectors, and likewise word similarities. Previous works (Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017) experimentally observed that constraining the mapping in such a way improves the quality of the inferred lexicon. With the square loss and by enforcing an orthogonal mapping W, Eq. (1) admits a closed form solution (Gower and Dijksterhuis, 2004): W * = UV , where UDV is the singular value decomposition of the matrix Y X.
Inference. Once a mapping W is learned, one can infer word correspondences for words that are not in the initial lexicon. The translation t(i) of a source word i is obtained as t(i) ∈ arg min j∈{1,...,N } (Wx i , y j ). (2) When the squared loss is used, this amounts to computing Wx i and to performing a nearest neighbor search with respect to the Euclidean distance: Hubness. A common observation is that nearest neighbor search for bilingual lexicon inference suffers from the "hubness problem" (Doddington et al., 1998;Dinu et al., 2014). Hubs are words that appear too frequently in the neighborhoods of other words. To mitigate this effect, a simple solution is to replace, at inference time, the square 2 -norm in Eq. (3) by another criterion, such as ISF (Smith et al., 2017) or CSLS (Conneau et al., 2017).
This solution, both with ISF and CSLS criteria, is applied with a transformation W learned using the square loss. However, replacing the loss in Eq. (3) creates a discrepancy between the learning of the translation model and the inference.

Word translation as a retrieval task
In this section, we propose to directly include the CSLS criterion in the model in order to make learning and inference consistent. We also show how to incorporate unsupervised information..
The CSLS criterion is a similarity measure between the vectors x and y defined as: where N Y (x) is the set of k nearest neighbors of the point x in the set of target word vectors Y = {y 1 , . . . , y N }, and cos is the cosine similarity. Note, the second term in the expression of the CSLS loss does not change the neighbors of x. However, it gives a loss function that is symmetrical with respect to its two arguments, which is a desirable property for training.
Objective function. Let us now write the optimization problem for learning the bilingual mapping with CSLS. At this stage, we follow previous work and constrain the linear mapping W to belong to the set of orthogonal matrices O d .
Here, we also assume that word vectors are 2normalized. Under these assumptions, we have Therefore, finding the k nearest neighbors of Wx i among the elements of Y is equivalent to finding the k elements of Y which have the largest dot product with Wx i . We adopt this equivalent formulation because it leads to a convex formulation when relaxing the orthogonality constraint on W. In summary, our optimization problem with the Relaxed CSLS loss (RCSLS) is written as: Convex relaxation. Eq. (4) involves the minimization of a non-smooth cost function over the manifold of orthogonal matrices O d . As such, it can be solved using manifold optimization tools (Boumal et al., 2014). In this work, we consider as an alternative to the set O d , its convex hull C d , i.e., the unit ball of the spectral norm. We refer to this projection as the "Spectral" model. We also consider the case where these constraints on the alignment matrix are simply removed.
Having a convex domain allows us to reason about the convexity of the cost function. We observe that the second and third terms in the CSLS loss can be rewritten as follows: where S k (n) denotes the set of all subsets of {1, . . . , n} of size k. This term, seen as a function of W, is a maximum of linear functions of W, which is convex (Boyd and Vandenberghe, 2004). This shows that our objective function is convex with respect to the mapping W and piecewise linear (hence non-smooth). Note, our approach could be generalized to other loss functions by replacing the term x i W y j by any function convex in W. We minimize this objective function over the convex set C d by using the projected subgradient descent algorithm. The projection onto the set C d is solved by taking the singular value decomposition (SVD) of the matrix, and thresholding the singular values to one.
Extended Normalization. Usually, the number of word pairs in the seed lexicon n is small with respect to the size of the dictionaries N . To benefit from unlabeled data, it is common to add an iterative "refinement procedure" (Artetxe et al., 2017) when learning the translation model W. Given a model W t , this procedure iterates over two steps. First it augments the training lexicon by keeping the best-inferred translation in Eq. (3). Second it learns a new mapping W t+1 by solving the problem in Eq. (1). This strategy is similar to standard semi-supervised approaches where the training set is augmented over time. In this work, we propose to use the unpaired words in the dictionaries as "negatives" in the RCSLS loss: instead of computing the k-nearest neighbors N Y (Wx i ) amongst the annotated words {y 1 , . . . , y n }, we do it over the whole dictionary {y 1 , . . . , y N }.

Experiments
This section reports the main results obtained with our method. We provide complementary results and an ablation study in the appendix. We refer to our method without constraints as RCSLS and as RCSLS+spectral if the spectral constraints are used.

Implementation details
We choose a learning rate in {1, 10, 25, 50} and a number of epochs in {10, 20} on the validation set. For the unconstrained RCSLS, a small 2 regularization can be added to prevent the norm of W to diverge. In practice, we do not use any regularization. For the English-Chinese pairs (en-zh), we center the word vectors. The number of nearest neighbors in the CSLS loss is 10. We use the 2normalized fastText word vectors by Bojanowski et al. (2017) trained on Wikipedia. Table 1 reports the comparison of RCSLS with standard supervised and unsupervised approaches on 5 language pairs (in both directions) of the MUSE benchmark (Conneau et al., 2017). Every approach uses the Wikipedia fastText vectors and supervision comes in the form of a lexicon composed of 5k words and their translations. Regardless of the relaxation, RCSLS outperforms the state of the art by, on average, 3 to 4% in accuracy. This shows the importance of using the same criterion during training and inference. Note that the refinement step ("refine") also uses CSLS to finetune the alignments but leads to a marginal gain for supervised methods.

The MUSE benchmark
Interestingly, RCSLS achieves a better performance without constraints (+0.8%) for all pairs. Contrary to observations made in previous works, this result suggests that preserving the distance between word vectors is not essential for word translation. Indeed, previous works used a 2 loss where, indeed, orthogonal constraints lead to an improvement of +5.3% (Procrustes versus Least Square Error). This suggests that a linear mapping W with no constraints works well only if it is learned with a proper criterion.
Impact of extended normalization. Table 2 reports the gain brought by including words not in the lexicon (unannotated words) to the performance of RCSLS. Extending the dictionary significantly improves the performance on all language pairs. Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en avg.  Conneau et al. (2017). All the methods use the CSLS criterion for retrieval. "Refine" is the refinement step of Conneau et al. (2017). Adversarial, ICP and Wassertsein Proc. are unsupervised (Conneau et al., 2017;Hoshen and Wolf, 2018;. en-es en-fr en-de en-ru avg.    Dinu et al. (2014). "Adversarial" is an unsupervised technique. The adversarial and Procrustes results are from Conneau et al. (2017). We use a CSLS criterion for retrieval.

The WaCky dataset
Dinu et al. (2014) introduce a setting where word vectors are learned on the WaCky datasets (Baroni et al., 2009) and aligned with a noisy bilingual lexicon. We select the number of epochs within {1, 2, 5, 10} on a validation set. Table 3 shows that RCSLS is on par with the state of the art. RCSLS is thus robust to relatively poor word vectors and noisy lexicons.

Comparison with existing aligned vectors
Recently, word vectors based on fastText have been aligned and released by Smith et al. (2017, Baby-lonPartners, BP) and Conneau et al. (2017, MUSE). Both use a variation of Procrustes to align word vectors in the same space. We compare these methods to RCSLS and report results in Table 5  the gap between RCSLS and the other methods is higher with a NN criterion, suggesting that RCSLS imports some of the properties of CSLS to the dot product between aligned vectors.

Impact on word vectors
Non-orthogonal mapping of word vectors changes their dot products. We evaluate the impact of this mapping on word analogy tasks (Mikolov et   2013a). In Table 4, we report the accuracy on analogies for raw word vectors and our vectors mapped to English with an alignement trained on the full MUSE training set. Regardless of the source language, the mapping does not negatively impact the word vectors. Similarly, our alignement has also little impact on word similarity, as shown in Table 6. We confirm this observation by running the reverse mapping, i.e., by mapping the English word vectors of Mikolov et al. (2018) to Spanish. It leads to an improvement of 1% both for vectors trained on Common Crawl (85% to 86%) and Wikipedia + News (87% to 88%).

Conclusion
This paper shows that minimizing a convex relaxation of the CSLS loss significantly improves the quality of bilingual word vector alignment. We use a reformulation of CSLS that generalizes to convex functions beyond dot products and provides a single end-to-end training that is consistent with the inference stage. Finally, we show that removing the orthogonality constraint does not degrade the quality of the aligned vectors.