A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap.


Introduction
Cross-lingual embedding mappings have shown to be an effective way to learn bilingual word embeddings (Mikolov et al., 2013;. The underlying idea is to independently train the embeddings in different languages using monolingual corpora, and then map them to a shared space through a linear transformation. This allows to learn high-quality cross-lingual representations without expensive supervision, opening new research avenues like unsupervised neural machine translation (Artetxe et al., 2018b;. While most embedding mapping methods rely on a small seed dictionary, adversarial training has recently produced exciting results in fully unsu-pervised settings (Zhang et al., 2017a,b;. However, their evaluation has focused on particularly favorable conditions, limited to closely-related languages or comparable Wikipedia corpora. When tested on more realistic scenarios, we find that they often fail to produce meaningful results. For instance, none of the existing methods works in the standard English-Finnish dataset from Artetxe et al. (2017), obtaining translation accuracies below 2% in all cases (see Section 5).
On another strand of work, Artetxe et al. (2017) showed that an iterative self-learning method is able to bootstrap a high quality mapping from very small seed dictionaries (as little as 25 pairs of words). However, their analysis reveals that the self-learning method gets stuck in poor local optima when the initial solution is not good enough, thus failing for smaller training dictionaries.
In this paper, we follow this second approach and propose a new unsupervised method to build an initial solution without the need of a seed dictionary, based on the observation that, given the similarity matrix of all words in the vocabulary, each word has a different distribution of similarity values. Two equivalent words in different languages should have a similar distribution, and we can use this fact to induce the initial set of word pairings (see Figure 1). We combine this initialization with a more robust self-learning method, which is able to start from the weak initial solution and iteratively improve the mapping. Coupled together, we provide a fully unsupervised crosslingual mapping method that is effective in realistic settings, converges to a good solution in all cases tested, and sets a new state-of-the-art in bilingual lexicon extraction, even surpassing previous supervised methods. showing the similarity distributions of three words (corresponding to the smoothed density estimates from the normalized square root of the similarity matrices as defined in Section 3.2). Equivalent translations (two and due) have more similar distributions than non-related words (two and cane -meaning dog). This observation is used to build an initial solution that is later improved through self-learning.

Related work
Cross-lingual embedding mapping methods work by independently training word embeddings in two languages, and then mapping them to a shared space using a linear transformation.
Most of these methods are supervised, and use a bilingual dictionary of a few thousand entries to learn the mapping. Existing approaches can be classified into regression methods, which map the embeddings in one language using a leastsquares objective (Mikolov et al., 2013;Shigeto et al., 2015;, canonical methods, which map the embeddings in both languages to a shared space using canonical correlation analysis and extensions of it (Faruqui and Dyer, 2014;Lu et al., 2015), orthogonal methods, which map the embeddings in one or both languages under the constraint of the transformation being orthogonal (Xing et al., 2015;Artetxe et al., 2016;Zhang et al., 2016;Smith et al., 2017), and margin methods, which map the embeddings in one language to maximize the margin between the correct translations and the rest of the candidates . Artetxe et al. (2018a) showed that many of them could be generalized as part of a multi-step framework of linear transformations.
A related research line is to adapt these methods to the semi-supervised scenario, where the training dictionary is much smaller and used as part of a bootstrapping process. While similar ideas where already explored for traditional count-based vector space models (Peirsman and Padó, 2010;Vulić and Moens, 2013), Artetxe et al. (2017) brought this approach to pre-trained low-dimensional word embeddings, which are more widely used nowadays. More concretely, they proposed a selflearning approach that alternates the mapping and dictionary induction steps iteratively, obtaining results that are comparable to those of supervised methods when starting with only 25 word pairs. A practical approach for reducing the need of bilingual supervision is to design heuristics to build the seed dictionary. The role of the seed lexicon in learning cross-lingual embedding mappings is analyzed in depth by Vulić and Korhonen (2016), who propose using document-aligned corpora to extract the training dictionary. A more common approach is to rely on shared words and cognates (Peirsman and Padó, 2010;Smith et al., 2017), while Artetxe et al. (2017) go further and restrict themselves to shared numerals. However, while these approaches are meant to eliminate the need of bilingual data in practice, they also make strong assumptions on the writing systems of languages (e.g. that they all use a common alphabet or Arabic numerals). Closer to our work, a recent line of fully unsupervised approaches drops these assumptions completely, and attempts to learn cross-lingual embedding mappings based on distributional information alone. For that purpose, existing methods rely on adversarial training. This was first proposed by Miceli Barone (2016), who combine an encoder that maps source language embeddings into the target language, a decoder that reconstructs the source language embeddings from the mapped embeddings, and a discriminator that discriminates between the mapped embeddings and the true target language embed-dings. Despite promising, they conclude that their model "is not competitive with other cross-lingual representation approaches". Zhang et al. (2017a) use a very similar architecture, but incorporate additional techniques like noise injection to aid training and report competitive results on bilingual lexicon extraction.  drop the reconstruction component, regularize the mapping to be orthogonal, and incorporate an iterative refinement process akin to self-learning, reporting very strong results on a large bilingual lexicon extraction dataset. Finally, Zhang et al. (2017b) adopt the earth mover's distance for training, optimized through a Wasserstein generative adversarial network followed by an alternating optimization procedure. However, all this previous work used comparable Wikipedia corpora in most experiments and, as shown in Section 5, face difficulties in more challenging settings.

Proposed method
Let X and Z be the word embedding matrices in two languages, so that their ith row X i * and Z i * denote the embeddings of the ith word in their respective vocabularies. Our goal is to learn the linear transformation matrices W X and W Z so the mapped embeddings XW X and ZW Z are in the same cross-lingual space. At the same time, we aim to build a dictionary between both languages, encoded as a sparse matrix D where D ij = 1 if the jth word in the target language is a translation of the ith word in the source language.
Our proposed method consists of four sequential steps: a pre-processing that normalizes the embeddings ( §3.1), a fully unsupervised initialization scheme that creates an initial solution ( §3.2), a robust self-learning procedure that iteratively improves this solution ( §3.3), and a final refinement step that further improves the resulting mapping through symmetric re-weighting ( §3.4).

Embedding normalization
Our method starts with a pre-processing that length normalizes the embeddings, then mean centers each dimension, and then length normalizes them again. The first two steps have been shown to be beneficial in previous work (Artetxe et al., 2016), while the second length normalization guarantees the final embeddings to have a unit length. As a result, the dot product of any two embeddings is equivalent to their cosine similarity and directly related to their Euclidean distance 1 , and can be taken as a measure of their similarity.

Fully unsupervised initialization
The underlying difficulty of the mapping problem in its unsupervised variant is that the word embedding matrices X and Z are unaligned across both axes: neither the ith vocabulary item X i * and Z i * nor the jth dimension of the embeddings X * j and Z * j are aligned, so there is no direct correspondence between both languages. In order to overcome this challenge and build an initial solution, we propose to first construct two alternative representations X ′ and Z ′ that are aligned across their jth dimension X ′ * j and Z ′ * j , which can later be used to build an initial dictionary that aligns their respective vocabularies.
Our approach is based on a simple idea: while the axes of the original embeddings X and Z are different in nature, both axes of their corresponding similarity matrices M X = XX T and M Z = ZZ T correspond to words, which can be exploited to reduce the mismatch to a single axis. More concretely, assuming that the embedding spaces are perfectly isometric, the similarity matrices M X and M Z would be equivalent up to a permutation of their rows and columns, where the permutation in question defines the dictionary across both languages. In practice, the isometry requirement will not hold exactly, but it can be assumed to hold approximately, as the very same problem of mapping two embedding spaces without supervision would otherwise be hopeless. Based on that, one could try every possible permutation of row and column indices to find the best match between M X and M Z , but the resulting combinatorial explosion makes this approach intractable.
In order to overcome this problem, we propose to first sort the values in each row of M X and M Z , resulting in matrices sorted(M X ) and sorted(M Z ) 2 . Under the strict isometry condition, equivalent words would get the exact same vector across languages, and thus, given a word and its row in sorted(M X ), one could apply nearest neighbor retrieval over the rows of sorted(M Z ) to find its corresponding translation.
On a final note, given the singular value decomposition X = U SV T , the similarity matrix is M X = U S 2 U T . As such, its square root √ M X = U SU T is closer in nature to the original embeddings, and we also find it to work better in practice. We thus compute sorted( √ M X ) and sorted( √ M Z ) and normalize them as described in Section 3.1, yielding the two matrices X ′ and Z ′ that are later used to build the initial solution for self-learning (see Section 3.3).
In practice, the isometry assumption is strong enough so the above procedure captures some cross-lingual signal. In our English-Italian experiments, the average cosine similarity across the gold standard translation pairs is 0.009 for a random solution, 0.582 for the optimal supervised solution, and 0.112 for the mapping resulting from this initialization. While the latter is far from being useful on its own (the accuracy of the resulting dictionary is only 0.52%), it is substantially better than chance, and it works well as an initial solution for the self-learning method described next.

Robust self-learning
Previous work has shown that self-learning can learn high-quality bilingual embedding mappings starting with as little as 25 word pairs (Artetxe et al., 2017). In this method, training iterates through the following two steps until convergence: 1. Compute the optimal orthogonal mapping maximizing the similarities for the current dictionary D: An optimal solution is given by W X = U and W Z = V , where U SV T = X T DZ is the singular value decomposition of X T DZ.
2. Compute the optimal dictionary over the similarity matrix of the mapped embeddings XW X W T Z Z T . This typically uses nearest neighbor retrieval from the source language into the target language, so D The underlying optimization objective is independent from the initial dictionary, and the algorithm is guaranteed to converge to a local optimum of it. However, the method does not work if starting from a completely random solution, as it tends to get stuck in poor local optima in that case.
For that reason, we use the unsupervised initialization procedure at Section 3.2 to build an initial solution. However, simply plugging in both methods did not work in our preliminary experiments, as the quality of this initial method is not good enough to avoid poor local optima. For that reason, we next propose some key improvements in the dictionary induction step to make self-learning more robust and learn better mappings: • Stochastic dictionary induction. In order to encourage a wider exploration of the search space, we make the dictionary induction stochastic by randomly keeping some elements in the similarity matrix with probability p and setting the remaining ones to 0. As a consequence, the smaller the value of p is, the more the induced dictionary will vary from iteration to iteration, thus enabling to escape poor local optima. So as to find a fine-grained solution once the algorithm gets into a good region, we increase this value during training akin to simulated annealing, starting with p = 0.1 and doubling this value every time the objective function at step 1 above does not improve more than ǫ = 10 −6 for 50 iterations.
• Frequency-based vocabulary cutoff. The size of the similarity matrix grows quadratically with respect to that of the vocabularies. This does not only increase the cost of computing it, but it also makes the number of possible solutions grow exponentially 3 , presumably making the optimization problem harder. Given that less frequent words can be expected to be noisier, we propose to restrict the dictionary induction process to the k most frequent words in each language, where we find k = 20, 000 to work well in practice.
• CSLS retrieval.  showed that nearest neighbor suffers from the hubness problem. This phenomenon is known to occur as an effect of the curse of dimensionality, and causes a few points (known as hubs) to be nearest neighbors of many other points (Radovanović et al., 2010a,b). Among the existing solutions to penalize the similarity score of hubs, we adopt the Cross-domain Similarity Local Scaling (CSLS) from . Given two mapped embeddings x and y, the idea of CSLS is to compute r T (x) and r S (y), the average cosine similarity of x and y for their k nearest neighbors in the other language, respectively. Having done that, the corrected score CSLS(x, y) = 2 cos(x, y) − r T (x) − r S (y).
Following the authors, we set k = 10.
• Bidirectional dictionary induction. When the dictionary is induced from the source into the target language, not all target language words will be present in it, and some will occur multiple times. We argue that this might accentuate the problem of local optima, as repeated words might act as strong attractors from which it is difficult to escape. In order to mitigate this issue and encourage diversity, we propose inducing the dictionary in both directions and taking their corresponding concatenation, so D = D X→Z + D Z→X .
In order to build the initial dictionary, we compute X ′ and Z ′ as detailed in Section 3.2 and apply the above procedure over them. As the only difference, this first solution does not use the stochastic zeroing in the similarity matrix, as there is no need to encourage diversity (X ′ and Z ′ are only used once), and the threshold for vocabulary cutoff is set to k = 4, 000, so X ′ and Z ′ can fit in memory.
Having computed the initial dictionary, X ′ and Z ′ are discarded, and the remaining iterations are performed over the original embeddings X and Z.

Symmetric re-weighting
As part of their multi-step framework, Artetxe et al. (2018a) showed that re-weighting the target language embeddings according to the crosscorrelation in each component greatly improved the quality of the induced dictionary. Given the singular value decomposition U SV T = X T DZ, this is equivalent to taking W X = U and W Z = V S, where X and Z are previously whitened applying the linear transformations (X T X) − 1 2 and (Z T Z) − 1 2 , and later de-whitened applying U T (X T X) 1 2 U and V T (Z T Z) 1 2 V . However, re-weighting also accentuates the problem of local optima when incorporated into self-learning as, by increasing the relevance of dimensions that best match for the current solution, it discourages to explore other regions of the search space. For that reason, we propose using it as a final step once self-learning has converged to a good solution. Unlike Artetxe et al. (2018a), we apply re-weighting symmetrically in both languages, taking W X = U S 1 2 and W Z = V S 1 2 . This approach is neutral in the direction of the mapping, and gives good results as shown in our experiments.

Experimental settings
Following common practice, we evaluate our method on bilingual lexicon extraction, which measures the accuracy of the induced dictionary in comparison to a gold standard.
As discussed before, previous evaluation has focused on favorable conditions. In particular, existing unsupervised methods have almost exclusively been tested on Wikipedia corpora, which is comparable rather than monolingual, exposing a strong cross-lingual signal that is not available in strictly unsupervised settings. In addition to that, some datasets comprise unusually small embeddings, with only 50 dimensions and around 5,000-10,000 vocabulary items (Zhang et al., 2017a,b). As the only exception,  report positive results on the English-Italian dataset of  in addition to their main experiments, which are carried out in Wikipedia. While this dataset does use strictly monolingual corpora, it still corresponds to a pair of two relatively close indo-european languages.
In order to get a wider picture of how our method compares to previous work in different conditions, including more challenging settings, we carry out our experiments in the widely used dataset of  and the subsequent extensions of Artetxe et al. (2017Artetxe et al. ( , 2018a, which together comprise English-Italian, English-German, English-Finnish and English-Spanish. More concretely, the dataset consists of 300-dimensional CBOW embeddings trained on WacKy crawling corpora (English, Italian, German), Common Crawl (Finnish) and WMT News Crawl (Spanish). The gold standards were derived from dictionaries built from Europarl word alignments and available at OPUS (Tiedemann, 2012), split in a test set of 1,500 entries and a training set of 5,000 that we do not use in our experiments. The datasets are freely available. As a non-european agglutinative language, the English-Finnish pair is particularly challeng-   Zhang et al. (2017a). We perform 10 runs for each method and report the best and average accuracies (%), the number of successful runs (those with >5% accuracy) and the average runtime (minutes).

EN-IT EN-DE EN-FI EN-ES
best avg s t best avg s t best avg s t best avg s t    and the extensions of Artetxe et al. (2017Artetxe et al. ( , 2018a. We perform 10 runs for each method and report the best and average accuracies (%), the number of successful runs (those with >5% accuracy) and the average runtime (minutes).
ing due to the linguistic distance between them. For completeness, we also test our method in the Spanish-English, Italian-English and Turkish-English datasets of Zhang et al. (2017a), which consist of 50-dimensional CBOW embeddings trained on Wikipedia, as well as gold standard dictionaries 4 from Open Multilingual WordNet (Spanish-English and Italian-English) and Google Translate (Turkish-English). The lower dimensionality and comparable corpora make an easier scenario, although it also contains a challenging pair of distant languages (Turkish-English). Our method is implemented in Python using NumPy and CuPy. Together with it, we also test the methods of Zhang et al. (2017a) and  using the publicly available implementations from the authors 5 . Given that Zhang et al. (2017a) report using a different value of their hyperparameter λ for different language pairs (λ = 10 for English-Turkish and λ = 1 for the rest), we test both values in all our experiments to 4 The test dictionaries were obtained through personal communication with the authors. The rest of the language pairs were left out due to licensing issues. 5 Despite our efforts, Zhang et al. (2017b) was left out because: 1) it does not create a one-to-one dictionary, thus difficulting direct comparison, 2) it depends on expensive proprietary software 3) its computational cost is orders of magnitude higher (running the experiments would have taken several months).
better understand its effect. In the case of , we test both the default hyperparameters in the source code as well as those reported in the paper, with iterative refinement activated in both cases. Given the instability of these methods, we perform 10 runs for each, and report the best and average accuracies, the number of successful runs (those with >5% accuracy) and the average runtime. All the experiments were run in a single Nvidia Titan Xp.

Results and discussion
We first present the main results ( §5.1), then the comparison to the state-of-the-art ( §5.2), and finally ablation tests to measure the contribution of each component ( §5.3).

Main results
We report the results in the dataset of Zhang et al. (2017a) at Table 1. As it can be seen, the proposed method performs at par with that of  both in Spanish-English and Italian-English, but gets substantially better results in the more challenging Turkish-English pair. While we are able to reproduce the results reported by Zhang et al. (2017a), their method gets the worst results of all by a large margin. Another disadvantage of that model is that different  Table 3: Accuracy (%) of the proposed method in comparison with previous work. * Results obtained with the official implementation from the authors. † Results obtained with the framework from Artetxe et al. (2018a). The remaining results were reported in the original papers. For methods that do not require supervision, we report the average accuracy across 10 runs. ‡ For meaningful comparison, runs with <5% accuracy are excluded when computing the average, but note that, unlike ours, their method often gives a degenerated solution (see Table 2).
language pairs require different hyperparameters: λ = 1 works substantially better for Spanish-English and Italian-English, but only λ = 10 works for Turkish-English.
The results for the more challenging dataset from  and the extensions of Artetxe et al. (2017Artetxe et al. ( , 2018a are given in Table  2. In this case, our proposed method obtains the best results in all metrics for all the four language pairs tested. The method of Zhang et al. (2017a) does not work at all in this more challenging scenario, which is in line with the negative results reported by the authors themselves for similar conditions (only %2.53 accuracy in their large Gigaword dataset). The method of  also fails for English-Finnish (only 1.62% in the best run), although it is able to get positive results in some runs for the rest of language pairs. Between the two configurations tested, the default hyperparameters in the code show a more stable behavior.
These results confirm the robustness of the proposed method. While the other systems succeed in some runs and fail in others, our method converges to a good solution in all runs without excep-tion and, in fact, it is the only one getting positive results for English-Finnish. In addition to being more robust, our method also obtains substantially better accuracies, surpassing previous methods by at least 1-3 points in all but the easiest pairs. Moreover, our method is not sensitive to hyperparameters that are difficult to tune without a development set, which is critical in realistic unsupervised conditions.
At the same time, our method is significantly faster than the rest. In relation to that, it is interesting that, while previous methods perform a fixed number of iterations and take practically the same time for all the different language pairs, the runtime of our method adapts to the difficulty of the task thanks to the dynamic convergence criterion of our stochastic approach. This way, our method tends to take longer for more challenging language pairs (1.7 vs 0.6 minutes for es-en and tr-en in one dataset, and 12.9 vs 7.3 minutes for en-fi and en-de in the other) and, in fact, our (relative) execution times correlate surprisingly well with the linguistic distance with English (closest/fastest is German, followed by Italian/Spanish, followed by Turkish/Finnish).  Table 4: Ablation test on the dataset of  and the extensions of Artetxe et al. (2017Artetxe et al. ( , 2018a. We perform 10 runs for each method and report the best and average accuracies (%), the number of successful runs (those with >5% accuracy) and the average runtime (minutes). Table 3 shows the results of the proposed method in comparison to previous systems, including those with different degrees of supervision. We focus on the widely used English-Italian dataset of  and its extensions. Despite being fully unsupervised, our method achieves the best results in all language pairs but one, even surpassing previous supervised approaches. The only exception is English-Finnish, where Artetxe et al. (2018a) gets marginally better results with a difference of 0.3 points, yet ours is the only unsupervised system that works for this pair. At the same time, it is remarkable that the proposed system gets substantially better results than Artetxe et al. (2017), the only other system based on selflearning, with the additional advantage of being fully unsupervised.

Ablation test
In order to better understand the role of different aspects in the proposed system, we perform an ablation test, where we separately analyze the effect of initialization, the different components of our robust self-learning algorithm, and the final symmetric re-weighting. The obtained results are reported in Table 4. In concordance with previous work, our results show that self-learning does not work with random initialization. However, the proposed unsupervised initialization is able to overcome this issue without the need of any additional information, performing at par with other character-level heuristics that we tested (e.g. shared numerals).
As for the different self-learning components, we observe that the stochastic dictionary induction is necessary to overcome the problem of poor lo-cal optima for English-Finnish, although it does not make any difference for the rest of easier language pairs. The frequency-based vocabulary cutoff also has a positive effect, yielding to slightly better accuracies and much faster runtimes. At the same time, CSLS plays a critical role in the system, as hubness severely accentuates the problem of local optima in its absence. The bidirectional dictionary induction is also beneficial, contributing to the robustness of the system as shown by English-Finnish and yielding to better accuracies in all cases.
Finally, these results also show that symmetric re-weighting contributes positively, bringing an improvement of around 1-2 points without any cost in the execution time.

Conclusions
In this paper, we show that previous unsupervised mapping methods (Zhang et al., 2017a; often fail on realistic scenarios involving non-comparable corpora and/or distant languages. In contrast to adversarial methods, we propose to use an initial weak mapping that exploits the structure of the embedding spaces in combination with a robust self-learning approach. The results show that our method succeeds in all cases, providing the best results with respect to all previous work on unsupervised and supervised mappings.
The ablation analysis shows that our initial solution is instrumental for making self-learning work without supervision. In order to make selflearning robust, we also added stochasticity to dictionary induction, used CSLS instead of nearest neighbor, and produced bidirectional dictionaries. Results also improved using smaller in-termediate vocabularies and re-weighting the final solution. Our implementation is available as an open source project at https://github. com/artetxem/vecmap.
In the future, we would like to extend the method from the bilingual to the multilingual scenario, and go beyond the word level by incorporating embeddings of longer phrases.