A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction

Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0% in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed. We propose Iterative Dimension Reduction to bridge this gap. Our experiments show that this simple method does not hamper the performance of similar language pairs and achieves an accuracy of 13.64 55.53% between English and four distant languages, i.e., Chinese, Japanese, Vietnamese and Thai.


Introduction
Unsupervised Bilingual Dictionary Induction (UBDI) is a task that aims to find the word translations given the monolingual word embeddings of two languages. Recent UBDI methods have shown promising results on similar language pairs such as English-Spanish (Artetxe et al., 2017;Lample et al., 2018;Zhou et al., 2019;Ren et al., 2020). These methods are mostly based on the initialization and the self-learning. The initialization first constructs a dictionary from the word embeddings, then the self-learning starts with this dictionary and alternates between refining the source-target word embedding mapping and inducing a new dictionary with this mapping.
Despite the success of UBDI, recent work has questioned the robustness of UBDI methods on distant language pairs (Søgaard et al., 2018;Vulić et al., 2019;Glavas et al., 2019), e.g., English-Japanese. They show that even for the most robust system VecMap (Artetxe et al., 2018), it still fails and has an accuracy of 0% on 87 out of 210 distant language pairs (Vulić et al., 2019). To be consistent with Artetxe et al. (2018), we define a system 'succeeds' when it has an accuracy above 5% and 'fails' otherwise.
Previous work has investigated how different properties of languages have an impact on UBDI performance (Søgaard et al., 2018). In this paper, we take a step further to inspect which part of VecMap breaks down. With a novel similarity metric to evaluate the initialization performance, we observe a gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed in distant language pairs. We find that the dimension reduction approach is very effective in bridging this gap. Therefore, we propose Iterative Dimension Reduction (IDR) to improve the robustness of VecMap and avoid performance loss due to dimension reduction. IDR first reduces the dimension of word embeddings and performs unsupervised learning on them. Then it initializes the self-learning on larger dimension embeddings using this learned system. This simple dimension reduction removes unimportant or noisy features, making the algorithm easier to find a proper solution to distant language pairs.

Entry
En-Zh Zh-En En-Ja Ja-En En-Vi Vi-En En-Th Th-En  150  74  74  97  137  110  147  154   Table 1: VecMap accuracy, the maximum accuracy of using a seed dictionary to initialize the self-learning and the minimum seed dictionary size to obtain that accuracy (we start with 10 pairs to estimate this size and add 10 pairs if the maximum accuracy is not achieved; Results are averaged over 3 runs).
We evaluate our approach on four similar European language pairs, including English-{Spanish, French, Italian, German} (En-{Es, Fr, It, De}), and four distant language pairs, including English-{Chinese, Japanese, Vietnamese, Thai} (En-{Zh, Ja, Vi, Th}). Our method not only has a close performance to the VecMap baseline in similar language pairs but also succeeds in all distant language pairs. In four distant language pairs, our method has an accuracy of 13.64∼55.53%, whereas the VecMap baseline has an accuracy of 0% in most cases, as shown in the first row of Table 1.

The VecMap Method
VecMap (Artetxe et al., 2018) learns weights W X and W Y for the source and target word embeddings X and Y and maps them to the same space for inducing the dictionary. It consists of two components: • Initialization. The initialization first computes M X = XX T and M Y = Y Y T . Then each row of M X and M Y is sorted and an initial dictionary D is induced by searching for nearest neighbors between the rows of √ M X and √ M Y .
• Self-learning. With the initial dictionary D, the self-learning iterates the following two steps: -It finds W X and W Y that maximize i j D ij ((X i * W X )·(Y j * W Y )) for the current dictionary D, where D ij = 1 if the j-th target word is the translation of the i-th source word and 0 otherwise. An optimal solution is given by W X = U and W Y = V , where U SV T = X T DY is the singular value decomposition of X T DY ; -A new dictionary D is induced by using the CSLS retrieval (Lample et al., 2018) to extract nearest neighbors in the similarity matrix P = XW X W T Y Y T . To avoid being trapped in a poor local optimum and encourage the exploration of possible word translations, similarity scores in P are kept with a probability p and set to 0 otherwise.
There are some pre-processing and post-processing steps that are crucial to VecMap: • Normalization and mean centering is applied before the initialization and the self-learning. It will normalize all vectors in X and Y to have a unit Euclidean norm. Then these vectors are mean centered dimension-wise and length-normalized again.
• Whitening (Bell and Sejnowski, 1997) is applied before the last iteration of the self-learning. It transforms X and Y such that each dimension has unit variance and that the dimensions are uncorrelated.
• Symmetric re-weighting is applied to the mapped embeddings XW X and Y W Y at the last iteration. It further improves XW X and Y W Y by weighting dimensions by their cross-correlation √ S, where S is a diagonal matrix with singular values on its diagonal entries.
• Dewhitening is the reverse of whitening and applied after symmetric re-weighting if whitening is applied. It restores the variance information. En-Zh En-Ja En-Vi En-Th Figure 1: The accuracy obtained by the self-learning starting from a given dictionary (Accuracy) vs. the dictionary similarity of that starting dictionary (Dictionary Similarity).
E n -E s E n -F r E n -I t E n -D e E n -Z h E n -J a E n -V i E n -T h

Dictionary Similarity +1
Threshold Actual Figure 2: The minimum initial dictionary similarity to succeed (Threshold) vs. the actual initial dictionary similarity (Actual).

When does Unsupervised Learning Fail?
Since VecMap is pipelined by the initialization and the self-learning, we can assume the failure of unsupervised learning comes from either or both of these two components. Two hypotheses arise: 1. The self-learning cannot succeed even if the initialization is perfect; 2. The initialization is too bad to kick-off the self-learning even if the self-learning is able to succeed.
It is easy to verify the first hypothesis: we start the self-learning with a human-annotated seed dictionary. This way assumes a perfect initialization and thus eliminates the impact of the initialization. Table 1 shows that a small seed dictionary is enough for the self-learning to have a good result. This observation reveals that the self-learning is able to succeed. This left us the second hypothesis, that unsupervised learning fails at the initialization.
Two natural questions come from the second hypothesis: 1. How to quantify the initialization performance, i.e., the quality of the dictionary generated by the initialization (initial dictionary in short)?
2. How well the initial dictionary need to be so that the self-learning can succeed?
One might expect the accuracy is a good proxy of the quality of a dictionary. But the accuracy only takes the correct translations into account. The intuition is that though the system fails to find the correct translation, it can still be useful if its translations are close to the correct answers. Thus we evaluate the average cosine similarity between the word embeddings of the system translations and the correct answers, dubbed dictionary similarity. It will score high if the translations are close to the correct answers. Figure 1 shows how the self-learning performs when starting from dictionaries with different dictionary similarities. These dictionaries are constructed by randomly replacing the translations in the initial dictionary. We can see that the self-learning only succeeds when the dictionary similarities of these starting dictionaries are above some thresholds. These thresholds represent the minimum similarity that the initial dictionary should have for the self-learning to succeed.
We test the dictionary similarity of the initial dictionary. As shown in Figure 2, the actual initial dictionary similarities of the initialization are above the thresholds in similar language pairs, but it is the opposite in distant language pairs. The gap between the actual initialization similarity and the minimum similarity for the self-learning to succeed implies the failure of VecMap in distant language pairs.  Table 2: The percentage of variance explained by the highest eigenvalue and the average cosine similarity between any two embeddings before and after applying the dropmax trick.

Proposed Method
As the gap is determined by embeddings and the algorithm, one can improve the algorithm to bridge this gap. In this work, we start with a different angle by simplifying embeddings to make the algorithm easier to succeed. Concretely, we run the dimension reduction on the word embeddings. In principle, the dimension reduction algorithm will drop features that are less important in explaining the data. It can be considered as a way to remove the noise and clean up the data. Here we choose Principal Component Analysis (PCA) (Pearson, 1901) for study. Spanish Japanese Figure 3: An example of embeddings with eigenvectors from PCA (Black arrows are eigenvectors/axes, coloured arrows are projected embedding vectors).
PCA first computes the covariance matrix of the features in the embeddings. This matrix represents the correlation among features. Then PCA performs the eigenvalue decomposition on this covariance matrix to obtain a set of eigenvectors. Finally, PCA uses eigenvectors with the highest n eigenvalues to project the raw embeddings to space with a lower dimension n.
However, this simple application of PCA only works in a few languages. We find that the highest eigenvalue is much larger than the others in most of the failed languages, e.g., it explains 96% variance in Japanese while 5.6% in Spanish as shown in the first row of Table 2 (the larger the highest eigenvalue is, the more variance it explains). This implies that the embeddings will be stretched the most on the direction of the eigenvector with the highest eigenvalue, resulting in most embeddings pointed to a close direction and being clustered together. Figure 3 shows a simple example of this. The average cosine similarity between every two embeddings in Table 2 also justifies our hypothesis that the more variance Algorithm 1 Iterative Dimension Reduction 1: procedure IDR(E, n) E is the raw embeddings, n is the initial dimension 2: D ← ∅ Set the dictionary to empty 3: while n ≤ 300 do 300 is the dimension of the raw embeddings 4: Reduce E toĒ with dimension min(n, 300) using PCA and dropmax 5:   it explains, the closer the embeddings are to each other (higher similarity). Such a phenomenon makes embeddings indistinguishable. The 'hubness' problem (Radovanovic et al., 2010) will occur, where one embedding is the nearest neighbor of many embeddings, even if they have different semantics. The simplest solution is that when projecting the embeddings with a dimension m to n, we drop not only eigenvectors with the lowest m − n + 1 eigenvalues but also the one with the highest eigenvalue. We refer to this as the dropmax trick. This makes embeddings distinguishable (lower similarity) as shown in the last row of Table 2 and helps VecMap to succeed in the remaining distant language pairs. Since the dimension reduction incurs the loss of information and hinders the further improvement of our system, we propose Iterative Dimension Reduction (IDR), as shown in Algorithm 1. The algorithm first runs VecMap on embeddings with the smallest dimension n (default set to 50). Then it uses the trained system to translate the K most frequent words (default set to 4K). The resulting dictionary will serve as the initial dictionary for the self-learning in the next step, where it runs on embeddings with a larger dimension (default set to 2× larger than in the previous step).

System
En-Zh Zh-En En-Ja Ja-En En-Vi Vi-En En-Th Th-En  reproduce the C-MUSE and POSTPROC system using Python. All these systems are run with the default hyper-parameters settings. Our method is based on the open-sourced VecMap implementation. We evaluate the baseline and our method on 4 similar language pairs, En-{Es, Fr, It, De}, and 4 distant language pairs, En-{Zh, Ja, Vi, Th}. We use the pretrained 300-dimensional fastText embeddings (Bojanowski et al., 2017) 3 . The evaluation dictionaries are from Lample et al. (2018). We trim all vocabularies to the 20K most frequent words for training. Specifically, VecMap retains the top-4K words for the initialization, while others use the whole vocabulary. All experiments are done on a single Nvidia GTX 1080Ti. We run each experiment 3 times but with different random seeds, then pick the one with the highest cosine similarity of induced nearest neighbors as the final result. This unsupervised model selection criterion has shown to correlate well with UBDI performance . Table 3 shows the results of various systems on similar language pairs, En-{Es, Fr, It, De} and their reverse {Es, Fr, It, De}-En. We can see that all baseline systems perform well on these language pairs. VecMap and C-MUSE outperform MUSE in most cases. This is because both systems employ the dropout trick in their self-learning processes, which has proven to be effective in jumping out of the local optimum (Artetxe et al., 2018;. However, all these baseline systems perform poor on distant language pairs, En-{Zh, Ja, Vi, Th} and their reverse {Zh, Ja, Vi, Th}-En, as shown in Table 4. C-MUSE is better than the others by obtaining positive results on En-Zh, Zh-En, Ja-En and Vi-En tasks, but still fails on other tasks, i.e., having an accuracy below 5%.

Results
Our method is based on VecMap, thus it has a good performance in similar language pairs, as shown in the last row of Table 3. On the other hand, our method is robust to distant language pairs as shown in Table 4. In all four distant language pairs and the two directions, our method obtains much better results than the baselines. For example, our method has an accuracy of 21.6% in En-Th and 13.64% in Th-En, where none of the baseline systems has an accuracy above 1%.
We also observe that the performance of our method in low-dimensional space is much worse than the one in high-dimensional space. For instance, our method has an accuracy of 21.6% in En-Th when the dimension is 300, while only 10.4% when the dimension is 100. This observation justifies our previous claim in Section 4 that dimension reduction incurs the loss of information and thus hinders the further improvement of our method. But directly run on high-dimensional embeddings does not succeed, as none of the baselines consistently has an accuracy above 0% in raw 300-dimensional embeddings. Therefore, it is necessary to run on the low-dimensional space first as a warm start of the high-dimensional counterpart to obtain better performance. 6 Analysis 6.1 Ablation Study Table 5 shows the results of using IDR and dropmax solely on distant language pairs. We can see that IDR is crucial to En-Th and Th-En. In Section 6.2, we show that dimension reduction helps to obtain isomorphic embeddings. These embeddings match the isomorphic assumption made by VecMap.
On the other hand, the dropmax trick is crucial to En-{Zh, Ja} and their reverse. This fact relates well with the observation in Table 2, where these two languages suffer from the hubness problem due to the Hubness H Before dropmax After dropmax Figure 5: The hubness level before and after applying the dropmax trick.
highest eigenvalue. The dropmax trick avoids this issue by removing this highest eigenvalue, as shown in Section 6.3. We also see that both IDR and dropmax are crucial to En-Vi and Vi-En, which implies that the hubness and isomorphism are their central problems.

Isomorphism
Many UBDI methods, including VecMap, make the isomorphic assumption, that the underlying nearest neighbor graphs of two language embedding spaces are connected in the same way. Søgaard et al. (2018) propose the eigenvector similarity to measure how well this assumption is held. Here we are interested in how the isomorphism of the underlying graphs change when the dimension is different. We first normalize, center and normalize the embeddings as in the pre-processing step, calculate the nearest neighbor graphs of the 10K most frequent words in each language, and compute their Laplacian matrices L 1 and L 2 . We then find the smallest k 1 such that the sum of the largest k 1 eigenvalues of L 1 is at least 90% of the sum of all its eigenvalues, and analogously for k 2 and L 2 . Finally we set k = min(k 1 , k 2 ), and define the eigenvector similarity of the two graphs as the sum of the squared differences between the k largest eigenvalues λ of L 1 and L 2 , = k i=1 (λ 1i − λ 2i ) 2 . The higher is, the less similar the graphs are. As shown in Figure 4, the eigenvector similarity drops significantly when the dimension is reduced. This implies that the underlying nearest neighbor graphs of two languages become similar in low-dimensional space. This helps the algorithm to succeed in low-dimensional space as the assumption it makes is held. This phenomenon might be the result that many language pairs share some principle axes of variation, especially the ones with high eigenvalues (Hoshen and Wolf, 2018).

Hubness
Cross-lingual word embeddings are known to suffer from the hubness problem (Lample et al., 2018), where a few points (known as hubs) are the nearest neighbors of many other points in high-dimensional spaces. As suggested in Section 4, distant language pairs might suffer more from this problem and the dropmax trick helps to alleviate this problem. Thus we would like to know to what extent the dropmax trick helps in the hubness problem. Here we adopt the hubness metric proposed by Ormazabal et al. (2019) for evaluation. This metric measures the percentage of target words H that are the nearest neighbor of all the source words. For instance, a hubness value of H = 60% would indicate that 60% of the target words are the nearest neighbors of all the source words. This way, lower values of H are indicative of a higher level of hubness. Figure 5 is the hubness level of different languages before and after applying the dropmax trick. We can see that the dropmax trick generally alleviates the hubness problem, even when the hubness level is low. For example, En-It has H = 56.39% before applying the dropmax trick. After that, it has H = 56.73%, a 0.34% improvement on the hubness level. For those language pairs with a high hubness level such as En-Zh and En-Ja, the improvement is obvious, e.g., more than 9% improvement on En-Zh.

Dictionary Similarity
As suggested in Section 4, dimension reduction helps to bridge the performance gap and we examine it here. This gap can be measured by the dictionary similarity proposed in Section 3. Here we choose En-{Vi, Th} for study, since dimension reduction is crucial to their successes as shown in Table 5.
In Figure 6, we can see that the gap between the initialization and the self-learning varies in different dimensions. In the dimension that VecMap succeeds, the actual initial dictionary similarity is much higher than the minimum initial dictionary similarity to succeed, e.g., 0.58 vs. 0.15 in 300 dimensions for En-Vi and 0.51 vs. 0.15 in 100 dimensions for En-Th. This means that VecMap in the previous dimension reduction step generates a good initial dictionary. When looking at the results in the previous step, we find that the actual initial dictionary similarity is close to but not surpass the minimum initial dictionary similarity to succeed, e.g., 0.15 vs. 0.17 in 200 dimensions for En-Vi and 0.15 vs. 0.2 in 50 dimensions for En-Th. This implies that though VecMap does not succeed in that dimension, closing this gap already allows it to translate well on the frequent words, i.e., words to construct the initial dictionary. This enables unsupervised learning in the next dimension reduction step.

Understanding the Initialization
In Figure 2, the initialization is poor on distant language pairs as indicated by the gap. We investigate which factor might have an impact on the accuracy of the initial dictionary (initial accuracy in short). It allows us to identify the obstacles in the initialization of distant language pairs. Here we measure the initial accuracy instead of the proposed dictionary similarity, as its value depends on embeddings and thus can not be compared across languages.
Here we study two factors: the oracle structural similarity and the maximum accuracy. Structural similarity is the cosine similarity between rows of √ M X and √ M Y , where a row in √ M X associates to a source word and represents the similarities between this source word and other source words, and analogously for √ M Y . This similarity is used in the initialization to select word translations to construct the initial dictionary. The oracle structural similarity is the average of the structural similarity of all possible and correct word translations in the initialization. This oracle structural similarity measures how strong the assumption made in the initialization is, which assumes aligned source and target words should have high structural similarity. The higher the oracle structural similarity is, the easier the initialization finds correct translations. As shown in the left part of Figure 7, the oracle structural similarity is positively related to the initial accuracy. Distant language pairs have a low similarity, which means that the assumption made in the initialization is unlikely to hold.
The maximum accuracy is the accuracy that is obtained by a perfect initialization strategy. It is lower than 100% as the vocabulary in the initialization might not contain the translations of some source and target words. We see that in the right part of Figure 7, the maximum accuracy is also positively related to the initial accuracy. Distant language pairs have a low maximum accuracy, which means that the initialization is less likely to find the correct translation for a source word.

Error Analysis
In Table 4, we can see that even for our best system it still has low accuracy in distant language pairs. To identify the main source of errors, we perform an error analysis of the system output in a sized 5K En-Zh dictionary from Lample et al. (2018). We randomly sample 200 error examples and let a human expert to classify these examples into four main categories: the answer and the translation are correct (CC), the answer is correct and the translation is wrong (CW), the answer is wrong and the translation is correct (WC), the answer and the translation are wrong (WW).
In our analysis, there are 25.5% errors are CC. This is due to the polysemy of words and the dictionary does not cover all possible translations. A few cases are WC (0.5%) and WW (3%). This means that there are some minor issues on the dictionary quality. For the main category CW (71%), there are 17% resulted from proper nouns, which have shown to be meaningless in the evaluation . 18.5% have a close meaning to the answer. 10.5% are the untranslated error, where the translation is identical to the source word. The remaining 25% are true errors, e.g., antonym.

Related Work
In recent years a number of methods have been proposed to learn bilingual dictionary from monolingual word embeddings. Early work (Mikolov et al., 2013) relies on a seed dictionary to learn the source-target word embedding mapping. Xing et al. (2015) enforce the word embeddings to be of unit length and the orthogonal constraint on the linear mapping. Faruqui and Dyer (2014) on the other hand use Canonical Correlation Analysis (CCA) to project both source and target embeddings to a common low-dimensional space. Artetxe et al. (2016) show that the above methods are variants of the same objective. Smith et al. (2017) further show that this objective is closely related to the orthogonal Procrustes problem. Artetxe et al. (2017) obtain competitive results using the self-learning with a seed dictionary of only 25 word pairs. Adversarial methods. Zhang et al. (2017a) attempt the unsupervised bilingual dictionary induction task using the adversarial network. They use a generator to transform the source word embeddings to the target word embeddings and a discriminator to classify whether the given embedding is sampled from the true target word embeddings or generated by the generator. The generator is trained to fool the discriminator and the discriminator is trained to identify the generated word embeddings. In the end, the generator will be used to induce the bilingual dictionary. Their following work (Zhang et al., 2017b) minimizes Earth-Mover's distance between the transformed source and target embeddings distribution. Lample et al. (2018) improve the results by treating the dictionary produced by the adversarial network as the seed dictionary of the self-learning. To mitigate the hubness problem (Radovanovic et al., 2010), they propose an effective nearest neighbors retrieval method CSLS for dictionary induction. Xu et al. (2018) minimize Sinkhorn distance instead and introduce the circle consistency such that a source word embedding can be translated back after translating it to a target word. Mohiuddin and Joty (2019) extract latent codes from word embeddings and align words according to their latent codes. Non-adversarial methods. There is another line of research that focuses on a non-adversarial approach. Artetxe et al. (2018) propose a heuristic to induce an initial dictionary by exploiting the structural similarity of embeddings. They also propose the stochastic dictionary induction method, which significantly improves the robustness as well as the performance of self-learning. Hoshen and Wolf (2018) assume that many language pairs share some principle axes of variation. Therefore they first use PCA to project the word embeddings to a lower-dimensional space. Then they apply a variant of the Iterative Closest Point method to find the source and target word embeddings mapping. Zhou et al. (2019) use normalizing flows to match the distribution of source and target word embeddings. But they rely on a numeral seed dictionary and the additional word frequency information. More recently,  find that more robust results can be obtained by using the adversarial method to produce the initial dictionary for the advanced self-learning (with the stochastic dictionary induction).  first generate a pseudo parallel corpus by an unsupervised machine translation system. They then extract a bilingual dictionary from the word alignment learned on that corpus. This simple process shows much better results than previous methods. Vulic et al. (2020) introduce a simple post-processing step to improve UBDI performance on distant language pairs.

Conclusion
In this work, we pinpoint in which part the representative UBDI system, VecMap, fails on distant language pairs. We identify a gap between the initialization performance and the minimum initialization performance for the self-learning to succeed, which is responsible for its failure. We propose Iterative Dimension Reduction to bridge this gap. Our method obtains substantial gains in distant language pairs without scarifying the performance of similar language pairs. It has shown to robust to the four distant language pairs we experiment with.