Unsupervised Joint Training of Bilingual Word Embeddings

State-of-the-art methods for unsupervised bilingual word embeddings (BWE) train a mapping function that maps pre-trained monolingual word embeddings into a bilingual space. Despite its remarkable results, unsupervised mapping is also well-known to be limited by the original dissimilarity between the word embedding spaces to be mapped. In this work, we propose a new approach that trains unsupervised BWE jointly on synthetic parallel data generated through unsupervised machine translation. We demonstrate that existing algorithms that jointly train BWE are very robust to noisy training data and show that unsupervised BWE jointly trained significantly outperform unsupervised mapped BWE in several cross-lingual NLP tasks.


Introduction
Bilingual word embeddings (BWE) represent the vocabulary of two languages in one common continuous vector space. They are known to be useful in a wide range of cross-lingual NLP tasks.
The most prevalent methods for training BWE are so-called mapping methods (Mikolov et al., 2013a): word embeddings for two languages are separately trained on respective monolingual data and then mapped into one common embedding space. The mapping function is usually trained using a small bilingual lexicon for supervision. Recently, unsupervised mapping for BWE (Artetxe et al., 2018a;Lample et al., 2018a), i.e., trained without using any manually created bilingual resources, has been shown to reach a performance comparable to supervised BWE in several crosslingual NLP tasks. Unsupervised BWE are trained with a three-step approach. First, word embeddings are roughly mapped into an initial BWE space, for instance using adversarial training or an heuristic mapping. Then, using the initial BWE, a small synthetic bilingual lexicon is induced. Finally, a new BWE, which is expected to be better than the initial BWE, is learned from the induced lexicon through a pseudo-supervision with some supervised mapping method. The last two steps can be repeated to refine the BWE. In spite of their success, unsupervised mapping methods are inherently limited by the dissimilarity between the original word embedding spaces to be mapped. The feasibility of aligning two embedding spaces relies on the assumption that they are isomorphic. However, Søgaard et al. (2018) showed that these spaces are, in general, far from being isomorphic, and thus they result in suboptimal or degenerated unsupervised mappings.
On the other hand, supervised methods that jointly train BWE from scratch (Upadhyay et al., 2016), on parallel or comparable corpora, do not have such limits since no pre-existing embedding spaces and no mapping function are involved. These methods jointly train BWE by exploiting bilingual and monolingual contexts of words, materialized by sentence or document pairs, to learn a single BWE space. However, they require large bilingual resources for training. To the best of our knowledge, joint training of BWE has never been explored for unsupervised scenarios.
In this paper, we propose unsupervised joint training of BWE. Our method is an extension of previous work on unsupervised BWE: we propose to generate, without supervision, synthetic parallel sentences that can be directly exploited to jointly train BWE with existing algorithms. We empirically show that this method learns better BWE for several cross-lingual NLP tasks.

Pseudo-supervised joint training
On the strong assumption that existing algorithms for joint training of BWE are robust enough even with very noisy parallel training data, we formulate the following research question: Do synthetic sentence pairs supply useful bilingual contextual information for learning better BWE?

Bilingual skipgram
Previous work on joint training of BWE hypothesizes that exploiting both monolingual and bilingual contextual information yields better word embeddings, monolingually and bilingually.
Among several existing algorithms for joint training of BWE, in this work, we use bilingual skipgram (BIVEC) (Luong et al., 2015), which has been shown to outperform other methods in several NLP tasks (Upadhyay et al., 2016). BIVEC uses the skipgram algorithm (Mikolov et al., 2013b) to learn the word embeddings for each language and exploits word alignments obtained for parallel data in order to make the embeddings cross-lingual. Given a pair of sentences, S 1 in some language L1 and S 2 in another language L2, a word w i in S 1 is replaced with its aligned word a(w i ) in S 2 , so that the L1 context can also be used for learning the embedding of the L2 word. BIVEC has been shown to be robust to noisy word alignments (Luong et al., 2015), which is a significant advantage of this method in our scenario using synthetic parallel data.

Training on synthetic parallel data
For an unsupervised training of BWE, the training data must also be generated in an unsupervised way. To this end, we chose unsupervised machine translation (MT). Recent work has shown significant progress in unsupervised MT (Artetxe et al., 2018b;Lample et al., 2018b) with generated translations of a reasonable quality. Both statistical (SMT) and neural MT (NMT) have been adapted to the unsupervised scenario. We chose unsupervised SMT (USMT) to generate synthetic parallel data since it generates better translations than unsupervised NMT (Lample et al., 2018b).
Given an initial BWE, for instance learned with unsupervised mapping methods, our method works as follows (see also Figure 1). First, a USMT is trained from monolingual data. We collect a set of phrases made of up to L tokens, using word2phrase, 1 for each of the source and target  (Artetxe et al., 2018;Lample et al., 2018) Figure 1: Our joint training framework is on top of existing unsupervised mapping methods.
languages. As phrases, we also consider all the token types in each corpus. In our phrase table, each L1 phrase is paired with its k most probable translations in L2 determined based on a score computed from the given BWE. 2 The phrase table and a language model trained on the L2 monolingual data compose the initial USMT. Then, the USMT is iteratively refined in the following manner.
• Synthetic parallel data are generated by translating monolingual data using the USMT. Both L1-to-L2 and L2-to-L1 translations can be considered (Artetxe et al., 2019).
• A new phrase table is trained on the synthetic parallel data to form a new USMT.
Finally, on the synthetic parallel data generated by our USMT after N refinement steps, we jointly train new BWE as described in Section 2.1.
Although this approach can efficiently generate parallel data of a reasonable quality, as shown in Figure 1, it heavily relies on the feasibility of mapping the word embeddings learned for L1 and L2 in the same space and used for the initial USMT. If the mapping fails, we cannot expect USMT to generate useful data for jointly training BWE. Conversely, if the mapping succeeds, we can generate data with bilingual contexts that may be useful to jointly train BWE.
More importantly, we use USMT assuming that BIVEC is robust enough to learn from very noisy parallel data. Our intuition comes from the fact that SMT generates less diverse translations, with a significantly different word frequency distribution than in translations naturally produced by humans. SMT is limited by the vocabulary of its phrase table and will favor the generation of frequent n-grams thanks to its language model. Same words appear more frequently in similar contexts, facilitating the training of word embeddings and compensating, to some extent, for the noisiness of the translations. In Appendix A, we provide results of our preliminary experiments supporting this assumption.

Experiments
Are BWE unsupervisedly and jointly trained on noisy synthetic data better than unsupervised mapped BWE?
To answer this question, we conducted experiments in three different tasks with three language pairs: English-German (en-de), English-French (en-fr), and English-Indonesian (en-id).

Settings for training BWE
We trained monolingual word embeddings with fastText (Bojanowski et al., 2017) 3 separately on English (239M lines), German (237M lines), and French (38M lines) News Crawl corpora provided by WMT 4 for en-de and en-fr. For enid, we used English (100M lines) and Indonesian (77M lines) Common Crawl corpora. 5 We then mapped the word embeddings into a BWE space using VECMAP, 6 one of the best and most robust methods for unsupervised mapping (Glavas et al., 2019). The resulting BWE were used as baselines in our evaluation tasks and also to bootstrap our USMT system.
Our initial USMT systems were induced with the following configuration. Maximum phrase length was set to six (L = 6). To make our experiments reasonably fast, we selected the 300k most frequent phrases referring to each monolingual corpus, and retained 300-best target phrases for each source phrase (k = 300). 4-gram language models were trained with lmplz (Heafield et al., 2013). Then, USMT systems were refined four times (N = 4) and used to generate synthetic parallel data by translating 10M sentences randomly sampled from the monolingual data. Finally, on the synthetic parallel data, we trained new BWE using BIVEC 7 with the parameters used in Upadhyay et al. (2016) and with word alignments determined by fast align (Dyer et al., 2013). 8 We performed contrastive experiments for some of our tasks with a simple method proposed by Levy et al. (2017), denoted SENTID, 9 with its default parameters for training BWE. SENTID does not optimize a joint objective but as for BIVEC we trained it on the synthetic parallel data and learned directly from scratch a single BWE space. SENTID does not require word alignments, but instead simply exploits sentence pair IDs as a bilingual signal associated with each word and train BWE by applying skipgram on a word/sentence-ID matrix.
All the methods for training word embeddings were trained with 512 dimensions and their -min-count parameter set to 5.
Note that in all our experiments, we filtered the vocabulary so that all BWE spaces have the same vocabulary when compared.

Task 1: Bilingual lexicon induction
Bilingual lexicon induction (BLI) is by far the most popular evaluation task for BWE used by previous work in spite of its limits (Glavas et al., 2019). In contrast to previous work, we used much larger test sets 10 for each language pair. Table 1 reports on accuracy in retrieving a correct translation with CSLS (Lample et al., 2018a) for each source word of the test sets. For all the tasks, BIVEC and SENTID achieved better accuracy than VECMAP. This supports our assumption that even noisy synthetic parallel data can provide useful bilingual contexts for training BWE. The largest improvements were observed for enid, with a gain of more than 10 points. Interestingly, BIVEC and SENTID performed similarly, pointing out that word alignments are not necessary in our scenario. The accuracy was higher when synthetic parallel data did not contain syn-7 https://github.com/lmthang/bivec 8 https://github.com/clab/fast_align 9 https://bitbucket.org/omerlevy/xling_ embeddings 10 https://github.com/facebookresearch/ MUSE   thetic English ("10M-0" for "en→ * " and "0-10M" for " * →en"). Using the concatenation of the synthetic data generated by L1→L2 and L2→L1 (last two rows of the table) slightly underperformed the best configuration despite the use of twice more training data. This is presumably due to the presence of sentences of two very different natures, synthetic and original, in the same language.
To evaluate the robustness of BIVEC, we compared the performance to those obtained with noisier synthetic data generated by the initial USMT (without refinement). As shown in Table 2, we observed comparable results, especially for en→de and en→fr, confirming that this approach is very robust to noisy training data.
Although BIVEC and SENTID used a sub-part of the monolingual data used by VECMAP, their vocabulary size can be larger. This unintuitive observation comes from the use of USMT to generate synthetic data: L1 words not covered by the phrase table are directly copied in the translations. As a result, such L1 words are introduced into the L2 vocabulary even if they do not appear in the L2 monolingual data used to train VECMAP, artifically increasing the coverage ratio 11 of the lexicon. This side-effect of our method is especially useful for instance for named entities that should be kept as is. Since such words in L1 and their copies in L2 cooccur frequently in synthetic data, their embeddings are similar. Obviously, this sideeffect is interesting only for close languages and may introduce numerous unwanted L1 words in the L2 space. See Appendix B for some more analyses.

Task 2: Machine translation
In the phrase table induction for USMT, both the geometry of the space (when retrieving the kclosest translations for a given source phrase) and the embeddings themselves (when computing cosine similarity for the translation probability) play an important role. Better BWE should lead to bet-  ter phrase tables and consequently translations of better quality. We thus regard USMT as an extrinsic evaluation task for BWE. Table 3 shows BLEU scores for our USMT at step 0 on en-de Newstest2016, en-fr Newstest2014 of WMT, and en-id ALT (Riza et al., 2016) 12 test sets. We observed from 0.3 (BIVEC, en→fr) to 2.5 (BIVEC, en→id) BLEU points of improvements over USMT using VECMAP. Again, BIVEC and SENTID performed similarly. However, note that here USMT is merely an evaluation task: the improvement observed at step 0 are practically useless for USMT, since we can often gain much larger improvements through refinement as described in Section 2.2. Consequently, we assume that perfoming more iterations, i.e., retraining BWE on synthetic parallel data generated by an USMT system initialized from unsupervised joint BWE, will not improve either translation quality or BWE quality.

Task 3: Monolingual word analogy
In the literature, VECMAP and BIVEC BWE have been shown to perform as well as, or better than, word embeddings trained exclusively on monolingual data in monolingual tasks. Since we use significantly less and noisier data for training BIVEC than VECMAP, we assume that this observation may not hold in our configuration.
We tested our assumption with the English word analogy task of Mikolov et al. (2013b) by comparing VECMAP and BIVEC English word embeddings, with several different sets of en-fr synthetic parallel data for training BIVEC. As shown in Table 4, BIVEC led to significantly lower accuracy than VECMAP, especially for the configuration trained on synthetic English (generated from French) with a gap of 32.2 points. We also observed a lower accuracy when using original English, presumably due to the use of much smaller data than for training VECMAP. However,  when training monolingual word embeddings using fastText on the same English data used for training BIVEC, we observed that fastText underperforms BIVEC. This confirms that BIVEC can take advantage of noisy but bilingual contexts to monolingually improve word embeddings.

Conclusion and future work
We show in several cross-lingual NLP tasks that unsupervised joint BWE achieved better results than unsupervised mapped BWE. Our experiments also highlight the robustness of joint training that can take advantage of bilingual contexts even from very noisy synthetic parallel data. Since our approach works on top of unsupervised mapping for BWE and uses synthetic data generated by unsupervised MT, it will directly benefit from any future advances in these two types of techniques. Our approach has, however, a higher computational cost due to the need of generating synthetic parallel data, while generating more data would also improve the vocabulary coverage. As a future work, we would like to study, for training BWE, the impact of the use of synthetic parallel data generated by unsupervised NMT, or of a different nature, such as translation pairs extracted from monolingual corpora without supervision. Such translation pairs are, in general, more fluent but potentially much less accurate.  Table 6: Results in BLI of VECMAP, BIVEC, and SENTID BWE, on the "full" Muse bilingual lexicons, without filtering the vocabulary. In other words, the compared BWE do not have the same vocabulary. The coverage is given by the VECMAP's evaluation script.

A Preliminary experiment
To empirically test our assumption on the robustness of BIVEC to noisiness of training data, we performed a preliminary experiment. First, we trained a low-quality SMT systems for en→de an d en→fr on small parallel corpora. 13 Then, a synthetic version of Europarl is compiled by coupling the English side of Europarl parallel corpora and its German and French translations generated by the SMT systems. Finally, with BIVEC, we obtained two types of BWE respectively from the original and the synthetic Europarl, and evaluated them in bilingual lexicon induction (BLI) tasks on the test sets used in Section 3.2.
Results are presented in Table 5. Despite the poor performance of our SMT systems, BWE learned from the synthetic Europarl were only slightly less accurate for BLI than the BWE learned from the original Europarl. This result supports our assumption that BIVEC can exploit noisy synthetic data produced by SMT.

B Bilingual lexicon induction: coverage statistics
To show how the vocabulary coverage varies between BWE spaces, and to evaluate their impact on the accuracy in BLI, we report in Table 6 the coverage and the accuracy in BLI for all the BWE evaluated without restricting their vocabulary to be the same. Note that, because of the differences in coverage, accuracy of joint BWE cannot directly be compared with VECMAP BWE.