Unsupervised Bilingual Word Embedding Agreement for Unsupervised Neural Machine Translation

Unsupervised bilingual word embedding (UBWE), together with other technologies such as back-translation and denoising, has helped unsupervised neural machine translation (UNMT) achieve remarkable results in several language pairs. In previous methods, UBWE is first trained using non-parallel monolingual corpora and then this pre-trained UBWE is used to initialize the word embedding in the encoder and decoder of UNMT. That is, the training of UBWE and UNMT are separate. In this paper, we first empirically investigate the relationship between UBWE and UNMT. The empirical findings show that the performance of UNMT is significantly affected by the performance of UBWE. Thus, we propose two methods that train UNMT with UBWE agreement. Empirical results on several language pairs show that the proposed methods significantly outperform conventional UNMT.

Several unsupervised BWE (UBWE) methods (Conneau et al., 2018;Artetxe et al., 2018a) have been proposed and these have achieved impressive performance in wordtranslation tasks. The success of UBWE makes unsupervised neural machine translation (UNMT) possible.
The combination of UBWE with denoising autoencoder and back-translation has led to UNMT that relies solely on monolingual corpora, with remarkable results reported for several language pairs such as English-French and English-German (Artetxe et al., 2018c;Lample et al., 2018a).
In previous methods, UBWE is first trained using non-parallel monolingual corpora. This pretrained UBWE is then used to initialize the word embedding in the encoder and decoder of UNMT. That is, the training of UBWE and UNMT take place in separate steps. In this paper, we first empirically investigate the relationship between UBWE and UNMT. Our empirical results show that: • 1) There is a positive correlation between the quality of the pre-trained UBWE and the performance of UNMT.
Based on these two findings, we hypothesize that the learning of UNMT with UBWE agreement would enhance UNMT performance. In detail, we propose two approaches, UBWE agreement regularization and UBWE adversarial training, to maintain the quality of UBWE during NMT training. Empirical results on several language pairs show that the proposed methods significantly outperform the original UNMT. The remainder of this paper is organized as follows. In Section 2, we introduce the background of UNMT. The results of preliminary experiments are presented and analyzed in Section 3. In Section 4, we propose methods to jointly train UNMT with UBWE agreement. In Sections 5 and 6 , we describe experiments to evaluate the performance of our approach and analyze the results. Section 7 introduces some related work and Section 8 concludes the paper.

Background of UNMT
There are three primary components of UNMT: UBWE initialization, denoising auto-encoder, and back-translation.
Consider a sentence X in language L 1 and a sentence Y in another language L 2 . The data spaces of the L 1 sentence X and the L 2 sentence Y are denoted by φ L 1 and φ L 2 , respectively.
After initialization by UBWE, the encoders and decoders of L 1 , L 2 are trained through denoising and back-translation. The objective function L all of the entire UNMT model would be optimized as: where L auto is the objective function for autodenoising, and L bt is the objective function for back-translation.

Bilingual Word Embedding Initialization
Unlike supervised NMT (Bahdanau et al., 2015;Chen et al., 2017aChen et al., ,b, 2018aVaswani et al., 2017), there are no bilingual supervised signals in UNMT. Fortunately, UBWE (Zhang et al., 2017;Artetxe et al., 2018a;Conneau et al., 2018) successfully learned translation equivalences between word pairs from two monolingual corpora. Typically, UBWE initializes the embedding of the vocabulary for the encoder and decoder of UNMT. The pre-trained UBWE provides naive translation knowledge to enable the back-translation to generate pseudo-supervised bilingual signals (Artetxe et al., 2018c;Lample et al., 2018a). The embeddings of the encoder and decoder change independently during the UNMT training process.

Denoising Auto-encoder
The auto-encoder is difficult to learn useful knowledge for UNMT without some constraints. Otherwise, it would become a copying task that learned to copy the input words one by one (Lample et al., 2018a). To alleviate this problem, we utilize the same strategy of denoising autoencoder (Vincent et al., 2010), and noise in the form of random token swaps is introduced in this input sentence to improve the model learning ability (Hill et al., 2016;He et al., 2016). The denoising auto-encoder, which encodes a noisy version and reconstructs it with the decoder in the same language, is optimized by minimizing the objective function: where C(X) and C(Y ) are noisy versions of sentences X and Y , P L 1 →L 1 (P L 2 →L 2 ) denotes the reconstruction probability in the language L 1 (L 2 ).

Back-translation
The denoising auto-encoder acts as a language model that has been trained in one language and does not consider the final goal of translating between two languages. Therefore, backtranslation (Sennrich et al., 2016) was adapted to train translation systems in a true translation setting based on monolingual corpora. Formally, given the sentences X and Y , the sentences Y P (X) and X P (Y ) would be produced by the model at the previous iteration.
The pseudo-parallel sentence pair (Y P (X), X) and (X P (Y ), Y ) would be obtained to train the new translation model. Finally, the back-translation process is optimized by minimizing the following objective function: where P L 1 →L 2 (P L 2 →L 1 ) denotes the translation probability across two languages.

Preliminary Experiments
To investigate the relationship between UBWE and UNMT, we empirically choose one similar language pair (English-French which are in the same language family) and one distant language pair (English-Japanese which are in the different language families) as the corpora. The detailed experimental settings for UBWE and UNMT are given in Section 5. Precision@1 indicates the accuracy of word translation using the top-1 predicted candidate in the MUSE test set 2 . As the UBWE accuracy increased, the NMT performance of both language pairs increased. This indicates that the quality of pre-trained UBWE is important for UNMT. Figure 2 shows the trend in UBWE accuracy and BLEU score as UNMT proceeds through the training stage. VecMap was used to pre-train the word embedding for the encoder and decoder of UNMT. We used source embedding of encoder and target embedding of decoder to calculate the word translation accuracy on the MUSE test set during UNMT training. Regardless of the language, the UBWE performance decreased significantly over the course of UNMT training, as shown in Figure 2.

Analysis
The empirical results in this section show that the quality of pre-trained UBWE is important to UNMT. However, the quality of UBWE decreases significantly during UNMT training. We hypothesize that maintaining the quality of UBWE may enhance the performance of UNMT.
In this subsection, we analyze some possible solutions to this issue. Use fixed embedding? As Figure 2 shows, the UBWE performance decreases significantly during the UNMT training process. Therefore, we try to fix the embedding of the encoder and decoder on the basis of the original baseline system (Baseline-fix). Table 1 shows that the performance of the Baseline-fix system is quite similar to that of the original baseline system. In other words, Baseline-fix prevents the degradation of UBWE accuracy; however, the fixed embedding also prevents UBWE from further improving UNMT training. Therefore, the fixed UBWE does not enhance the performance of UNMT. For English-French and English-German UNMT, Lample et al. (2018b) concatenated two bilingual corpora into a single monolingual corpus. They adopted BPE to enlarge the number of shared subwords in the two languages. The pre-trained monolingual subword embedding was used as the initialization for UNMT. Because there are many shared subwords in these similar language pairs, this method achieves better performance than other UBWE methods. However, this initialization does not work for distant language pairs such as English-Japanese and English-Chinese, where there are few shared subwords. Using wordbased embedding in UNMT is more universal. In addition, word-based embedding are easy to combine with UBWE technology. Therefore, we do not adopt BPE in the proposed method.

Train UNMT with UBWE Agreement
Based on previous empirical findings and analyses, we propose two joint agreement mechanisms, i.e., UBWE agreement regularization and UBWE adversarial training, that enable UBWE and UNMT to interact during the training process, resulting in improved translation performance. Figure 3 illustrates the architecture of UNMT and the proposed agreement mechanisms.
Generally, during UNMT training, an objective function L BW E is added to ensure UBWE agreement. The general UNMT objective function can be reformulated as follows: (4)

UBWE Agreement Regularization
On the basis of the existing architecture of UNMT, we induce UBWE agreement regularization during back-translation to maintain the UBWE accuracy in the encoder and the decoder during UNMT training. The similarity function Similarity(L 1 , L 2 ) of the encoder and decoder embeddings is used to measure the UBWE accuracy and the objective function L BW E is where enc L 1 and enc L 2 denote all word embeddings of encoders L 1 and L 2 , respectively, dec L 1 and dec L 2 denote all word embeddings of decoders L 1 and L 2 , respectively.
As there is no test or development data set that can be employed as a bilingual dictionary in UNMT, before computing Similarity(L 1 , L 2 ), we need to generate a synthetic word-pair dictionary to measure the UBWE accuracy during NMT training. Motivated by Conneau et al. (2018), we use the cross-domain similarity local scaling (CSLS) to measure the UBWE accuracy. This can also be viewed as the similarity between the source word embedding and the target word embedding.
where y ∈ N (x i ) denotes the K nearest neighborhood of the source word x i , and similarly for x ∈ N (y i ). enc x i denotes the embedding of word x i in encoder L 1 and dec y i denotes the word embedding of y i in decoder L 2 . As the size of the entire vocabulary is large, we select a subset as the synthetic word-pair dictionary. By ranking the CSLS, we can select the most accurate word pairs {x i , y i } as the synthetic dictionary Dict x−>y . The opposite word pairs Dict y−>x = {y j , x j } could be obtained by the same method. enc y j denotes the embedding of word y j by encoder L 2 and dec x j denotes the embedding of word x j by decoder L 1 . Both dictionary sizes are set to |Dict|. Therefore, the similarity between the word embeddings in the encoder and decoder is measured as Similarity(enc L 2 , dec L 1 ) The above similarity between word pairs in Dict is used for UBWE agreement regularization during back-translation. Note that the synthetic word-pair dictionary is dynamically selected in each epoch of UNMT training.

UBWE Adversarial Training
In UBWE, there is a transformation matrix to project the source word embedding to the target word embedding. Motivated by Conneau et al. (2018), we induce a transformation matrix using an adversarial approach .
The generator is estimated as: where enc x is the L 1 the encoder word embedding, dec y is the corresponding L 2 decoder word embedding, and W 1 is the transformation matrix that project the embedding space of enc x onto that of dec y . The discriminator D 1 is a multilayer perceptron representing the probability that the word embedding comes from this language. It is trained to discriminate the language to which the word embedding between W 1 enc x and dec y belongs. W 1 is trained to confuse the discriminator D 1 by making W 1 enc x and dec y increasingly similar. In other words, we train D 1 to maximize the probability of choosing the accurate language between the original word embedding and samples from G 1 . The generator G 1 is trained to minimize log(1−D 1 (G 1 (enc x ))). Thus, the two-player minimax game (Goodfellow et al., 2014) with value function V (G 1 , D 1 ) is optimized as: (12) D 2 and G 2 are similar to D 1 and G 1 . The objective functions for the discriminator D 1 and generator G 1 can be written as: L D 2 and L G 2 are similar to L D 1 and L G 1 . After inducing UBWE adversarial training into UNMT, the L BW E objective function is minimized as where L adv1 = L G 1 + L D 1 and L adv2 = L G 2 + L D 2 . The proposed L BW E (L agreement or L adv ) is added to the L all in Eq. 4 during back-translation of UNMT training as shown in Figure 3.

Datasets
The proposed methods were evaluated on three language pairs: French-English (Fr-En), German-English (De-En), and Japanese-English (Ja-En). Fr-En and De-En are similar European language pairs. We used 30 million sentences from the WMT monolingual News Crawl datasets from 2007 to 2013. Ja-En is a distant languages pair and so UBWE training is much more difficult than for similar European language pairs (Søgaard et al., 2018). In addition, Japanese and English are different language families and their word orderings are quite different. As a result, the performance of Ja-En UNMT is too poor to further empirical study if only pure monolingual data are used. Therefore, we constructed simulated experiments using shuffled parallel sentences, i.e., 3.0M sentence pairs from the ASPEC corpus for Ja-En. We reported the results on WMT newstest2014 for Fr-En, WMT newstest2016 for De-En, and WAT-2018 ASPEC testset for Ja-En.

UBWE Settings
For UBWE training, we first used the monolingual corpora described above to train   Table 2: Performance (BLEU score) of UNMT. "++" after a score indicates that the proposed method was significantly better than the UNMT baseline at significance level p <0.01.
the embeddings for each language independently with fastText 3 (Bojanowski et al., 2017) (default settings). The word embeddings were normalized by length and mean centered before bilingual projection. We then used VecMap 4 (Artetxe et al., 2018a) (default settings) to project two monolingual word embeddings into one space.
To evaluate the quality of UBWE, we selected the accuracy of word translation using the top-1 predicted candidate in the MUSE test set as the criterion.

UNMT Settings
In the training process for UNMT, we used the transformer-based UNMT toolkit 5 and the settings of Lample et al. (2018b). That is, we used four layers in both the encoder and the decoder. Three out of the four encoder and decoder layers were shared between the source and target languages. The dimension of the hidden layers was set to 512. Training used a batch-size of 32 and the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.0001, β 1 = 0.5. The vocabulary size was set to 60k by concatenating the source and target corpora. We performed 140 epochs 6 (approximately 500K iterations) to train every model. The case-sensitive BLEU score computed with the multi − bleu.perl script from Moses 7 was used as the evaluation metric. For model selection, we followed the strategy described by Lample et al. (2018a). That is, the BLEU score computed between the original source sentences 6 The definition of epoch in UNMT is different from that in NMT. We followed the settings in Lample et al. (2018b)'s toolkit, i.e., 3500 iterations as one epoch. 7 https://github.com/moses-smt/ mosesdecoder and their reconstructions was used as the criterion. We selected the model that had the highest average BLEU score over the two translation directions.
For the proposed methods, both UBWE agreement regularization and UBWE adversarial training were added as objective functions at the beginning of UNMT training. The detailed parameter settings are discussed in Section 6. Figure 4 shows the trend in UBWE quality and BLEU score during UNMT training on Fr-En and Ja-En. Our observations are as follows:

Performance
1) For all systems, the UBWE accuracy decreases during UNMT training. This is consistent with our finding in the preliminary experiments.
2) For the system with UBWE agreement regularization and UBWE adversarial training, UBWE accuracy decreased much more slowly than in the original baseline system. This indicates that the proposed methods effectively mitigated the degradation of UBWE accuracy.
3) Regarding the two proposed methods, UBWE agreement regularization was better at mitigating the degradation of UBWE accuracy than UBWE adversarial training. Table 2 presents the detailed BLEU scores of the UNMT systems on the De-En, Fr-En, and Ja-En test sets. Our observations are as follows: 1) Our re-implemented baseline performed similarly with the state-of-the-art method of Lample et al. (2018b). This indicates that the baseline is a strong system.
2) The proposed methods significantly outperformed the corresponding baseline in all the language pairs by 1∼3 BLEU scores.
3) Regarding the two proposed methods, UBWE adversarial training performed slightly better than UBWE agreement regularization by BLEU score, although UBWE agreement regularization was better at maintaining UBWE accuracy. The reason may be that agreement regularization is just added to the training objective of UNMT. In comparison, UBWE adversarial training is jointly trained with UNMT, thus has more interaction with UNMT model.

Discussion
We now analyze the effect of the hyperparameters.
There are two primary factors that affect the performances of the proposed methods: the synthetic word-pair dictionary size for UBWE agreement regularization and λ for UBWE adversarial training.

Effect of Dictionary Size
We first evaluated the impact of the synthetic word-pair dictionary size |Dict| during UBWE agreement regularization training on the Fr-En task. As indicated by Table 3, almost all models with different dictionary sizes outperformed the baseline system. This indicates that the proposed method is robust.  We also investigated the relationship between dictionary size and UBWE accuracy. As shown in Fig. 5, a larger dictionary size results in a slower decrease in UBWE accuracy. This indicates that a larger dictionary size helps estimate a better UBWE agreement. However, larger dictionary size did not always obtain a higher BLEU as shown in Table 3. The model with a dictionary size of 3000 achieved the best performance.

Effect of Hyper-parameter λ
In Figure 6, we empirically investigated how the hyper-parameter λ in Eq. (4) affects the UNMT performance on the Fr-En task. The selection of λ influences the role of the L BW E across the entire UNMT training process. Larger values of λ cause the L BW E to play a more important role than the back-translation and denoising loss terms. The smaller the value of λ, the less important are the L BW E . As the Fig. 6 shows, λ ranging from 0.01 to 10 nearly all enhanced UNMT performance and a balanced λ = 1 achieved the best performance.

Efficiency
We now discuss the efficiency of our proposed methods. Table 4 indicates that UBWE agreement regularization does not increase the number of parameters. UBWE adversarial training adds very few parameters. The training speed of these methods is almost the same. In addition, the proposed methods do not affect the UNMT decoding. Thus, our proposed methods do not affect the speed of the model.

Related Work
The supervised BWE (Mikolov et al., 2013), which exploits similarities between the source language and the target language through a linear transformation matrix, serves as the basis for many NLP tasks, such as machine translation (Bahdanau et al., 2015;Vaswani et al., 2017;Chen et al., 2018b;, dependency parsing (Zhang et al., 2016;, semantic role labeling Li et al., 2019). However, the lack of a large wordpair dictionary poses a major practical problem for many language pairs. UBWE has attracted considerable attention. For example, Artetxe et al. (2017) proposed a self-learning framework to learn BWE with a 25-word dictionary, and Artetxe et al. (2018a) extended previous work without any word dictionary via fully unsupervised initialization. Zhang et al. (2017) and Conneau et al. (2018) proposed UBWE methods via generative adversarial network training.
Recently, several UBWE methods (Conneau et al., 2018;Artetxe et al., 2018a) have been applied to UNMT (Artetxe et al., 2018c;Lample et al., 2018a). These rely solely on monolingual corpora in each language via UBWE initialization, denoising auto-encoder, and back-translation. A shared encoder was used to encode the source sentences and decode them from a shared latent space (Artetxe et al., 2018c;Lample et al., 2018a). The difference is that Lample et al. (2018a) used a single shared decoder and Artetxe et al. (2018c) leveraged two independent decoders for each language. Yang et al. (2018) used two independent encoders for each language with a weight-sharing mechanism to overcome the weakness of retaining the uniqueness and internal characteristics of each language. Lample et al. (2018b) achieved remarkable results in several similar languages such as English-French by concatenating two bilingual corpora as one monolingual corpus and using monolingual embedding pre-training in the initialization step. This initialization achieves better performance than other UBWE methods. However, it does not work in some distant language pairs such as English-Japanese. This is why we did not use this initialization process for UBWE in our method.
In addition, an alternative unsupervised method based on statistical machine translation (SMT) was proposed (Lample et al., 2018b;Artetxe et al., 2018b). The unsupervised machine translation performance was improved through combining UNMT and unsupervised SMT Ren et al., 2019;Artetxe et al., 2019). More recently, Lample and Conneau (2019) achieved better UNMT performance through introducing the pretrained language model. Neural network based language model has been shown helpful in supervised machine translation (Wang et al., 2014;. We think that the proposed agreement mechanism can work with the pretrained language model.

Conclusion
UBWE is a fundamental component of UNMT. In previous methods, the pre-trained UBWE is only used to initialize the word embedding of UNMT. In this study, we found that the performance of UNMT is significantly affected by the quality of UBWE, not only in the initialization stage, but also during UNMT training. Based on this finding, we proposed two joint learning methods to train UNMT with UBWE agreement. Empirical results on several language pairs show that the proposed methods can mitigate the decrease in UBWE accuracy and significantly improve the performance of UNMT.