Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios

Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems.

However, in real-world scenarios, in contrast to the many large corpora available for high-resource languages such as English and French, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian. * Part of this work was done when Haipeng Sun and Rui Wang were an internship research fellow and a researcher at NICT, respectively. † Corresponding author. The UNMT system usually performs poorly in a low-resource scenario when there is not an adequate training corpus for one language.
In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose a self-training mechanism for UNMT. In detail, we propose selftraining with unsupervised training (ST-UT) and self-training with pseudo-supervised training (ST-PT) strategies to train a robust UNMT system that performs better in this scenario. To the best of our knowledge, this paper is the first work to explore the unbalanced training data scenario problem in UNMT. Experimental results on several language pairs show that the proposed strategies substantially outperform conventional UNMT systems.

Unbalanced Training Data Scenario
In this section, we first define the unbalanced training data scenario according to training data size. Consider one monolingual corpus {X} in highresource language L 1 and another monolingual corpus {Y } in low-resource language L 2 . The data size of {X} and {Y } are denoted by |X| and |Y |, respectively. In an unbalanced training data scenario, |X| is generally much larger than |Y | so that training data {X} is not fully utilized.
To investigate UNMT performance in an unbalanced training data scenario, we empirically chose English (En) -French (Fr) as the language pair. The detailed experimental settings for UNMT are given in Section 5. We used a transformer based XLM toolkit and followed the settings of Lample and Conneau (2019). We randomly extracted 2 million sentences for each language from all 50 million sentences in the En and Fr training corpora to create small corpora and simulate unbalanced training data scenarios. Table 1 shows the UNMT performance for different training data sizes. The performance with 25M training sentences for both French and English configuration is similar to the baseline (50M training sentences for both French and English configuration). However, the UNMT performance decreased substantially (4-5 BLEU points) when the size of the training data decreased rapidly. In the unbalanced training data scenario, when training data for one language was added, they were not fully utilized and only slightly improved the UNMT's BLEU score. The performance (2M/50M) is similar with the UNMT system, configured 2M training sentences for both French and English. In short, Table 1 demonstrates that the UNMT performance is bounded by the smaller monolingual corpus. The UNMT model converges and even causes over-fitting in the low-resource language while the model in the high-resource language doesn't converge. This observation motivates us to better use the larger monolingual corpus in the unbalanced training data scenario.

Background
We first briefly describe the three components of the UNMT model (Lample and Conneau, 2019): cross-lingual language model pre-training, the denoising auto-encoder (Vincent et al., 2010), and back-translation (Sennrich et al., 2016a). Crosslingual language model pre-training provides a naive bilingual signal that enables the backtranslation to generate pseudo-parallel corpora at the beginning of the training. The denoising autoencoder acts as a language model to improve translation quality by randomly performing local substitutions and word reorderings.
Generally, back-translation plays an important role in achieving unsupervised translation across two languages. The pseudo-parallel sentence pairs produced by the model at the previous iteration are used to train the new translation model. The general back-translation probability is optimized by maximizing where P (X) and P (Y ) are the empirical data distribution from monolingual corpora {X}, {Y }, and P M U (Y |X) and P M U (X|Y ) are the conditional distributions generated by the UNMT model. In addition, M U * denotes the model at the previous iteration for generating new pseudo-parallel sentence pairs to update the UNMT model. Self-training proposed by Scudder (1965), is a semi-supervised approach that utilizes unannotated data to create better models. Self-training has been successfully applied to many natural language processing tasks (Yarowsky, 1995;McClosky et al., 2006;Zhang and Zong, 2016;He et al., 2020). Recently, He et al. (2020) empirically found that noisy self-training could improve the performance of supervised machine translation and synthetic data could play a positive role, even as a target.

Self-training Mechanism for UNMT
Based on these previous empirical findings and analyses, we propose a self-training mechanism to generate synthetic training data for UNMT to alleviate poor performance in the unbalanced training data scenario. The synthetic data increases the diversity of low-resource language data, further enhancing the performance of the translation, even though the synthetic data may be noisy. As the UNMT model is trained, the quality of synthetic data becomes better, causing less and less noise. Compared with the original UNMT model that the synthetic data is just used as the source part, we also use the synthetic data as the target part in our proposed methods. Newly generated synthetic data, together with original monolingual data, are fully utilized to train a robust UNMT system in this scenario. According to the usage of the generated synthetic training data, our approach can be divided into two strategies: ST-UT (Algorithm 1) and ST-PT (Algorithm 2).
ST-UT: In this strategy, we first train a UNMT model on the existing monolingual training data. The final UNMT system is trained using the ST-UT strategy for k 1 epochs. For one epoch l in the ST-UT strategy, a subset{X sub } is selected randomly from monolingual training data {X}. The quantity of {X sub } is of |X|, is a quantity The synthetic data are used 1 , together with the monolingual data to train a new UNMT model M U l . Therefore, the translation probability for the ST-UT strategy is optimized by maximizing ST-PT: In this strategy, we first train a UNMT system on the existing monolingual training data and switch to a standard neural machine translation system from UNMT system with synthetic parallel data for both translation directions. The final translation system is trained using the ST-PT strategy for k 2 epochs. For one epoch q in the ST-PT strategy, a subset{X sub } is selected randomly from monolingual training data {X}, and all monolingual data {Y } is selected. The quantity of {X sub } is of |X|, is a quantity ratio hyperparameter. The last trained pseudo-supervised neu- Apply the last trained PNMT model where P M P q (Y |X) and P M P q (X|Y ) are the conditional distributions generated by the PNMT model on epoch q for the ST-PT strategy; P M P *

Datasets
We considered three language pairs in our simulation experiments: Fr-En, Romanian (Ro)-En and

Method
En-Fr Fr-En En-Ro Ro-En En-Et Et-En Lample et al. (2018a) 15.05 14.31 n/a n/a n/a n/a Artetxe et al. (2018) 15.13 15.56 n/a n/a n/a n/a Lample et al. (2018b) 27.60 27.68 25.13 23.90 n/a n/a Lample and Conneau (2019) Table 3: Performance (BLEU score) of UNMT on the unbalanced training data scenario. Note that only 2 million Fr monolingual training data were used for En-Fr.The quantity ratio was set to 10%. The number of epochs was set to two for both proposed strategies. "++" after a score indicates that the strategy was significantly better than the baseline at significance level p <0.01.
Estonian (Et)-En translation tasks. The statistics of the data are presented in  For preprocessing, we used the Moses tokenizer (Koehn et al., 2007).
To clean the data, we only applied the Moses script clean-corpus-n.perl to remove lines from the monolingual data containing more than 50 words. We used a shared vocabulary for all language pairs, with 60,000 subword tokens based on BPE (Sennrich et al., 2016b).

UNMT Settings
We used a transformer-based XLM toolkit and followed the settings of Lample and Conneau (2019) for UNMT: six layers for the encoder and the decoder. The dimensions of the hidden layers were set to 1024. The batch size was set to 2000 tokens. The Adam optimizer (Kingma and Ba, 2015) was used to optimize the model parameters. The initial 3 http://data.statmt.org/news-crawl/ learning rate was 0.0001, β 1 = 0.9, and β 2 = 0.98. We trained a specific cross-lingual language model for each different training dataset. The language model was used to initialize the full parameters of the UNMT system. Eight V100 GPUs were used to train all UNMT models. We used the casesensitive 4-gram BLEU score computed by the multi − bleu.perl script from Moses (Koehn et al., 2007) to evaluate the test sets. Table 3 presents the detailed BLEU scores of the UNMT systems on the En-Fr, En-Ro, and En-Et test sets. Our re-implemented baseline performed similarly to the state-of-the-art method of Lample and Conneau (2019) on the En-Ro language pair. In particular, we used only 2 million Fr monolingual training data on the En-Fr language pair, so the re-implemented baseline performed slightly worse than Lample and Conneau (2019).

Main Results
Our proposed self-training mechanism substantially outperformed the corresponding baseline in all language pairs by 2-4 BLEU points. Regarding the two proposed strategies, the ST-PT strategy performed better than the ST-UT strategy by 1 BLEU point because the synthetic data are more directly integrated into the training. For ST-UT, the synthetic data was just used as the target part. In contrast, the synthetic data was used as the source and target part for ST-PT. The synthetic parallel data could improve translation performance. These results demonstrate that synthetic data improve translation performance in our proposed self-training mechanism. The detailed analyses of the hyperparameters such as quantity ratio and epoch number k 1 , k 2 are provided in Appendix.

Input
Ma pole oma loomingust kunagi kaugel ja tööd ma ei karda . Reference I 'm never far from my work and I 'm not afraid of work .

Baseline
I am never far from my work and work I 'm not afraid of . +ST-PT I 'm never far from my work and I 'm not afraid of the job .

Input
Salvador Adame kadus läänepoolses Michoacani osariigis kolm päeva pärast Valdezi tapmmist . Reference Salvador Adame disappeared in the western state of Michoacan three days after Valdez was killed .

Baseline
Salvador Adame disappeared in west Michoacan , Mexico , three days after Valdezi was killed . +ST-PT Salvador Adame disappeared in the western state of Michoacan three days after Valdezi was killed .

Case Study
Moreover, we analyze translation examples to further analyze the effectiveness of our proposed selftraining mechanism. Table 4 shows two translation examples, which were generated by UNMT baseline system and +ST-PT system on the Et-En dataset, respectively. For the first example, +ST-PT method could make the translation more fluent, compared with the baseline system. For the second example, +ST-PT method could make the translation more accurate. These examples indicate that our proposed self-training mechanism could be widely applied to the unbalanced training data scenario.

Conclusion
UNMT has achieved remarkable results on massive monolingual corpora. However, a UNMT system usually does not perform well in a scenario where there is not an adequate training corpus for one language. Based on this unbalanced training data scenario, we proposed two self-training strategies for UNMT. Experimental results on several language pairs show that our proposed strategies substantially outperform UNMT baseline.

A.1 Quantity Ratio Analysis
We investigated the effect of quantity ratio on UNMT performance for the En-Fr translation task during the first epoch of our proposed self-training methods. As shown in Fig. 1, ranging from 1% to 100% all enhanced UNMT performance and the performance was similar when the quantity ratio was greater than 10%. The UNMT model converged faster with less data. Therefore, we selected 10% as the quantity ratio for our proposed selftraining methods.
1% 5% 10% 30% 50% 100% In Figure 2, we empirically demonstrate how the number of epochs affects the UNMT performance on the En-Fr and En-Et translation tasks. We found that the use of additional epochs has little influence on the baseline system. In contrast, increasing the number of epochs for our proposed strategies can improve performance because the quality of the synthetic data used by the UNMT model is better after more epochs; however, the improvement decreases as additional epochs are added. Considering the computational cost of synthetic data generation, we trained the UNMT model for only two epochs.