Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform strong supervised baselines for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.


Introduction
Neural machine translation systems (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016) have demonstrated state-of-the-art results for a diverse set of language pairs when given large amounts of relevant parallel data. However, given the prohibitive nature of such a requirement for low-resource language pairs, there has been a growing interest in unsupervised machine translation (Ravi and Knight, 2011) and its neural counterpart, unsupervised neural machine translation (UNMT) (Lample et al., 2018a;Artetxe et al., 2018), which leverage only monolingual source and target corpora for learning. Bilingual unsupervised systems (Lample and Conneau, 2019;Artetxe et al., 2019;Ren et al., 2019;Li et al., 2020a) have achieved surprisingly strong results on high-resource language pairs such as English-French and English-German.
However, these works only evaluate on highresource language pairs with high-quality data, which are not realistic scenarios where UNMT would be utilized. Rather, the practical potential of UNMT is in low-resource, rare languages that may not only lack parallel data but also have a shortage of high-quality monolingual data. For instance, Romanian (a typical evaluation language for unsupervised methods) has 21 million lines of high-quality in-domain monolingual data provided by WMT. In contrast, for an actual low-resource language, Gujarati, WMT only provides 500 thousand lines of monolingual data (in news domain) and an additional 3.7 million lines of monolingual data from Common Crawl (noisy, general-domain).
Given the comparably sterile setups UNMT has been studied in, recent works have questioned the usefulness of UNMT when applied to more realistic low-resource settings. Kim et al. (2020) report BLEU scores of less than 3.0 on low-resource pairs and Marchisio et al. (2020) also report dramatic degradation under domain shift. However, the negative results shown by the work above only study bilingual unsupervised systems and do not consider multilinguality, which has been well explored in supervised, zero-resource and zero-shot settings (Johnson et al., 2017;Firat et al., 2016a,b;Neubig and Hu, 2018;Gu et al., 2018;Liu et al., 2020;Ren et al., 2018;Zoph et al., 2016) to improve performance for low-resource languages. The goal of this work is to study if multilinguality can help UNMT be more robust in the low-resource, rare language setting.
In our setup (Figure 1), we have a single model for 5 target low-resource unsupervised directions (that are not associated with any parallel data): Gujarati, Kazakh, Nepali, Sinhala, and Turkish. These languages are chosen to be studied for a variety of reasons (discussed in §3) and have been of particular challenge to unsupervised systems. In our approach, as shown in Figure 1, we also leverage auxiliary data from a set of higher resource languages: Russian, Chinese, Hindi, Arabic, Tamil, and Telugu. These higher resource languages not only possess significant amounts of monolingual data but also auxiliary parallel data with English that we leverage to improve the performance of the target unsupervised directions 1 . Existing work on multilingual unsupervised translation (Liu et al., 2020;Garcia et al., 2020;Li et al., 2020b;Bai et al., 2020), which also uses auxiliary parallel data, employs a two-stage training scheme consisting of pre-training with noisy reconstruction objectives and fine-tuning with on-thefly (iterative) back-translation and cross-translation terms ( §4). We show this leads to sub-optimal performance for low-resource pairs and propose an additional intermediate training stage in our approach.
Our key insight is that pre-training typically results in high XÑEn (to English) performance but poor EnÑX (from English) results, which makes finetuning unstable. Thus, after pre-training, we propose an intermediate training stage that leverages offline back-translation (Sennrich et al., 2016) to generate synthetic data from the XÑEn direction to boost EnÑX accuracy.
Our final results show that our approach outperforms a variety of supervised and unsupervised baselines, including the current state-of-the-art supervised model for the NeÑEn language pair. Additionally, we perform a series of experimental studies to analyze the factors that affect the performance of the proposed approach, as well as the performance in data-starved settings and settings where we only have access to noisy, multi-domain monolingual data. 2019a; Al-Shedivat and Parikh, 2019). Zero-shot translation concerns the case where direct (source, target) parallel data is lacking but there is parallel data via a common pivot language to both the source and the target. For example, in Figure 1, RuØZh and HiØTe would be zero-shot directions.
In contrast, a defining characteristic of the multilingual UNMT setup is that the source and target are disconnected in the graph and one of the languages is not associated with any parallel data with English or otherwise. EnØGu or EnØKk are such example pairs as shown in Figure 1.
Recently ; Liu et al. (2020) showed some initial results on multilingual unsupervised translation in the low-resource setting. They tune language-specific models and employ a standard two-stage training scheme (Lample and Conneau, 2019), or in the case of Liu et al. (2020) directly fine-tuning on a related language pair (e.g. HiÑEn) and then test on the target XÑEn pair (e.g. GuÑEn). In contrast our approach trains one model for all the language pairs targetted and employs a three stage training scheme that leverages synthetic parallel data via offline back-translation.

Terminology
There is some disagreement on the definition of multilingual unsupervised machine translation, which we believe arises from extrapolating unsu-  pervised translation to multiple languages. In the case of only two languages, the definition is clear: unsupervised machine translation consists of the case where there is no parallel data between the source and target languages. However, in a setting with multiple languages, there are multiple scenarios which satisfy this condition. More explicitly, suppose that we want to translate between languages X and Y and we have access to data from another language Z. Then, we have three possible scenarios: • We possess parallel data for pX , Zq and pZ, Yq which would permit a 2-step supervised baseline via the pivot. Existing literature (Johnson et al., 2017;Firat et al., 2016b) has used the term "zero-shot" and "zero-resource" to refer specifically to this setup.
• We have parallel data for pX , Zq but only monolingual data in Y, as considered in (Li et al., 2020b;Liu et al., 2020;Garcia et al., 2020;Bai et al., 2020;Artetxe et al., 2020). Note that the pivot-based baseline above is not possible in this setup.
• We do not have any parallel data among any of the language pairs, as considered in (Liu et al., 2020;Sun et al., 2020).
We believe the first setting is not particularly suited for the case where either X or Y are true low-resource languages (or extremely low-resource languages), since it is unlikely that these languages possess any parallel data with any other language. On the other hand, we usually assume that one of these languages is English and we can commonly find large amounts of parallel data for English with other high-resource auxiliary languages. For these reasons, we focus on the second setting for the rest of this work.
Arguably, the existence of the auxiliary parallel data provides some notion of indirect supervision that is not present when only utilizing monolingual data. However, this signal is weaker than the one encountered in the zero-shot setting, since it precludes the 2-step supervised baseline. As a result, recent work (Artetxe et al., 2020;Garcia et al., 2020;Liu et al., 2020) has also opted to use the term "unsupervised". We too follow this convention and use this terminology, but we emphasize that independent of notation, our goal is to study the setting where only the (extremely) low-resource languages of interest possess no parallel data, whether with English or otherwise.

Choice of languages
The vast majority of works in UNMT (multilingual or otherwise) have focused on traditionally highresource languages, such as French and German. While certain works simulate this setting by using only a smaller subset of the available monolingual data, such settings neglect common properties of true low-resource, rare languages: little-to-no lexical overlap with English and noisy data sources coming from multiple domains. Given the multifaceted nature of what it means to be a low-resource language, we have chosen a set of languages with many of these characteristics. We give a detailed account of the available data in Table 1.
Target unsupervised directions: We select Turkish (Tr), Gujarati (Gu), and Kazakh (Kk) from WMT . The latter two possess much smaller amounts of data than most language pairs considered for UNMT e.g. French or German. In order to vary the domain of our test sets, we additionally include Nepali (Ne) and Sinhala (Si) from the recently-introduced FLoRes dataset , as the test sets for these languages are drawn from Wikipedia instead of news. Not only do these languages possess monolingual data in amounts comparable to the low-resource languages from WMT, the subset of in-domain monolingual data for both languages make up less than 5% of the available monolingual data of each language.
Auxiliary languages: To choose our auxiliary languages that contain both monolingual data and parallel data with English, we took into account linguistic diversity, size, and relatedness to the target directions. Russian shares the same alphabet with Kazakh, and Hindi, Telugu, and Tamil are related to Gujarai, Nepali and Sinhala. Chinese, while not specifically related to any of the target language, is high resource and considerably different in structure from the other languages.

Background
For a given language pair pX, Yq of languages X and Y, we possess monolingual datasets D X and D Y , consisting of unpaired sentences of each language.
Neural machine translation In supervised neural machine translation, we have access to a parallel dataset D XˆZ consisting of translation pairs px, zq. We then train a model by utilizing the crossentropy objective: where p θ is our translation model. We further assume p θ follows the encoder-decoder paradigm, where there exists an encoder Enc θ which converts x into a variable-length representation which is passed to a decoder p θ py|xq :" p θ py|Enc θ pxqq.
Unsupervised machine translation In this setup, we no longer possess D XˆY . Nevertheless, we may possess auxiliary parallel datasets such as D XˆZ for some language Z, but we enforce the constraint that we do not have access to analogous dataset D YˆZ . Current state-of-the-art UNMT models divide their training procedure into two phases: i) the pre-training phase, in which an initial translation model is learned through a combination of language modeling or noisy reconstruction objectives (Song et al., 2019;Lewis et al., 2019;Lample and Conneau, 2019) applied to the monolingual data; ii) the fine-tuning phase, which resumes training the translation model built from the pre-training phase with a new set of objectives, typically centered around iterative back-translation i.e. penalizing a model's error in round-trip translations. We outline the objectives below: Pre-training objectives We use the MASS objective (Song et al., 2019), which consists of masking 2 2 We choose a starting index of less than half the length l of the input and replace the next l{2 tokens with a [MASK] a contiguous segment of the input and penalizing errors in the reconstruction of the masked segment. If we denote the masking operation by MASK, then we write the objective as follows: where l x denotes the language indicator of example x. We also use cross-entropy on the available auxiliary parallel data.
Fine-tuning objectives We use on-the-fly backtranslation, which we write explicitly as: L back-translation px, l y q "´log p θ px|ỹpxq, l x q whereỹpxq " argmax y p θ py|x, l y q and we apply a stop-gradient toỹpxq. Computing the modeỹpxq of p θ p¨|x, l y q is intractable, so we approximate this quantity with a greedy decoding procedure. We also utilize cross-entropy, coupled with crosstranslation (Garcia et al., 2020;Li et al., 2020b;Bai et al., 2020), which ensures cross-lingual consistency: L cross-translation px, y, l z q "´log p θ py|zpxq, l y q wherezpxq " argmax z p θ pz|x, l z q.

Method
For the rest of this work, we assume that we want to translate between English (En) and some lowresource languages which we denote by X. In our early experiments, we found that proceeding to the fine-tuning stage immediately after pre-training with MASS provided sub-optimal results (see §7.2), so we introduced an intermediate stage which leverages synthetic data to improve performance. This yields a total of three stages, which we describe below.

First stage of training
In the first stage, we leverage monolingual and auxiliary parallel data, using the MASS and crossentropy objectives on each type of dataset respectively. We describe the full procedure in Algorithm 1.
token. The starting index is randomly chosen to be 0 or l{2 with 20% chance for either scenario otherwise it is sampled uniformly at random.

Second stage of training
Once we have completed the first stage, we will have produced an initial model capable of generating high-quality XÑEn (to English) translations for all of the low-resource pairs we consider, also known as many-to-one setup in multilingual NMT (Johnson et al., 2017). Unfortunately, the model does not reach that level of performance for the EnÑX translation directions, generating very low-quality translations into these low-resource languages. Note that, this phenomenon is ubiquitously observed in multilingual models (Firat et al., 2016a;Johnson et al., 2017;Aharoni et al., 2019). This abysmal performance could have dire consequences in the fine-tuning stage, since both onthe-fly back-translation and cross-translation rely heavily on intermediate translations. We verify that this is in fact the case in §7.2.
Instead, we exploit the strong XÑEn performance by translating subsets 3 of the monolingual data of the low-resource languages using our initial model and treat the result as pseudo-parallel datasets for the language pairs EnÑX. More explicitly, given a sentence x from a low-resource language, we generate an English translationỹ En with our initial model and create a synthetic translation-pair pỹ En , xq. We refer to this procedure as offline back-translation (Sennrich et al., 2015). We add these datasets to our collection of auxiliary parallel corpora and repeat the training procedure from the first stage (Algorithm 1), starting from the last checkpoint. Note that, while offline back-translated (synthetic) data is commonly used for zero-resource translation (Firat et al., 2016b;, it is worth emphasizing the difference here again, that in the configuration studied in this paper, we do not assume the existence of any parallel data between EnØX, which is exploited by such methods.
Upon completion, we run the procedure a second time, with a new subset of synthetic data of twice the size for the EnÑX pairs. Furthermore, since the translations from English have improved, we take disjoint subsets 4 of the English monolingual data and generate corpora of synthetic XÑEn translation pairs that we also include in the second run of our procedure.

Third stage of training
For the third and final stage of training, we use back-translation of the monolingual data and crosstranslation 5 on the auxiliary parallel data. We also leverage the synthetic data through the crossentropy objective. We present the procedure in detail under Algorithm 2.

Main experiment
In this section, we describe the details of our main experiment. As indicated in Figure 1, we consider five languages (Nepali, Sinhala, Gujarati, Kazakh, Turkish) as the target unsupervised language pairs with English. We leverage auxiliary parallel data from six higher-resource languages (Chinese, Russian, Arabic, Hindi, Telugu, Tamil) with English. The domains and counts for the datasets considered can be found in Table 1 and a more detailed discussion on the source of the data and the preprocessing steps can be found in the Appendix. In the following subsections, we provide detailed descriptions of the model configurations, training parameters, evaluation and discuss results of our main experiment.

Datasets and preprocessing
We draw most of our data from WMT. The monolingual data comes from News Crawl 6 when available. For all the unsupervised pairs except Turkish, we supplement the News Crawl datasets with monolingual data from Common Crawl and Wikipedia 7 . 4 1 million lines of English per low-resource language. 5 For Nepali, Sinhala and Gujarati, we use Hindi as the pivot language. For Turkish, we use Arabic and for Kazakh, we use Russian. 6 http://data.statmt.org/news-crawl/ 7 We used the monolingual data available from https://github.com/facebookresearch/flores for Nepali and Sinhala in order to avoid any data leakage from the test sets.   The parallel data we use came from a variety of sources, all available through WMT. We drew our English-Hindi parallel data from IITB (Kunchukuttan et al., 2017); English-Russian, English-Arabic, and English-Chinese parallel data from the UN Corpus (Ziemski et al., 2016); English-Tamil and English-Telugu from Wikimatrix (Schwenk et al., 2019). We used the scripts from Moses (Koehn, 2009) to normalize punctuation, remove non-printing characters, and replace the unicode characters with their non-unicode equivalent. We additionally use the normalizing script from Indic NLP (Kunchukuttan, 2020) for Gujarati, Nepali, Telugu, and Sinhala. We concatenate two million lines of monolingual data for each language and use it to build a vocabulary with SentencePiece 8 (Kudo and Richardson, 2018) of 64,000 pieces. We then separate our data into SentencePiece pieces and remove all training samples that are over 88 pieces long.

Model architecture
All of our models were coded and tested in Tensorflow (Abadi et al., 2016). We use the Trans-former architecture (Vaswani et al., 2017) as the basis of our translation models. We use 6-layer encoder and decoder architecture with a hidden size of 1024 and an 8192 feedforward filter size. We share the same encoder for all languages. To differentiate between the different possible output languages, we add (learned) language embeddings to each token's embedding before passing them to the decoder. We follow the same modification as done in Song et al. (2019) and modify the output transformation of each attention head in each transformer block in the decoder to be distinct for each language. Besides these modifications, we share decoder parameters for every language.

Training parameters
We use three different settings, corresponding to each stage of training. For the first stage, we use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.0002, weight decay of 0.2 and batch size of 2048 examples. We use a learning rate schedule consisting of a linear warmup of 4000 steps to a value 0.0002 followed by a linear decay for 1.2 million steps. At every step, we choose a single dataset from which to draw a whole batch using the following process: with equal probability, choose either monolingual or parallel. If the choice if D consists of monolingual data then 6: lD Ð Language of D.

7:
Sample batch x from D.

24:
Source language: lx Ð Language of x.

34:
end if 35: end if 36: end for 37: end while is monolingual, then we select one of the monolingual datasets uniformly at random. If the choice is parallel, we use a temperature-based sampling scheme based on the numbers of samples with a temperature of 5 (Arivazhagan et al., 2019b). In the second stage, we retain the same settings for both rounds of leveraging synthetic data except for the learning rate and number of steps. In the first round, we use the same number of steps, while in the second round we only use 240 thousand steps, a 1/5th of the original.
For the final phase, we bucket sequences by their sequence length and group them up into batches of at most 2000 tokens. We train the model with 8 NVIDIA V100 GPUs, assigning a batch to each one of them and training synchronously. We also use the Adamax optimizer instead, and cut the learning rate by four once more.

Baselines
We compare with the state-of-the-art unsupervised and supervised baselines from the literature. Note all the baselines build language-specific models, whereas we have a single model for all the target unsupervised directions.
Unsupervised baselines: For the bilingual unsupervised baselines, we include the results of Kim et al. (2020) 9 for EnØGu and EnØKk and of  for EnØSi. We also report other multilingual unsupervised baselines. mBART (Liu et al., 2020) leverages auxiliary parallel data (e.g. EnØHi parallel data for GuÑEn) after pre-training on a large dataset consisting of 25 languages and the FLoRes dataset benchmark  leverages HiØEn data for the EnØNe language pair. All the unsupervised baselines that use auxiliary parallel data perform considerably better than the ones that don't.
Supervised baselines: In addition to the unsupervised numbers above, mBART and the FLo-Res dataset benchmarks report supervised results that we compare with. We additionally include one more baseline where we followed the training scheme proposed in stage 1, but also included the missing parallel data. We labeled this model "Mult. MT Baseline", though we emphasize that we also leverage the monolingual data in this baseline, as in recent work (Siddhant et al., 2020a;Garcia et al., 2020).

Evaluation
We evaluate the performance of our models using BLEU scores (Papineni et al., 2002). BLEU scores are known to be dependent on the data preprocessing (Post, 2018) and thus proper care is required to ensure the scores between our models and the baselines are comparable. We thus only considered baselines which report detokenized BLEU scores with sacreBLEU (Post, 2018) or report explicit pre-processing steps. In the case of the Indic languages (Gujarati, Nepali, and Sinhala), both the baselines we consider Liu et al., 2020) report tokenized BLEU using the tokenizer provided by the Indic-NLP library (Kunchukuttan, 2020). For these languages, we follow this convention as well so that the BLEU scores  remain comparable. Otherwise, we follow suit with the rest of the literature and report detoknized BLEU scores through sacreBLEU 10 .

Results & discussion
We list the results of our experiments for the WMT datasets in Table 2 and for the FLoRes datasets in Table 3. After the first stage of training, we obtain competitive BLEU scores for XÑEn translation directions, outperforming all unsupervised models as well as mBART for the language pairs KkÑEn and GuÑEn. Upon completion of the second stage of training, we see that the EnÑX language pairs observe large gains, while the XÑEn directions also improve. The final round of training further improves results in some language pairs, yielding an increase of +0.44 BLEU on average.
Note that in addition to considerably outperforming all the unsupervised baselines, our approach outperforms the supervised baselines on many of the language pairs, even matching the state-of-theart on NeÑEn. Specifically, it outperforms the supervised mBART on six out of ten translation directions despite being a smaller model and  on all pairs. Critically, we outperform our own multilingual MT baseline, trained in the same fashion and data as Stage 1, which further reinforces our assertion that unsupervised MT can provide competitive results with supervised MT in low-resource settings.

Further analysis
Given the substantial quality gains delivered by our proposed method, we set out to investigate what design choices can improve the performance of unsupervised models. To ease the computational burden, we further filter the training data to remove any sample which are longer than 64 Sen-tencePiece 11 pieces long and cut the batch size in half for the first two stages. Additionally, we only do one additional round of training with synthetic data as opposed to the two rounds performed for the benchmark models. While these choices negatively impact performance, the resulting models still provide competitive results with our baselines and hence are more than sufficient for the purposes of experimental studies.
7.1 Increasing multilinguality of the auxiliary parallel data improves performance It was shown in Garcia et al. (2020); Bai et al. (2020) that adding more multilingual data improved performance, and that the inclusion of auxiliary parallel data further improved the BLEU scores (Siddhant et al., 2020b). In this experiment, we examine whether further increasing multilinguality under a fixed data budget improves performance. For all configurations in this subsection, we utilize all the available English and Kazakh monolingual data. We fix the amount of auxiliary monolingual data to 40 million, the auxiliary parallel data to 12 million, and vary the number of languages which manifest in this auxiliary data. We report the results on Table 4. It is observed that increasing the multilinguality of the parallel data is crucial, but the matter is less clear for the monolingual data. Using more languages for the monolingual data can potentially harm performance, but in the presence of multiple auxiliary language pairs with supervised data this degradation vanishes.

Synthetic data is critical for both stage 2 and stage 3 of training
In the following experiments, we evaluate the role of synthetic parallel data in the improved performance found at the end of stage 2 and stage 3 of our training procedure. We first evaluate whether the improved performance at the end of stage 2 comes from the synthetic data or the continued training. We consider the alternative where we repeat the same training steps as in stage 2 but without the synthetic data. We then additionally fine-tune these models with the same procedure as stage 3, but without any of the terms involving synthetic data. We report the BLEU scores for all these configurations in Table 5. The results suggest: the baseline without synthetic parallel data shows inferior performance across all language pairs compared to our approach leveraging synthetic parallel data.   Table 6: Total BLEU increase for XÑEn over baseline fine-tuning strategy consisting of on-the-fly backtranslation (BT) and no synthetic data. We refer to cross-translation as "CT".
Finally, we inspect whether the synthetic parallel data is still necessary in stage 3 or if it suffices to only leverage it during the second stage. We consider three fine-tuning strategies, where we either (1) only utilize on-the-fly back-translation (2) additionally include cross-translation terms for Gujarati, Nepali, and Sinhala using Hindi (3) additionally include a cross-translation terms for Turkish and Kazakh involving Arabic and Russian respectively. We compare all of the approaches to the vanilla strategy that only leverages on-the-fly backtranslation and report the aggregate improvements in BLEU on the XÑEn directions over this baseline in Table 6. We see two trends: The configurations that do not leverage synthetic data perform worse than those that do, and increasing multilinguality through the inclusion of cross-translation further improves performance.

Our approach is robust under multiple domains
We investigate the impact of data quantity and quality on the performance of our models. In this experiment, we focus on EnØGu and use all available monolingual and auxiliary parallel data for all languages except Gujarati. We consider three configurations: (1) 500,000 lines from News Crawl (indomain high-quality data); (2) 500,000 lines from Common Crawl (multi-domain data); (3) 100,000 lines from News Crawl. We present the results on  both newstest2019 and newsdev2019 for EnØGu on Table 7. We see that both Common Crawl and News Crawl configurations produce similar results at this scale, with the Common Crawl configuration having a small edge on average. Notice that even in this data-starved setting, we still outperform the competing unsupervised models. Once we reach only 100,000 lines, performance degrades below mBART but still outperforms the bilingual UNMT approach of Kim et al. (2020), revealing the power of multilinguality in low-resource settings.

Conclusion
In this work, we studied how multilinguality can make unsupervised translation viable for lowresource languages in a realistic setting. Our results show that utilizing the auxiliary parallel data in combination with synthetic data through our threestage training procedure not only yields large gains over unsupervised baselines but also outperforms several modern supervised approaches.