Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT

Using a language model (LM) pretrained on two languages with large monolingual data in order to initialize an unsupervised neural machine translation (UNMT) system yields state-of-the-art results. When limited data is available for one language, however, this method leads to poor translations. We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. To reuse the pretrained LM, we have to modify its predefined vocabulary, to account for the new language. We therefore propose a novel vocabulary extension method. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq), yielding more than +8.3 BLEU points for all four translation directions.


Introduction
Neural machine translation (NMT) has recently achieved remarkable results (Bahdanau et al., 2015;Vaswani et al., 2017), based on the exploitation of large parallel training corpora. Such corpora are only available for a limited number of languages. UNMT has attempted to address this limitation by training NMT systems using monolingual data only (Artetxe et al., 2018;Lample et al., 2018). Top performance is achieved using a bilingual masked language model (Devlin et al., 2019) to initialize a UNMT encoder-decoder system (Lample and Conneau, 2019). The model is then trained using denoising auto-encoding (Vincent et al., 2008) and back-translation (Sennrich et al., 2016a). The approach was mainly evaluated by translating between high-resource languages.
Translating between a high-resource and a lowresource language is a more challenging task. In this setting, the UNMT model can be initialized with a pretrained cross-lingual LM. However, training this UNMT model has been shown to be ineffective when the two languages are not related (Guzmán et al., 2019). Moreover, in order to use a pretrained cross-lingual LM to initialize a UNMT model, the two models must have a shared vocabulary. Thus, a bilingual LM needs to be trained from scratch for each language pair, before being transferred to the UNMT model (e.g. En-De LM for En-De UNMT).
Motivated by these issues, we focus on the question: how can we accurately translate between a high-monolingual-resource (HMR) and a lowmonolingual-resource (LMR) language? To address this question, we adapt a monolingual LM, pretrained on an HMR language to an LMR language, in order to initialize a UNMT system.
We make the following contributions: (1) We propose REused-LM (RE-LM), an effective transfer learning method for UNMT. Our method reuses a pretrained LM on an HMR language, by fine-tuning it on both LMR and HMR languages. The finetuned LM is used to initialize a UNMT system that translates the LMR to the HMR language (and vice versa). (2) We introduce a novel vocabulary extension method, which allows fine-tuning a pretrained LM to an unseen language. (3) We show that RE-LM outperforms a competitive transfer learning method (XLM) for UNMT on three language pairs: English-German (En-De) on a synthetic setup, En-Mk and En-Sq. (4) We show that RE-LM is effective in low-resource supervised NMT. (5) We conduct an analysis of fine-tuning schemes for RE-LM and find that including adapters (Houlsby et al., 2019) in the training procedure yields almost the same UNMT results as RE-LM at a lower computational price.

Related Work
Transfer learning for UNMT. The field of UNMT has recently experienced tremendous progress. Motivated by this, we study the use of adapters during LM fine-tuning in our analysis.

Proposed Approach
We describe our method for translation between a high-resource (HMR) and a low-resource language (LMR) using monolingual data in this section.

RE-LM
Our proposed approach consists of three steps, as shown in Figure 1: (A) We train a monolingual masked LM on the HMR language, using all available HMR corpora. This step needs to be performed only once for the HMR language. Note that a publicly available pretrained model could also be used. (B) To fine-tune the pretrained LM on the LMR language, we need to first overcome the vocabulary mismatch problem. We therefore extend the vocabulary of the pretrained model using our proposed method, described in §3.2. (C) Finally, we initialize an encoder-decoder UNMT system with the fine-tuned LM. The UNMT model is trained using denoising auto-encoding and online back-translation for the HMR-LMR language pair. . We observe that splitting Sq using En BPEs (BPE HMR ) results in heavily segmented tokens. This problem is alleviated using BPE joint tokens, learned on both languages.

Vocabulary Extension
We propose a novel method that enables adapting a pretrained monolingual LM to an unseen language. We consider the case of an LM pretrained on an HMR language. The training data is split using Byte-Pair-Encoding (BPE) (Sennrich et al., 2016b). We denote these BPE tokens as BPE HMR and the resulting vocabulary as V HMR . We aim to fine-tune the trained LM to an unseen LMR language. Splitting the LMR language with BPE HMR tokens would result in heavy segmentation of LMR words ( Figure  2). To counter this, we learn BPEs on the joint LMR and HMR corpus (BPE joint ). We then use BPE joint tokens to split the LMR data, resulting in a vocabulary V LMR . This technique increases the number of shared tokens and enables cross-lingual transfer of the pretrained LM. The final vocabulary  Table 2: BLEU scores on the dev set using increasing amounts of parallel data. We show in bold the models that achieve at least +1 BLEU compared to XLM.
using monolingual data. For supervised translation, NMT training is performed using only parallel corpora, without backtranslation of monolingual data. The first baseline is a randomly initialized NMT system. The second baseline is an NMT model initialized with XLM. We compare them to our proposed approach, RE-LM. Both XLM and RE-LM are pretrained on the monolingual corpora of the two languages of interest.
In the analysis, we investigate adding adapters (Rebuffi et al., 2018) after each self-attention and each feed-forward layer of the pretrained monolingual LM. We add adapters of hidden size 256. We freeze the parameters of the pretrained LM and finetune only the adapters and the embedding layer. Table 1 presents our UNMT results, comparing random initialization, XLM and RE-LM. Synthetic setup. We observe that RE-LM consistently outperforms XLM. Using 50K De sentences, RE-LM has small gains over XLM (+1.1 BLEU in En→De). However, when we scale to slightly more data (500K), the performance of RE-LM is clearly better than the one of XLM, with +3 En→De BLEU gains. With 1M De data, our model surpasses the XLM by more than 2.6 BLEU in both directions.  Real-world setup. Our approach surpasses XLM in both language pairs. We observe that RE-LM achieves at least +8.3 BLEU over XLM for En-Mk.

Unsupervised Translation
Our model was first pretrained on En and then fine-tuned on both En and Mk. Therefore, it has processed all En and Mk sentences, obtaining a good cross-lingual representation. However, XLM is jointly trained on En and Mk. As a result, it overfits Mk before processing all En data. RE-LM is similarly effective for En-Sq, achieving an improvement of at least +9.3 BLEU over XLM. Synthetic vs Real-world setup. The effectiveness of RE-LM is pronounced in the real-world setup. We identify two potential reasons. First, for En-De, 8M En is used for LM pretraining, while for En-Mk and En-Sq, 68M En is used. When XLM is trained on imbalanced HMR-LMR data, it overfits the LMR language. This is more evident for the En-Mk (or En-Sq) than for the En-De XLM, perhaps due to the larger data imbalance. Second, in En-De, we use high-quality corpora for both languages (NewsCrawl), whereas Mk and Sq are trained on low-quality CommonCrawl data. The fact that RE-MLM outperforms XLM for Mk and Sq shows that it is more robust to noisy data than the XLM.

Low-Resource Supervised Translation
We compare XLM, RE-LM and random, an NMT model trained from scratch. We observe (

Analysis
We experiment with different fine-tuning schemes and show the results in Table 3. RE-LM. We compare fine-tuning an MLM only on the LMR language to fine-tuning it on both the HMR and LMR language (rows 1 and 2). Fine-tuning only on the LMR language obtains worse BLEU scores because catastrophic forgetting occurs. This is observed for most language pairs. The negative effect is clear for Mk and Sq, where fine-tuning only on the LMR results in worse BLEU scores than random initialization, shown in Table 1.
Adapters. We insert adapters to the pretrained LM and fine-tune only the adapter and embedding layer. We use the fine-tuned LM to initialize a UNMT system. Adapters are used for both translation directions during UNMT training. Fine-tuning the LM on the LMR language only yields at least +3.9 BLEU for En-Sq compared to fine-tuning on both (rows 3, 4). En and Sq are not similar languages and their embeddings also differ. Thus, fine-tuning on both is not helpful. By contrast, fine-tuning only on Sq preserves the pretrained model's knowledge, while adapters are trained to encode Sq. For En-De and En-Mk, both approaches provide similar results. En and Mk do not share an alphabet, so their embeddings do not overlap and both finetuning methods are equally effective. In En-De, fine-tuning only on De is marginally better than fine-tuning on both. We highlight that adapters allow parameter-efficient fine-tuning. LM + adapters (Table 3) reaches almost the same results as RE-LM, using just a fraction of the RE-LM parameters while fine-tuning. Details can be found in the appendix.

Conclusions
Training competitive unsupervised NMT models for HMR-LMR scenarios is important for many real low-resource languages. We proposed RE-LM, a novel approach that fine-tunes a high-resource LM on a low-resource language and initializes an NMT model. RE-LM outperformed a strong baseline in UNMT, while also improving translations on a lowresource supervised setup. In future work, we will apply our method to languages with corpora from diverse domains and also to other languages.

A.1 Preprocessing and Datasets
Vocabulary Extension. We provide more examples of different segmentations of Sq, De and Mk using either the BPE HMR or the BPE joint tokens in Figure 3. We observe that, as expected, the Mk sentence is split to the character level, as it uses a different alphabet (Cyrillic) than the one that the BPE HMR tokens were learned on (Latin).
Preprocessing. For the En-De XLM baseline, we learn 60K BPE splits on the concatenation of sentences sampled randomly from the monolingual corpora, using the sampling method proposed in Lample and Conneau (2019) with α = 0.5. For En-Mk and En-Sq XLM baselines, we learn 32K BPE splits on the concatenation of sentences sampled randomly from the monolingual corpora, with α = 0.5. For RE-LM, we extend the initial L HMR (En) vocabulary used for LM pretraining. L LMR data is split using BPE joint ( §3.2). The L HMR (En) data remains split according to BPE HMR , exactly as it was during pretraining. The same data is then used for NMT training. Finally, the UNMT model is initialized with RE-LM. Datasets. We use all deduplicated corpora available from OSCAR (Ortiz Suárez et al., 2019) for Macedonian and Albanian, that can be easily downloaded from this link: https://oscar-corpus.com/. We also report that we remove sentences longer than 100 words after BPE splitting. We split the data using the fastBPE codebase 4 . We use the official WMT 2016 dev/test sets for En-De. 4 https://github.com/glample/fastBPE

A.2 Model Configuration
We tie the embedding and output (projection) layers of both LM and NMT models (Press and Wolf, 2017). We use a dropout rate of 0.1 and GELU activations (Hendrycks and Gimpel, 2017). We use the default parameters of Lample and Conneau (2019) in order to train our models unless otherwise specified. We do not tune the hyperparameters. The code was built with PyTorch (Paszke et al., 2019) on top of the XLM implementation 5 . This code was used for LM pretraining, LM fine-tuning, UNMT training, and NMT training.
LM configuration and training details. In the LM training, Adam optimizer (Kingma and Ba, 2015) with learning rate 10 −4 is used. RE-LM approach pretrains a monolingual language model whereas the XLM approach pretrains a bilingual language model. We obtain a checkpoint every 200K sentences processed by the model. We train each LM using as training criterion the validation perplexity on the LMR language, with a patience of 10.
The training details of the two pretraining methods are presented here: • The monolingual LM pretraining required 1 week of training, 8 GPUs and had 137M parameters.
• The XLM pretraining required 1 week of training, in 8 GPUs. The total number of trainable parameters is 138M.
Our approach requires also an LM fine-tuning step. The runtimes, parameters and GPU details are shown in Table 4 under RE-LM ft column. The runtimes mentioned refer to the En-Mk language pair. We should note that the LM fine-tuning step is a lot faster than performing XLM pretraining for each language pair.
NMT configuration and training details. In the UNMT and supervised NMT training, Adam with inverse square root learning rate scheduling and a learning rate of 10 −4 is used (Vaswani et al., 2017). We evaluate NMT models on the dev set every 3000 updates using greedy decoding. The parameters and runtimes of the UNMT models we used are shown in Table 4 under UNMT columns. Likewise, the details of supervised NMT models are shown under sup NMT columns. We get a checkpoint every 50K sentences processed by the model.  params  337M  337M  156M 337M  337M  88M 359M  337M  337M  runtime  48h  10h  60h  72h  10h  44h  20h  18h  15h  # GPUs  1  1  1  1  1  1  1  1  1   Table 4: Parameters, training runtime and number of GPUs used for each experiment. All the GPUs were of the same type. ft refers to the fine-tuning of the pretrained monolingual LM. adap refers to the addition of adapters to the LM and the UNMT model.
Regarding the RE-LM + adapters training procedure, we note that, different from Houlsby et al. We provide details about the parameters, training runtime and number of GPUs used for the pretraining models (XLM and monolingual LM). The XLM pretraining required 1 week of training, in 8 GPUs. The total number of trainable parameters is 138M. We compare this step with pretraining the monolingual LM we use in our approach. This training procedure required 1 week of training, 8 GPUs and had 137M parameters.

A.3 Validation Scores of Results
In Tables 5 and 6 we show the dev scores of the main results of our proposed approach (RE-LM) compared to the baselines. These Tables extend Table 1 of the main paper.
In Tables 7 and 8, we show the dev scores of the extra fine-tuning experiments we did for the analysis. The Tables correspond to Table 3 of the main paper.
We note that the dev scores are obtained using greedy decoding, while the test scores are obtained with beam search of size 5. We clarify that we train each NMT model using as training criterion the validation BLEU score of the LMR→HMR direction, with a patience of 10. We specifically use multi-bleu.perl script from Moses.     Table 8: Comparison of UNMT BLEU scores obtained using different fine-tuning schemes of the pretrained monolingual LM with corresponding dev scores for En-Mk and En-Sq. pretr. LM refers to the pretrained LM, trained on HMR data, while ft refers to fine-tuning. ft both means fine-tuning on the LMR and the HMR language.