Generalized Data Augmentation for Low-Resource Translation

Low-resource language pairs with a paucity of parallel data pose challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing a large amount of monolingual data is regarded as an effective way to alleviate the problem. In this paper, we propose a general framework of data augmentation for low-resource machine translation not only using target-side monolingual data, but also by pivoting through a related high-resource language. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the low-resource language, making best use of available resources to better approximate the true distribution of the low-resource language. First, we inject low-resource words into high-resource sentences through an induced bilingual dictionary. Second, we further edit the high-resource data injected with low-resource words using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.


Introduction
The task of Machine Translation (MT) for low resource languages (LRLs) is notoriously hard due to the lack of the large parallel corpora needed to achieve adequate performance with current Neural Machine Translation (NMT) systems (Koehn and Knowles, 2017). A standard practice to improve training of models for an LRL of interest (e.g. Azerbaijani) is utilizing data from a related high-resource language (HRL, e.g. Turkish). Both transferring from HRL to LRL (Zoph et al., 2016;Nguyen and Chiang, 2017;Gu et al., 2018) and joint training on HRL and LRL parallel data (Johnson et al., 2017;Neubig and Hu, 2018)  to be effective techniques for low-resource NMT. Incorporating data from other languages can be viewed as one form data augmentation, and particularly large improvements can be expected when the HRL shares vocabulary or is syntactically similar with the LRL (Lin et al., 2019). Simple joint training is still not ideal, though, considering that there will still be many words and possibly even syntactic structures that will not be shared between the most highly related languages. There are model-based methods that ameliorate the problem through more expressive source-side representations conducive to sharing (Gu et al., 2018;Wang et al., 2019), but they add significant computational and implementation complexity.
In this paper, we examine how to better share information between related LRL and HRLs through a framework of generalized data augmentation for low-resource MT. In our basic setting, we have access to parallel or monolingual data of an LRL of interest, its HRL, and the target language, which we will assume is English. We propose methods to create pseudo-parallel LRL data in this setting. As illustrated in Figure 1, we augment parallel data via two main methods: 1) back-translating from ENG to LRL or HRL; 2) converting the HRL-ENG dataset to a pseudo LRL-ENG dataset.
In the first thread, we focus on creating new parallel sentences through back-translation. Backtranslating from the target language to the source (Sennrich et al., 2016) is a common practice in data augmentation, but has also been shown to be less effective in low-resource settings where it is hard to train a good back-translation model (Currey et al., 2017). As a way to ameliorate this problem, we examine methods to instead translate from the target language to a highly-related HRL, which remains unexplored in the context of low-resource NMT. This pseudo-HRL-ENG dataset can then be used for joint training with the LRL-ENG dataset.
In the second thread, we focus on converting an HRL-ENG dataset to a pseudo-LRL-to-ENG dataset that better approximates the true LRL data. Converting between HRLs and LRLs also suffers from lack of resources, but because the LRL and HRL are related, this is an easier task that we argue can be done to some extent by simple (or unsupervised) methods. 1 In our proposed method, for the first step, we substitute HRL words on the source side of HRL parallel datasets with corresponding LRL words from an induced bilingual dictionary generated by mapping word embedding spaces (Xing et al., 2015;Lample et al., 2018b). In the second step, we further attempt translate the pseudo-LRL sentences to be closer to LRL ones utilizing an unsupervised machine translation framework.
In sum, our contributions are four fold: 1. We conduct a thorough empirical evaluation of data augmentation methods for lowresource translation that take advantage of all accessible data, across four language pairs. 2. We explore two methods for translating between related languages: word-by-word substitution using an induced dictionary, and unsupervised machine translation that further uses this word-by-word substituted data as input. These methods improve over simple unsupervised translation from HRL to LRL by more than 2 to 10 BLEU points.
3. Our proposed data augmentation methods improve over standard supervised backtranslation by 1.5 to 8 BLEU points, across all datasets, and an additional improvement of up to 1.1 BLEU points by augmenting from both ENG monolingual data, as well as HRL-ENG parallel data.

A Generalized Framework for Data Augmentation
In this section, we outline a generalized data augmentation framework for low-resource NMT.

Datasets and Notations
Given an LRL of interest and its corresponding HRL, with the goal of translating the LRL to English, we usually have access to 1) a limited-sized LRL-ENG parallel dataset {S LE , T LE }; 2) a relatively highresource HRL-ENG parallel dataset {S HE , T HE }; 3) a limited-sized LRL-HRL parallel dataset {S HL , T HL }; 4) large monolingual datasets in LRL M L , HRL M H and English M E .
To clarify notation, we use S and T to denote the source and target sides of parallel datasets, and M for monolingual data. Created data will be referred to asŜ m A )B . The superscript m denotes a particular augmentation approach (specified in Section 3). The subscripts denote the translation direction that is used to create the data, with the LRL, HRL, and ENG denoted with 'L', 'H', and 'E' respectively.

Augmentation from English
The first two options for data augmentation that we explore are typical back-translation approaches: 1. ENG-LRL We train an ENG-LRL system and back-translate English monolingual data to LRL, denoted by {Ŝ E )L , M E }.

ENG-HRL
We train an ENG-HRL system and back-translate English monolingual data to HRL, denoted by {Ŝ E )H , M E }.
Since we have access to LRL-ENG and HRL-ENG parallel datasets, we can train these backtranslation systems (Sennrich et al., 2016) in a supervised fashion. The first option is the common practice for data augmentation. However, in a low-resource scenario, the created LRL data can be of very low quality due to the limited size of training data, which in turn could deteriorate the LRL)ENG translation performance. As we show in Section 5, this is indeed the case.
The second direction, using HRL back-translated data for LRL)ENG translation, has not been explored in previous work. However, we suggest that in low-resource scenarios it has potential to be more effective than the first option because the quality of the generated HRL data will be higher, and the HRL is close enough to the LRL that joint training of a model on both languages will likely have a positive effect.

Augmentation via Pivoting
Using HRL-ENG data improves LRL-ENG translation because (1) adding extra ENG data improves the target-side language model, (2) it is possible to share vocabulary (or subwords) between languages, and (3) because the syntactically similar HRL and LRL can jointly learn parameters of the encoder. However, regardless of how close these related languages might be, there still is a mismatch between the vocabulary, and perhaps syntax, of the HRL and LRL. However, translating between HRL and LRL should be an easier task than translating from English, and we argue that this can be achieved by simple methods.
Hence, we propose "Augmentation via Pivoting" where we create an LRL-ENG dataset by translating the source side of HRL-ENG data, into the LRL. There are again two ways in which we can construct a new LRL-ENG dataset: 3. HRL-LRL We assume access to an HRL-ENG dataset. We then train an HRL-LRL system and convert the HRL side of S HE to LRL, creating a {Ŝ H )L , T HE } dataset.
4. ENG-HRL-LRL Exactly as before, except that the HRL-ENG dataset is the result of backtranslation. That means that we have first converted English monolingual data M E tô S E )H , and then we convert those to the LRL, creating a dataset Given a LRL-HRL dataset {S LH , T LH } one could also train supervised back-translation systems. But we still face the same problem of data scarcity, leading to poor quality of the augmented datasets. Based on the fact that an LRL and its corresponding HRL can be similar in morphology and word order, in the following sections, we propose methods to convert HRL to LRL for data augmentation in a more reliable way.

LRL-HRL Translation Methods
In this section, we introduce two methods for converting HRL to LRL for data augmentation. Mikolov et al. (2013) show that the word embedding spaces share similar innate structure over different languages, making it possible to induce bilingual dictionaries with a limited amount of or even without parallel data (Xing et al., 2015;Zhang et al., 2017;Lample et al., 2018b). Although the capacity of these methods is naturally constrained by the intrinsic properties of the two mapped languages, it's more likely to create a high-quality bilingual dictionary for two highly-related languages. Given the induced dictionary, we can substitute HRL words with LRL ones and construct a word-by-word translated pseudo-LRL corpus.

Dictionary Induction
We use a supervised method to obtain a bilingual dictionary between the two highly-related languages. Following Xing et al. (2015), we formulate the task of finding the optimal mapping between the source and target word embedding spaces as the Procrustes problem (Schönemann, 1966), which can be solved by singular value decomposition (SVD): where X and Y are the source and target word embedding spaces respectively. As a seed dictionary to provide supervision, we simply exploit identical words from the two languages. With the learned mapping W , we compute the distance between mapped source and target words with the CSLS similarity measure (Lample et al., 2018b). Moreover, to ensure the quality of the dictionary, a word pair is only added to the dictionary if both words are each other's closest neighbors. Adding an LRL word to the dictionary for every HRL word results in relatively poor performance due to noise as shown in Section 5.3.

Corpus Construction Given an HRL-ENG
dataset, we substitute the words in S HE with the corresponding LRL ones using our induced dictionary. Words not in the dictionary are left untouched. By injecting LRL words, we convert the original or augmented HRL data into pseudo-LRL, which explicitly increases lexical overlap between the concatenated LRL and HRL data. The

Augmentation with Unsupervised MT
Although we assume LRL and HRL to be similar with regards to word morphology and word order, the simple word-by-word augmentation process will almost certainly be insufficient to completely replicate actual LRL data. A natural next step is to further convert the pseudo-LRL data into a version closer to the real LRL. In order to achieve this in our limited-resource setting, we propose to use unsupervised machine translation (UMT).
UMT Unsupervised Neural Machine Translation (Artetxe et al., 2018;Lample et al., 2018a,c) makes it possible to translate between languages without parallel data. This is done by coupling denoising auto-encoding, iterative back-translation, and shared representations of both encoders and decoders, making it possible for the model to extend the initial naive word-to-word mapping into learning to translate longer sentences.
Initial studies of UMT have focused on data-rich, morphologically simple languages like English and French. Applying the UMT framework to lowresource and morphologically rich languages is largely unexplored, with the exception of Neubig and Hu (2018) and Guzmán et al. (2019), showing that UMT performs exceptionally poorly between dissimilar language pairs with BLEU scores lower than 1. The problem is naturally harder for morphologically rich LRLs due to two reasons. First, morphologically rich languages have a higher proportions of infrequent words (Chahuneau et al., 2013). Second, even though still larger than the respective parallel datasets, the size of monolingual datasets in these languages is much smaller compared to HRLs.
Modified Initialization As pointed out in Lample et al. (2018c), a good initialization plays a critical role in training NMT in an unsupervised fashion. Previously explored initialization methods include: 1) word-for-word translation with an induced dictionary to create synthetic sentence pairs for initial training (Lample et al., 2018a;Artetxe et al., 2018); 2) joint Byte-Pair-Encoding (BPE) for both the source and target corpus sides as a pre-processing step. While the first method intends to give a reasonable prior for parameter search, the second method simply forces the source and target languages to share the same subword vocabulary, which has been shown to be effective for translation between highly related languages.
Inspired by these two methods, we propose a new initialization method that uses our word substitution strategy ( §3.1). Our initialization is comprised of a sequence of three steps: 1. First, we use an induced dictionary to substitute HRL words in M H to LRL ones, producing a pseudo-LRL monolingual datasetM L . 2. Second, we learn a joint word segmentation model on both M L andM L and apply it to both datasets. 3. Third, we train a NMT model in an unsupervised fashion between M L andM L . The training objective L is a weighted sum of two loss terms for denoising auto-encoding and iterative back-translation: where u * denotes translations obtained with greedy decoding, C denotes a noisy manipulation over input including dropping and swapping words randomly, λ 1 and λ 2 denotes the weight of language modeling and back translation respectively.
In our method, we do not use any synthetic parallel data for initialization, expecting the model to learn the mappings between a true LRL distribution and a pseudo-LRL distribution. This takes advantage of the fact that the pseudo-LRL is naturally closer to the true LRL than the HRL is, as the injected LRL words increase vocabulary overlap.

Why Pivot for Back-Translation?
Pivoting through an HRL in order to convert English to LRL will be a better option compared to directly translating ENG to LRL under the following three conditions: 1) HRL and LRL are related enough to allow for the induction of a high-quality bilingual dictionary; 2) There exists a relatively high-resource HRL-ENG dataset; 3) A high-quality LRL-ENG dictionary is hard to acquire due to data scarcity or morphological distance. Essentially, the direct ENG)LRL back-translation may suffer from both data scarcity and morphological differences between the two languages. Our proposal breaks the process into two easier steps: ENG)HRL translation is easier due to the availability of data, and HRL)LRL translation is easier because the two languages are related.
A good example is the agglutinative language of Azerbaijiani, where each word may consist of several morphemes and each morpheme could possibly map to an English word itself. Correspondences to (also agglutinative) Turkish, however, are easier to uncover. To give a concrete example, the Azerbijiani word "düşüncәlәrim" can be fairly easily aligned to the Turkish word "düşüncelerim" while in English it corresponds to the phrase "my thoughts", which is unlikely to be perfectly aligned.

Data
We use the multilingual TED corpus (Qi et al., 2018) as a test-bed for evaluating the efficacy of each augmentation method. We conduct extensive experiments over four low-resource languages: Azerbaijani (AZE), Belarusian (BEL), Galician (GLG), and Slovak (SLK), along with their highly related languages Turkish (TUR), Russian (RUS), Portuguese (POR), and Czech (CES) respectively. We also have small-sized LRL-HRL parallel datasets, and we download Wikipedia dumps to acquire monolingual datasets for all languages.
The statistics of the parallel datasets are shown in Table 1. For AZE, BEL and GLG, we use all available Wikipedia data, while for the rest of the languages we sample a similar-sized corpus. We sample 2M/200K English sentences from Wikipedia data, which are used for baseline UMT training and augmentation from English respectively.

Pre-processing
We train a joint sentencepiece 2 model for each LRL-HRL pair by concatenating the monolingual corpora of the two languages. The segmentation model for English is trained on English monolingual data only. We set the vocabulary size for each model to 20K. All data are then segmented by their respective segmentation model. We use FastText 3 to train word embeddings using M L and M H with a dimension of 256 (used for the dictionary induction step). We also pre-train subword level embeddings on the segmented M L , M L and M H with the same dimension.

Model Architecture
Supervised NMT We use the self-attention Transformer model (Vaswani et al., 2017). We adapt the implementation from the open-source translation toolkit OpenNMT (Klein et al., 2017). Both encoder and decoder consist of 4 layers, with the word embedding and hidden unit dimensions set to 256. 4 We use a batch size of 8096 tokens.
Unsupervised NMT We train unsupervised Transformer models with the UnsupervisedMT toolkit. 5 Layer sizes and dimensions are the same as in the supervised NMT model. The parameters of the first three layers of the encoder and the decoder are shared. The embedding layers are initialized with the pre-trained subword embeddings from monolingual data. We set the weight parameters for autodenoising language modeling and iterative back translation as λ 1 = 1 and λ 2 = 1.

Training and Model Selection
After data augmentation, we follow the pre-train and fine tune paradigm for learning (Zoph et al., 2016;Nguyen and Chiang, 2017 Table 2: Evaluation of translation performance over four language pairs. Rows 1 and 2 show pre-training BLEU scores. Rows 3-13 show scores after fine tuning. Statistically significantly best scores are highlighted (p < 0.05). mixed fine-tuning strategy of Chu et al. (2017), fine-tuning the base model on the concatenation of the base and augmented datasets. For each setting, we perform a sufficient number of updates to reach convergence in terms of development perplexity.
We use the performance on the development sets (as provided by the TED corpus) as our criterion for selecting the best model, both for augmentation and final model training.

Results and Analysis
A collection of our results with the baseline and our proposed methods is shown in Table 2.

Baselines
The performance of the base supervised model (row 1) varies from 11.8 to 29.5 BLEU points. Generally, the more distant the source language is from English, the worse the performance. A standard unsupervised MT model (row 2) achieves extremely low scores, confirming the results of Guzmán et al. (2019), indicating the difficulties of directly translating between LRLand ENG in an unsupervised fashion. Rows 3 and 4 show that standard supervised back-translation from English at best yields very modest improvements. Notable is the exception of SLK-ENG, which has more parallel data for training than other settings. In the case of BEL and GLG, it even leads to worse performance. Across all four languages, supervised back-translation into the HRL helps more than into the LRL; data is insufficient for training a good LRL-ENG MT model.

Back-translation from HRL
HRL-LRL Rows 5-9 show the results when we create data using the HRL side of an HRL-ENG dataset. Both the low-resource supervised (row 5) and vanilla unsupervised (row 6) HRL)ENG translation do not lead to significant improvements. On the other hand, our simple word substitution approach (row 7) and the modified UMT approach (row 8) lead to improvements across the board: +3.0 BLEU points in AZE, +7.8 for BEL, +2.3 for GLG, +1.1 for SLK. These results are significant, demonstrating that the quality of the back-translated data is indeed important.
In addition, we find that combining the datasets produced by our word substitution and UMT models provide an additional improvement in all cases (row 9). Interestingly, this happens despite the fact that the ENG data are the exact same between rows 5-9.
ENG-HRL-LRL We also show that even in the absence of parallel HRL-LRL data, our pivoting method is still valuable. Rows 10 and 11 in Table 2 show the translation accuracy when the augmented data are the result of our two-step pivot back-translation. In both cases, monolingual ENG is first translated into HRL and then into LRL with either just word substitution (row 10) or modified UMT (row 11). Although these results are slightly worse than our one-step augmentation of a parallel HRL-LRL dataset, they still outperform the baseline standard back-translation (rows 3 and 4). An interesting note is that in this setting, word substitution is clearly preferable to UMT for the second translation pivoting step, which we explain in §5.3.
Combinations We obtain our best results by combining the two sources of data augmentation. Row 12 shows the result of using our simple word substitution technique on the HRL side of both a parallel and an artificially created (back-translated) HRL-ENG dataset. In this setting, we further improve not only the encoder side of our model, as before, but we also aid the decoder's language modeling capabilities by providing ENG data from two distinct resources. This leads to improvements of 3.6 to 8.2 BLEU points over the base model and 0.3 to 2.1 over our best results from HRL-ENG augmentation.
Finally, row 13 shows our attempt to obtain further gains by combining the datasets from both word substitution and UMT, as we did in setting 7. This leads to a small improvement of 0.2 BLEU points in AZE, but also to a slight degradation on the other three datasets.
We also compare the results of our augmentation methods with other state-of-the-art methods that either perform improvements to modeling to improve the ability to do parameter sharing (Wang et al., 2019), or train on many different target languages simultaneously (Aharoni et al., 2019). The results demonstrate that the simple data augmentation strategies presented here improve significantly over these previous methods.

Analysis
In this section we focus on the quality of HRL)LRL translation, showing that our better M-UMT initialization method leads to significant improvements compared to standard UMT. We use the dev sets of the HRL-LRL datasets to examine the performance of M-UMT between related languages. We calculate the pivot BLEU 6 score on the LRL side of each created dataset (S HL , S w H )L ,Ŝ u H )L ,Ŝ m H )L ). In Figure 2 we plot pivot HRL-LRL BLEU scores against the translation LRL-ENG BLEU ones. First, we observe that across all datasets, the pivot BLEU of our M-UMT method is higher than standard UMT (the squares are all further right than their corresponding stars). Vanilla Data Example Sentence Pivot BLEU S LE (GLG) Pero con todo, veste obrigado a agardar nas mans dunha serie de estraños moi profesionais. S HE (POR) Em vez disso, somos obrigados a esperar nas mãos de uma série de estranhos muito profissionais. 0.09 S w H )L En vez disso, somos obrigados a esperar nas mans de unha serie de estraños moito profesionais.  UMT's scores are 2 to 10 BLEU points worse than the M-UMT ones. This means that UMT across related languages significantly benefits from initializing with our simple word substitution method. Second, as illustrated in Figure 2, the pivot BLEU score and the translation BLEU are imperfectly correlated; even though M-UMT reaches the highest pivot BLEU, the resulting translation BLEU is comparable to using the simple word substitution method (rows 7 and 8 in Table 2). The reason is that the quality of {Ŝ m H )L , T HE } is naturally restricted by the {Ŝ w H )L , T HE }, whose quality is in turn restricted by the induced dictionary. However, by combining the augmented datasets from these two methods, we consistently improve the translation performance over using only word substitution augmentation (compare Table 2 rows 7 and 9). This suggests that the two augmented sets improve LRL-ENG translation in an orthogonal way.
Additionally, we observe that augmentation from back-translated HRL data leads to generally worse results than augmentation from original HRL data (compare rows 7,8 with rows 10,11 in Table 2). We believe this to be the result of noise in the back-translated HRL, which is then compounded by further errors from the induced dictionary. Therefore, we suggest that the simple word substitution method should be preferred for the second pivoting step when augmenting back-translated HRL data. Table 3 provides an example conversion of an HRL sentence to pseudo-LRL with the word substitution strategy, and its translation with M-UMT. From S HE toŜ w H )L , the word substitution strategy achieves very high unigram scores (0.50 in this case), largely narrowing the gap between two languages. The M-UMT model then edits the pseudo-LRL sentence to convert all its words to LRL.  with LRL-ENG translation quality. For each word in the tested set, we define a word as "rare" if it is in the training set's lowest 10 th frequency percentile. This is particularly true for LRL test set words when using concatenated HRL-LRL training data, as the LRL data will be smaller. We further define rare words to be "addressed" if after adding augmented data the rare word is not in the lowest 10 th frequency percentile anymore. Then, we define the "address rate" of a test dataset as the ratio of the number of addressed words to the number of rare words. The address rate of each method, along with the corresponding translation BLEU score is shown in Figure 3. As indicated by the Pearson correlation coefficients, these two metrics are highly correlated, indicating that our augmentation methods significantly mitigate problems caused by rare words, improving MT quality as a result.

Dictionary Induction
We conduct experiments to compare two methods of dictionary induction from the mapped word embedding spaces: 1) Unidirectional: For each HRL word, we collect its closest LRL word to be added to the dictionary; 2) Bidirectional: We only add word pairs the two words of which are each other's closest neighbor to the dictionary.
In order to know how many LRL words are injected into the HRL corpus, we show the number of injected unique word types, number of injected words, and the corresponding BLEU score of models trained with bidirectional and unidirectional word induction in Table 4. It can be seen that the ratio of word numbers is higher than that of word types between bidirectional and unidirectional word induction, indicating that the injected words using the bidirectional method are of relatively high frequency. The BLEU scores show that bidirectional word induction performs better than unidirectional induction in most cases (except BEL). One explanation could be that adding each word's closest neighbor as a pair into the dictionary introduces additional noise that might harm the low-resource translation to some extent.

Related Work
Our work is related to multilingual and unsupervised translation, bilingual dictionary induction, as well as approaches for triangulation (pivoting).
In a low-resource MT scenario, multilingual training that aims at sharing parameters by leveraging parallel datasets of multiple languages is a common practice. Some works target learning a universal representation for all languages either by leveraging semantic sharing between mapped word embeddings (Gu et al., 2018) or by using character n-gram embeddings (Wang et al., 2019) optimizing subword sharing. More related with data augmentation, Nishimura et al. (2018) fill in missing data with a multi-source setting to boost multilingual translation.
Unsupervised machine translation enables training NMT models without parallel data (Artetxe et al., 2018;Lample et al., 2018a,c). Recently, multiple methods have been proposed to further improve the framework. By incorporating a statistical MT system as posterior regularization, Ren et al. (2019) achieved state-of-the-art for en-fr and en-de MT. Besides MT, the framework has also been applied to other unsupervised tasks like nonparallel style transfer (Subramanian et al., 2019;Zhang et al., 2018).
Bilingual dictionaries learned in both supervised and unsupervised ways have been used in lowresource settings for tasks such as named entity recognition (Xie et al., 2018) or information re-trieval (Litschko et al., 2018). Hassan et al. (2017) synthesized data with word embeddings for spoken dialect translation, with a process that requires a LRL-ENG as well as a HRL-LRL dictionary, while our work only uses a HRL-LRL dictionary.
Bridging source and target languages through a pivot language was originally proposed for phrasebased MT (De Gispert and Marino, 2006;Cohn and Lapata, 2007). It was later adapted for Neural MT (Levinboim and Chiang, 2015), and  proposed joint training for pivot-based NMT.  proposed to use an existing pivottarget NMT model to guide the training of sourcetarget model. Lakew et al. (2018) proposed an iterative procedure to realize zero-shot translation by pivoting on a third language.

Conclusion
We propose a generalized data augmentation framework for low-resource translation, making best use of all available resources. We propose an effective two-step pivoting augmentation method to convert HRL parallel data to LRL. In future work, we will explore methods for controlling the induced dictionary quality to improve word substitution as well as M-UMT. We will also attempt to create an end-toend framework by jointly training M-UMT pivoting system and low-resource translation system in an iterative fashion in order to leverage more versions of augmented data.