Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation

We present our contribution to the WMT19 Similar Language Translation shared task. We investigate the utility of neural machine translation on three low-resource, similar language pairs: Spanish – Portuguese, Czech – Polish, and Hindi – Nepali. Since state-of-the-art neural machine translation systems still require large amounts of bitext, which we do not have for the pairs we consider, we focus primarily on incorporating monolingual data into our models with backtranslation. In our analysis, we found Transformer models to work best on Spanish – Portuguese and Czech – Polish translation, whereas LSTMs with global attention worked best on Hindi – Nepali translation.


Introduction
We present our contribution to the WMT 2019 Similar Language Translation shared task, which focused on translation between similar language pairs in low-resource settings (Barrault et al., 2019). Similar languages have advantages that can be exploited when building machine translation systems. In particular, languages that come from the same language family (or that come from related language families) may have in common a multitude of information such as lexical or syntactic structures. This commonality has been exploited in a number of previous works for similar language translation (Haji et al., 2003;Goyal andLehal, 2009, 2011;Pourdamghani and Knight, 2017).
In this work, we are primarily concerned with neural machine translation (NMT). NMT is a language agnostic framework where language similarities could possibly be exploited to build scalable, state-of-the-art (SOTA) machine translation systems. For example, NMT systems have been used on a number of WMT translation tasks where they enabled highly successful modeling (Bahdanau et al., 2014;Luong et al., 2015;Koehn, 2017;Vaswani et al., 2017;Edunov et al., 2018). A weakness with NMT is its dependence on large bitext corpora. For this reason, researchers have considered ways to mitigate this specific issue.
A prominent approach meant to alleviate need for large parallel data is backtranslation. This technique generates synthetic bitext by translating monolingual sentences of the target language into the source language with a pre-existing target-to-source translation system. These noisy source translations are then incorporated to train a new source-to-target MT system (Sennrich et al., 2015a). This approach is instrumental in unsupervised machine translation where authors have shown that, up to a certain amount of bitext, better translation systems can be trained with these unsupervised approaches than supervised methods (Artetxe et al., 2017;Lample et al., 2017Lample et al., , 2018. Backtranslation research has also extended to scenarios of training supervised systems with just synthetic data (Edunov et al., 2018;Marie and Fujita, 2018). Given the success of this approach, it offers a promising avenue to leverage monolingual data for improving translation between similar languages.
Motivated by the success of backtranslation, we focus on leveraging monolingual data to improve NMT systems for similar language pairs. Hence, for our submissions to the shared task, we focus on investigating the effectiveness of synthetic bitext produced with backtranslation.
The rest of the paper is organized as follows: We discuss our methods in Section 2, including our NMT models and our decisions for backtranslation. Section 3 is where we describe our analysis of the shared task data. In Section 4, we present our experimental findings, discussing the effectiveness of backtranslation in terms of BLEU score performance. We conclude in Section 5.

Methodology
Here, we outline our approach to improve translation quality for similar languages. This includes description of the two NMT models we considered in our analysis, and our procedure for backtranslating data.

Model Architectures
Sequence to sequence (seq2seq) models (Vinyals et al., 2015) have emerged as the most prominent architecture in the NMT literature. In seq2seq models, source sentences X are encoded as a series of latent representations capturing words in context information. A decoder utilizes these hidden states, such as for initialization, to help inform the decoding process for target sentences Y . For our work, we consider both a recurrent neural network (RNN) with attention and Transformer seq2seq models for our experiments. We briefly introduce each of these next.

Recurrent Neural Network Architecture
There are a number of variations of RNN architectures previously considered for NMT. The one we chose is the default model available in the OpenNMT-py toolkit (Klein et al., 2017). It is an implementation of one of several variations studied by Luong et al. (2015) which focused on understanding attention in depth. It follows the typical seq2eq architecture but includes an attention mechanism which combines the encoder hidden states as a context vector which is added as an additional input to the decoder. We include additional details of this particular model in the supplementary material, and otherwise only mention that both the encoder and decoder are Long Short Term Memory cells (Hochreiter and Schmidhuber, 1997). For the rest of the paper we shall refer to this model as LSTM+Attn when discussing it.

Transformer
The Transformer is a model that uses intraattention (self-attention) instead of sequential hidden states. For translation, it has been shown to train faster compared to RNN-based seq2seq architectures (Vaswani et al., 2017). For brevity, we exclude discussing this model in detail, and instead refer readers to the original paper Vaswani et al. (2017), or alternatively the tutorial by Rush (2018) which provides a step-by-step guide on the implementation.

Backtranslation Decisions
Applying backtranslation in practise generally requires a number of decisions such as the amount of synthetic text to add and decoding scheme choice. Both of these considerations have previously been studied by Edunov et al. (2018) which can be applied as general backtranslation guidelines. We largely based our choices off of their findings, but with one discrepancy. In their work, the emphasis was on the number of available training sentence pairs when making backtranslation choices as the key factor.
However, Edunov et al. (2018) do not discuss other aspects of bitext such as sentence length variation, number of words, or even initial bitext quality. This makes it difficult to apply their findings to other bitext corpora based solely on number of sentences. Our assumption when applying findings from Edunov et al. (2018) is that the translation system's BLEU score is more reflective of the expected synthetic sentence quality than the number of sentences used. Our final results suggest this assumption is fairly reasonable. Our Hindi -Nepali translation models, despite having the smallest bitext corpus, performed better on the test sets compared to our Polish -Czech systems following this choice.
Before backtranslating any data, we trained both the Transformer and LSTM+Attn NMT systems with only the provided bitext corpora and calculated the BLEU score on the validation set. Based on our bitext only model performances, we then chose the appropriate backtranslation scheme for each language pair. For the Spanish -Portuguese systems we sampled the synthetic source sentences because Edunov et al. (2018) found that for resource rich language pairs this could provide better training signal. For our work, this corresponded to randomly picking each word x i from the probability distribution for the current position x i ∼ p(x i |y, x <i ). For both Czech -Polish and Hindi -Nepali synthetic sentences, the synthetic source sentences were deterministically produced with greedy decoding, as their validation BLEU scores were much lower. This again was in line with translation behavior of backtranslation found by Edunov et al. (2018).
We used these decoding schemes to backtrans-late the available monolingual data with the best corresponding bitext only NMT system (either the Transformer or LSTM+Attn model) for each language direction. The two exceptions were Spanish and Hindi, for each of which we had significantly more monolingual data. For Spanish, we only used ∼3.3M sentences at most, and for Hindi we only used ∼2.4M sentences.
For our experiments, the best performing bitext only systems produced 2 sets of backtranslated text. The first set (which we will refer to as Synth 1) included only parts of all the considered monolingual data for a subset of the translation directions. The second set (henceforth referred to as Synth 2) consisted of backtranslating all Czech, Polish, Hindi, and Nepali monolingual data and larger portions of the Portuguese and Spanish data. As part of the Synth 2 data set, we increased the frequency bitext was trained on compared to synthetic bitext. This meant that for every synthetic sentence our models trained on, the model was trained on several sentences of the bitext. This decision was due to the performances we found on our Synth 1 datasets where several language pairs did not perform as well. In most cases, with the exception of few of our Spanish -Portuguese models, systems trained with these synthetic datasets outperformed our bitext only models.
At this point, we had produced 24 models trained on synthetic and real bitext. 1 From these 24 models, we again chose the best performing ones to perform a 3rd round of backtranslation. This 3rd set of backtranslated data (which we refer to as Synth 3) followed the same decoding schemes for each language pair as previously discussed. The amount of backtranslation was mostly the same except for the synthetic Portuguese to Spanish data where we backtranslated the largest amount of the available Spanish monolingual data. Exact counts are available in Tables 2,3,4. In the work we report here, we only followed this procedure once. In the future, our goal will be to follow the iterative backtranslation approach proposed by Hoang et al. (2018).

Dataset Analysis
In this section, we present an analysis of the shared task data. For additional information, such as our pre-processing of the data, refer to the supplemen-tary material.
To get an understanding of the provided data, we collect statistics including the word and sentence counts, sentence length variation, and token overlap. Table 1 contains information on the approximate sentence and word counts after cleaning the data. Based on the size of the datasets, we hypothesize that our most successful NMT system would be for Spanish -Portuguese (∼3.5M sentences), followed by, Czech -Polish (∼1.7M sentences), and Hindi -Nepali being the most difficult (∼68K sentences).
In addition to this, the sentence length variations in the box-plots of Figure 1 highlight how for Spanish -Portuguese, and Czech -Polish the sentences are generally longer in the bitext compared to Hindi -Nepali. In our experimental results, we reason that part of the success for the LSTM+Attn models on Hindi -Nepali is due to the short sentence lengths. A cited advantage of the Transformer (Vaswani et al., 2017) is its ability to encode longer dependencies, but also see Tang et al. (2018), which on the Hindi -Nepali corpus would not be as much of a requirement due to the shorter bitext.
We also wanted to understand from which perspective each of the language pairs might be considered similar, so we analyzed the overlap between tokens in each language pairs bitext. We tokenized on our cleaned data with the Tok-Tok Tokenizer available through the NLTK toolkit. 2 We then calculated the percentage of shared tokens compared to the total tokens at increasingly higher thresholds by token frequency. Figure 2 shows our findings for the percentage of shared tokens at different thresholds of token frequency. These plots would suggest that although Spanish -Portuguese and Czech -Polish have larger over all token overlap, the most frequent tokens are where much of the language discrepancy is. Czech and Polish in particular, seem to have significantly fewer shared tokens which could suggest a smaller lexical overlap. This could partially be because of differences in alphabets between Czech and Polish. By contrast, Hindi and Nepali seem to share much more in common as we see an increase of overlap for more frequent tokens, but we note this could be an artefact of the small size of the Hindi and Nepali data. We now present our experimental findings.

Experiments
For all of our experiments, we use OpenNMT-py (Klein et al., 2017) to handle training and build our models. For our LSTM+Attn model, we used the default parameters provided in the OpenNMTpy toolkit. For the Transformer, we used the recommended settings provided by the OpenNMTpy toolkit, with the exception of using 2 layers in the Transformer encoder and decoder instead of 6. We changed the number of Transformer layers because we found in our preliminary results on the bitext only systems that this worked well for each language direction. We did not investigate model architecture and hyperparameter tuning further, and hence we note additional work in this context could lead to better performance ( tion, we also perform ensemble decoding by using different checkpoints in the optimization process and further details can be found in the supplement material. We represented the vocabulary for each language with a joint byte-pair encoding (BPE) model (Sennrich et al., 2015b) trained on all available bitext and monolingual data shared between the languages motivated by the work of Lample et al. (2018). Our BPE models were trained with the SentencePiece API and consisted of 20,000 merge operations. 3 The reader may notice that, based on our discussion in Section3, Czech and Polish may not have necessarily benefited from a joint vocabulary. This indeed may be the case, especially as our final results for Czech -Polish translation were the lowest-performing among all our final systems.
We present our findings for each respective language pair on the validation data provided by task organizers. 4 We measure performance on the validation data with the BLEU score based on the BPE representations of sentences using the script that comes with the OpenNMT-py toolkit. Note that for our test data, BLEU score is measured on the detokenized input sequences (i.e., word tokens rather than BPE).    mation on the size of the training data used for each model. Note that we did not evaluate the Synth 3 dataset on the LSTM+Attn model which was due to our previous findings and compute resource limitations.

Spanish ↔ Portuguese Results
We found that too much of the sampled backtranslated text did not necessarily improve translation quality. Between the Synth 1 and Synth 2 synthetic sets, we can see a small drop of performance particularly for Spanish to Portuguese translation where we had much more available monolingual data to backtranslate. In our best performing model, part of this improvement is likely due to us doubling the number of times the bitext was looked at with respect to the synthetic sentences. This is in alignment with previous research findings on the importance of bitext over synthetic sentence pairs (Sennrich et al., 2015a;Edunov et al., 2018). Table 3 shows our Czech -Polish validation BLEU scores and, like our Spanish -Portuguese systems, excludes results of the LSTM+Attn model on Synth 3 dataset. Similar to our Spanish -Portuguese models, we found that the most useful change is doubling the amount of times the bitext is trained on. One difference with our Czech -Polish data was that we had upsampled bitext sooner having tried it on the Synth 2 dataset instead of waiting till Synth 3. This discrepancy allowed us to isolate improvements on the Synth 3 dataset to the quality of synthetic sentences instead of having result confounded with upsampling like with Spanish -Portuguese. As we see in our results from Synth 2 to Synth 3, where the only difference is synthetic sentence quality, we again achieve an improvement in BLEU score. Table 4 show's our results for Hindi -Nepali translation. As our initial models on this particular pair were performing relatively poorly, we decided to train even more frequently on the bitext compared to the amounts considered on the previous language pairs. This decision was in part motivated by the results of Edunov et al. (2018) where upsampling bitext with deterministically backtranslating data in low resource language pairs seemed most effective.

Hindi ↔ Nepali Translation
Initially we believed that maintaining a close to 1-to-1 ratio of synthetic to real bitext would always be necessary to achieve better results. For the Synth 1 dataset, we upsampled the training corpus by 5x's for Hindi to Nepali translation and 10x's for Nepali to Hindi translation. This lead to large improvements for both models when translating from Nepali to Hindi, although it did not provide quite as noticeable improvements for translating Hindi to Nepali. The most likely explanation is the noticeable difference in the amount of synthetic sentences. At least for Nepali to Hindi this choice to maintain the 1-to-1 ratio seemed to work best for Nepali to Hindi as we achieved our best performance on Synth 1 for this translation direction.
Although generally maintaining close to a 1-to-1 ratio seems to be important, we note one discrepancy for Hindi to Nepali results. Between the Synth 1 to Synth 2 Hindi to Nepali dataset we kept the upsampled bitext fixed while increasing the  amount of synthetic sentences to closer to a 2 to 3 ratio of real to synthetic bitext. In the Transformer case, this increase in data seemed beneficial as the BLEU score for the Transformer improved, but seemed to negatively impact the LSTM+Attn model. This raises a potential question on whether considerations of backtranslation could be model dependent. We leave investigating this question as future work. We further found that there is a limitation to the benefit of upsampling the amount of bitext despite having even more synthetic bitext. For the Synth 3 datasets, we again returned to maintaining a 1to-1 ratio of real to synthetic bitext. This lead to upsampling the data 10x's for translating Hindi to Nepali, and 20x's for Nepali to Hindi. This upsampling, along with higher quality synthetic data did seem to benefit both the Transformer ad LSTM+attn model for Hindi to Nepali translation which achieved our best performances. In contrast, as the amount of synthetic data increased for Nepali to Hindi translation, we observed this to negatively impact performance compared to those on the Synth 1 datasets. Even though the synthetic sentences were produced with a better translation systems, the Synth 3 dataset performance was still worse.

Shared Task Evaluation
Official, shared task results for our primary submissions are presented in Table 5 along with a number of important choices we made as to which models to submit. There are a number of interesting behaviors we see in terms of performance from our validation to test sets. In the Spanish -Portuguese translation systems, we can see that the relative BLEU scores between the two directions are fairly stable. This is likely in part due to the sampling process used for backtranslation we used in comparison for the other language pairs which used greedily decoded sentences. As for the other language pairs, although we originally hypothesized that Czech -Polish would produce better systems than Hindi -Nepali our results seem to suggest the opposite and that we might have overfit the Czech -Polish validation set compared to Hindi -Nepali translation.

Conclusion
Our findings are congruent with previous work showing the efficacy of backtranslation as a strategy for improving NMT systems. However, we couch this conclusion with caution. The reason is that tuning the correct amount of included synthetic data is still much dependent on the size of data at hand (which can be limited). Further work is needed before we can reach a more definitive recommendation as to how to perform backtranslation in different contexts, with varying degrees of resource availability.

A Data Sources
Submissions to the shared task were asked to only use the data provided data from the organizers. This included bitext from a number of different sources of varying utility to training translation systems. For the Spanish -Portuguese and Czech -Polish bitext corpora included the latest JRC-Acquis (Steinberger et al., 2006), Europarl (Koehn, 2005, News Commentary (new) data sets, as well as the Wiki Titles corpus (Bojar et al., 2018). The Hindi -Nepali corpus consists of the KDE, Ubuntu, and Gnome data sets available through Tiedemann (2012). 5 There was also a bilingual dictionary included for Hindi -Nepali language pair but we did not include it in our analysis because they were largely word to word translations. By the same argument, we likely should not have included the Wiki titles data set either as this corpus was also largely word to word translations. An interesting observation from our results is that our Czech -Polish systems ended up doing much worse then our Hindi -Nepali systems suggesting perhaps fewer, longer sentences are indeed more valuable then shorter, near word to word translations. Additionally, the organizers provided monolingual datasets for Spanish, Portuguese, Czech and Polish. They all largely came from the same sources including the Europarl, JRC-Acquis, New Crawl, and News Commentary datasets. For Hindi and Nepali, we were allowed to use any monolingual data we found. For Hindi monolingual data, we only used the corpora collected by Bojar et al. (2014) which consisted of several million sentences collected from the internet. For Nepali, we largely used corpora provided in the WMT19 Parallel Corpus Filtering shared task which included a filtered Wikipedia dump of Nepali sentences, Global Voices Corpus (Koehn, 2018), the Nepali tagged corpus (nep), and a bible corpus (Christos-C, 2017). Externally, we found 3 additional Nepali corpora including one called the Nepali News corpus (Bhatta, 2017), the Ted Multilingual corpus (Kulkarni, 2016),and an additional Wikipedia dump corpus (Rosa, 2018).

A.1 Data Set Cleaning Information
To clean the datasets, we removed white spaces and re-tabulated the sentence pairs because of formatting errors. Additionally, we removed any pairs which were less than 4 characters long excluding leading and trailing white spaces. Table 6,8 contain the number of word counts per data set considered in this work. Table 7, 9 contain the sentence counts per dataset after the cleaning process. 5 The actual Hi -Ne sources were never disclosed but were confirmed by organizers    As mentioned in the paper, our RNN architecture is a one of several studied in the work of Luong et al. (2015). The particular model we use can be described with the following equations.
p(y j |y <j , x) = Generator(s j ) The encoder and decoder are Long Short Term Memory (LSTM) RNNs (Hochreiter and Schmidhuber, 1997), where the encoder produces latent representations z i for each word embedding x i in the source sentence of length T . Equation 2 refers to general attention proposed by Luong et al. (2015), where W g is learned and Equations 3 and 4 show the application of this global attention mechanism. The decoder LSTM then produces hidden states s j using as input the word embedding y j , context vectors j−1 , and previous hidden state s j−1 . The context hidden statess j are how the log-probability of target words are determined and are calculated on the concatenation of context c and previous hidden state s j−1 with learned parameters W s .

B.2 Ensemble Decoding
As a way to further improve translation system quality, previous research has shown that an ensemble of models can improve translation performance (Koehn, 2017). For our work this meant using a window around the best performing single models that we found on the evaluation set. By window we mean we translated the test and evaluation sets with the single best model along with the n checkpoint models before, and n checkpoint models after the single best model. For our final evaluations this involved either n = 1 or n = 2 windows around the best performing models. We did not find much difference between the two choices of n as both generally gave only minute improvements to performance. Our checkpoints were saved after every 10,000 mini-batch updates. As an example, generally we found the Transformer worked well with around 50,000 or 60,000 updates. Supposing we found 50,000 steps the best along with picking n = 1, we then included the checkpoint at 40,000 updates and 60,000 updates to translate the final model.

B.3 Hyperparameter Information
Table 10 contains the specific parameters for the models used in our analysis. One parameter left out of the tables was the number of updates which in OpenNMT-py is counted per batch update. For the RNN model we found 150,000 steps generally sufficient for our best performances on the Hindi -Nepali data, and at most 60,000 or 50,000 steps with the Transformer sufficient for Spanish -Portuguese and Czech -Polish even with the backtranslated data.

B.4 Tuning results
In Table 11 shows the full results of tuning our models. As a reminder, the BLEU scores were calculated on the byte-pair encoding representations of the sentences instead of the detokenized translations. This is in part why the scores, particularly in some cases, are much higher than the final validation scores reported in the paper.