Data Augmentation for Low-Resource Neural Machine Translation

The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.


Introduction
In computer vision, data augmentation techniques are widely used to increase robustness and improve learning of objects with a limited number of training examples. In image processing the training data is augmented by, for instance, horizontally flipping, random cropping, tilting, and altering the RGB channels of the original images (Krizhevsky et al., 2012;Chatfield et al., 2014). Since the content of the new image is still the same, the label of the original image is preserved (see top of Figure 1). While data augmentation has become a standard technique to train deep networks for image processing, it is not a common practice in training networks for NLP tasks such as Machine Translation.
Neural Machine Translation (NMT) (Bahdanau et al., 2015;Sutskever et al., 2014;Cho et al., 2014) is a sequence-to-sequence architecture where an encoder builds up a representation of the source sentence and a decoder, using the previous A boy is holding a bat.
A boy is holding a backpack.
A boy is holding a bat.
A boy is holding a bat. Ein Junge hält einen Schläger.
Ein Junge hält einen Rucksack. LSTM hidden states and an attention mechanism, generates the target translation. To train a model with reliable parameter estimations, these networks require numerous instances of sentence translation pairs with words occurring in diverse contexts, which is typically not available in low-resource language pairs. As a result NMT falls short of reaching state-of-the-art performances for these language pairs (Zoph et al., 2016). The solution is to either manually annotate more data or perform unsupervised data augmentation. Since manual annotation of data is timeconsuming, data augmentation for low-resource language pairs is a more viable approach. Recently Sennrich et al. (2016a) proposed a method to back-translate sentences from monolingual data and augment the bitext with the resulting pseudo parallel corpora.
In this paper, we propose a simple yet effective approach, translation data augmentation (TDA), that augments the training data by altering existing sentences in the parallel corpus, similar in spirit to the data augmentation approaches in computer vision (see Figure 1). In order for the augmentation process in this scenario to be label-preserving, any change to a sentence in one language must pre-serve the meaning of the sentence, requiring sentential paraphrasing systems which are not available for many language pairs. Instead, we propose a weaker notion of label preservation that allows to alter both source and target sentences at the same time as long as they remain translations of each other.
While our approach allows us to augment data in numerous ways, we focus on augmenting instances involving low-frequency words, because the parameter estimation of rare words is challenging, and further exacerbated in a lowresource setting. We simulate a low-resource setting as done in the literature (Marton et al., 2009;Duong et al., 2015) and obtain substantial improvements for translating EnglishÑGerman and GermanÑEnglish.

Translation Data Augmentation
Given a source and target sentence pair (S,T), we want to alter it in a way that preserves the semantic equivalence between S and T while diversifying as much as possible the training examples. A number of ways to do this can be envisaged, as for example paraphrasing (parts of) S or T. Paraphrasing, however, is by itself a difficult task and is not guaranteed to bring useful new information into the training data. We choose instead to focus on a subset of the vocabulary that we know to be poorly modeled by our baseline NMT system, namely words that occur rarely in the parallel corpus. Thus, the goal of our data augmentation technique is to provide novel contexts for rare words. To achieve this we search for contexts where a common word can be replaced by a rare word and consequently replace its corresponding word in the other language by that rare word's translation: original pair augmented pair S : s1, ..., si, ..., sn S 1 : s1, ..., s 1 i , ..., sn T : t1, ..., tj, ..., tm T 1 : t1, ..., t 1 j , ..., tm where t j is a translation of s i and word-aligned to s i . Plausible substitutions are those that result in a fluent and grammatical sentence but do not necessarily maintain its semantic content. As an example, the rare word motorbike can be substituted in different contexts: Implausible substitutions need to be ruled out during data augmentation. To this end, rather than relying on linguistic resources which are not available for many languages, we rely on LSTM language models (LM) (Hochreiter and Schmidhuber, 1997;Jozefowicz et al., 2015) trained on large amounts of monolingual data in both forward and backward directions. Our data augmentation method involves the following steps: Targeted words selection: Following common practice, our NMT system limits its vocabulary V to the v most common words observed in the training corpus. We select the words in V that have fewer than R occurrences and use this as our targeted rare word list V R .
Rare word substitution: If the LM suggests a rare substitution in a particular context, we replace that word and add the new sentence to the training data. Formally, given a sentence pair pS, T q and a position i in S we compute the probability distribution over V by the forward and backward LMs and select rare word substitutions C as follows: where topK returns the K words with highest conditional probability according to the context. The selected substitutions s 1 i , are used to replace the original word and generate a new sentence.
Translation selection: Using automatic word alignments 1 trained over the bitext, we replace the translation of word s i in T by the translation of its substitution s 1 i . Following a common practice in statistical MT, the optimal translation t 1 j is chosen by multiplying direct and inverse lexical translation probabilities with the LM probability of the translation in context: If no translation candidate is found because the word is unaligned or because the LM probability is less than a certain threshold, the augmented sentence is discarded. This reduces the risk of generating sentence pairs that are semantically or syntactically incorrect.
Sampling: We loop over the original parallel corpus multiple times, sampling substitution positions, i, in each sentence and making sure that each rare word gets augmented at most N times so that a large number of rare words can be affected. We stop when no new sentences are generated in one pass of the training data. Table 1 provides some examples resulting from our augmentation procedure. While using a large LM to substitute words with rare words mostly results in grammatical sentences, this does not mean that the meaning of the original sentence is preserved. Note that meaning preservation is not an objective of our approach.
Two translation data augmentation (TDA) setups are considered: only one word per sentence can be replaced (TDA r"1 ), or multiple words per sentence can be replaced, with the condition that any two replaced words are at least five positions apart (TDA rě1 ). The latter incurs a higher risk of introducing noisy sentences but has the potential to positively affect more rare words within the same amount of augmented data. We evaluate both setups in the following section. En

Evaluation
In this section we evaluate the utility of our approach in a simulated low-resource NMT scenario.

Data and experimental setup
To simulate a low-resource setting we randomly sample 10% of the EnglishØGerman WMT15 training data and report results on newstest 2014, 2015, and 2016 (Bojar et al., 2016). For reference we also provide the result of our baseline system on the full data.
As NMT system we use a 4-layer attentionbased encoder-decoder model as described in (Luong et al., 2015) trained with hidden dimension 1000, batch size 80 for 20 epochs. In all experiments the NMT vocabulary is limited to the most common 30K words in both languages. Note that data augmentation does not introduce new words to the vocabulary. In all experiments we preprocess source and target language data with Bytepair encoding (BPE) (Sennrich et al., 2016b) using 30K merge operations. In the augmentation experiments BPE is performed after data augmentation.
For the LMs needed for data augmentation, we train 2-layer LSTM networks in forward and backward directions on the monolingual data provided for the same task (3.5B and 0.9B tokens in English and German respectively) with embedding size 64 and hidden size 128. We set the rare word threshold R to 100, top K words to 1000 and maximum number N of augmentations per rare word to 500. In all experiments we use the English LM for the rare word substitutions, and the German LM to choose the optimal word translation in context. Since our approach is not label preserving we only perform augmentation during training and do not alter source sentences during testing.
We also compare our approach to Sennrich et al. (2016a) by back-translating monolingual data and adding it to the parallel training data. Specifically, we back-translate sentences from the target side of WMT'15 that are not included in our low-resource baseline with two settings: keeping a one-to-one ratio of back-translated versus original data (1 : 1) following the authors' suggestion, or using three times more back-translated data (3 : 1).
We measure translation quality by singlereference case-insensitive BLEU (Papineni et al., 2002) computed with the multi-bleu.perl script from Moses.
the importance of sizable training data for NMT. Next we observe that both back-translation and our proposed TDA method significantly improve translation quality. However TDA obtains the best results overall and significantly outperforms backtranslation in all test sets. This is an important finding considering that our method involves only minor modifications to the original training sentences and does not involve any costly translation process. Improvements are consistent across both translation directions, regardless of whether rare word substitutions are first applied to the source or to the target side. We also observe that altering multiple words in a sentence performs slightly better than altering only one. This indicates that addressing more rare words is preferable even though the augmented sentences are likely to be noisier.
To verify that the gains are actually due to the rare word substitutions and not just to the repetition of part of the training data, we perform a final experiment where each sentence pair selected for augmentation is added to the training data unchanged (Oversampling in Table 2). Surprisingly, we find that this simple form of sampled data replication outperforms both baseline and backtranslation systems, 2 while TDA rě1 remains the best performing system overall.
We also observe that the system trained on augmented data tends to generate longer translations. Averaging on all test sets, the length of translations generated by the baseline is 0.88 of the average reference length, while for TDA r"1 and TDA rě1 it is 0.95 and 0.94, respectively. We attribute this effect to the ability of the TDA-trained system to generate translations for rare words that were left 2 Note that this effect cannot be achieved by simply continuing the baseline training for up to 50 epochs. untranslated by the baseline system.

Analysis of the Results
A desired effect of our method is to increase the number of correct rare words generated by the NMT system at test time.
To examine the impact of augmenting the training data by creating contexts for rare words on the target side, Table 3 provides an example for GermanÑEnglish translation. We see that the baseline model is not able to generate the rare word centimetres as a correct translation of the German word zentimeter . However, this word is not rare in the training data of the TDA rě1 model after augmentation and is generated during translation. Table 3 also provides several instances of augmented training sentences targeting the word centimetres. Note that even though some augmented sentences are nonsensical (e.g. the speed limit is five centimetres per hour), the NMT system still benefits from the new context for the rare word and is able to generate it during testing. Figure 2 demonstrates that this is indeed the case for many words: the number of rare words occurring in the reference translation (V R X V ref ) is three times larger in the TDA system output than in the baseline output. One can also see that this increase is a direct effect of TDA as most of the rare words are not 'rare' anymore in the augmented data, i.e., they were augmented sufficiently many times to occur more than 100 times (see hatched pattern in Figure 2). Note that during the experiments we did not use any information from the evaluation sets.
To gauge the impact of augmenting the contexts for rare words on the source side, we examine normalized attention scores of these words before and after augmentation. When translating Source der tunnel hat einen querschnitt von 1,20 meter höhe und 90 zentimeter breite . Baseline translation the wine consists of about 1,20 m and 90 of the canal . TDA rě1 translation the tunnel has a UNK measuring meters 1.20 metres high and 90 centimetres wide . Reference the tunnel has a cross -section measuring 1.20 metres high and 90 centimetres across .   Finally Table 4 provides examples of cases where augmentation results in incorrect sentences. In the first example, the sentence is ungrammati-cal after substitution (of / yearly), which can be the result of choosing substitutions with low probabilities from the English LM topK suggestions.
Errors can also occur during translation selection, as in the second example where betraut is an acceptable translation of entrusted but would require a rephrasing of the German sentence to be grammatically correct. Problems of this kind can be attributed to the German LM, but also to the lack of a more suitable translation in the lexicon extracted from the bitext. Interestingly, this noise seems to affect NMT only to a limited extent.

Conclusion
We have proposed a simple but effective approach to augment the training data of Neural Machine Translation for low-resource language pairs. By leveraging language models trained on large amounts of monolingual data, we generate new sentence pairs containing rare words in new, synthetically created contexts. We show that this approach leads to generating more rare words during translation and, consequently, to higher translation quality. In particular we report substantial improvements in simulated lowresource EnglishÑGerman and GermanÑEnglish settings, outperforming another recently proposed data augmentation technique.