Revisiting Low-Resource Neural Machine Translation: A Case Study

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions, underperforming phrase-based statistical machine translation (PBSMT) and requiring large amounts of auxiliary data to achieve competitive results. In this paper, we re-assess the validity of these results, arguing that they are the result of lack of system adaptation to low-resource settings. We discuss some pitfalls to be aware of when training low-resource NMT systems, and recent techniques that have shown to be especially helpful in low-resource settings, resulting in a set of best practices for low-resource NMT. In our experiments on German–English with different amounts of IWSLT14 training data, we show that, without the use of any auxiliary monolingual or multilingual data, an optimized NMT system can outperform PBSMT with far less data than previously claimed. We also apply these techniques to a low-resource Korean–English dataset, surpassing previously reported results by 4 BLEU.


Introduction
While neural machine translation (NMT) has achieved impressive performance in high-resource data conditions, becoming dominant in the field (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017), recent research has argued that these models are highly data-inefficient, and underperform phrase-based statistical machine translation (PBSMT) or unsupervised methods in low-data conditions (Koehn and Knowles, 2017;Lample et al., 2018b).In this paper, we re-assess the validity of these results, arguing that they are the result of lack of system adaptation to low-resource settings.Our main contributions are as follows: • we explore best practices for low-resource Figure 3: BLEU scores for English-Spanish systems trained on 0.4 million to 385.7 million words of parallel data.Quality for NMT starts much lower, outperforms SMT at about 15 million words, and even beats a SMT system with a big 2 billion word in-domain language model under high-resource conditions.
How do the data needs of SMT and NMT compare?NMT promises both to generalize better (exploiting word similary in embeddings) and condition on larger context (entire input and all prior output words).
We built English-Spanish systems on WMT data,7 about 385.7 million English words paired with Spanish.To obtain a learning curve, we used 1 1024 , 1 512 , ..., 1 2 , and all of the data.For SMT, the language model was trained on the Spanish part of each subset, respectively.In addition to a NMT and SMT system trained on each subset, we also used all additionally provided monolingual data for a big language model in contrastive SMT systems.
Results are shown in Figure 3. NMT exhibits a much steeper learning curve, starting with abysmal results (BLEU score of 1.6 vs. 16.4 for 1 1024 of the data), outperforming SMT 25.7 vs. 24.7 with 1 16 of the data (24.1 million words), and even beating the SMT system with a big language model with the full data set (31.1 for NMT, 28.4 for SMT, 30.4 for SMT+BigLM).
NMT, evaluating their importance with ablation studies.
• we reproduce a comparison of NMT and PB-SMT in different data conditions, showing that when following our best practices, NMT outperforms PBSMT with as little as 100 000 words of parallel training data.
2 Related Work 2.1 Low-Resource Translation Quality Compared Across Systems Figure 1 reproduces a plot by Koehn and Knowles (2017) which shows that their NMT system only outperforms their PBSMT system when more than 100 million words (approx.5 million sentences) of parallel training data are available.Results shown by Lample et al. (2018b) are similar, showing that unsupervised NMT outperforms supervised systems if few parallel resources are available.In both papers, NMT systems are trained with hyperparameters that are typical for high-resource set-tings, and the authors did not tune hyperparameters, or change network architectures, to optimize NMT for low-resource conditions.

Improving Low-Resource Neural Machine Translation
The bulk of research on low-resource NMT has focused on exploiting monolingual data, or parallel data involving other language pairs.Methods to improve NMT with monolingual data range from the integration of a separately trained language model (Gülc ¸ehre et al., 2015) to the training of parts of the NMT model with additional objectives, including a language modelling objective (Gülc ¸ehre et al., 2015;Sennrich et al., 2016b;Ramachandran et al., 2017), an autoencoding objective (Luong et al., 2016;Currey et al., 2017), or a round-trip objective, where the model is trained to predict monolingual (target-side) training data that has been back-translated into the source language (Sennrich et al., 2016b;He et al., 2016;Cheng et al., 2016).As an extreme case, models that rely exclusively on monolingual data have been shown to work (Artetxe et al., 2018b;Lample et al., 2018a;Artetxe et al., 2018a;Lample et al., 2018b).Similarly, parallel data from other language pairs can be used to pre-train the network or jointly learn representations (Zoph et al., 2016;Chen et al., 2017;Nguyen and Chiang, 2017;Neubig and Hu, 2018;Gu et al., 2018a,b;Kocmi and Bojar, 2018).
While semi-supervised and unsupervised approaches have been shown to be very effective for some language pairs, their effectiveness depends on the availability of large amounts of suitable auxiliary data, and other conditions being met.For example, the effectiveness of unsupervised methods is impaired when languages are morphologically different, or when training domains do not match (Søgaard et al., 2018) More broadly, this line of research still accepts the premise that NMT models are data-inefficient and require large amounts of auxiliary data to train.In this work, we want to re-visit this point, and will focus on techniques to make more efficient use of small amounts of parallel training data.Low-resource NMT without auxiliary data has received less attention; work in this direction includes ( Östling and Tiedemann, 2017;Nguyen and Chiang, 2018).
3 Methods for Low-Resource Neural Machine Translation

Mainstream Improvements
We consider the hyperparameters used by Koehn and Knowles (2017) to be our baseline.This baseline does not make use of various advances in NMT architectures and training tricks.In contrast to the baseline, we use a BiDeep RNN architecture (Miceli Barone et al., 2017), label smoothing (Szegedy et al., 2016), dropout (Srivastava et al., 2014), word dropout (Sennrich et al., 2016a), layer normalization (Ba et al., 2016) and tied embeddings (Press and Wolf, 2017).

Language Representation
Subword representations such as BPE (Sennrich et al., 2016c) have become a popular choice to achieve open-vocabulary translation.BPE has one hyperparameter, the number of merge operations, which determines the size of the final vocabulary.
For high-resource settings, the effect of vocabulary size on translation quality is relatively small; Haddow et al. (2018) report mixed results when comparing vocabularies of 30k and 90k subwords.
In low-resource settings, large vocabularies result in low-frequency (sub)words being represented as atomic units at training time, and the ability to learn good high-dimensional representations of these is doubtful.Sennrich et al. (2017a) propose a minimum frequency threshold for subword units, and splitting any less frequent subword into smaller units or characters.We expect that such a threshold reduces the need to carefully tune the vocabulary size to the dataset, leading to more aggressive segmentation on smaller datasets.1

Hyperparameter Tuning
Due to long training times, hyperparameters are hard to optimize by grid search, and are often re-used across experiments.However, best practices differ between high-resource and lowresource settings.While the trend in high-resource settings is towards using larger and deeper models, Nguyen and Chiang (2018) use smaller and fewer layers for smaller datasets.Previous work has argued for larger batch sizes in NMT (Morishita et al., 2017;Neishi et al., 2017), but we find that using smaller batches is beneficial in lowresource settings.More aggressive dropout, including dropping whole words at random (Gal and Ghahramani, 2016), is also likely to be more important.We report results on a narrow hyperparameter search guided by previous work and our own intuition.

Lexical Model
Finally, we implement and test the lexical model by Nguyen and Chiang (2018), which has been shown to be beneficial in low-data conditions.The core idea is to train a simple feed-forward network, the lexical model, jointly with the original attentional NMT model.The input of the lexical model at time step t is the weighted average of source embeddings f (the attention weights a are shared with the main model).After a feedforward layer (with skip connection), the lexical model's output h l t is combined with the original model's hidden state h o t before softmax computation.
Our implementation adds dropout and layer normalization to the lexical model. 2

Data and Preprocessing
We use the TED data from the IWSLT 2014 German→English shared translation task (Cettolo et al., 2014).We use the same data cleanup and train/dev split as Ranzato et al. (2016), resulting in 159 000 parallel sentences of training data, and 7584 for development.
As a second language pair, we evaluate our systems on a Korean-English dataset 3 with around 90 000 parallel sentences of training data, 1000 for development, and 2000 for testing.
For both PBSMT and NMT, we apply the same tokenization and truecasing using Moses scripts.For NMT, we also learn BPE subword segmentation with 30 000 merge operations, shared between German and English, and independently for Korean→English.To simulate different amounts of training resources, we randomly subsample the IWSLT training corpus 5 times, discarding half of the data at each step.Truecaser and BPE segmentation are learned on the full training corpus; as one of our experiments, we set the frequency threshold for subword units to 10 in each subcorpus (see 3.2).Table 1 shows statistics for each subcorpus, including the subword vocabulary.

PBSMT Baseline
We use Moses (Koehn et al., 2007) to train a PBSMT system.We use MGIZA (Gao and Vogel, 2008) to train word alignments, and lmplz (Heafield et al., 2013) for a 5-gram LM.Feature weights are optimized on the dev set to maximize BLEU with batch MIRA (Cherry and Foster, 2012) -we perform multiple runs where indicated.Unlike Koehn and Knowles (2017), we do not use extra data for the LM.Both PBSMT and NMT can benefit from monolingual data, so the availability of monolingual data is no longer an exclusive advantage of PBSMT (see 2.2).

NMT Systems
We train neural systems with Nematus (Sennrich et al., 2017b).Our baseline mostly follows the settings in (Koehn and Knowles, 2017); we use adam (Kingma and Ba, 2015) and perform early stopping based on dev set BLEU.We express our batch size in number of tokens, and set it to 4000 in the baseline (comparable to a batch size of 80 sentences used in previous work).
We subsequently add the methods described in section 3, namely the bideep RNN, label smoothing, dropout, tied embeddings, layer normalization, changes to the BPE vocabulary size, batch size, model depth, regularization parameters and learning rate.Detailed hyperparameters are reported in Appendix A.

Results
Table 2 shows the effect of adding different methods to the baseline NMT system, on the ultra-low data condition (100k words of training data) and the full IWSLT 14 training corpus (3.2M words).Our "mainstream improvements" add around 6-7 BLEU in both data conditions.
For a comparison with PBSMT, and across different data settings, consider Figure 2, which shows the result of PBSMT, our NMT baseline, and our optimized NMT system.Our NMT baseline still performs worse than the PBSMT system for 3.2M words of training data, which is consistent with the results by Koehn and Knowles (2017).However, our optimized NMT system shows strong improvements, and outperforms the PBSMT system across all data settings.Some sample translations are shown in Appendix B.
For comparison to previous work, we report lowercased and tokenized results on the full IWSLT 14 training set in Table 3.Our results far outperform the RNN-based results reported by Wiseman and Rush (2016), and are on par with the best reported results on this dataset.
Table 4 shows results for Korean→English, using the same configurations (1, 2 and 8) as for German-English.Our results confirm that the techniques we apply are successful across datasets, and result in stronger systems than previously reported on this dataset, achieving 10.37 BLEU as compared to 5.97 BLEU reported by Gu et al. (2018b).

Conclusions
Our results demonstrate that NMT is in fact a suitable choice in low-data settings, and can outperform PBSMT with far less parallel training data than previously claimed.Recently, the main trend in low-resource MT research has been the better exploitation of monolingual and multilingual resources.Our results show that low-resource NMT is very sensitive to hyperparameters such as BPE vocabulary size, word dropout, and others, and by following a set of best practices, we can train competitive NMT systems without relying on auxiliary resources.This has practical relevance for languages where large amounts of monolingual data, or multilingual data involving related languages, are not available.Even though we focused on only using parallel data, our results are also relevant for work on using auxiliary data to improve low-resource MT.Supervised systems serve as an important baseline to judge the effectiveness of semisupervised or unsupervised approaches, and the quality of supervised systems trained on little data can directly impact semisupervised workflows, for instance for the backtranslation of monolingual data.

B Sample Translations
Table 6 shows some sample translations that represent typical errors of our PBSMT and NMT systems, trained with ultra-low (100k words) and low (3.2M words) amounts of data.For unknown words such as blutbefleckten ('bloodstained') or Spaniern ('Spaniards', 'Spanish'), PBSMT systems default to copying, while NMT systems produce translations on a subword-level, with varying success (blue-flect, bleed; spaniers, Spanians).NMT systems learn some syntactic disambiguation even with very little data, for example the translation of das and die as relative pronouns ('that', 'which', 'who'), while PBSMT produces less grammatical translation.On the flip side, the ultra low-resource NMT system ignores some unknown words in favour of a moreor-less fluent, but semantically inadequate translation: erobert ('conquered') is translated into doing, and richtig aufgezeichnet ('registered correctly', 'recorded correctly') into really the first thing.In a bloodstained continent, these people alone were never conquered by the Spanish.

PBSMT 100k
In a blutbefleckten continent, were these people the only, the never of the Spaniern erobert were.PBSMT 3.2M In a blutbefleckten continent, these people were the only ones that were never of the Spaniern conquered.
NMT 3.2M (baseline) In a blinging tree continent, these people were the only ones that never had been conquered by the Spanians.
NMT 100k (optimized) In a blue-flect continent, these people were the only one that has never been doing by the spaniers.NMT 3.2M (optimized) In a bleed continent, these people were the only ones who had never been conquered by the Spanians.
source Dies ist tatschlich ein Poster von Notre Dame, das richtig aufgezeichnet wurde.reference This is actually a poster of Notre Dame that registered correctly.

PBSMT 100k
This is actually poster of Notre lady, the right aufgezeichnet was.PBSMT 3.2M This is actually a poster of Notre Dame, the right recorded.
NMT 3.2M (baseline) This is actually a poster of emergency lady who was just recorded properly.
NMT 100k (optimized) This is actually a poster of Notre Dame, that was really the first thing.NMT 3.2M (optimized) This is actually a poster from Notre Dame, which has been recorded right.

Figure 2 :
Figure 2: German→English learning curve, showing BLEU as a function of the amount of parallel training data, for PBSMT and NMT.

Table 2 :
German→English IWSLT results for training corpus size of 100k words and 3.2M words (full corpus).Mean and standard deviation of three training runs reported.

Table 3 :
Results on full IWSLT14 German→English data on tokenized and lowercased test set with multi-bleu.perl.

Table 4 :
Korean→English results.Mean and standard deviation of three training runs reported.
Table 5 lists hyperparameters used for the different experiments in the ablation study (Table2).Hyperparameters were kept constant across different data settings, except for the validation interval and subword vocabulary size (see Table1).

Table 5 :
Configurations of NMT systems reported inTable 2. Empty fields indicate that hyperparameter was unchanged compared to previous systems.source In einem blutbefleckten Kontinent, waren diese Menschen die einzigen, die nie von den Spaniern erobert wurden.reference

Table 6 :
German→English translation examples with phrase-based SMT and NMT systems trained on 100k/3.2Mwords of parallel data.