Dynamic Curriculum Learning for Low-Resource Neural Machine Translation

Large amounts of data has made neural machine translation (NMT) a big success in recent years. But it is still a challenge if we train these models on small-scale corpora. In this case, the way of using data appears to be more important. Here, we investigate the effective use of training data for low-resource NMT. In particular, we propose a dynamic curriculum learning (DCL) method to reorder training samples in training. Unlike previous work, we do not use a static scoring function for reordering. Instead, the order of training samples is dynamically determined in two ways - loss decline and model competence. This eases training by highlighting easy samples that the current model has enough competence to learn. We test our DCL method in a Transformer-based system. Experimental results show that DCL outperforms several strong baselines on three low-resource machine translation benchmarks and different sized data of WMT’16 En-De.


Introduction
Recently, neural machine translation (NMT) has demonstrated impressive performance improvements and became the de-facto standard (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017). However, like other neural methods, NMT is data-hungry. This makes it challenging when we train such a model in low-resource scenarios (Koehn and Knowles, 2017). Researchers have developed promising approaches to low-resource NMT. Among these are data augmentation (Sennrich et al., 2016a;Fadaee et al., 2017), transfer learning (Zoph et al., 2016;Kocmi and Bojar, 2018), and pre-trained models (Peters et al., 2018;Devlin et al., 2019). But these approaches rely on external data other than bi-text. To date, it is rare to see work on the effective use of bilingual data for low-resource NMT.
In general, the way of feeding samples plays an important role in training neural models. A good instance is that it is popular to shuffle the input data for robust training in state-of-the-art systems. More systematic studies on this issue can be found in recent papers (Bengio et al., 2009;Kumar et al., 2010;Shrivastava et al., 2016). For example, Arpit et al. (2017) have pointed out that deep neural networks tend to prioritize learning "easy" samples first. This agrees with the idea of curriculum learning (Bengio et al., 2009) in that an easy-to-hard learning strategy can yield better convergence for training.
In NMT, curriculum learning is not new. Several research groups have applied it to large-scale translation tasks although few of them discuss the issue in a low-resource setup (Zhang et al., 2018;Platanios et al., 2019;. The first question here is how to define the "difficulty" of a training sample. Previous work resorts to functions that produce a difficulty score for each training sample. This score is then used to reorder samples before training. But the methods of this type enforce a static scoring strategy and somehow disagrees with the fact that the sample difficulty might be changing when the model is updated during training. Another assumption behind curriculum learning is that the difficulty of a sample should fit the competence of the model we are training. Researchers have implicitly modeled this issue by hand-crafted curriculum schedules (Zhang et al., 2018) or simple functions (Platanios et al., 2019), whereas there has no in-depth discussion on it yet.
In this paper, we continue the line of research on curriculum learning in low-resource NMT. We propose a dynamic curriculum learning (DCL) method to address the problems discussed above. The novelty of DCL is two-fold. First, we define the difficulty of a sample to be the decline of loss (i.e., negative log-likelihood). In this way, we can measure how hard a sentence can be translated via the real objective used in training. Apart from this, the DCL method explicitly estimates the model competence once the model is updated, so that one can select samples that the newly-updated model has enough competence to learn.
DCL is general and applicable to any NMT system. In this work, we test it in a Transformer-based system on three low-resource MT benchmarks and different sized data selected from the WMT'16 En-De task. Experimental results show that our system outperforms the strong baselines and several curriculum learning-based counterparts.
2 Related work 2.1 Low-Resource NMT Koehn and Knowles (2017) show that NMT systems result in worse translation performance in lowresource scenarios. Researchers have developed promising approaches to this problem which mainly focus on introducing external knowledge to improve low-resource NMT performance. Data augmentation (Sennrich et al., 2016a;Fadaee et al., 2017) alleviates this problem by generating pseudo parallel data. A large amount of auxiliary parallel corpus from other related language pairs can be used to pre-train model parameters and transfer to target language pair (Zoph et al., 2016;Chen et al., 2017;Gu et al., 2018a;Gu et al., 2018b;Kocmi and Bojar, 2018). Pre-trained language models trained with a large amount of monolingual data (Peters et al., 2018;Devlin et al., 2019) improve the quality of NMT model significantly (Clinchant et al., 2019;Zhu et al., 2020).
However, these approaches rely on a large number of external resources or conditions, e.g., the auxiliary parallel corpus related to the source or target language, or a large amount of monolingual data. Sennrich and Zhang (2019) demonstrate the competitive NMT model can be trained with the appropriate hyperparameters in low-resource scenarios without any external resources. This is consistent with our motivation. The difference lies in that they focus on the model settings, and we explore the training strategy which utilizes bilingual data effectively for low-resource NMT.

Curriculum Learning
Curriculum learning (Bengio et al., 2009) is motivated by the learning strategy of biological organisms which orders the training samples in an easy-to-hard manner (Elman, 1993). Benefited from organized training, the neural network explores harder samples effectively utilizing the previous knowledge learned from easier samples. Weinshall et al. (2018) demonstrate curriculum learning speeds up the learning process, especially at the beginning of training. Curriculum learning has been applied to several tasks, including language modeling (Bengio et al., 2009), image classification (Weinshall et al., 2018), and human attribute analysis (Wang et al., 2019).
Curriculum learning has recently shown to train large-scale translation tasks efficiently and effectively by controlling the way of feeding samples. Kocmi and Bojar (2017) construct mini-batch contains sentences similar in length and linguistic phenomena, then organize the order by increased complexity in one epoch. Zhang et al. (2018) group the training samples into shards based on model-based and linguistic difficulty criteria, then train with hand-crafted curriculum schedules. Platanios et al. (2019) propose competence-based curriculum learning that select training samples based on sample difficulty and model competence. Kumar et al. (2019) use reinforcement learning to learn the curriculum automatically.  propose a norm-based curriculum learning method based on the norm of word embedding to improve the efficiency of training an NMT system.  propose uncertainty-aware cur-riculum learning. To the best of our knowledge, this is the first comprehensive discussion of curriculum learning in a low-resource setup.
On the other hand, curriculum learning is similar to data selection and data sampling methods. More similar work is that  propose a dynamic sampling method that calculates the decline of loss during training to improve the NMT training efficiency. They start training from the full training set and then gradually decrease. This is contrary to the idea of curriculum learning.

Problem Definition
Let D train be the training corpus and |D train | be the corpus size. s = (x, y) is a training sample in D train , where x is the source sentence and y is the target sentence. NMT systems learn a conditional probability P (y|x): where |y| is the length of y, θ is a set of model parameters. The training objective is to seek the optimal parametersθ by minimizing the negative log-likelihood (NLL) of the training set: Our objective is to learn better model parameters by curriculum learning in low-resource NMT. We decompose the whole training process into multiple phases T = (t 0 , t 1 , t 2 , . . . , t k ) 1 . For every phase t, the sub-optimal process can be viewed as: where θ t is the model parameter at phase t. Two sub-questions (Platanios et al., 2019) in curriculum learning are separated to determine the training data D t train : • Sample Difficulty. How to measure the difficulty of a training sample with a quantified value?
• Model Competence. How to estimate the competence of the model to arrange training data that model can learn effectively?
Previous work enforces a static scoring strategy to measure sample difficulty and encourage simple functions to estimate the model competence. In this way, the training data at each phase is pre-determined before training. But in fact, sample difficulty and model competence are not independent with the current model parameters.
A natural idea is to re-arrange the curriculum once the model is updated, so that we can select training data with appropriate difficulty for current training. We discuss these two questions in-depth in the following section.

Dynamic Curriculum Learning
We propose a dynamic curriculum learning method to reorder training samples in training. We determine the order of training samples dynamically, rather than using a static scoring for reordering. Besides, we propose a batching method to reduce gradient noise.

Sample Difficulty
Equation 2 shows that the training objective of NMT is to minimize the loss (i.e., negative log-likelihood) of the training set. For a training sample s = (x, y) at phase t, the loss is calculated as: The NMT model generally translates sentences with lower loss better. Based on this idea, Zhang et al. (2018) define the difficulty as the probability of the top-1 translation candidates generated by a pre-trained NMT model. While translation probability related to the training objective represents the sample difficulty accurately compared with the heuristic metrics, it still suffers from the problem of static scoring. Therefore, a natural idea is calculating the loss of training samples dynamically to measure its difficulty. To this end, we evaluate the loss of all the training samples with the fixed model parameters before each training phase: where d(s; θ t ) is the difficulty of the sample s at phase t. While the loss shows the level that the current model can handle this sample, there also suffers from two drawbacks . First, the loss of a sample may be large at the initial phase but easy to decrease rapidly after a few phases. Second, one sentence with small loss may have no space to further decrease and training model with these sentences iteratively may lead to overfitting. Therefore, we define the difficulty of a sample to be the decline of loss. In this way, we take into account the model change between the previous phase and the current phase. The decline-based sample difficulty is measured as: where a ≥ 1 represents we compare the loss decline at phase t − a and phase t. Based on this difficulty metric, the sentence with low difficulty indicates the NMT model improves the predicted accuracy of its translation result significantly. Therefore, it is more likely to learn better in the next phase. On the contrary, the sentence with high difficulty indicates the current NMT model does not have enough competence to handle it and may wait to be learned at a later phase.

Model Competence
Platanios et al. (2019) propose a competence-based curriculum learning framework which defines the model competence c(t) ∈ (0, 1] at training step t by simple functional forms: where c 0 ≥ 0 is the initial competence, p is the coefficient to control the curriculum schedule. Competence is seen as a linear function when p = 1 and the harder samples increase by a constant amount, square root function when p = 2 and the harder samples increase faster in the early phases and slower in the later phases. However, these intuitive functions might not be universal for model competence. There is a gap between the high-resource and low-resource tasks. Limited by a small amount of data, the performance of the NMT model improves slowly at the early phases (see Section 7.1).
In this paper, we propose a dynamic estimation method, which measures model competence at every phase based on the performance of the development set. While the loss of development set is an optional t ← t + 1; 12: end while method, the sentence-level evaluation metric BLEU (Papineni et al., 2002) presents more superiorities (Shen et al., 2016). In this way, the model competence avoids the prior hypothesis of the training process and is related to the real performance on the unseen dataset.
Specifically, we pre-train a vanilla NMT model and record the best BLEU value on the development set as curriculum length BLEU T . The model competence is estimated as: where BLEU t is the BLEU at phase t, β ∈ (0, 1] is a coefficient to control the curriculum speed. With a smaller β, the progress of curriculum learning is faster and the model can be trained on the entire training set earlier. We suppose that the model has weak competence to only learn well from the easiest training samples at the initial phase and gradually has enough competence to handle the entire training set D train . We measure the sample difficulty and model competence before every phase, then the |D train | * c(t) easiest training samples are selected to train the NMT model. Benefited from dynamic measurement, the newlyupdated model has enough competence to learn samples with the appropriate difficulties. Goyal et al. (2017) address the optimization difficulty when training a neural network with large batches and exhibit good generalization. Large batches have demonstrated better performance in high-resource NMT tasks due to the lower gradient noise scale (Ott et al., 2018;Popel and Bojar, 2018). However, this method degrades the performance in low-resource tasks due to poorer generalization (Keskar et al., 2017).

Batching
The dominant NMT batches the samples with similar lengths to speed training up (Khomenko et al., 2017). To reduce gradient noise, we propose a batching method which batches the samples based on similar difficulty in our curriculum learning method. Samples with similar difficulty indicate their losses fall at a similar rate. It might have a stabilized gradient direction and leads to better performance.

Training Strategy
Zhang et al. (2018) define two general types of curriculum learning strategy. The deterministic curriculum (Kocmi and Bojar, 2017) arranges the training samples with fixed order and performs worse due to lacking randomization. The probabilistic curriculum (Platanios et al., 2019) generates a batch uniformly sampled from the training set based on sample difficulty and model competence. The latter generally works well in the previous curriculum learning methods. However, in our preliminary experiments, we find that the vanilla model trained by sampling performs worse or converges slower slightly than the training strategy which trains the model with the whole training set in an epoch. A possible reason is sampling might lead to unbalanced training because some samples are not fully trained due to sampling omission.
Our method dynamically measures the sample difficulty and model competence at each phase, then selects a certain proportion of easier samples to train based on model competence. It ensures that training samples are not missed due to sampling and also retains randomization to avoid overfitting. Algorithm 1 shows the overall training procedure of our method.

Datasets
We consider three different low-resource machine translation benchmarks from IWSLT TED talk, running experiments in IWSLT'15 Chinese-English (Zh-En), IWSLT'15 Thai-English (Th-En), and IWSLT'17 English-Japanese (En-Ja). We concatenate dev2010, tst2010, tst2011, tst2012, and tst2013 as the development set. We use tst2015 as the test set for Zh-En and En-Th, tst2017 for En-Ja. To simulate different amounts of training resources, we randomly subsample 50K/100K/300K sentence pairs from WMT'16 English-German dataset and denote them as 50K/100K/300K. Furthermore, we also verify the effect of our method in the high-resource scenarios with all WMT'16 English-German training set (4.5M). We concatenate newstest2012 and newstest2013 as the development set and newstest2016 as the test set. Data statistics are shown in Table 1.
We tokenize the Chinese sentences using NiuTrans (Xiao et al., 2012) word segmentation tookit and Japanese sentences using MeCab 2 . For other language pairs, we apply the same tokenization using Moses scripts (Koehn et al., 2007). We learn Byte-Pair Encoding (Sennrich et al., 2016b)

Model Settings
In all experiments, we use the fairseq (Ott et al., 2019) 3 implementation of the Transformer. Inspired by (Sennrich and Zhang, 2019), we select different hyperparameters for each dataset to build a strong baseline in low-resource NMT 4 . The model hyperparameters are shown in Table 2. We use the Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . We increase the learning rate from 10 −7 to 0.0005 for IWSLT datasets and 0.0007 for WMT dataset with linear warmup over the first 4000 steps and decay the learning rate by inverse square root way. For all experiments, we use the dropout of 0.1 and label smoothing ls = 0.1 for regularization. We share the embedding and the output layers of the decoder for IWSLT experiments, share all embedding for WMT experiments.

Training and Inference
We train the model with batch size consisted of 4096 tokens except for 4.5M which uses 4096 * 8 tokens to compare with a standard setting. We evaluate the development set at every epoch during training by decoding using greedy search and calculate the BLEU 5 (Papineni et al., 2002) score. We early stop training when there is no improvement of the BLEU for 10 consecutive checkpoints. During inference, we use beam search with the beam size of 5 and length penalty of 1 for IWSLT datasets, beam size of 4 and length penalty of 0.6 for WMT datasets.

Curriculum Learning Setup
We implement our method with fairseq by simply modifying. Our DCL method measures the sample difficulty and model competence before every phase dynamically. While it results in extra time consumption (about 30%), it is acceptable for low-resource tasks.
In all experiments, we set a = 1 uniformly in Equation 6 to measure the difficulty, which means the sample difficulty takes into account two adjacent phases. For the model competence described in Equation 8, we record the best BLEU of the baseline model on the development set as the BLEU T . Although hyperparameter with careful selection can bring improvement, we set c 0 = 0.2 and β = 0.9 universally for all experiments. It means we start training with the 20% easiest sentences and train the model with the whole training set when the performance achieves 90% of BLEU T .
We use the following notations to represent different curriculum learning strategies. For the sample difficulty, we compare our method with previous heuristic metrics and two other dynamic metrics: • Heuristic metrics: Source sentence length (Length) and source word rarity (Rarity) (Platanios et al., 2019). • Dynamic metrics: Random difficulty value (Random) and the loss at the current phase (Loss). • Our method: Loss decline between the previous phase and the current phase (Decline).
For the model competence, we experiment with the following methods: • Functional forms: Linear (Linear) and square root (Sqrt) model competence (Platanios et al., 2019). • Our method: Dynamic model competence (DMC) based on the performance. Table 3 summaries the experimental results with different curriculum learning methods. The existing curriculum learning methods (row 2 and row 3) can not improve performance stably or even degraded, which demonstrates the heuristic metrics are not helpful in low-resource scenarios.

Results
With the dynamic model competence, we observe the BLEU scores of Random (row 4) fluctuates on the baseline model. Although the difficulty is measured dynamically during training, the meaningless value is not favorable for curriculum learning. We also measure the difficulty with loss dynamically (row 5), which leads to a slight improvement in the larger datasets. However, it degrades performance in the smaller datasets (Th-En, 50K, and 100K). It agrees with analysis in Section 4.1 that it is easy to fall into  overfitting due to repeated training on some samples with low loss, especially in the extremely scarce datasets. Then, we test our proposed difficulty metric of Decline. With the simple competence functions (row 6 and row 7), they outperform significantly the strong baselines and previous methods. It demonstrates our proposed metric is of high relevance with sample difficulty for NMT than the other four metrics. However, the Sqrt competence function does not perform better than the Linear function in all datasets, this is not consistent with the previous conclusion (Platanios et al., 2019). The possible reason is that the model competence of some low-resource datasets improves slowly in the early phases. DMC (row 8) avoids the prior hypothesis of the performance change and achieves better or similar performance compared with the above methods.
Finally, we batch the samples with similar difficulties 6 (row 9). While the model is trained with more training steps due to padding, it achieves further improvement over our curriculum learning method in all tasks. This verifies our hypothesis that the gradient of samples with similar difficulty is more stabilized. This is an interesting result and we will explore it in future work.
Overall, the experimental results show our proposed method achieves better performance compared with strong baselines and several curriculum learning-based counterparts in the low-resource NMT tasks.

Analysis
We take En-De 50K dataset which achieves the most performance improvement to analyze our method. Although it is sampled from WMT dataset, we think it can demonstrate the advantages of our method obviously.

Learning Curve
We visual the learning curve for comparing the convergence of our method on WMT'16 En-De datasets of different sizes in Figure 1. One obvious difference in the learning curve is that the performance changes of high-resource datasets (300K/4.5M) during training are more similar to square root function, and low-resource datasets (50K/100K) are more similar to the linear function. This phenomenon is consistent with the above experimental results and shows the necessity of calculating the model competence self-adaptively.
We also observe that our proposed method converges faster and better than the baseline model significantly in all datasets, especially in early phases. On the extremely scarce dataset (50K), DCL improves the performance significantly by learning the bilingual data effectively.
On the other hand, our method only slightly works for high-resource tasks of 4.5M parallel sentence pairs. Although DCL improves the convergence speed in early training, the baseline model catches up Steps En-De 4.5M  with enough training steps. A possible reason is that a large-scale parallel corpus includes sufficient knowledge and reduces the benefits of the method. The weak model trained with small-scale corpus is easy to underfit or overfit, and DCL method improves the performance by highlighting easy samples. However, the strong model trained in high-resource scenarios should pay attention to learning difficult samples. We will explore it in the future work.

Average Loss
As described in section 4, the loss of a training sample indicates whether the NMT model can predict it well. Figure 2 shows the average loss of all training samples when they are trained different counts on the En-De 50K dataset. Our method selects the samples with the fastest loss decline for learning at each training phase, which achieves the lower loss than baseline when training the same counts. It demonstrates learning with more training data is not always beneficial and the better strategy is to select dynamically according to the current model state.

BLEU with Different Sentence Lengths
Curriculum learning over-samples the easy samples and one underlying drawback is that it may reduce performance on hard samples due to less training. Although sentence length can not represent the difficulty of the training samples accurately, a widely accepted conclusion is that it is more difficult to translate longer sentences. We visual the training counts of samples with different sentence lengths in Figure 3. The training counts of samples in our method is significantly less than baseline, especially for long sentences. This also shows that long sentences are also more difficult in our measurement method.
We divide the test set into different groups according to sentence length and show the BLEU scores of baseline and our method in Figure 4. We observe that our method outperforms the baseline model significantly in translating different length sentences, especially in translating the shorter and longer sentences. This demonstrates the organized training process can achieve better translation performance for hard samples even with fewer training times.

Conclusion
In this paper, we propose a dynamic curriculum learning (DCL) method to explore the effective use of bilingual data for low-resource NMT. We define the difficulty of a training sample by the decline of loss and estimate the model competence self-adaptively based on the performance of the development set. Different from previous work, we re-arrange the curriculum once the model is updated, so that the training data with appropriate difficulty is learned by the current model effectively. Experimental results show that our method outperforms the strong baselines and several curriculum learning-based counterparts on several low-resource translation tasks.
DCL only modifies the training strategy without any external data, which has great practical significance for real low-resource scenarios. With a strong baseline model, we can improve the effectiveness of other semi-supervised methods, such as generating the high quality back-translation data. In the future, we will rethinking the existing training startegies and explore the application of DCL methods to more difficult tasks, such as unsupervised learning and model training with pseudo data.