Self-Paced Learning for Neural Machine Translation

Recent studies have proven that the training of neural machine translation (NMT) can be facilitated by mimicking the learning process of humans. Nevertheless, achievements of such kind of curriculum learning rely on the quality of artificial schedule drawn up with the handcrafted features, e.g. sentence length or word rarity. We ameliorate this procedure with a more flexible manner by proposing self-paced learning, where NMT model is allowed to 1) automatically quantify the learning confidence over training examples; and 2) flexibly govern its learning via regulating the loss in each iteration step. Experimental results over multiple translation tasks demonstrate that the proposed model yields better performance than strong baselines and those models trained with human-designed curricula on both translation quality and convergence speed.


Introduction
Neural machine translation (NMT) has achieved promising results with the use of various optimization tricks (Hassan et al., 2018;Xu et al., 2019;Li et al., 2020;. In spite of that, these techniques lead to increased training time and massive hyper-parameters, making the development of a well-performed system expensive (Popel and Bojar, 2018;. As an alternative mitigation, curriculum learning (CL, Elman, 1993;Bengio et al., 2009) has shown its effectiveness on speeding up the convergence and stabilizing the NMT model training Platanios et al., 2019). CL teaches NMT model from easy examples to complex ones rather than equally considering all samples, where the keys lie in the definition of "diffi-culty" and the strategy of curricula design (Krueger and Dayan, 2009;Kocmi and Bojar, 2017). Existing studies artificially determine data difficulty according to prior linguistic knowledge such as sentence length (SL) and word rarity (WR) (Platanios et al., 2019;Zhou et al., 2020), and manually tune the learning schedule (Liu et al., 2020;Fomicheva et al., 2020). However, neither there exists a clear distinction between easy and hard examples (Kumar et al., 2010), nor these human intuitions exactly conform to effective model training .
Instead, we resolve this problem by introducing self-paced learning (Kumar et al., 2010), where the emphasis of learning can be dynamically determined by model itself rather than human intuitions. Specifically, our model measures the level of confidence on each training example (Gal and Ghahramani, 2016;Xiao and Wang, 2019), where an easy sample is actually the one of high confidence by the current trained model. Then, the confidence score is served as a factor to weight the loss of its corresponding example. In this way, the training process can be dynamically guided by model itself, refraining from human predefined patterns.
We evaluate our proposed method on IWSLT15 En⇒Vi, WMT14 En⇒De, as well as WMT17 Zh⇒En translation tasks. Experimental results reveal that our approach consistently yields better translation quality and faster convergence speed than TRANSFORMER (Vaswani et al., 2017) baseline and recent models that exploit CL (Platanios et al., 2019). Quantitative analyses further confirm that the intuitive curriculum schedule for a human does not fully cope with that for model learning.
2 Self-Paced Learning for NMT if these artificial supervisions are feasible, the long sequences or rare tokens are not always "difficult" as the model competence increases. From this view, we design a self-paced learning algorithm that offers NMT the abilities to 1) estimate the confidences over samples appropriated for the current training state; and 2) automatically control the focus of learning through regulating the training loss, as illustrated in Fig. 1.

Confidence Estimation
We propose to determine the learning emphasis according to the model confidence (Ueffing and Ney, 2005;Soricut and Echihabi, 2010), which quantifies whether the current model is confident or hesitant on translating the training samples. The model confidence can be quantified by Bayesian neural networks (Buntine and Weigend, 1991;Neal, 1996), which place distributions over the weights of network. For efficiency, we adopt widely used Monte Carlo dropout sampling (Gal and Ghahramani, 2016) to approximate Bayesian inference. Given current NMT model parameterized by θ and a mini-batch consisting of N sentence pairs {(x 1 , y 1 ), · · · , (x N , y N )}, we first perform M passes through the network, where the m-th pasŝ θ m randomly deactivates part of neurons. Thus, each example yields M sets of conditional probabilities. The lower variance of translation probabilities reflects higher confidence that the model has with respect to the instance (Dong et al., 2018;. We propose multi-granularity strategies for confidence estimation: Sentence-Level Confidence (SLC) A natural choice for measuring the confidence of sentence pair (x n , y n ) is to assess the variance of translation probability Var{P (y n |x n ,θ m )} M m=1 . Accordingly, confidence scoreα n can be formally expressed as: Here, we assign a hyper-parameter k to scale the gap between scores of confident and unconfident examples. The larger absolute value of k represents higher discriminative manner and vice versa. In some extreme cases, all the confidence scores in a mini-batch may tend to small or big value, e.g. the estimation at the early stage of the training. 2 In order to stabilize the training process and maintain the same loss scale as conventional model, we normalize the confidence scores by sof tmax: . (2) Token-Level Confidence (TLC) Intuitively, confidence scores can be evaluated at more fine-grained level. We extend our model into token-level so as to estimate the confidence on translating each element in target sentence y n . The confidenceβ n j of the j-th token y n j is: where Var{P (y n j |x n , y n <j ,θ m ) denotes the variance of the translation probability with respect to y n j . Similar to sentence-level strategy, the confidence scores of tokens are normalized as: where J indicates the length of target sentence y n .

Training Strategy
A larger confidence score indicates that the current model is confident on the corresponding example. Therefore, the model should learn more from the predicted loss. In order to govern the learning schedule automatically, we leverage the confidence scores as factors to weight the loss, thus controlling the update at each time step. To this end, the sentence log-likelihood can be defined as: Finally, the loss of a batch is calculated as: At the early stage of the study, the model learns more from confident samples, thus accelerating the training. The hesitant samples are not completely ignorant, but relatively few can be learned. As training proceeds, the loss of high-confidence samples gradually reduce, and the model will pay more attention on "complex" samples with low prediction accuracy, thus raising their confidence. In this way, the loss of different samples are dynamically revised, eventually balancing the learning. Contrast to related studies Platanios et al., 2019) which adopt CL into NMT with predefined patterns, the superiority of our model lies in its flexibility on both learning emphasis and strategy. Several researchers may concern about the processing speed when integrating Monte Carlo Dropout sampling. Contrary to prior studies which estimate confidence during inference (Dong et al., 2018;, we only perform forward propagation M = 5 times in training time, which avoids the auto-regressive decoding for efficiency.

Experiments
We evaluate our method upon TRANSFORMER-Base/Big model (Vaswani et al., 2017) and conduct experiments on IWSLT15 English-to-Vietnamese (En⇒Vi), WMT14 English-to-German (En⇒De) and WMT17 Chinese-to-English (Zh⇒En) tasks. For fair comparison, we use the same experimental setting as Platanios et al. (2019) for En⇒Vi and follow the common configuration in Vaswani et al. (2017) for En⇒De and Zh⇒En.
During training, we apply 0.3 dropout ratio and batch size as 4,096 for En⇒Vi task, and experiments are conducted upon one Nvidia GTX1080Ti GPU device. For En⇒De and Zh⇒En task, we use 32,768 as batch size, and use four Nvidia V100 GPU devices for experiments. We use beam size as 4, 5, 10, and decoding alpha as 1.5, 0.6, 1.35 for each task, respectively (Vaswani et al., 2017). We compare our models with two baselines: • Base and Big represent the vanilla TRANS-FORMER (Vaswani et al., 2017) Figure 2: Affects of k on best performance after certain training steps upon En⇒De dev set. At the early stage of the training, a higher k yields better translation quality, denoting a faster convergence speed.

Confidence/Unconfidence Balancing
As mentioned in Sec. 2.1, we assign k to balance the extent of discrimination between confident and unconfident examples. We first conduct experiments on En⇒De to evaluate the impact of k. Fig. 2 shows that, the larger k, the faster convergence speed. However, the final performance slightly decreases when k > 2. We believe that the overlarge k leads to overfit on confident samples and ignore initial hesitated samples. This demonstrates that an appropriate balance on the discriminative manner contributes to both convergence acceleration and final performance. Besides, when k is negative, models will pay more attention to unconfident examples. This circumstance is identical to reverse-CL , where training is advised to offer examples in a hard-to-easy order. Our results confirm that unconfidence-first strategy (k < 0) performs worse than baseline, which is similar with previous findings on CL . We attribute this to the fact that the heuristic design forces NMT model to unceasingly learn more from unconfident examples, and finally leads to the strait of catastrophic forgetting (Goodfellow et al., 2014). Therefore, we set k = 2 for subsequent experiments.

Main Results
As shown in Tab. 1, our baseline models outperform the reported results (Vaswani et al., 2017;Platanios et al., 2019) on the same data, making the evaluation convincing. The proposed self-paced learning method (SPL) achieves better results than existing CL approaches that artificially determine the difficulty (SL or WR), demonstrating the effectiveness of our method. Specifically, removing either SLC or TLC decreases the translation quality,  Table 1: Overall experimental results of all approaches upon three translation tasks. Each cell contains the mean value and standard variance of BLEU scores derived from 5 independent experimental runs. "SPL": proposed self-paced learning model. "Acc.": Acceleration ratio of training time required to achieve the best performance of baseline. "↑": the improvement is significant by contrast to TRANSFORMER-Base/Big baseline model (p < 0.01). indicating that two confidence estimations are complementary to each other. TLC outperforms its SLC counterpart, which confirms that more fine-grained information benefits to the training. Moreover, our method consistently improves the translation quality with around 1 BLEU score across all involved tasks and multiple model settings. This shows the universality and effectiveness of SPL on different scales of training data and model sizes.

Analysis
In this section, we further investigate how the proposed method exactly affects the NMT model training by conducting experimental analyses on 1) convergence speed, 2) self-paced adjustment and 3) sequential bucketing.

Convergence Speed
As aforementioned, one motivation of exploiting self-paced learning is to accelerate the convergence of model training. We visualize the learning curve of examined models on En⇒De dev set in Fig. 3. As seen, the vanilla NMT model reaches its convergence at step 120k, while the proposed one gets the same performance at step 47k, yielding 2.43 times faster. Although Monte Carlo Dropout sampling requires extra time to forward-pass the neural network at each iteration step, our method can still reach comparable result on dev set with shorter training time, achieving 1.46x faster training speed (column "Acc." in Tab. 1). Besides, we also observe that two methods proposed by Platanios et al.
(2019) reveal a comparable tendency with baseline. We explain this with the view that Platanios et al.
(2019) examined these approaches with a batch of 5,120 tokens, much smaller than that used in our experiments (32,768). Since larger batch size can considerably facilitate the training (Popel and Bojar, 2018;, the benefits of their models may be marginal with this change.

Self-Paced Adjustment
It is interesting to investigate how our model adjusts its learning. We randomly extract 300 En⇒De training examples, which then be categorized into 3 subsets according to their sentence lengths. Fig. 4 shows the ratios of averaged SLC scores between our method and vanilla NMT system at different checkpoints. As seen, at the beginning of the training, the ratio of confidence score with respect to short sentences is greater than 1, indicating our model pays more attention to shorter examples than baseline. This is identical with human intuition that the short sentences seem easier and should be learned earlier . However, as training continues, our model focuses on short and long sentences simultaneously and hesitates on sentences with medium length, which goes against human intuition and indicates that long sentences may easier than its medium counterparts for current model. From then on, the curves fluctuate and interlace continuously, revealing that SPL automatically regulates its learning emphasis. These phenomena show the flexibility of our model, and confirm that predefined data difficulty and learning schedule is insufficient to fully match the model learning.

Sequential Bucketing
Conventional model training sorts examples with similar lengths into buckets to keep efficiency. This may introduce bias when estimating confidence scores, because longer sequence may gain far less attention due to the productive multiplication of probabilities for SLC estimation. Generally, larger window size for bucketing increases the diversity of length within each batch, but reduce the efficiency of training due to extra padding tokens.
To investigate whether the diversity of sequential lengths within each batch may introduce bias to SLC score computation, we conduct a series of experiments with different settings of sequential bucketing. As shown in Fig.5, we explore the effect of this on En⇒De task, revealing that both baseline and our approach can gain improvement from larger bucket range. Nevertheless, the performance of baseline model decreases along with lower diversity of sequential lengths, whereas that of our model does not diminish. Our model gives better performance with smaller window size compared to baseline. Here we can conclude, that the performance of TRANSFORMER baseline model is bothered by close sequence lengths within each batch, whereas our model shows its flexibility of adjusting its learning to avoid such effect.
For fair comparison as well as keeping the training efficiency, we follow the default setting from Vaswani et al. (2017) Figure 5: Performance upon WMT14 En⇒De dev set with different bucketing strategy. With window size for sequence bucketing being smaller, the number of buckets accordingly increases, and our model can maintain its performance whereas baseline drops.

Conclusion
In this paper, we propose a novel self-paced learning model for NMT in which the learning schedule is determined by model itself rather than being intuitively predefined by humans. Experimental results on three translation tasks verify the universal effectiveness of our approach. Quantitative analyses confirm that exploiting self-paced strategy presents a more flexible way to facilitate the model convergence than its CL counterparts. It is interesting to combine with other techniques Hao et al., 2019) to further improve NMT. Besides, as this idea is not limited to machine translation, it is also interesting to validate our model in other NLP tasks, such as low-resource NMT model training (Lample et al., 2018; and neural architecture search (Guo et al., 2020).