Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

Scarcity of parallel sentence-pairs poses a significant hurdle for training high-quality Neural Machine Translation (NMT) models in bilingually low-resource scenarios. A standard approach is transfer learning, which involves taking a model trained on a high-resource language-pair and fine-tuning it on the data of the low-resource MT condition of interest. However, it is not clear generally which high-resource language-pair offers the best transfer learning for the target MT setting. Furthermore, different transferred models may have complementary semantic and/or syntactic strengths, hence using only one model may be sub-optimal. In this paper, we tackle this problem using knowledge distillation, where we propose to distill the knowledge of ensemble of teacher models to a single student model. As the quality of these teacher models varies, we propose an effective adaptive knowledge distillation approach to dynamically adjust the contribution of the teacher models during the distillation process. Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach, achieving up to +0.9 BLEU score improvements compared to strong baselines.


Introduction
Neural models have been revolutionising machine translation (MT), and have achieved state-of-the-art for many high-resource language pairs (Chen et al., 2018;Stahlberg, 2019;Maruf et al., 2019). However, the scarcity of bilingual parallel corpora is still a major challenge for training high-quality NMT models (Koehn and Knowles, 2017). Transfer learning by fine-tuning, from a model trained for a highresource language-pair, is a standard approach to tackle the scarcity of the data in the target low-resource language-pair (Dabre et al., 2017;Kocmi and Bojar, 2018;Kim et al., 2019). However, this is a one-to-one approach, which is not able to exploit models trained for multiple high-resource language-pairs for the target language-pair of interest. Furthermore, models transferred from different high-resource language-pairs may have complementary syntactic and/or semantic strengths, hence using a single model may be sub-optimal.
Another appealing approach is multilingual NMT, whereby a single NMT model is trained by combining data from multiple high-resource and low-resource language-pairs (Johnson et al., 2017;Ha et al., 2016;Neubig and Hu, 2018). However, the performance of a multilingual NMT model is highly dependent on the types of languages used to train the model. Indeed, if languages are from very distant language families, they lead to negative transfer, causing low translation quality in the multilingual system compared to the counterparts trained on the individual language-pairs (Tan et al., 2019a;Oncevay et al., 2020). To address this problem, (Tan et al., 2019b) has proposed a knowledge distillation approach to effectively train a multilingual model, by selectively distilling the knowledge from individual teacher models to the multilingual student model. However, still all the language pairs are trained in a single model with a blind contribution during training.  In this paper, we propose a many-to-one transfer learning approach which can effectively transfer models from multiple high-resource language-pairs to a target low-resource language-pair of interest. As the fine-tuned models from different high-resource language pairs can have complementary syntactic and/or semantic strengths in the target language-pair, our idea is to distill their knowledge into a single student model to make the best use of these teacher models. We further propose an effective adaptive knowledge distillation (AKD) approach to dynamically adjust the contribution of the teacher models during the distillation process, enabling making the best use of teachers in the ensemble. Each teacher model provides dense supervision to the student via dark knowledge (Hinton et al., 2015) using a mechanism similar to label smoothing (Szegedy et al., 2016;Müller et al., 2019), where the amount of smoothing is regulated by the teacher. In our AKD approach, the label smoothing coming from different teachers is combined and regulated, based on the loss incurred by the teacher models during the distillation process. Experiments on transferring from a collection of six language pairs from IWSLT to five low-resource language-pairs from TED Talks demonstrate the effectiveness of our approach, achieving up to +0.9 BLEU score improvements compared to strong baselines.

Adaptive Knowledge Distillation
We address the problem of low-resource NMT, assuming that we have access to models for high resource languages, and data for low resource model. Our approach relies on two main steps, (i) Transferring from high-resource to low-resource language-pairs by fine tuning the high-resource models using the small amount of bilingual data, and (ii) Adaptive distillation of knowledge from the teacher models to the student model.
More specifically, given a training dataset for a low-resource language-pair, D LR := {(x 1 , y 1 ), .., (x n , y n )} and multiple individual high resource NMT models {✓ l } L l=1 fine-tuned on D LR (teachers), we are interested in training a single NMT model (student) by adaptively distilling knowledge from all teachers based on their effectiveness to improve the accuracy of the student. Knowledge distillation (KD) is a process of improving the performance of a simple student model by using a distribution over soft labels obtained from an expert teacher model instead of hard ground-truth labels (Hinton et al., 2015). The training objective to distill the knowledge from a single teacher to the student involves, where ✓ l and ✓ LR are the parameters of the teacher and student models, respectively. P (. | .) is the conditional probability with the student model and Q(. | .) denotes the output distribution of the teacher model. According to Equation 1, knowledge distillation provides dense training signal as each word in the vocabulary (V ) contributes to the training objective, regulated by a weight coming from the teacher. This is in contrast to the negative log-likelihood training objective, which only provides supervision signal based on the correct target words according to the bilingual training data, Given a collection of teacher models {✓ l } L l=1 , we pose the following training objective, where ↵ l regulates the contribution of the l-th teacher. We dynamically adjust the contribution weights over the course of the distillation process, in order to effectively address the knowledge gap of the student during the training process. This is achieved based on the rewards (negative perplexity) attained by the teachers on the data, where these values are passed through a softmax transformation to turn into a distribution. To stabilize these contribution weights over the course of the training process, we smooth them using a running geometric average. The student model is trained end-to-end with a weighted combination of losses coming from the ensemble of teachers and the data, where 1 = 0.5 and 2 is started from 0.5 and gradually increased to 3 following the annealing function of (Bowman et al., 2015) in our experiments. Our approach is summarized in Algorithm 1 and Figure 1.

Settings
Data. We conduct our experiments on the European languages of IWSLT and TED datasets. The language pairs with more than 100K training data are considered as high-resource and the ones less than 15k are assumed as low-resource. The high-resource models are trained on IWSLT2014 (ru,de,it,pl,nl,esen). IWSLT 2014 MT task data (sl-en) (Cettolo et al., 2014), and TED talk data (gl/et/nb/eu-en) (Qi et al., 2018) are used as low-resource languages. Detail about the preprocessing step and the statistics of data and language codes based on ISO 639-1 standard 1 are listed in Section 1.1 of Appendix A.
Training configuration. Individual low-resource and high-resource NMT models are trained on the low-resource data. The first trained from scratch and the later by finetuning with the vanilla transformer architecture. For multilingual NMT, we train a single model with all high-resource and the up-sampled of low-resource language pairs by using a decoder language embedding layer to identify the type of language during the inference step. Multilingual selective knowledge distillation (Tan et al., 2019b) is trained with all language pairs while matching the outputs of each low-resource model simultaneously through knowledge distillation. For training our approach, we fine-tune the high-resource models with low-resource languages and treat them as teachers. When training on the low-resource language, we load teacher models into memory and train a single low-resource model (student) from scratch while using the weighted average of teachers' probabilities based on their contribution weight. In order to make clear how different teachers contribute during training the student, we illustrate contribution weights of all teachers for first 30 iterations of different mini-batches during the training in Figure 2.
Model configuration. All models are trained with Transformer architecture (Vaswani et al., 2017), with the model hidden size of 256, feed-forward hidden size of 1024, and 2 layers, implemented in Fairseq framework (Ott et al., 2019). We use the Adam optimizer (Kingma and Ba, 2015) and an inverse square root schedule with warm-up (maximum LR 0.0005). We apply dropout and label smoothing with a rate of 0.3 and 0.1 respectively. The source and target embeddings are shared and tied with the last layer. We train with half-precision floats on one V100 GPU, with at most 4028 tokens per batch.

Results
In Table 1, we compare our approach with individual NMT models, transferred models from highresource language pairs, multilingual NMT, and multilingual selective knowledge distillation (Tan et al., 2019b). We selected the best models according to the SacreBLEU 2 score on the validation set. In our experiments, bold numbers indicate the best results and underlined numbers show the second best ones. Transfer learning results are inline with the language family relationships (Littell et al., 2017). The high-resource languages which are linguistically close to the low-resource languages have the most impact on low-resource model's improvement. Likewise, the contribution weights of different teachers are consistent with the performance of the teachers as hypothesized (See results in Table 1 and Figure  2). According to Table 1, the multilingual models (with and without knowledge distillation) are less accurate than at least one of the transferred models from high-resource languages 3 . This suggests a weak link may exist between the impact of each high-resource language and its contribution during the training multilingually. Adaptive knowledge distillation compensates this blind collaboration between teachers by weighting the teachers' contributions particularly for the cases where majority of teachers and student are linguistically close such as "nb-en". The qualitative examples are presented in Section 1.4 of Appendix A. It is worth noting that, we empirically observed when there is more diversity in teachers (e.g, in case of "gl-en" in Table 1), adaptive KD underperforms compared to the best teacher and we hypothesise this happens because there is an empirically dominant teacher ("es"). This observation suggests that a prior effort for choosing the proper teacher languages (e.g., based on the language family information) will directly impact the performance of the low-resource NMT model.

Contribution Weight Analysis
To analyse the effect of teachers' contribution weights, we compare two different contribution settings: (i) Adaptive contribution: which assigns the contribution weights to all the teachers based on their performance per mini-batch as explained in Section 2. (ii) Equal contribution: which gives all the teachers the same contribution weights. According to Table 2, the equal contribution setting is not as effective as the adaptive contribution especially for the languages with more inconsistent teachers (based on BLEU score) e.g., "gl-en".

Contribution Temperature Scaling
Through the experiments, we observed that when most of the teachers do not agree (in terms of perplexity), a constant temperature is not an ideal option. An alternative is to adaptively change the value of the temperature given the agreement among the teachers determined based on the distance between the  Figure 2: Teachers' contribution weights during the training of low-resource NMT models for "sl-en", "gl-en", and "nb-en" language pairs, first 30 iterations for different mini-batches. maximum and minimum perplexity between teachers which can be formulated as: where S is the output of the softmax operation on the negative perplexity of all N teachers and (max(S) min(S)) is inversely proportional to the extent of the agreement between teachers. Such temperature scaling encourages the contribution of better teachers in case of the existence of a disagreement, while it allows similar contributions when all teachers agree on a mini-batch. Table 3 shows the effect of adaptive temperature for two languages.

Conclusion
In this paper, we present an adaptive knowledge distillation approach to improve NMT for low-resource languages. We address the inefficiency of the original transfer learning and multilingual learning by making wiser use of all high-resource languages and models in an effective collaborative learning manner. Our approach shows its effectiveness in translation of low-resource languages especially when there are complementary knowledge in multiple high-resource languages from the same linguistic family and it is not explicitly clear which language has more impact in every mini-batch of low-resource training data. Experiments on the translation of five extremely low-resource languages to English show improvements compared to the strong baselines. and TED 2 datasets in our experiments as listed in Table 1. We filter the parallel corpus with langid.py (Lui and Baldwin, 2012) and remove sentences with a length ratio greater than 1.5. All the sentences are first tokenized with the Moses tokenizer 3 and then segmented with BPE segmentation (Sennrich et al., 2016) with a learned BPE model by 32k merge operations on all languages. We keep the output vocabulary of the teacher and student models the same to make the knowledge distillation feasible.  Table 2 showcases the generated English translations by the individual student, all the teachers, and student trained through adaptive knowledge distillation from Norwegian language. This example shows that while there is a diversity between different teachers' translations e.g., for the verb of "provoke", the student is impacted by the agreement of the majority of teachers. Moreover, this example shows that our adaptive KD model captures the best of all teachers resulting in a higher quality translation.

Ref
And great creativity is needed to do what it does so well : to provoke us to think differently with dramatic creative statements .
Individual kepler great mission mission to do it as well : to grow us to think with dramatic creativity .
Teacher (ru-en) and the first creativity needed to do what it does : to promote us to think about the dramatic creativity .
Teacher (de-en) now , the future creativity needs to do it as it does : to provoke us to think differently with dramatic creative expression .
Teacher (it-en) now , the future creativity is needed to do what it does so well : to provocate us to think differently about dramatic reactive .
Teacher (es-en) the future of creativity to do that as it's doing so good : to provocate us to think differently about dramatic creativity .
Teacher (pl-en) the future of creativity to do what it does so good : to promise others with dramatic creativity .
Teacher (nl-en) now , the frequent creativity is to make it that it makes so good : to provoke us with dramatic creative .
Proposed Adapt. KD now , they need great creativity to do what it does so well : provoke us to think differently with dramatic creativity. Table 2: The generated outputs from the individual student, all teachers, and student trained with multi-teachers (Proposed Adapt. KD) for "nb-en" MT task. Some of the correct keyword translations are indicated with green color while hallucinations are represented by red. The bold-green shows the best of the teachers' output which is also captured with the student.