Active Learning Approaches to Enhancing Neural Machine Translation: An Empirical Study

Active learning is an efficient approach for mitigating data dependency when training neural machine translation (NMT) models. In this paper, we explore new training frameworks by incorporating active learning into various techniques such as transfer learning and iterative back-translation (IBT) under a limited human translation budget. We design a word frequency based acquisition function and combine it with a strong uncertainty based method. The combined method steadily outperforms all other acquisition functions in various scenarios. As far as we know, we are the first to do a large-scale study on actively training Transformer for NMT. Specifically, with a human translation budget of only 20% of the original parallel corpus, we manage to surpass Transformer trained on the entire parallel corpus in three language pairs.


Introduction
Many impressive progresses have been made in neural machine translation (NMT) in the past few years (Luong et al., 2015;Gehring et al., 2017;Vaswani et al., 2017;. However, the general training procedure requires tremendous amounts of high-quality parallel corpus to achieve a deep model's full potential. The scarcity of the training corpus is a common problem for many language pairs, which might lead to the NMT model's poor performance. However, constructing a parallel corpus is a slow and laborious process. Professional human translators and well-trained proofreaders are needed. Although several dual learning (He et al., 2016;Bi et al., 2019) and unsupervised learning (Artetxe et al., 2018;Lample et al., 2017;Lample and Conneau, 2019) approaches have been successfully used, they are often inferior to the supervised models. In such cases, active learning might be a good choice. The goal of active learning in NMT is to train a well-performing model under a limited human translation budget. We achieve this goal by using some particularly designed acquisition functions to select informative sentences to construct a training corpus.
Acquisition functions can be categorized into two types: model related and model agnostic. For the former, the methods we use are all based on the idea of uncertainty. For the latter, we devise a word frequency based method which takes linguistic features into consideration. Both types of acquisition functions have been proven to be beneficial in active NMT training, especially when they are appropriately combined.
Data augmentation techniques that consume no human translation budget are worth exploring in active NMT training. If the parallel corpus of a related language pair is available, transfer learning (Zoph et al., 2016;Kim et al., 2019) might be a good choice. Otherwise, we propose a new training framework that integrates active learning with iterative back-translation (IBT) (Hoang et al., 2018). We achieve success in both the settings, especially when active learning bonds with IBT.
The main contributions of this work are listed as follows: 1) To the best of our knowledge, we are the first to give a comprehensive study of active learning in NMT under various settings. 2) We propose a word frequency based acquisition function which is model agnostic and effective. This acquisition function can further enhance existing uncertainty based methods, achieving even better results in all settings. 3) We design a new training framework for active iterative back-translation as well as a simple data augmentation technique. With a human translation budget of only 20% of the original parallel corpus, we can achieve better BLEU scores than the fully supervised Transformer does (Vaswani et al., 2017).

Related Work
Active learning As for natural language processing, active learning is well studied in text classification (Zhang et al., 2017;Ru et al., 2020) and named entity recognition (Shen et al., 2017;Siddhant and Lipton, 2018;Prabhu et al., 2019). Peris and Casacuberta (2018) applied attention based acquisition functions for NMT. Liu et al. (2018) introduced reinforcement learning to actively train an NMT model.
Data selection in NMT Although active learning has not been thoroughly studied in NMT, the related data selection problem attracts some attention. van  Interactive NMT Interactive NMT exploits user feedback to help improve translation systems. Realworld (Kreutzer et al., 2018) or simulated user feedback includes highlighting accurate translation chunks (Petrushkov et al., 2018) or correct errors made by machine (Peris and Casacuberta, 2018;Domingo et al., 2019). Kreutzer and Riezler (2019) took the cost of different types of supervision (feedback) into account, which resembles the idea of active learning.

Methodology
We give a detailed description of active neural machine translation (NMT) in this section. Basic settings and some terminologies are introduced in Section 3.1. In Section 3.2 and Section 3.3, various acquisition functions are presented and explained. Section 3.4 deals with combining active learning with transfer learning and iterative backtranslation. Figure 1 is an illustration of different training frameworks in NMT.

Active NMT
Several terminologies need to be clarified before introducing the active NMT circulation, namely, acquisition function, oracle and budget.
Acquisition Function An acquisition function gives a score to each untranslated sentence in the monolingual corpus. Sentences with higher scores are more likely to be selected as the training corpus. Acquisition functions fall into two types, model related and model agnostic. A model related acquisition function takes a sentence as the model input and gives a score depending on the model output. A model agnostic acquisition function often concerns about the informativeness of the sentence itself, which can score each sentence before training the model.
Oracle An oracle is a gold standard for a machine learning task. For NMT, an oracle can output the ground truth translation given a source sentence (specifically an expert human translator). A parallel corpus is gradually constructed by employing an oracle to translate the selected sentences.
Budget Budget means the total cost one can afford to employ an oracle. For NMT, we need to hire human experts to translate sentences. In order to simulate active NMT training, throughout all our experiments, the cost is the number of words been translated.
In the beginning, we have a large-scale monolingual corpus of the source language. We do several rounds of active training until the total budget is used up. In each round, five steps are taken: • Use an acquisition function to score each untranslated sentence.
• Sort the untranslated sentences according to the scores in descending order.
• Select high score untranslated sentences until the token budget in this round is used up.
• Remove the selected sentences from the monolingual corpus and employ an oracle to translate them.
• Add these new sentence pairs to the parallel corpus and retrain the NMT model.
Transformer is what we use throughout our experiments. As this architecture is commonly used and our implementation has little difference with the original, we skip an exhaustive background description of the underlying model. One can refer to Vaswani et al. (2017) for some details. The active NMT training circulation is shown in part (b) of Figure 1.

Model Related Acquisition Functions
All model related acquisition functions we try are based on uncertainty. Settles and Craven (2008) tried these methods on sequence labeling tasks. For NMT, we use greedy decoding to generate a synthetic translation of each sentence x = (x 1 , · · · , x n ) in the monolingual corpus U . We denote this synthetic translation aŝ y = (ŷ 1 , · · · ,ŷ m ). In the i th decoding step, the model outputs a probability distribution over the entire vocabulary P θ (·|x,ŷ <i ).
Least Confident (lc) A direct interpretation of model uncertainty is the average confidence level on the generated translation. We strengthen the model on its weaknesses and force it to learn more on intrinsically hard sentences.
Minimum Margin (margin) Margin means the average probability gap between the model's most confident word y * i,1 and second most confident word y * i,2 in each decoding step. With a small margin, the model is unable to distinguish the best translation from an inferior one.
Token Entropy (te) Concentrated distributions tend to have low entropy. Entropy is also an appropriate measurement of uncertainty. In NMT, we calculate the average entropy in each decoding step as given by the following equation.
Total Token Entropy (tte) To avoid favoring long sentences, we choose to take average over sentence length in the above three methods. However, it remains a question whether querying long sentences should be discouraged. We design an acquisition function to figure out this issue by removing the 1 m term from Token Entropy.

Model Agnostic Acquisition Functions
Uncertainty based acquisition functions depend purely on probability. We propose a model agnostic acquisition function that focuses on linguistic features. In NMT, it is important to enable the model calculate delf y(s) by Equation (7) 7:Û =Û ∪ {s} 8: end for 9: for s in sort(U ) by delf y score do to translate unseen future sentences. In other words, we wish to choose those sentences that are representatives of all the untranslated sentences but less similar with what has previously been selected.
In each active training round, we have a set of untranslated sentences in the source language side, which is denoted as U . Also, those sentences that have been selected in previous active training rounds are denoted as L. We denote a sentence as s = (s 1 , · · · , s K ) which is different from what it is in Section 3.2 because we are now working on word level instead of the subword level. First, we define the logarithm frequency of a word w in U , namely, F (w|U ).
Where C(w|·) means the occurrence number of a word w in a certain sentence set. As shown in Equation (6), the representativeness of a sentence s is determined by its average logarithm word frequency in U . A decay factor λ 1 ≥ 0 is introduced to assist the model to pay more attention to the uncommon words in the previously selected corpus L.
Directly using lf scores is problematic. The algorithm favors a small number of function words (like "a", "the") which account for a high proportion of the entire corpus. Also, redundancy breaks out since sentences of similar content share similar scores. These two drawbacks are disastrous for building a well-performing translation system. A gradual reranking is used to ease these two problems. Equation (6) is employed for the first round of sorting.Û (s) is the set of all sentences that have a higher lf score than s. If s has a high lf score, but each word s i in s frequently appears inÛ (s), we use a decay term e −λ 2 C(s i |Û (s)) to cut down its score. In this way, we tend to discard repetitive sentences and filter out insignificant function words. Details can be found in Equations (7) and (8). λ 1 and λ 2 are non-negative constants.
We name this model agnostic acquisition function as decay logarithm frequency (delfy) which is summarized in Algorithm 1.

Active NMT with Data Augmentation
Directly incorporating active learning into NMT can be beneficial. However, is there any technique that consumes no extra budget to further improve translation performance? The answer depends on the availability of some related parallel corpus. Transferring knowledge from a related language pair can be considered if an extra parallel corpus is available. Iterative back-translation is worth trying if not.
Transfer Learning We assume that there exists a rich parallel corpus in a related translation direction, e.g., we try to build a German-English NMT system and we have access to French-English sentence pairs. The model is initialized by training on this related parallel corpus. Active NMT training is carried out as described in Section 3.1 after model initialization.
Iterative Back-Translation Iterative backtranslation (IBT) (Sennrich et al., 2016a;Hoang et al., 2018) proves to be of help in boosting model performance. IBT offers a data augmentation technique that is budget free (no human translator needed) when considering active NMT training. However, simply using all monolingual corpus to generate a synthetic parallel corpus will hurt instead of improving the model performance. We designed some experiments to validate this argument. Detailed results can be seen in Appendix B.
Two reasons may cause these poor results. First, the quality of synthetic corpus varies. Some of the synthetic sentence pairs can be beneficial, while others only introduce chaos into the NMT model. Second, the percentage of the synthetic corpus in the entire training corpus is too high. To cope with these two problems, we propose a new Active IBT framework. Models of opposite translation directions are responsible for constructing training corpus for each other. Sentences with the highest acquisition function scores are divided into two parts. One part is translated by an oracle to enrich the parallel corpus. Another part is used to generate a new synthetic corpus. In this way, we manage to control the quality as well as the percentage of the synthetic corpus.
This framework is shown in part (c) of Figure 1, and some details can be found in Algorithm 2.
Active IBT++ Active learning aims at choosing informative sentences to train the model. Is there any way that we can exploit more value from these selected sentences? Inspired by Nguyen et al. (2019), we propose some further data augmentation techniques after Active IBT is done. Models of the last k 1 rounds are used for translating the final parallel corpus, such that each selected sentence will have diversified translations. We merge the diversified parallel corpus with the synthetic corpus of a specific translation direction in the last k 2 rounds. Duplicate sentence pairs are filtered out. The NMT model is re-initialized and trained on this enlarged training corpus.
We name this technique Active IBT++ and summarize it in Algorithm 3. For simplicity, we only consider one translation direction in Algorithm 3. The same technique can be easily done in another translation direction.

Dataset, Preprocessing and Implementation
We experiment on three language pairs, namely . For active NMT with or without transfer learning, we only experiment on translating into English. Instead, for active iterative back-translation (IBT), evaluation is carried out on translating from English and into English. The evaluation metric is BLEU (Papineni et al., 2002). Model hyper parameters are identical to Transformer base (Vaswani et al., 2017). Adam optimizer (Kingma and Ba, 2014) is used with a learning rate of 7 × 10 −4 . We use the same learning rate scheduling strategy as Vaswani et al. (2017) does with a warmup step of 4000. During training, the label smoothing factor and the dropout probability are set to 0.1. λ 1 , λ 2 in Algorithm 1 are all set to 1.0.
Our implementation is based on pytorch 3 . All models are trained on 8 RTX 2080Ti GPU cards with a mini-batch of 4096 tokens. We stop training 2 https://github.com/moses-smt/mosesdecoder 3 http://pytorch.org/ if validation perplexity does not decrease for 10 epochs in each active training round.

Active NMT
As a starting point, we empirically compare different acquisition functions proposed in Section 3.2 and Section 3.3, as well as the uniformly random selection baseline. Twelve rounds of active NMT training are done. In each round, 1.67% of the entire parallel corpus is selected and added into the training corpus. Thus, we ensure the token budget is 20% of the entire parallel corpus in the final round. Training corpus in the first round is identical across different acquisition functions to ensure the fairness of comparison.
Results are shown in Figure 2. Most active acquisition functions can outperform the random selection baseline in all three language pairs. Our model agnostic acquisition function (delfy) is also better than the best uncertainty based acquisition function. We try to combine delfy with some wellperforming uncertainty based acquisition functions since they represent different aspects of the informativeness of a sentence. We choose to combine delfy with token entropy (te). We add the ranks given by these two acquisition functions to avoid the magnitude problem. For example, if a sentence  gets the highest delfy score as well as the secondhighest te score, then its delfy rank is 1 and its te rank is 2, such that its final score is 1 + 2 = 3. Since we sort sentences in descending order of their scores, we should multiply the summation of the ranks by −1. This new combined acquisition function is named as te-delfy.
Our combined method (te-delfy) proves to be more effective, outperforming all the other acquisition functions in each active NMT training round in all three language pairs. To be more specific, in the last active training round, te-delfy surpasses the best uncertainty based acquisition function by 1.4 BLEU points in DE-EN, 1.6 BLEU points in RU-EN and 1.1 BLEU points in LT-EN.

Active NMT with Transfer Learning
To evaluate different acquisition functions in active NMT with transfer learning, we start from a French to English NMT model. The parallel corpus for building this initial model contains 4M sentence pairs which are randomly selected from the WMT 2014 shared task. To share vocabulary between different languages, we latinize all the Russian sentences 4 . Figure 3 shows the results. All the active acquisition functions are still advantageous compared with 4 https://github.com/barseghyanartur/transliterate the random selection baseline except total token entropy (tte). Our combined method (te-delfy) is also the best in most active training rounds. Te-delfy yields the best final results, beating the best uncertainty based acquisition function by 0.5 BLEU points in DE-EN, 0.3 BLEU points in RU-EN and 0.5 BLEU points in LT-EN. However, in active NMT with transfer learning, the performance gains brought by different acquisition functions are not as much as it is in active NMT (Section 4.2).

Active Iterative Back-Translation
For active iterative back-translation (IBT), we randomly select 10% of the entire parallel corpus to train an initial NMT model. The initial model is shared across different acquisition functions. We do 10 rounds of Active IBT training. In each round, 1% of the entire parallel corpus is added into the training corpus. The total token budget is still 20% as in Section 4.2 and Section 4.3. For α in Algorithm 2, we use as many as half of the amount of the authentic parallel corpus in this Active IBT round. k 1 , k 2 in Algorithm 3 are set to 3 and 6 respectively.
Results are summarized in Figure 4. Our combined method (te-delfy) becomes even more powerful than it is in active NMT, leading all the way until the final round in all the experiments. All active acquisition functions we try surpass the random  baseline by a large margin, with a minimum performance gain of 1.1 BLEU points. We argue that synthetic sentence pairs need more sophisticated selection criteria than the authentic ones. Low-quality pseudo-parallel data can damage rather than help the model performance.
We make a comparison between the actively learned models and the full supervision Transformer in Table 1. The best results are all achieved by te-delfy which further proves its superiority. Active IBT++ (Algorithm 3) is applied with te-delfy. With a token budget of 20% of the entire parallel corpus, we can surpass the vanilla Transformer in every translation direction. These results show that Active IBT and Active IBT++ are promising approaches for enhancing NMT models.

Linguistic Features
In order to find the common features of the beneficial sentences in translation, we analyze the final parallel corpus constructed by different acquisition functions in active NMT from four aspects. All the analyses are done on word level instead of the subword level. First, we study the impact of the average sentence length. Second, we study the vocabulary coverage by calculating the ratio of the vocabulary size of the selected corpus to the total/test vocabulary size. Finally, the lexical diversity of the selected corpus is analyzed based on the MTLD metric (McCarthy and Jarvis, 2010). Analyses are done on random selection, the best uncertainty based method, delfy and te-delfy. The results are shown in Figure 5.
Most algorithms tend to choose some mediumlength sentences, rather than the extremely long or short ones. We also use sentence length as our acquisition function (choosing the longest or shortest sentences), which proves to be terrible (Appendix A). Vocabulary coverage varies among different acquisition functions, with random selection always being the lowest one. Higher vocabulary coverage means fewer unseen words which might create a more knowledgeable model. Also, delfy and tedelfy always achieve higher MTLD scores than the other two methods do. Note that a higher vocabulary coverage does not necessarily mean a higher diversity score. In LT-EN and RU-EN, delfy always has a larger vocabulary size than te-delfy, but its selected corpus is less diverse. In general, a good acquisition function should favor medium-length sentences as well as having a large vocabulary cov-erage. Meanwhile, diversified training corpus is also beneficial to model performance.

Reverse Active learning
Active learning chooses difficult samples for the model. Instead, several curriculum learning methods (Zhang et al., 2018;Platanios et al., 2019;Liu et al., 2020; accelerates model convergence, which starts training with easy data samples and gradually moves to hard ones. Curriculum learning's success makes it reasonable to think about whether the reverse of active learning is also beneficial. Reverse active learning selects sentences with the lowest acquisition function scores in each round. We make a comparison between active learning and reverse active learning in Table 2. Reverse active learning lags behind active learning with all acquisition functions we try. Also, reverse active learning can not beat the random baseline of 18.5 BLEU points. Curriculum learning emphasizes the training process of networks (easy to hard), which might accelerate convergence. However, when the amount of training data is limited, active learning is a better choice.

Conclusion
Various acquisition functions are conducted on active NMT, active NMT with transfer learning and active iterative back-translation (IBT). Our experiment results strongly prove that active learning is beneficial to NMT. Our combined method (tedelfy) achieves the best final BLEU score in every experiment we do. Also, the proposed Active IBT++ framework efficiently exploits the selected parallel corpus to further enhance the model accuracy. These techniques may also be useful for unsupervised NMT. Active pre-training is worth trying and active IBT has already proven its capability. We leave it for future work to study more acquisition functions in more NMT scenarios.