Token-level Adaptive Training for Neural Machine Translation

There exists a token imbalance phenomenon in natural language as different tokens appear with different frequencies, which leads to different learning difficulties for tokens in Neural Machine Translation (NMT). The vanilla NMT model usually adopts trivial equal-weighted objectives for target tokens with different frequencies and tends to generate more high-frequency tokens and less low-frequency tokens compared with the golden token distribution. However, low-frequency tokens may carry critical semantic information that will affect the translation quality once they are neglected. In this paper, we explored target token-level adaptive objectives based on token frequencies to assign appropriate weights for each target token during training. We aimed that those meaningful but relatively low-frequency words could be assigned with larger weights in objectives to encourage the model to pay more attention to these tokens. Our method yields consistent improvements in translation quality on ZH-EN, EN-RO, and EN-DE translation tasks, especially on sentences that contain more low-frequency tokens where we can get 1.68, 1.02, and 0.52 BLEU increases compared with baseline, respectively. Further analyses show that our method can also improve the lexical diversity of translation.


Introduction
Neural machine translation (NMT) systems (Kalchbrenner and Blunsom, 2013;Cho et al., 2014 Table 1: The average frequency on the NIST training set and proportion of tokens with different frequencies in reference and the translation of the vanilla NMT model (a Transformer model) on the NIST test sets. All the target tokens (BPE sub-words with 30K merge operations ) of the training set are ranked by their frequencies in descending order. The 'Token Order' column represents the frequency interval ([10%, 30%) means the frequency of token is between top 10% and 30%).
The 'Average Frequency' column represents the average frequencies of the tokens in each interval, which show the token imbalance phenomenon in natural language. The last two columns show the vanilla NMT model tends to generate more high-frequency tokens and less low-frequency tokens than reference.
Some work tries to improve the rare word translation by maintaining phrase tables or back-off vocabulary (Luong et al., 2015;Jean et al., 2015;Li et al., 2016;Pham et al., 2018) or adding extra components (Gülçehre et al., 2016;Zhao et al., 2018), which bring in extra training complexity and computing expense. Some NMT techniques which are based on smaller translation granularity can alleviate this issue, such as hybrid word-character-based model (Luong and Manning, 2016), BPE-based model (Sennrich et al., 2016) and word-piece-based model (Wu et al., 2016). These effective work alleviate the token imbalance phenomenon to a certain extent and become the de-facto standard in most NMT models. Although sub-word based NMT models have achieved significant improvements, they still face the token-level frequency imbalance phenomenon, as Table 1 shows.
Furthermore, current NMT models generally assign equal training weights to target tokens without considering their frequencies. It is very likely for NMT models to ignore the loss produced by the low-frequency tokens because of their small proportion in the training sets. The parameters related to them can not be adequately trained, which will, in turn, make NMT models tend to prioritize output fluency over translation adequacy, and ignore the generation of low-frequency tokens during decoding, which is illustrated in Table 1. It shows that the vanilla NMT model tends to generate more highfrequency tokens and less low-frequency tokens. However, low-frequency tokens may carry critical semantic information which may affect translation quality once they are neglected.
To address the above issue, we proposed tokenlevel adaptive training objectives based on target token frequencies. We aimed that those meaningful but relatively low-frequency tokens could be assigned with larger loss weights during training so that the model will learn more about them. To explore suitable adaptive objectives for NMT, we first applied existing adaptive objectives from other tasks to NMT and analyzed their performance. We found that though they could bring modest improvement on the translation of low-frequency tokens, they did much damage to the translation of highfrequency tokens, which led to an obvious degradation on the overall performance. This implies that the objective should ensure the training of highfrequency tokens first. Then, based on our observations, we proposed two heuristic criteria for design-ing the token-level adaptive objectives based on the target token frequencies. Last, we presented two specific forms for different application scenarios according to the criteria. Our method yields consistent improvements in translation quality on ZH-EN, EN-RO, and EN-DE translation tasks, especially on sentences that contain more low-frequency tokens where we can get 1.68, 1.02, and 0.52 BLEU increases compared with baseline, respectively. Further analyses show that our method can also improve the lexical diversity of translation.
Our contributions can be summarized as follows: • We analyzed the performance of the existing adaptive objectives when they were applied to NMT. Based on our observations, we proposed two heuristic criteria for designing token-level adaptive objectives and present two specific forms to alleviate the problem brought by the token imbalance phenomenon.
• The experimental results validate that our method can improve not only the translation quality, especially on those low-frequency tokens, but also the lexical diversity.

Background
In our work, we apply our method in the framework of Transformer (Vaswani et al., 2017) which will be briefly introduced here. We denote the input sequence of symbols as x = (x 1 , . . . , x J ), the ground-truth sequence as y * = (y * 1 , . . . , y * K ) and the translation as y = (y 1 , . . . , y K ).
The Encoder & Decoder The encoder is composed of N identical layers. Each layer has two sublayers. The first sublayer is a multi-head attention unit used to compute the self-attention of the input, named self-attention multi-head sublayer, and the second one is a fully connected feed-forward network, named FNN sublayer. Both of the sublayers are followed by a residual connection operation and a layer normalization operation. The input sequence x will be first converted to a sequence of vectors E is the sum of the word embedding and the position embedding of the source word x j . Then, this input sequence of vectors will be fed into the encoder and the output of the N -th layer will be taken as source hidden states. The decoder is also composed of N identical layers. In addition to the same kind of two sublayers in each encoder layer, the third crossattention sublayer is inserted between them, which  Table 2: BLEU on the validation set of the Chinese-English translation task. 'Low' is the subset of the validation set which contains more low-frequency tokens while 'High' contains more high-frequency tokens.
performs multi-head attention over the output of the encoder. The final output of the N -th layer gives the target hidden states S = [s 1 ; . . . ; s I ], where s i is the hidden states of y k .
The Objective The model is optimized by minimizing a cross-entropy loss with the ground-truth: where K is the length of the target sentence.

Method
Our work aims to explore suitable adaptive objectives that can not only improve the learning of low-frequency tokens but also avoid harming the translation quality of high-frequency tokens. We first investigated two existing adaptive objectives, which were proposed for solving the token imbalance problems in other tasks, and analyzed their performance. Then, based on our observations, we introduced two heuristic criteria for designing the adaptive objective. Based on the proposed criteria, we put forward two simple but effective functional forms from different perspectives, which can be adapted to various application scenarios in NMT.

Existing Adaptive Objectives Investigation
The form of adaptive objective is as follows: where w(y i ) is the weight assigned to the target token y i , which varies as the token frequency changes. Actually, there are some existing adaptive objectives which have been proven effective for other tasks. It can help us understand what is necessary for a suitable adaptive objective for NMT if we apply these methods to it. The first objective we have investigated is the form in Focal loss (Lin et al., 2017), which was proposed for solving the label imbalance problem in the object detection task: Although it doesn't utilize the frequency information directly, it actually reduces the weights of the high-frequency classes more because they are usually easier to classify with higher prediction probabilities. We set γ to 1 as suggested by their experiments. We noticed that this method greatly reduced the weights of high-frequency tokens, and the variance of weights is large. The second is the linear weighting function (Jiang et al., 2019), which was proposed for the dialogue response generation task: where Count(y k ) is the frequency of token y k in the training set and V t denotes the target vocabulary. Then, the normalized weights w(y i ), which have a mean of 1, are assigned to the target tokens. We noticed that the weights of high-frequency tokens are only slightly less than 1, and the variance of weights is small. We tested these two objectives on the Chinese to English translation task and the results on the validation set are given in Table 2 1 . To verify their effects on the high-and low-frequency tokens, we also divided the validation set into two subsets based on the average token frequency of the sentences, the results of which are also given in Table 2. It shows that although these two methods can bring modest improvement in the translation of the low-frequency tokens, it does much harm to high-frequency tokens, which has a negative impact on the overall performance. We noted that both of these two methods reduced the weights of the high-frequency tokens to different degrees, and we argued that when the highfrequency tokens account for a large proportion in NMT corpus, this hinders the normal training of them. To validate our argument, we simply add 1 to the weighting term of focal loss: The results are also given in Table 2 (Row 5), which indicates that this method actually avoids the damage to the high-frequency tokens. The overall results indicate that it is not robust enough to improve the learning of low-frequency tokens by reducing the weight of high-frequency tokens during the training of NMT. Although our goal is to improve the training of low-frequency tokens, we should first ensure the training of high-frequency tokens, and then increase the weights of low-frequency tokens appropriately. Based on the above findings, we proposed the following criteria.

Heuristic Criteria for Token Weighting
We proposed two heuristic criteria for designing the token-level training weights: Minimum Weight Ensurence. The training weight of any token in the target vocabulary should be equal to or bigger than 1, which can be described as: Although we can force the model to pay more attention to low-frequency tokens by shrinking the weights of high-frequency tokens, the previous analyses have proved that the training performance is more sensitive to the change of high-frequency tokens' weights due to their large proportion in the training set. A relatively small decrease in the weights of high-frequency tokens will prevent the generation probabilities of ground-truth tokens from ascending continually, which may result in an obvious degradation of the overall performance. Therefore, we ensure that all the token weights are equal to or bigger than 1 considering the training stability as well as designing convenience. Weights Expectation Range Control. On the condition that the first criterion is satisfied, those high-frequency tokens could have already been well learned without any extra attention. Now, those low-frequency tokens could be assigned with higher weights. Meanwhile, we also need to ensure that the weights of low-frequency tokens can't be too large, or it will hurt the training of highfrequency tokens certainly. Therefore, the expectation of the training weights on the whole training set should be in [1, 1 + δ]: where |V t | denotes the size of the target vocabulary, δ is a relatively small number compared with 1. A larger weight expectation means we allocate larger weights to those low-frequency tokens. In contrast, an appropriate weight expectation as defined in this criterion can help improve the overall performance. The two criteria proposed here are not the only options for NMT, but the adaptive objective satisfying these two criteria can improve not only the translation performance of low-frequency tokens but also the overall performance based on our experimental observations.

Two Specific Adaptive Objectives
In this paper, we proposed two simple functional forms for w(y k ) heuristically based on the previous criteria and justified them with some intuitions.
Exponential: Given the target token y k , we define the exponential weighting function as: There are two hyperparameters in it, i.e., A and T, which control the shape and the value range of the function. They can be set up according to the two criteria above. The plot of this weighting function is presented in Figure 1. In this case, we don't consider the factor of noisy tokens so that the weight increases monotonically as the frequency decreases. Therefore, this weighting function is suitable for cleaner training data where the extremely low-frequency tokens only take up a small proportion.
Chi-Square: The exponential form weighting function is not suitable for the training data which contain many noisy tokens, because they would be assigned with relatively large weights and have bigger impacts when their weights are summed together. To alleviate this problem, we proposed another form of the weighting function: w(y k ) = A · Count 2 (y k )e −T·Count(y k ) + 1. (9) The form of this function is similar to the form of chi-square distribution, so we named it as chisquare. Plot of this weighting function is presented in Figure 1. We can see from the plot that the weight increases as the frequency decreases at first. Then, after a specific frequency threshold, which is decided by the hyperparameter T, the weight decreases as the frequency decreases. In this case, the most frequent tokens and the extremely rare tokens, which could be noise, all will be assigned with small weights. Meanwhile, those middle-frequency words will have larger weights. Most of them are meaningful and valuable for translation but can't be well learned with an equal-weighted objective function. This form of weighting function is suitable for more noisy training data.

Data Preparation
ZH→EN. The training data consists of 1.25M sentence pairs from LDC corpora which has 27.9M Chinese words and 34.5M English words, respectively 2 . The data set MT02 was used as validation and MT03, MT04, MT05, MT06, MT08 were used for the test. We tokenized and lowercased English sentences using the Moses scripts 3 , and segmented the Chinese sentences with the Stanford Segmentor 4 . The two sides were further segmented into subword units using Byte-Pair Encoding (BPE) (Sennrich et al., 2016) with 30K merge operations separately.
EN→RO. We used the preprocessed version of the WMT2016 English-Romanian dataset released by Lee et al. (2018) which includes 0.6M sentence pairs. We used news-dev 2016 for validation and news-test 2016 for the test. The two languages shared the same vocabulary generated with 40K merge operations of BPE.
EN→DE. The training data is from WMT2016 which consists of about 4.5M sentences pairs with 118M English words and 111M German words. We chose the news test-2013 for validation and newstest 2014 for the test. 32K merge operations BPE were performed on both sides jointly.

Systems
We used the open-source toolkit called Fairseqpy (Edunov et al., 2017) released by Facebook as our Transformer system.
• Baseline. The baseline system was implemented as the base model configuration in Vaswani et al. (2017) strictly. Since our method is further trained based on the pre-trained model at a low learning rate, we also trained another baseline model following the same procedures as our methods have except that all the target tokens share equal weights in the objective, denoted as Baseline-FT.
• Fine Tuning (Luong and Manning, 2015). This model was first trained with all the training sentence pairs and then further trained with sentences containing more low-frequency tokens. To filter out sentences containing more low-frequency tokens, the method in Platanios et al. (2019) was adopted as our judging metric with a small modification: where I is the sentence length. We added a factor 1 I to eliminate the influence of sentence length. All the target sentences were ranked by this metric in ascending order and the bottom one third of the training sentences were chosen as the in-domain data. This method tries to utilize frequency information at the sentence level, while our work uses it at the token level in contrast.
• Sampler (Chu et al., 2017). This method oversampled the sentences containing more lowfrequency tokens filtered by Eq. 10 three times and then concatenated them with the rest of the training data. Thus the NMT model will be trained with more low-frequency tokens in every epoch. • Entropy Regularization (ER) (Pereyra et al., 2017). This method was proposed for solving the overconfidence problem, which adds a confidence penalty term to the original objective: It is known that token imbalance is one of the causes of overconfidence problem (Jiang and de Rijke, 2018), so this method may also alleviate the token imbalance problem. We varied α from 0.05 to 0.4 and chose the best one according to the results on the validation sets for different languages.
Noting that the label smoothing is applied in the vanilla transformer model which has a similar effect on the output, we removed it from the model when we tested this method.  • Linear (Jiang et al., 2019). This method was proposed for solving the token imbalance problem in the the dialogue response generation task: Then, the normalized weights, which had a mean of 1, were applied to the training objective.
• Our Exp. This system was first trained with the normal objective (Equation 1), where all the target tokens have the same training weights. Then the model was further trained with the adaptive objective at a low learning rate. The weights were produced by the Exponential form (Equation 8).
For computing stability, we used Count(y k ) C median instead of Count(y k ) in the weighting function, where C median is the median of the token frequency. • Our K2. This system was trained following the same procedure as system Our Exp except that the training weights were produced by the Chi-Square form (Equation 9).
The translation quality was evaluated by 4-gram BLEU (Papineni et al., 2002) with the multi-bleu.pl script. Besides, we used beam search with a beam size of 4 and a length penalty of 0.6 during the decoding process.

Hyperparameters
There are two hyperparameters in our weighting functions, A and T. In our experiments, we fixed A to narrow search space and the overall weight range is [1, e]. We tuned another hyperparameter T on the validation data sets under the criteria proposed in section 3.2. The results are shown in Table 3. According to the results, the best hyperparameters differed across different language pairs. It is affected by the proportion of low-frequency words and high-frequency words. Generally speaking, when the proportion of low-frequency words gets smaller, the hyperparameter T should be set smaller too. But it also shows that it is easy for our methods to get a stable improvement over the baseline system following the criteria above. Finally, we used the best hyperparameters as found on the validation data sets for the final evaluation of the test data sets. For example, T = 0.35 in the exponential form for ZH→EN and T = 4.00 in the chi-square form for EN→RO.

Main Results
The results are shown in Table 4. It shows that the contrast methods can not bring stable improvements over the baseline system. They bring excessive damages to the translation of high-frequency tokens which can be proved by the analyzing experiments in the next section. As a contrast, our methods can bring stable improvements over Baseline-FT almost without any additional computing or storage expense. On the EN→RO and EN→DE translation tasks, Our Exp is more effective than Our K2 while on the ZH→EN translation task the result is reversed. The reason is that the NIST training data set contains more noisy tokens, which can be ignored by the Our K2 method. More analyses based on the token frequency are shown in the next section.

Effects on Translation Quality with Considering Token Frequencies
To further illustrate the effects of our method, we evaluated the performance based on the token frequency. For the ZH→EN translation task, we concatenated the MT03-08 test sets together as a big test set. For the EN→RO and EN→DE translation tasks, we just used their test sets. Each sentence was scored according to Eq. 10 and sorted in ascending order. Then the test set was divided into  Table 4: BLEU scores on three translation tasks. The column of ∆ shows the improvement compared to Baseline-FT. ** and * mean the improvements over Baseline-FT is statistically significant (Collins et al., 2005) (ρ < 0.01 and ρ < 0.05, respectively). The results show that our methods can achieve significant improvements on translation quality.    Table 5. three subsets with equal size, denoted as HIGH, MIDDLE, and LOW, respectively. Sentences in the subset LOW contain more low-frequency tokens while the HIGH is reverse.

ZH→EN
The results are given in Table 5 and Table 6. The contrast methods outperform the Baseline-FT on the LOW subset but are worse than it in the HIGH and MIDDLE subsets, which indicates that the gains on the translation of low-frequency tokens come at the expense of the translation of high-frequency tokens. As a contrast, both of our methods can not only bring a significant improvement on the LOW subset but also get a modest improvement on the HIGH and MIDDLE subsets. It can be concluded that our methods can ameliorate the translation of low-frequency tokens without hurting the translation of high-frequency tokens.

Effects on Translation Quality with Different BPE Sizes
It is known that the BPE sizes have a large impact on the data distribution. Intuitively, a smaller size of BPE will bring a more balanced data distribu- Figure 3: The count of tokens with different frequencies in references, translations of the baseline systems and our methods on the ZH→EN translation task. The tokens are ranked by their frequencies in the training sets. The x-axis represents the frequency interval ([20%, 30%) means the frequency of tokes is between top 20% and 30%), the y-axis is the count of the tokens applied with a common logarithm operation in each interval.  tion, but it will also increase the average sentence length and neglect some token co-occurrences. To verify the effectiveness of our method with different BPE sizes, we varied the BPE sizes from 1K to 40K on the ZH→EN translation task. The results are shown in Figure 2. It shows that as the number of BPE size increases, the BLEU of baseline rises first and then declines. Compared with the baseline systems, our method can always bring improvements, and the larger the BPE size, i.e., the more imbalanced the data distribution, the larger the improvement brought by our method. In practice, the BPE size either comes from the experience or is chosen from several trial-and-errors. No matter what the situation is, our method can always bring a stable improvement.

Effects on Token Distribution and Lexical Diversity
Compared with the reference, the outputs of the vanilla NMT model contain more high-frequency tokens and have lower lexical diversity (Van-Source búduàn guānbì nàxiē wūrǎn huánjìng de méikuàng .
Reference those coalmines pollute the environment should be continuously shut down .
Baselie-FT continually close down those mines that pollute the environment .
Our Exp those coalmines that pollute the environment should be continuously closed.
Our K2 those coalmines that pollute the environment should be continuously closed. Source yǐhòu kěyǐ gěi wǒ dāndú pèi jiān bàngōngshì . Reference an exclusive office could be assigned me later on . Baselie-FT later i could match my office alone . Our Exp i could be assigned an office alone later .

Our K2
later i could be assigned an office alone . massenhove et al., 2019b). To verify whether our methods can alleviate these problems, we did the following experiments based on the ZH→EN translation task. The tokens in the target vocabulary were first arranged in descending order according to their token frequencies. Then they were divided into ten intervals equally. Finally, we counted the number of tokens in each token frequency interval of the reference and the translation of different systems. The results are shown in Figure 3 and we did a common logarithm for display convenience. It shows that there is an obvious gap between the Baseline-FT and reference, and the curve of Baseline-FT is lower than the curve of reference in every frequency interval except for the top 10%. As a contrast, our methods can reduce this gap, and the tokens distribution is closer to the real distribution. Besides, we also measure the lexical diversity of the translations with several criteria, namely, type-token ratio (TTR) (Templin, 1957), the approximation of hypergeometric distribution (HD-D) and the measure of textual lexical diversity (MTLD) (Mccarthy and Jarvis, 2010). The results are given in Table 7. It shows that our method can also improve the lexical diversity of the translation. Table 8 shows two translation examples in the ZH→EN translation direction. In the first sentence, the Baseline-FT system failed to generate the low-frequency noun 'coalmine' (frequency: 43), but generated a relatively high-frequency word 'mine' (frequency: 1155). We can see that this lowfrequency token carries the central information of this sentence, and the mistranslation of it prevents people from understanding this sentence correctly.

Case Study
In the second sentence, our methods generated the low-frequency verb 'assigned' (frequency: 841) correctly, while the Baseline-FT generated a more frequent token 'match ' (frequency: 1933), which reduced the translation accuracy and fluency. These examples can be part of the evidence to show the effectiveness of our methods.

Related Work
Rare Word Translation. Rare word translation is one of the key challenges for NMT. For word-level NMT models, NMT has its limitation in handling a larger vocabulary because of the training complexity and computing expense. Some work tries to solve this problem by maintaining phrase tables or back-off vocabulary (Luong et al., 2015;Jean et al., 2015;Li et al., 2016). The subword-based NMT (Sennrich et al., 2016;Luong and Manning, 2016;Wu et al., 2016) reduces the size of vocabulary greatly and become the mainstream technology gradually. Gowda and May (2020) gave a detailed analysis about the effects of the BPE size on the data distribution and translation quality. Some recent work tried to further improve the translation of the rare words with the help of the memory network or the pointer network (Zhao et al., 2018;Pham et al., 2018). In contrast, our methods can improve the translation performance without extra cost and can be combined with other techniques. Class Imbalance. Class imbalance means the total number of some classes of data is far less than the total number of other classes. This problem can be observed in various tasks (Wei et al., 2013;Johnson and Khoshgoftaar, 2019). In NMT, the class imbalance problem might be the underlying cause of, among others, the gender-biased output problem (Vanmassenhove et al., 2019a), the inability of MT system to handle morphologically richer language correctly (Passban et al., 2018), or the exposure bias problem (Ranzato et al., 2016;Shao et al., 2018;Zhang et al., 2019). The methods of trying to solve this can be divided into two types. The data-based methods (Baloch and Rafi, 2015;Ofek et al., 2017) make use of over-and undersampling to reduce the imbalance. The algorithmbased methods (Zhou and Liu, 2005;Lin et al., 2017) give extra reward to different classes. Our method is algorithm-based which brings no extra cost.
Word Frequency-based Methods. Some work also makes use of word frequency information to help learning, such as in the word segmentation (Sun et al., 2014) and term extraction (Frantzi et al., 1998;Vu et al., 2008). In NMT, word frequency information is used for curriculum learning (Kocmi and Bojar, 2017;Platanios et al., 2019) and domain adaptation data selection (Wang et al., 2017;Zhang and Xiong, 2018;Gu et al., 2019). Wang et al. (2020) analyzed the miscalibration problem on the low-frequency tokens. Jiang et al. (2019) proposed a linear weighting function to solve the word imbalance problem in the dialogue response generation task. Compared with it, our method is more suitable for NMT.

Conclusion
In this work, we focus on the token imbalance problem of NMT. We show that the output of vanilla NMT contains more high-frequency tokens and has lower lexical diversity. To alleviate this problem, we investigated existing adaptive objectives for other tasks and then proposed two heuristic criteria based on the observations. Next, we gave two simple but effective forms based on the criteria, which can assign appropriate training weights to target tokens. The final results show that our methods can achieve significant improvement in performance, especially on sentences that contain more low-frequency tokens. Further analyses show that our method can also improve the lexical diversity.