Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation

Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward-translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.


Introduction
Neural machine translation (NMT) is a data-hungry approach, which requires a large amount of data to train a well-performing NMT model (Koehn and Knowles, 2017). However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. To relieve this problem, several approaches have been proposed to better exploit the training data, such as curriculum learning (Platanios et al., 2019), data diversification (Nguyen et al., 2019), and data denoising (Wang et al., 2018).
In this paper, we explore an interesting alternative which is to reactivate the inactive examples in the training data for NMT models. By definition, inactive examples are the training examples that only marginally contribute to or even inversely harm the performance of NMT models. Concretely, we use sentence-level output probability (Kumar and Sarawagi, 2019) assigned by a trained NMT model to measure the activeness level of training examples, and regard the examples with the least probabilities as inactive examples ( §3.1). Experimental results show that removing 10% most inactive examples can marginally improve translation performance. In addition, we observe a high overlapping ratio (e.g., around 80%) of the most inactive and active examples across random seeds, model capacity, and model architectures ( §4.2). These results provide empirical support for our hypothesis of the existence of inactive examples in large-scale datasets, which is invariant to specific NMT models and depends on the data distribution itself.
We further propose data rejuvenation to rejuvenate the inactive examples to improve the performance of NMT models. Specifically, we train an Experimental results show that the data rejuvenation approach consistently and significantly improves performance on SOTA NMT models (e.g., LSTM (Domhan, 2018), TRANSFORMER (Vaswani et al., 2017), and DYNAMICCONV (Wu et al., 2019)) on the benchmark WMT14 English-German and English-French datasets ( §4.4). Encouragingly, our approach is also complementary to existing data manipulation methods (e.g., data diversification (Nguyen et al., 2019) and data denoising (Wang et al., 2018)), and combining them can further improve performance.
Finally, we conduct extensive analyses to better understand the inactive examples and the proposed data rejuvenation approach. Quantitative analyses reveal that the inactive examples are more difficult to learn than active ones, and rejuvenation can reduce the learning difficulty ( §5.1). The rejuvenated examples stabilize and accelerate the training process of NMT models ( §5.2), resulting in final models with better generalization capability ( §5.3).
Our contributions of this work are as follows: • Our study demonstrates the existence of inactive examples in large-scale translation datasets, which mainly depends on the data distribution.
• We propose a general framework to rejuvenate the inactive examples to improve the training of NMT models.

Related Work
Data Manipulation. Our work is closely related to previous studies on manipulating training data for NMT models, which focuses on exploiting the original training data without augmenting additional data. For example, the data denoising approach (Wang et al., 2018) aims to identify and clean the noise training examples. Data diversification (Nguyen et al., 2019) tries to diversify the training data by applying forward-translation (Zhang and Zong, 2016) to the source side of the parallel data, or back-translation (Sennrich et al., 2016a) to the target side of parallel data in a reverse translation direction. Our approach is complementary to theirs, and using them together can further improve translation performance (Table 4). Another distantly related direction is to simplify the source sentences so that a black-box machine translation system can better translate them (Mehta et al., 2020), which is out of scope in this work.  (Kocmi and Bojar, 2017;Zhang et al., 2018;Platanios et al., 2019;Liu et al., 2020b). 3 Methodology Figure 1 shows the framework of the data rejuvenation approach, in which we introduce two models: an identification model and a rejuvenation model. There are many possible ways to implement the general idea of data rejuvenation. The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that data rejuvenation helps.

Identification Model
We describe a simple heuristic to implement the identification model by leveraging the output probabilities of NMT models. The training objective of the NMT model is to maximize the log-likelihood of the training data{[x n , y n ]} N n=1 : The trained NMT model assigns a sentence-level probability P (y|x) to each sentence pair (x, y), indicating the confidence of the model to generate the target sentence y from the source one x (Kumar and Sarawagi, 2019;Wang et al., 2020). Intuitively, if a training example has a low sentence-level probability, it is less likely to provide useful information for improving model performance, and thus is regarded as an inactive example. Therefore, we adopt sentence-level probability P (y|x) as the metric to measure the activeness level of each training example: where T is the number of target words in the training example. I(y|x) is normalized by the length of target sentence y to avoid length bias. We train an NMT model on the original training data and use it to score each training example. We treat a certain percent of training examples with the least sentence-level probabilities as inactive examples.

Rejuvenation Model
Inspired by recent successes on data augmentation for NMT, we adopt the widely-used backtranslation (Sennrich et al., 2016a) and forwardtranslation (Zhang and Zong, 2016) (Sennrich et al., 2016b) with 32K merge operations for both language pairs. The experimental results were reported in casesensitive BLEU score (Papineni et al., 2002).
Model. We validated our approach on a couple of representative NMT architectures: • LSTM (Domhan, 2018) that is implemented in the TRANSFORMER framework.
• TRANSFORMER (Vaswani et al., 2017) that is based solely on attention mechanisms.
• DYNAMICCONV  that is implemented with lightweight and dynamic convolutions, which can perform competitively to the best reported TRANSFORMER results.
We adopted the open-source toolkit Fairseq  to implement the above NMT models. We followed the settings in the original works to train the models. In brief, we trained the LSTM model for 100K steps with 32K (4096 × 8) tokens per batch. For TRANSFORMER, we trained 100K and 300K steps with 32K tokens per batch for the BASE and BIG models respectively. We trained the DYNAMICCONV model for 30K steps with 459K (3584 × 128) tokens per batch. We selected the model with the best perplexity on the validation set as the final model. We first conducted ablation studies on the identification model ( §4.2) and rejuvenation model ( §4.3) on the WMT14 En⇒De dataset with TRANSFORMER-BASE. Then we reported the translation performance on different model architectures and language pairs, as well as the comparison with previous studies ( §4.4).

Identification of Inactive Examples
In this section, we investigated the reasonableness and consistency of the identified inactive examples.
Identified Inactive Examples. As aforementioned, we ranked the training examples according to the sentence-level output probability (i.e., confidence) assigned by a trained NMT model. We followed Wang et al. (2020)    fect the performance of NMT models: 1) random seeds for TRANSFORMER-BASE: "1", "12", "123", "1234", and "12345"; 2) model capacity for TRANS-FORMER: TINY (3 × 256), BASE (6 × 512), and BIG (6 × 1024); and 3) model architectures: the aforementioned architectures in Section 4.1. For each data bin, we calculated the ratio of examples that are shared by different model variants (e.g., different random seeds). Generally, a high overlapping ratio denotes the identified examples are more agreed by different models, which suggests the examples are not model-specific. Figure 4 depicts the results. As expected, there is always a high overlapping ratio (over 80%) for the most inactive examples (i.e., 1 st data bin) across model variants and language pairs. The high consistency of identified inactive examples demonstrates that the proposed identification is invariant to specific models, and depends on the data distribution itself. Another interesting finding is that the most

Rejuvenation of Inactive Examples
In this section, we evaluated the impact of different components on the rejuvenation model.

Ratio of Examples Labelled as Inactive.
After all examples were assigned a sentence-level probability by the identification model, we labelled R% of examples with the least probabilities as the inactive examples. We investigated the effect of different R on translation performance, as shown in Figure 5. Clearly, rejuvenating the inactive examples consistently outperforms its non-rejuvenated counterpart, demonstrating the necessity of the data rejuvenation. Concerning the rejuvenation model, the BLEU score decreases with the increase of R.    Table 2, removing 10% random examples inversely harms the translation performance, and rejuvenating them leads to a further decrease of performance. In contrast, the proposed data rejuvenation improves performance as expected. These results provide empirical support for our claim that the improvement comes from the proposed data rejuvenation rather than forward translation. Table 3 lists the results across model architectures and language pairs. Our TRANSFORMER models achieve better results than that reported in previous work (Vaswani et al., 2017), especially on the large-scale En⇒Fr dataset (e.g., more than 1.0 BLEU points). Ott et al. (2018) showed that models of larger capacity benefit from training with large batches. Analogous to DYNAMICCONV, we trained another TRANSFORMER-BIG model with 459K tokens per batch ("+ Large Batch" in Table 3) as a strong baseline. We tested statistical significance with paired bootstrap resampling (Koehn, 2004) using compare-mt 2 . Clearly, our data rejuvenation consistently and significantly improves translation performance in all cases, demonstrating the effectiveness and universality of the proposed data rejuvenation approach. It's worth noting that our approach achieves significant improvements without introducing any additional data and model modification. It makes the approach robustly applicable to most existing NMT systems.

Main Results
Comparison with Previous Work. The proposed data rejuvenation approach belongs to the family of data manipulation. Accordingly, we compare it with several widely-used manipulation strategies: data diversification (Nguyen et al., 2019), and data denoising (Wang et al., 2018).
For data diversification, we used both forwardtranslation (FT, Zhang and Zong, 2016)  sentence pair, a noise score is computed based on the noisy and denoised models, which is used for instance sampling during training. Table 4 shows the comparison results on the WMT14 En⇒De test set. All approaches improve translation performance individually except for data diversification with back-translation. Our approach can obtain further improvement on top of these manipulation approaches, indicating that data rejuvenation is complementary to them.
In addition, we computed the overlapping ratio between the noisiest and most inactive examples (10% of the training data) identified by data denoising and data rejuvenation approaches, respectively. We found that there are only 32% of examples that are shared by the two approaches, indicating that the inactive examples are not necessarily noisy examples. In order to better understand the characteristics of inactive examples, we will give more detailed analyses on linguistic properties of the inactive examples in Section 5.1.

Analysis and Discussion
In this section, we performed an extensive study to understand inactive examples and data rejuvenation in terms of linguistic properties ( §5.1), learning stability ( §5.2) and generalization capacity ( §5.3). We also investigated the strategy to speed up the pipeline of data rejuvenation ( §5.4). Unless otherwise stated, all experiments were conducted on the En⇒De dataset with TRANSFORMER-BASE.

Linguistics Properties
In this section, we investigated the linguistic properties of the identified inactive examples. We explored the following 3 types of properties: frequency rank, coverage, and uncertainty. Frequency rank measures the rarity of words, which is calcu-

Model
Margin GSNR TRANSFORMER-BASE 0.68 5.2e-3 + Data Rejuvenation 0.71 8.5e-3 lated for the target words since the proposed data rejuvenation method modifies the target side of the training examples. Coverage measures the ratio of source words being aligned by any target words. Uncertainty measures the level of multi-modality of a parallel corpus (Zhou et al., 2019). These properties reflect the difficulty of training examples to be learned by NMT models. Figure 6 depicts the results. As seen, the linguistic properties consistently suggest that inactive examples are more difficult than those active ones. By rejuvenation, the inactive examples are transformed into much simpler patterns such that NMT models are able to learn from them.

Learning Stability
In this section, we studied how data rejuvenation improved translation performance from the perspective of the optimization process, as shown in Figure 7. Concerning the training loss (Figure 7(a)), our approach converges faster and presents much less fluctuation than the baseline model during the whole training process. Correspondingly, the BLEU score on the validation set is significantly boosted (Figure 7(b)). These results suggest that data rejuvenation is able to accelerate and stabilize the training process.  Table 6: Results of speeding up ("Rej.-Big") on the WMT14 En⇒De dataset. "Time" denotes the time of the whole process using 4 NVIDIA Tesla V100 GPUs.

Generalization Capability
In this section, we investigated how data rejuvenation affected the generalization capability of NMT models with two measures, namely, Margin (Bartlett et al., 2017) and Gradient Signal-to-Noise Ratio (GSNR, Liu et al., 2020a). Table 5 lists the results, in which the GSNR values are at the same order of magnitude as that reported by Liu et al. (2020a). As seen, our approach achieves noticeably larger Margin and GSNR values, demonstrating that data rejuvenation improves the generalization capability of NMT models.

Speeding Up
The pipeline of data rejuvenation in Figure 1 is time-consuming: training the identification and rejuvenation models in sequence as well as the scoring and rejuvenating procedures make the time cost of data rejuvenation more than 3X that of the standard NMT system. To save the time cost, a promising strategy is to let the identification model take the responsibility of rejuvenation. Therefore, we used the TRANSFORMER-BIG model with the large batch configuration trained on the raw data to accomplish both identification and rejuvenation. The resulted data is used to train two final models, i.e., TRANSFORMER-BIG and DYNAMICCONV. Figure 6 lists the results. With almost no decrease of translation performance, the time cost of data rejuvenation is reduced by about 33%. This makes the total time cost comparable with those data manipulation or augmentation techniques that require additional NMT systems, such as data diversification (Nguyen et al., 2019) and backtranslation (Sennrich et al., 2016a). In addition, the superior performance of DYNAMICCONV (i.e., 30.4) further demonstrates the high agreement of inactive examples across architectures.

Analysis on Inactive Examples
Human Translations from Target to Source as Inactive Examples? Since forward translation The information of source-translated/natural examples is unavailable for training examples, but fortunately is provided for test sets 3 . We split the test examples of En⇒De into 10 data bins according to the sentence-level probability (see Eq. (2)) of the identification model (i.e., TRANSFORMER-BASE), and then calculate the ratio of source-translated examples in each bin. As seen in Figure 8, the ratios of source-translated examples in 1 st and 2 nd bins (i.e., 69% and 59%) significantly exceed that in the whole test set (i.e., 1500/3003), suggesting that human translations from target to source are more likely to be inactive examples.
Case Study. By inspecting the inactive examples, we find that the target sentences tend to be paraphrases of the source sentences rather than direct translations. We provide two cases in Table 7. In the first case, the target sentence does not translate "finished the destruction of the first" in the source sentence directly but rephrases it as "tat dann das seine und zerstörte den Rest", meaning "then did his and destroyed the rest" (that was not destroyed by The First World War). As for the second case, "denied by the latter" uses passive voice but its corresponding phrase in the target sentence is in active voice. These observations indicate that the inconsistent structure or expression between source and target sentences could make the examples difficult for NMT models to learn well. En⇒Fr X Anything denied by the latter was effectively confirmed as true .
Y Tout ce que démentait cette agence se révélait dans la pratique bien réel . =>En: Everything that this agency denied turned out to be very real in practice .

Y'
Toute chose niée par ce dernier aété effectivement confirmée comme vraie . =>En: Anything denied by the latter has actually been confirmed to be true .

Conclusion
In this study, we propose data rejuvenation to exploit the inactive training examples for neural machine translation on large-scale datasets. The proposed data rejuvenation scheme is a general framework where one can freely define, for instance, the identification and rejuvenation models. Experimental results on different model architectures and language pairs demonstrate the effectiveness and universality of the data rejuvenation approach. Future directions include exploring advanced identification and rejuvenation models that can better reflect the learning abilities of NMT models, as well as validating on other NLP tasks such as dialogue and summarization. they contain more rare words, which make them more difficult to be learned by NMT models than the active examples.
Coverage. Coverage measures the ratio of source words being aligned by any target words (Tu et al., 2016). Firstly, we train an alignment model on the training data by fast-align 5 (Dyer et al., 2013), and force-align the source and target sentences of each subset. Then, we calculate the coverage of each source sentence, and report the averaged coverage of each subset. The lower coverage of inactive examples indicates that they are not very well aligned as the active examples, which also make them more difficult for NMT models to learn. Uncertainty. Uncertainty measures the level of multi-modality of a parallel corpus (Zhou et al., 2019). The uncertainty of a source sentence can reflect the number of its possible translations in the target side. We consider the corpus level uncertainty, which measures the complexity of each subset. Corpus level uncertainty is simplified as the sum of entropy of target words conditioned on the aligned source words denoted H(y|x = x t ). Therefore, an alignment model is also required. To prevent uncertainty from being dominated by frequent words, we follow Zhou et al. (2019) to calculate uncertainty by averaging the entropy of target words conditioned on a source word denoted That is to say, inactive examples contain more complex patterns, which are more difficult to be learned by NMT models.

A.3 Generalization Capability
Margin. Margin (Bartlett et al., 2017) is a classic concept in support vector machine, measuring the geometric distance between the support vectors and the decision boundary. To apply margin for NMT models, we follow Li et al. (2019) to compute wordwise margin, which is defined as the probability of the correctly predicted word minus the maximum probability of other word types. We compute the word-wise margin over the training set and report the averaged value.
GSNR. The gradient signal to noise ratio (GSNR) metric (Liu et al., 2020a)   to positively correlate with generalization performance. The calculation of a parameter's GSNR is defined as the ratio between its gradient's squared mean and variance over the data distribution. For NMT models, we compute GSNR of each parameter and report the averaged value over all the parameters.
Compared with the baseline model trained on the raw data, the model trained with our data rejuvenation has larger Margin and GSNR, suggesting that data rejuvenation is able to improve the generalization capability of the final NMT models.

A.4 Validation Performance
In Table 8, we provide details of the main results, including the translation performance on both the validation and test sets. Generally, the models with our data rejuvenation outperform the baseline models on both validation and test sets.

A.5 More Ablation Studies
Reversed Models for Identification and Rejuvenation. Some researchers are curious whether the back-translation strategy will work if reversed NMT models are adopted for both identification and rejuvenation. To study this strategy, we trained a reversed translation model on the raw data as the identification model, and another reversed transla- Fine-tuning on Inactive Examples. We also tried a more straightforward strategy to re-use the inactive examples, i.e., to fine-tune the baseline NMT models on the inactive examples. We investigated this strategy on the En⇒De dataset with a pre-trained TRANSFORMER-BASE model. Experimental results show that the model diverges after fine-tuning on the inactive examples either individually or in combination with similar-sized active examples (the latter diverges slower), suggesting that fine-tuning on the inactive examples may not be a promising strategy.

A.6 Doubts on Main Results
Random Seeds. Some researchers may doubt if the improvement achieved by our approach comes from lucky random starts. To dispel this doubt, we conducted experiments on the En⇒De dataset using the TRANSFORMER-BASE model with three random seeds (i.e., 1, 12, and 123). Our approach consistently outperforms the baseline model in all cases (i.e., 27.5/28.3, 27.4/28.2, and 27.1/27.9), demonstrating the effectiveness of our approach.
Source Language. Some researchers may have questions about the language pairs used in the experiments that both language pairs have English as the source language, which could determine the rejuvenation strategy. To demonstrate the universality of our approach across language directions, we conducted an experiment on the WMT14 De-En translation task. The TRANSFORMER-BASE model achieved a BLEU score of 31.2, and the data rejuvenation approach improves performance by +0.6 BLEU point.