Dynamic Data Selection and Weighting for Iterative Back-Translation

Back-translation has proven to be an effective method to utilize monolingual data in neural machine translation (NMT), and iteratively conducting back-translation can further improve the model performance. Selecting which monolingual data to back-translate is crucial, as we require that the resulting synthetic data are of high quality \textit{and} reflect the target domain. To achieve these two goals, data selection and weighting strategies have been proposed, with a common practice being to select samples close to the target domain but also dissimilar to the average general-domain text. In this paper, we provide insights into this commonly used approach and generalize it to a dynamic curriculum learning strategy, which is applied to iterative back-translation models. In addition, we propose weighting strategies based on both the current quality of the sentence and its improvement over the previous iteration. We evaluate our models on domain adaptation, low-resource, and high-resource MT settings and on two language pairs. Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.


Introduction
Back-translation (Sennrich et al., 2016b) is an effective strategy for improving the performance of neural machine translation (NMT) using monolingual data, delivering impressive gains over already competitive NMT models (Edunov et al., 2018). The strategy is simple: given monolingual data in the target language, one can use a translation model in the opposite of the desired translation direction to back-translate the monolingual data, effectively synthesizing a parallel dataset, which is in turn used to train the final translation model. Further improvements can be obtained by iteratively repeating this process (Hoang et al., 2018) in both directions, resulting in strong forward and backward translation models, an approach known as iterative back-translation.
However, not all monolingual data are equally important. An envisioned downstream application is very often characterized by a unique data distribution. In such cases of domain shift, back-translating target domain data can be an effective strategy  for obtaining a better in-domain translation model. One common strategy is to select samples that are both (1) close to the target distribution and (2) dissimilar to the average generaldomain text (Moore and Lewis, 2010). However, as depicted in Figure 1, this method is not ideal because the second objective could bias towards the selection of sentences far from the center of the target distribution, potentially leading to selecting a non-representative set of sentences.
Even if we could select all in-domain monolingual data, the back-translation model would still not be suited for translating them because it has not been trained on in-domain parallel data and thus the back-translated data will be of poor quality. As we demonstrate in the experiments, the quality of the back-translated data can have a large influence on the final model performance.
To achieve the two goals of both selecting targetdomain data and back-translating them with high quality, in this paper, we propose a method to combine dynamic data selection with weighting strategies for iterative back-translation. Specifically, the dynamic data selection selects subsets of sentences from a monolingual corpus at each training epoch, gradually transitioning from selecting general-domain data to choosing target-domain sentences. The gradual transition ensures that the back-translation model of each iteration can adequately translate the selected sentences, as they are close to the distribution of its current training data. We also assign weights to the back-translated data that reflect their quality, which further reduces the effect of potential noise due to low quality translations. The proposed data selection and weighting strategies are complementary to each other, as the former focuses on domain information while the latter emphasizes the quality of sentences.
We investigate the performance of our methods in domain adaptation, low-resource and highresource MT settings and on German-English and Lithuanian-English datasets. Our strategies demonstrate improvements of up to 1.8 BLEU points over a competitive iterative back-translation baseline and up to 1.2 BLEU points over the best static data selection strategies. In addition, our analysis reveals that the selected samples can represent the target distribution well and that the weighting strategies are effective in noisy settings.

Background: Back-Translation
Back-translation (Sennrich et al., 2016a) has proven to be an effective way of utilizing monolingual data for machine translation. Given a parallel training corpus D EF , we first train a target-tosource machine translation model M F E . Then, we use the pre-trained model M F E to translate a target language monolingual corpus D F to the source language and obtain a synthetic parallel corpus (D E , D F ). Last, we concatenate back-translated data (D E , D F ) with the original parallel corpus D EF to train a source-to-target model M EF .
The success of back-translation has moti- vated researchers to investigate and extend the method (Edunov et al., 2018;. The dual learning approach  integrates training on parallel data and training on monolingual data via round-trip translation and use of language models to improve output fluency. Cheng et al. (2016) attempt to augment back-translation with weighting strategies and the back-translation steps are conducted iteratively.
Because a better back-translation system will likely lead to a better synthetic corpus, Hoang et al. (2018) propose to use iterative back-translation and achieve improvements over previous state-of-theart models in various settings. As shown in Algorithm 1, at each training step, a batch of monolingual sentences is sampled from one language and back-translated to the other language. The backtranslated data is utilized to train the model in the other direction. The process is repeated in both directions. Parallel data can be mixed with the back-translated data to train the translation models.

Methods
In this section, we first introduce our data selection and weighting strategies separately, and then illustrate how we combine the two ideas. In our problem setting, we are given two MT models M EF and M F E pretrained on parallel data D EF , and both source and target monolingual corpora D E and D F . The goal is to select and weight samples from the two monolingual corpora for backtranslation, in order to best improve the performance of the two translation models.

Data Selection Strategies
We first describe a commonly used static selection strategy, and then illustrate our dynamic approach.
where H LM in (s) and H LMgen (s) represent the cross-entropy scores of s measured with an indomain and a general-domain language model (LM) respectively. Sentences with the highest scores will be selected for training. Typically, the in-domain language model LM in is trained with a small set of sentences in the target domain and LM gen is trained with all data available.

Our Two Scoring Criteria
Instead of static data selection, we propose a new curriculum strategy for iterative back-translation. Specifically, we measure both representativeness, i.e. how well the sentence represents the target distribution, and simplicity, i.e. how well the MT models can translate the sentence, of each sentence s in the monolingual corpus. First, we select the most simple samples for back-translation to ensure the quality of the back-translated data. As the training progresses, the model will become better at translating in-domain sentences, and we will then shift to choosing more representative examples. Formally, at each epoch t, we rank the corpus according to where repr(s) and simp(s) denote the representativeness and simplicity of sentence s respectively. We discuss the representativeness and the simplicity metrics in the following sections. The term λ(t) balances between the two criteria and is a function of the current epoch t.
We adopt the square-root growing function for λ (Platanios et al., 2019) and set where c 0 is the initial value and T denotes the time after which we solely select representative samples. λ increases relatively quickly at first and then its acceleration will be gradually decreased as the training progresses, which is suitable for our task as at first the sentences are relatively simple and thus we will not need much time on those sentences. 2 Connections to Moore and Lewis (2010). Our proposed criteria generalize Moore and Lewis (2010). The first term of Equation 1, namely H LM in (s), measures the representativeness of data because the in-domain LM assigns low entropy to sentences that appear frequently in the target domain. The second term H LMgen (s), on the other hand, measures the simplicity of the sentences. If H LMgen (s) is high, it is likely that some n-grams of the sentence s appear frequently in the parallel training data D EF , indicating that the MT models will likely translate the sentence well. In other words, the sentence s can provide limited additional information if H LMgen (s) is high. Therefore, one can view Moore and Lewis (2010) as selecting the most representative and difficult sentences.

Representativeness Metrics
We propose three approaches to measure the sentence representativeness.
In-Domain Language Model Cross-Entropy (LM-in). As in Axelrod et al. (2011);Duh et al. (2013), we can use H LM in to measure the representativeness of the instances. Concretely, we train a language model LM in with indomain monolingual data and compute the score 1 |s| |s| t=1 log P LM in (s t |s <t ) for each sentence s. TF-IDF Scores (TF-IDF). TF-IDF score is another way to perform data selection for machine translation (Kirchhoff and Bilmes, 2014). For a sentence s = v 1 , . . . , v N , one can compute the term frequency TF s,vn and the inverse document frequency IDF vn for each word v n . We can thus obtain the TF-IDF vector Finally, we calculate the cosine similarity between the TF-IDF vectors of s and each sentence s in in a small in-domain dataset, and treat the maximum value as its representativeness score.

BERT Representation Similarities (BERT).
BERT (Devlin et al., 2019) has proven to be a powerful model for learning sentence representations. Following the conclusion of Pires et al. (2019), we feed each sentence to the multilingual BERT model and average the hidden states for all the input tokens except [CLS] and [SEP] at the eighth layer to obtain the sentence representation. We then compute the cosine similarity between representations of sentence s in the monolingual corpus and each sentence s in in a small in-domain set, and the maximum value is treated as the representativeness score.

Simplicity Metrics
In our experiments, we study two metrics for measuring the simplicity of sentences.
General-Domain Language Model Cross-Entropy (LM-gen). We train a language model LM gen with the one side of the parallel training data D EF . Then, for each sentence s we compute the score 1 |s| |s| t=1 log P LMgen (s t |s <t ). Round-Trip BLEU (R-BLEU). Given two pretrained MT models M EF and M F E , round-trip translation first translates a sentence s into another language using M EF and then back-translates the result using M F E , obtaining the reconstructed sentence s . The BLEU score between s and s is treated as our simplicity metric. Similar ideas have been applied to filter sentences of lowquality (Imankulova et al., 2017).
For both the representativeness and simplicity scores, it should be noted that they are separately normalized to [0, 1], using the equation score(s)−score min scoremax−score min , where score max and score min are the maximum and minimum scores. 3 Also, both the representativeness and simplicity scores can be computed in a pre-processing step, and we only need to adjust λ(t) in Equation 3 during training.

Weighting Strategies
Next, we illustrate how we perform data weighting on the back-translated data by measuring both the current quality of sentences and its improvements over the previous iteration.

Measuring the Current Quality
As general translation models could perform poorly on the in-domain data, it is important to downweight the examples of bad quality. To this end, we present two ways to measure the current quality of the back-translated sentences.

Encoder Representation Similarities (Enc).
We feed the source sentence x and the target sentence y to the encoders of M EF and M F E respectively, and average the hidden states at the final layer to obtain the representations enc EF (x) and enc F E (y). The cosine similarity between them is treated as the quality metric. 4 Agreement Between Forward and Backward Models (Agree). Inspired by Junczys-Dowmunt (2018), the second approach utilizes the agreement of the two translation models. For each sentence pair (x, y), we compute the length-normalized conditional probability H EF (y|x) and H F E (x|y), then exponentiate the absolute value between them to obtain exp(−(|H EF (y|x)−H F E (x|y)|)). Intuitively, it is likely that the back-translated sentences are of poor quality if there are huge disagreements on them between the two models.

Measuring Quality Improvements
In domain adaptation, it is natural that at first the in-domain sentences are poorly translated. As training progresses, however, the quality should be improved. We therefore propose a metric to measure We report the best-performing models of only using selection strategies ("Best Selection"), only using curriculum strategies ("Best Curriculum"), only using weighting strategies ("Best Weighting" ) and using both the best curriculum and the best weighting strategies ("Best Weighting + Best Weighting" ). "Enc-Imp" indicates both the encoder representation similarities and the quality improvement metrics are used for weighting. The highest scores are in bold and * indicates statistical significance compared with the best baseline (p < 0.05, computed using comparemt ).
the improvement in translation quality and combine it with the current quality metric, in order to encourage the inclusion of in-domain sentences where the translation qualities have improved. Specifically, every time we obtain the quality score of sentence s, we store it, then the next time we come across the same sentence, we can compare the new quality score with the previous one: where the clipping function limits the weights to a reasonable range. We set (w low , w high ) to ( 1 2 , 2).

Overall Algorithm: Combining Curriculum and Weighting Strategies
Our final algorithm is shown in Figure 2. At each epoch, we compute the score for each sentence in monolingual corpora using Equation 2 and select the top p% sentences, where p is a hyper-parameter. Afterwards, we perform back-translation and data  weighting on the selected data, then use the backtranslated data to train the translation model. The process will be repeated iteratively for both directions, with λ increased at each training epoch.

Experiments on Domain Adaptation
We first conduct experiments in the domain adaptation setting, where we adapt models from a general domain to a specific domain.

Setup
Datasets. We first train the translation models with (general-domain) WMT-14 German-English dataset, consisting of about 4.5M training sentences, then perform iterative back-translation with (in-domain) law or medical OPUS monolingual data (Tiedemann, 2012). We de-duplicate the law and medical training data and sub-sample 250K and 200K sentences respectively in both languages to obtain the monolingual corpora. The development and test sets contain 2K sentences in each domain. Byte-pair encoding (Sennrich et al., 2016b) is applied with 32K merge operations. The generaldomain and in-domain language models are trained on the WMT training data and the OPUS monolingual data respectively. The OPUS development sets are used to compute the TF-IDF and BERT representativeness scores.  Models. We implement our approaches upon the Transformer-base model (Vaswani et al., 2017). Both the encoder and decoder consist of 6 layers and the hidden size is set to 512. For the translation models, weights of the top 4 layers of the encoders and bottom 4 layers of the decoders are shared between forward and backward models, as these parameters tend to be language-independent (Yang et al., 2018). We also tie the source and target word embeddings. We build 5-gram language models with modified Kneser-Ney smoothing using KenLM (Heafield, 2011).
Hyper-Parameters. c 0 and T in Equation 3 are set to 0.1 and 5. We select 30% of the sentences with the highest score at each epoch for our curriculum methods and 50% of the sentences for the static data selection baselines.

Results
We compare our dynamic curriculum and weighting methods with three baselines: the iterative backtranslation baseline, a baseline trained with only data selection strategies, a baseline trained with only data weighting strategies. The results with the best-performing representativeness and simplicity metrics (TF-IDF and R-BLEU, respectively) in the domain adaptation setting are listed in Table 1.
Iterative Back-Translation. The iterative backtranslation method is rather competitive, as it improves over the unadapted baseline by 9.6 BLEU and simple back-translation by 1.8 BLEU points.
Selection Strategies. We can see from the table that the best-performing selection strategies, namely selecting sentences with high TF-IDF scores, is generally effective and can improve the baseline by about 0.5 BLEU points.
Curriculum and Weighting Strategies. Both our curriculum and weighting strategies outperform the unadapted and the iterative back-translation models, with the curriculum learning method achieving better performance and improving the strong iterative back-translation baseline by 1.1 BLEU points. Combining curriculum and weighting methods can further improve the performance by up to 0.5 BLEU points, demonstrating the two strategies are complementary to each other.

Choices of Metrics
We examine different choices of representativeness and simplicity metrics. The performance of different models is listed in Table 2.
Representativeness Metrics. All data selection strategies outperform the baseline, with TF-IDF, LM-diff, and BERT metrics exhibiting fairly robust performance in all settings. Due to its simplicity, we choose TF-IDF for experiments where a good in-domain development set is available.
Data Weighting Strategies. The agreementbased weighting method ("Agree") performs slightly worse than the encoder-similarity weighting strategy ("Enc"), probably because the two languages are similar and thus encoders with shared parameters can accurately measure the data quality.
Curriculum Strategies. Our curriculum strategies need to consider both representativeness and simplicity metrics. Table 2 demonstrates that TF-IDF is a better metric than other representativeness metrics in both static and dynamic data selection settings. Also, the round-trip BLEU score can be better at measuring the simplicity of sentences than LM-gen. Last, by comparing the Moore-Lewis method ("LM-diff") with our curriculum strategy ("LM-in+LM-gen"), the advantages of dynamic data selection are clear.

Analysis
In this part, we investigate how noise in the backtranslated data impacts the model performance and how many sentences we should select. We also qualitatively examine if our weighting methods assign weights appropriately.
Effect of Back-Translation Quality. We try to generate the back-translated data using sampling, greedy search and beam search for iterative backtranslation and the results are listed in Table 3. We find that the sampling method significantly degrades the model performance, as it introduces more noise than other approaches, demonstrating it should also be folly true to apply to anti -subsidy investigations . 0.397 0.447 Table 4: Examples of our weighting strategy (Enc). We use our model (Curri+Enc) at the 5K-, 10K-, 15K-th iterative back-translation step to weight sentences. The assigned weights correlate well with the BLEU scores.

R-BLEU TF-IDF
High (≈ 1) Low (≈ 0)   that noise can have a negative impact in domain adaptation settings. The conclusion is similar to the findings in low-resource settings Edunov et al. (2018). In addition, we find that our weighting strategies are more beneficial in noisy settings.
Effect of the Percentage p. We test how many sentences should be selected at each epoch for our curriculum strategies. As shown in Figure 3, selecting 30% of the monolingual sentences achieves the best performance in general. Selecting fewer samples can discard valuable information whereas choosing more instances can introduce more noise.
Weighting Examples. We use our model (Curri+Enc) to back-translate some sentences from the monolingual corpus and Table 4 shows the weights our models assign at different training stages. It is clear that the assigned weights correlate well with the BLEU scores, demonstrating our methods can perform weighting appropriately.

Characteristics of the Selected Data
In this part, we investigate certain characteristics of the selected samples.
Lengths. Figure 4 shows the average lengths of the selected sentences in each bucket. We can see that 1) both LM-in and BERT favor long sentences, with one possible explanation being that those sentences are more likely to contain in-domain words; 2) TF-IDF does not share this feature, likely due to the IDF term; 3) sentences with high R-BLEU scores are generally short, which is unsurprising since NMT models are typically bad at translating long sentences.
Unigram Distribution Distance. We also compute the unigram ditribution distance using the Hellinger distance. Concretely, we compute the unigram distribution P and Q for both the selected data and the test set, and calculate where V is the size of the vocabulary. The larger the Hellinger distance is, the more dissimilar the two distributions are. Figure 4 shows that both TF-IDF and BERT match the test distribution well. Also, LM-in performs better than LM-diff, which confirms our hypothesis that the data selected by the Moore-Lewis method cannot adequately represent the target distribution.   Diversity Among Selected Data at Each Epoch. As our curriculum strategies dynamically select different subsets of data, here we examine how many new sentences are actually introduced at each epoch. We find that starting from the second epoch, 12.5%, 10.4%, 12.5%, 18.3%, 21.5% of the selected sentences will be replaced at each epoch, and 52.5% of the monolingual sentences will be selected at least once in total.
Examples.  contain some out-of-vocabulary words, while sentences with low TF-IDF but high R-BLEU scores are generally short and frequently include digits and single characters. Most of the sentences with both low TF-IDF and R-BLEU scores are extremely noisy and can be safely discarded.

Experiments on Low-Resource and High-Resource Scenarios
Next, we conduct experiments in both low-and high-resource scenarios over two language pairs: German-English and Lithuanian-English.

Setup
Data statistics are shown in Table 7. When the target distribution is the news domain, we train the in-domain LMs with 500K sentences from the news monolingual data. The other settings (including hyperparameters) are the same as before.

Results
The results are reported in Table 6. We find that LM-in and LM-gen is the best metric combination for curriculum strategies when the target distribution is the news domain. TF-IDF and R-BLEU as the representativeness and simplicity metrics are the best in all other settings.
Low-Resource Settings. In low-resource settings, iterative back-translation can improve the baseline model by a large margin, and our curriculum strategies can still outperform the strong baseline by 1.3 BLEU points. Weighting methods also generally help and our best method outperforms iterative back-translation by 1.8 BLEU points.
High-Resource Settings. In high-resource settings, our curriculum strategies improve the iterative back-translation baseline by up to 0.3 BLEU points. Data weighting strategies do not always help, probably because in high-resource settings the back-translated data is already of high quality. Our best method outperforms iterative backtranslation by 0.6 BLEU points.

Related Work
Back-translation (Sennrich et al., 2016a) has proven to be successful and several extensions of it have been proposed Cheng et al., 2016;, among which iterative backtranslation methods (Cotterell and Kreutzer, 2018;Hoang et al., 2018;Niu et al., 2018) have demonstrated strong empirical performance. For domain adaptation, Moore and Lewis (2010); Axelrod et al. (2011) use language model cross entropy differences to select data that are similar to in-domain text, which is adapted by Duh et al. (2013) to neural models. Similarly, Kirchhoff and Bilmes (2014) propose to use TF-IDF scores to select relevant samples for machine translation. van der Wees et al. (2017) propose dynamic data selection strategies for machine translation models, and Zhang et al. (2019) extend the idea to curriculum strategies. As for filtering noisy sentences, Junczys-Dowmunt (2018) propose to utilize the agreement between forward and backward translation models and Wang et al. (2019a) propose uncertainty-based confidence estimation to improve back-translation. Wang et al. (2019b) compose dynamic domain-data selection with dynamic clean-data selection. Our methods generalize the previous strategies and our primary focus is to improve iterative back-translation.

Conclusion
In this paper, we provide a novel insight into a widely-used data selection method (Moore and Lewis, 2010) and generalize it to a curriculum strategy for iterative back-translation. We also propose data weighting methods. Extensive experiments are performed to evaluate the model performance. Analyses reveal the selected samples can represent the target domain well, and our weighting strategies benefit noisy settings the most. Future directions include experimenting on other datasets and language pairs, as well as developing better scoring criteria and score combination techniques.