Sampling and Filtering of Neural Machine Translation Distillation Data

In most of neural machine translation distillation or stealing scenarios, the highest-scoring hypothesis of the target model (teacher) is used to train a new model (student). If reference translations are also available, then better hypotheses (with respect to the references) can be oversampled and poor hypotheses either removed or undersampled. This paper explores the sampling method landscape (pruning, hypothesis oversampling and undersampling, deduplication and their combination) with English to Czech and English to German MT models using standard MT evaluation metrics. We show that careful oversampling and combination with the original data leads to better performance when compared to training only on the original or synthesized data or their direct combination.


Introduction
Model distillation is a process of transferring the knowledge of one or more, usually larger, model(s) into another, usually smaller, model (Buciluǎ et al., 2006). A variation of this is training a new model in a way that its performance is similar to that of the already trained one. This is achieved by making use of either teacher predictions (black-box) or other products of the workings of the teacher, such as attention-score or decoder score (grey/glass-box). Assuming we have access to a parallel corpus, we focus on sampling the translation hypotheses and making use not only of the teacher scores but also of their comparison to the reference.
There are various possible motivations for model distillation. The student model can be much smaller than the teacher model, which has the benefit of faster inference speed (Germann et al., 2020). It can also be used for model stealing, where an adversary tries to copy the teacher functionality. This is a practical concern for production-level MT systems (Wallace et al., 2020).
One of the approaches for knowledge distillation is to use the teacher model to generate a new dataset for the student model to train on. Having access to a trained teacher model, this approach does not require parallel data and can leverage large monolingual corpora. Reference translations, however, help with determining which of the teacher's translations are good and which are of low quality.
We focus on this approach and propose and compare several importance sampling approaches to prepare training data for student models that leverage access to reference translations. These include pruning, upsampling and undersampling, deduplication and their combination. We show that a combination of these methods improves the student performance over just using the reference or the best hypothesis (by the decoder score), which is a common distillation practice.
The experiment code is available open-source. 1

Related work
The general methodology for knowledge distillation in the form of teacher-student has been proposed by Hinton et al. (2015). For the MT task specifically, Tan Kim and Rush (2016) shows that taking either the top sentence with respect to the teacher decoder score or BLEU (Papineni et al., 2002) improves the performance. Germann et al. (2020) presented student models that distil knowledge from a larger teacher model with a negligible loss in performance. They manipulate the queried data based on target sentence quality, such as by removing sentences that are not correctly recognized by a language identifier. For the parallel part of the data, they extract the best BLEU scoring sentence out of 8 hypotheses. Freitag et al. (2017) experiment with pruning sentences that are below some TER (Snover et al., 2006) threshold (lower is better). They further document the effect of using an ensemble of teachers and also reducing the student model size.

Methods
The evaluation of every sampling method follows the following three-step process. First, the specific parallel corpus (Section 2.1) is translated by the teacher model (Section 2.2) for the considered translation direction. New datasets based on metrics are then created. The reference is taken into consideration during the hypothesis selection. We train new models (students) on these datasets and measure their performance. There are 12 hypotheses (default in Marian NMT) provided by the teacher using beam search for every source sentence which we consider when composing a new dataset.

Translation 11
New data:  Figure 1 shows an example of the sampling process with BLEU. Twelve translations are made of Source and each receives a score against the provided reference. The new data contain Translation 2 three times, because of its high score. Translation 12 is omitted because of its low score. This upsampling is explained in detail in Section 2.3.

Data
We make use of the Europarl v10 parallel corpus (Koehn, 2005) for English-Czech (0.6M sentences) and English-German (1.8M sentences). The sentences are longer (23 target words per sentence on average) than in the WMT News Task domain (Barrault et al., 2020). To modern standards, this dataset is relatively small and very domain restricted. This was chosen deliberately because of computational limitations. 2 Despite that it demonstrates the results of the different sampling methods with respect to each other. These results may not be transferable to large parallel corpora in which training data is abundant.
For every language pair, we randomly sample 15k sentences as development dataset (used only for determining the best epoch and early stopping) and 15k sentences for final test evaluation which is reported. The WMT News test dataset is not used for student evaluation, because the students are trained on a limited amount of data and on a different domain. Out of the WMT20 News tokens, 0.18% are not present in the Europarl training set. This would introduce a higher variance into the WMT News test evaluation, which would be largely dependent on the diversity of the teacher vocabulary.

Models
The teachers 3 in this experiment are transformerbased (Vaswani et al., 2017), speed optimized and were themselves created by knowledge distillation from state-of-the-art models Junczys-Dowmunt, 2019), as proposed by Germann et al. (2020). The Czech↔English model is described by Germann et al. (2020) and the English→German model by Bogoychev et al. (2020). Our student models follow the teacher's architecture with half the size of the embedding vector (256 instead of 512) and half of the attention heads (4 instead of 8). Student models were trained with an early stopping of 20 evaluations on validation data with evaluation performed every 10k sentences. Vocabularies were not shared from the teacher because they did not affect the results, and not using them makes fewer assumptions regarding the level of access to the teacher model. Marian NMT (Junczys-Dowmunt et al., 2018) is used for teacher decoding and student training. Table 1 shows the teacher performance measured on WMT20 News and the test subset of Europarl. Czech models performed better on the Europarl than on the News task, while for the German model the trend was the opposite. This may be caused by the fact that the models were distilled from a system that had Europarl as part of the training data, CzEng 2.0 (Kocmi et al., 2020

Sampling
Concerning the sampling metrics (always between the considered hypothesis and the reference), we make use of BLEU, ChrF (Popović, 2015), TER (negative), the difference (negative of absolute value) in subword unit counts by SentencePiece (Kudo and Richardson, 2018) (SP) and decoder probability divided by the number of output tokens (score). TER and SP are negative in Section 3 so that higher is always better. The motivation for SP is to capture the difference in length of the hypotheses with respect to the reference. This is a very naive metric, but we can use it to see the performance and the behaviour of all the other metrics. Although BLEU is a document-level metric, it can also be used to determine sentence similarity. Standard machine translation metrics 4 are computed using Sacrebleu (Post, 2018). Different sampling methods are used even though the goal is to maximize the BLEU scores of the student models. There is no reason to assume that sampling only based on BLEU will lead to the best results. The number of training sentences differs for every method. We define the following notation. 4 Sacrebleu metrics version strings: BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+v1.4.14 ChrF2+numchars.6+space.false+v1.4.14 TER+tok.tercom-nonorm-punct-noasian-uncased+v1.4.14 • T -top; T n metric takes n top translation hypotheses according to metric; equal to S 1,1,...1(n) metric . The student model may benefit from seeing e.g. the second best hypothesis, even though it's not the best available. This results in n times the number of original sentences which are all different.
• S -skewed; S k 1 ,k 2 ,...kn metric takes k 1 × the top translation hypotheses according to metric, k 2 × the second top translation, etc. As opposed to T n metric , this method tries to preserve the information of the ordering by setting k 1 ≥ k 2 ≥ . . . k n . This results in ( k i ) times the number of original sentences but only n times of which are different sentences.
It is used after joining the results of other methods. This method is useful for emulating the or operation: Dedup[A + B] then means "all sentences in either A or B." The output size is strictly dependent on their overlap.
• G -greater than; G m metric takes all sentence translations with metric at least m. This results in sentences that are close to the reference according to the metric. The number of output sentences highly dependent on the threshold and is discussed in the corresponding section.
Sampling methods can be combined: T 2 bleu + G −10 score joins the top 2 sentences measured by BLEU and adds them to the hypotheses with decoder score of at least −10. Duplicates are intentionally not removed; thus, hypotheses in both sampling methods are upsampled.

Results
Baseline. Table 2 shows results for baseline sampling methods. Original corresponds to training only to the provided parallel corpus (references). T 1 score takes only the highest-scoring hypothesis from the decoder, which is related to the scenario where the reference is not available, and the decoder score is the best measure for hypothesis quality. 5 The sampling method T 12 − takes all available hypotheses (metric does not matter).
Training on the original data leads to better results than training on the best scoring hypotheses.  Training on all hypotheses results in slightly lower BLEU performance. This may be caused by the small amount of training data available in which case taking all hypotheses just improves the vocabulary and language modelling capacity.
Best hypotheses. The results of datasets created by taking either the best one or the four best hypotheses for every source sentence is shown in Table 3. In the case of multiple hypotheses having the same score, the one with the highest decoder score is chosen. The top one and top four hypotheses were chosen to show that the optimum is neither the top one nor the top twelve (all) hypotheses. On average, the hypothesis overlap 6 in sampling between metrics is 29% for T 1 and 51% for T 4 . This is expected and shows that when more top hypotheses are taken into the new dataset, the individual metrics tend to matter less.  Taking only the top-scoring hypothesis of reference-based metrics, T 1 showed better results than the baseline (training on the original data, tak-6 Overlap computed as average m1 =m2 |T 1 m1 ∩ T 1 m2 |/n and average m1 =m2 |T 4 m1 ∩ T 4 m2 |/(4n). Original data size is n.

Dataset CS→EN EN→CS EN→DE
ing the highest decoder scoring hypothesis or taking all hypotheses). In all cases the T 4 outperformed T 1 . The main gains were on CS→EN and EN→CS. Although the results on EN→DE are only slightly better than the baseline, they are systematic across all metrics except for SP. The effect of choosing the metric for the top four hypotheses seems marginal, even compared to sampling based on the decoder score. The only exception is the SP difference, which leads to lower results.
Thresholding. Determining a single threshold for all datasets leads to a vastly different number of hypotheses being selected (the use of G 65 BLEU results in 1.3× the original dataset for CS→EN, but 0.6 for EN→DE). Therefore, we establish different metric thresholds for every dataset so that the new datasets are 1× to 1.5× the original size for consistent results across language pairs. Some of the source sentences were easier to translate, and more of their hypotheses were put into the new dataset. Others had no hypothesis above a given threshold and were not included in the new data at all. On average only 25% of original sentences were preserved for BLEU, ChrF, TER and SP. For the decoder score metric, it is 46%. The high loss of source sentences is expected since most of the hypotheses share large portions of the target sentence and only differ in a few words. All of them will then behave similarly with respect to the metric.  The highest performance is achieved using G score which can be explained by how much of the original sentences were preserved. G score shows that it is possible to achieve a performance comparable to T 1 score with less than half of the source sentences by only taking all hypotheses with a decoder score above a threshold. G BLEU gets worse results (on average −1.1 BLEU), but with only 27% source sentences preserved.

Dataset CS→EN EN→CS EN→DE
Better performance could be achieved by lowering the threshold to allow more source sentences and by intersecting the result with some of the other sampling methods, thus eliminating only the very low-quality sentence pairs. This is the approach (done with 5 hypotheses) done by Freitag et al. (2017): T 1 score ∩ G −0.8 T ER . Upsampling. In the first upsampling case, S 4,3,2,1 , the best hypothesis is present four times, the second-best three times, the third-best two times and the fourth-best once. The reason for upsampling better hypotheses is that we want to force the optimizer to make bigger steps for sentence pairs that are of high quality, but at the same time, we want to present other hypotheses to enlarge the vocabulary and improve the student's language model. The most straightforward approach is to put multiple copies of the high-quality example into the dataset. We also experiment with S 2,2,1,1 , because the upsampling intensity for every hypothesis rank is an independent variable as well. Both of these schemes are relatively conservative so that they can be compared to each other and to T 4 . Results for upsampling within a single metric are shown in Table 5  Both versions of upsampling (S 4,3,2,1 and S 2,2,1,1 ) outperformed all of the previous results. There seems to be no systematic difference between S 4,3,2,1 and S 2,2,1,1 . With the exception of SP and decoder score, the metrics are comparable. A direct comparison can be made to T 4 = S 1,1,1,1 because both T 4 and the upsampling methods contain all source sentences and even the same hypotheses. The only difference is that in the upsampling case, the better hypothesis is upsampled. In this case S 2,2,1,1 had higher results over T 4 with p < 0.005 by  Combination. For the combination scenarios, the newly sampled datasets are joined together. This is shown in Table 6. In the first four cases, the sampling methods were joined with the original data. A baseline to this is T 1 score + Original, which is commonly used for distillation.
Deduplicating the top four hypotheses according to BLEU or decoder score and adding them to the original data did not improve over the baseline. Combining the upsampling according to the decoder score with the original data also did not help. Replacing the decoder score with BLEU resulted in a significant improvement. The original data is upsampled so that the ratio of synthetic to original data is 4:1 in the first case and 2:1 in the second one.
For the rest of the cases, the methods are combined without the original data. Baselines are shown in Table 2. The combination of the top four hypotheses (T 4 BLEU or T 4 score ) with all of the hypotheses, T 12 − , improved over the baseline, including T 12 − , but performed poorly with respect to the other methods. Taking hypotheses that are in the top four according to either BLEU or decoder score leads to the best results in this section. The top one hypothesis, according to BLEU, is upsampled at least two and at most four times. This seems to work best for EN→DE where the training data were three times larger.
Bigger student model. To demonstrate the data sampling method behaviour on slightly larger models, the common distillation baseline (T 1 score + Original) and the best performing proposed sampling method (S 4,3,2,1 BLEU + 4 × Original) were used to train a student of the same size as the used teacher (embedding vector dimension 512 and 8 attention heads). The results are shown in Table 7. They are systematically higher than for the smaller models, and the difference between the baseline and the best sampling is preserved.

Summary
Although widely used, taking only the highestscoring sentence (with respect to the decoder score or any reference-based metrics, such as BLEU) does not lead to the best results. In the context of the proposed experiments, these are achieved by a combination of top hypotheses and the original data, such as S 4,3,2,1 BLEU +4×Original (upsampling according to BLEU and joining with the original data duplicated four times). Here, an improvement of an average +2 BLEU points against T 1 score + Original was achieved.
The choice of the sampling metric does not significantly influence the results, especially in cases where more than the top one hypothesis is sampled. Because of this, in most scenarios the decoder score can be used instead, reducing the need for translation references.
Future work. We worked with only two upsampling schemes: S 4,3,2,1 and S 2,2,1,1 . However, the two vectors are arbitrary and more of the vast vector space should be explored, especially with more than the top four hypotheses considered or more skewed towards the best hypothesis. More sophisticated methods based on the value of the metric instead of just the ordering could also be tried out.
The effects of large models (both teacher and student) and data access should be explored to verify the transferability of the results of the current setup. Specifically, the teacher model should not be a distilled model itself. The robustness of the training should also be established.
Even though this paper focused solely on MT, the importance sampling methods could also be applied and verified on other NLP tasks, possibly even on more general machine learning problems.