Dynamically Composing Domain-Data Selection with Clean-Data Selection by “Co-Curricular Learning” for Neural Machine Translation

Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not explicitly examined. This paper introduces a “co-curricular learning” method to compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities. We apply an EM-style optimization procedure to further refine the “co-curriculum”. Experiment results and analysis with two domains demonstrate the effectiveness of the method and the properties of data scheduled by the co-curriculum.


Introduction
Significant advancement has been witnessed in neural machine translation (NMT), thanks to better modeling and data.As a result, NMT has found successful use cases in, for example, domain translation and helping other NLP applications, e.g., (Buck et al., 2018;McCann et al., 2017).As these tasks start to scale to more domains, a challenge starts to surface: Given a source monolingual corpus, how to use it to improve an NMT model to translate same-domain sentences well?Data selection plays an important role in this context.
In machine translation, data selection has been a fundamental research topic.One idea (van der Wees et al., 2017;Axelrod et al., 2011) for this problem is to use language models to select parallel data out of a background parallel corpus, seeded by the source monolingual sentences.This approach, however, performs poorly on noisy data, such as large-scale, web-crawled datasets, because data noise hurts NMT performance (Khayrallah and Koehn, 2018).The lower learning curve in Figure 1 shows the effect of noise on domain-data selection.
NMT community has realized the harm of data noise to translation quality, leading to efforts in data denoising (Koehn et al., 2018), as has been popular in computer vision (Hendrycks et al., 2018).The upper curve in Figure 1 shows the effect of clean-data selection on the same noisy data.These denoising methods, however, cannot be directly used for the problem in question as they require trusted parallel data as input.
We introduce a method to dynamically combine clean-data selection and domain-data selection.We treat them as independent curricula, and compose them into a "co-curriculum".We summarize our contributions as: 1. "Co-curricular learning", for transfer learning across data quality.It extends the single curriculum learning work in NMT and makes the existing domain-data selection method work better with noisy data.
improvement with deep models, it surprisingly improves shallow model by 8-10 BLEU points -We find that bootstrapping seems to "regularize" the curriculum and make it easier for a small model to learn on.
3. We wish our work contributed towards better understanding of data, such as noise, domain, or "easy to learn", and its interaction with NMT network.
2 Related Work

Measuring Domain and Noise in Data
Data selection for MT usually uses a scoring function to rank sentence pairs.Cross entropy difference (Moore and Lewis, 2010) between two language models is usually used for selecting domain sentences, e.g., (van der Wees et al., 2017;Axelrod et al., 2011).For a source sentence x of length |x|, with a general-domain language model (LM), parameterized as ϑ, and an in-domain LM, ϑ, the domain-relevance of x is calculated as:1 Alternative measures (Wang et al., 2017;Chen and Huang, 2016;Chen et al., 2016) also show effectiveness.With Eq. 1 to select data, the data distribution (domain quality) in the in-domain monolingual data used to train P (x; ϑ) is transferred into the selected data through the scoring.
Data selection has also been used for data denoising (Junczys-Dowmunt, 2018;Wang et al., 2018b), by using NMT models and trusted data to measure the noise level in a sentence pair.One such a scoring function uses a baseline NMT, θ, trained on noisy data and a cleaner NMT, θ, obtained by fine-tuning θ on a small trusted parallel dataset, and measures quality in a sentence pair (x, y): φ x, y; θ, θ = log P y|x; θ −log P y|x; θ

|y|
(2) Using NMT models for selection can also lead to faster convergence (Wang et al., 2018a).With Eq. 2, the distribution (data quality) in the trusted parallel data is transferred into the selected data.These scoring functions usually use smaller networks.

Curriculum Learning for NMT
Curriculum learning (CL) (Bengio et al., 2009) Hence, for a training with T maximum steps, C is a sequence: At t, an online learner samples data from Q t to train on, resulting in a task (or model), m t .Therefore, C corresponds to a sequence of tasks, M = m 1 , ..., m t ..., m f , where m f is the final task of interest.Intermediate tasks, m t , are sorted in increasing relevance to m f as a series of "stepping stones" to m f , making curriculum learning a form of transfer learning that transfers knowledge through M to benefit m f .A performance metric P(C, m f ) is used to evaluate m f .There has already been rich research in CL for NMT.Fine-tuning a baseline on in-domain parallel data is a good strategy (Thompson et al., 2018;Sajjad et al., 2017;Freitag and Al-Onaizan, 2016 Mini-batch sampling is important for CL.Several alternatives have been introduced to evolve the training criteria Q t over time (Zhang et al., 2018;Wang et al., 2018b;van der Wees et al., 2017;Kocmi and Bojar, 2017;Platanios et al., 2019).In these curricula, tasks in M are sequenced in order of increasing relevance.Earlier tasks are exposed to a diversity of examples and later tasks progressively concentrate on data subsets more relevant to the final task.

More Related Work
Junczys-Dowmunt (2018) introduces a practical and effective method to combine (static) features for data filtering.Mansour et al. (2011) combine an n-gram LM and IBM translation Model 1 (Brown et al., 1993) for domain data filtering.We compose different types of dynamic online selection rather than combining static features.
Back translation (BT), e.g., (Sennrich et al., 2016), is another important approach to using monolingual data for NMT.Here we use monolingual data to seed data selection, rather than generating parallel data directly from it.Furthermore, we study the use of source-language monolingual data, in which case BT cannot be applied directly.(Eq.2) to sort data by noise level into a denoising curriculum.

Problem Setting
The setup, however, assumes that the indomain, trusted parallel data, D ID XY , does not exist -Our goal is to use an easily available monolingual corpus and recycle existing trusted parallel data to reduce the cost of curating in-domain parallel data.
We are interested in a composed curriculum, C co , to improve either original curriculum: We hope P(C co , m f ) ≈ P(C true , m f ) as if a small in-domain, trusted parallel dataset were available.We call this co-curricular learning.

Curriculum Mini-Batching
To facilitate the definition of co-curricular learning and following (Platanios et al., 2019;Wang et al., 2018b), we define a dynamic data selection function, D φ λ (t, D), to return the top λ(t) of examples in a dataset D sorted by a scoring function φ at a training step t.We use λ(t) = 0.5 t/H , (0 < λ ≤ 1), as a pace function to return a selection ratio value that decays over time controlled by a hyper-parameter H.2 During training, D φ λ (t, D) progressively evolves into smaller subdatasets that are more relevant to the final task using the scoring function.In practice, D φ λ (t, D ) can be applied on a small buffer D of random examples from the much bigger D, for efficient online training.It may also be desirable to set a floor value on λ(t) to avoid potential data selection bias.This is how we implement a curriculum in experiments.We introduce two different co-curricula below.
We then can constrain the re-weighting, W t (x, y), to assign non-zero weights only to examples in D ψ λ (t, D XY ) at a training step.We use uniform sampling.The co-curriculum is thereby fully instantiated based on Eq. 3 and Eq. 4.However, values of φ and ϕ may not be on the same scale or even from the same family of distributions.Therefore, despite its simplicity, C mix co may not be able to enforce either curriculum sufficiently.
, defines two selection functions and nests them.Let β (t) = 0.5 t/F and γ (t) = 0.5 t/G be two pace functions, implemented similarly to above λ(t), with different hyper-parameters F and G.
Then Eq. 3 is redefined into Eq. 4 with uniform sampling:4 At a time step, both pace functions, in their respective paces, discard examples that become less relevant to their own tasks.All surviving examples then have an equal opportunity to be sampled.Even though uniformly sampled, examples that are more relevant are retained longer in training and thus weighed more over time.
Table 1 shows a toy example of how two curricula are composed.At step 1, no example is discarded yet, and all examples have equal sampling opportunity (W 1 's).At step 2, the denoising curriculum discards the noisiest example 2, but the domain curriculum still keeps all; So only 1 and 3 are retained in the co-curriculum (W 2 ).In step 3, the domain curriculum discards the least in-domain example 3, so only 1 is left in the cocurriculum now (W 3 ).The denoising curriculum has a slower pace than the domain curriculum.Over the four steps, example 1 is kept longer thus weighed more.

Curriculum Optimization
We further improve the co-curriculum using an EM (Dempster et al., 1977)  With D XY and D ID X , we train a domain scoring function, ϕ(x; ϑ, ϑ).With D XY and D OD XY , we train a denoising scoring function, φ(y|x; θ, θ).The in-domain component ϑ of ϕ or the clean component ϑ of φ are obtained by fine-tuning ϑ or θ on the respective seed data.These initialize the procedure (iteration 0).
At iteration i, we generate a concrete cocurriculum using the dynamic re-weighting, W t , as defined in Section 4. Let GEN-C denote the curriculum generation process: Then, we fine-tune the original noisy NMT component, θ, of φ on C co : θ * is used to replace the clean component of φ θ i is then compared against the original θ for scoring.The updated φ and the constant ϕ work together to generate a new co-curriculum in the next iteration going back to Eq. 8.In this process, only the denoising function φ is iteratively updated, made more aware of the domain.
We call the procedure EM-style because D XY is treated as incomplete without the (hidden) data order.The generated C co in each iteration sorts the data and thus is viewed as complete.It is then used to train θ by maximizing the performance of the final task.θ and C co bootstrap each other.The process finishes after a pre-defined number of iterations.We use shallow parameterization for scoring functions but we can train a deep model on the final C co .The process also uses fine-tuning, so it can be run efficiently.
In principle, the domain-data scoring function ϕ can be updated in a similar manner, too, by updating its in-domain component, ϑ.This may help when the in-domain monolingual corpus is very small.An alternating optimization process can be used to bootstrap both.We, however, do not investigate this.

Setup
We consider two background datasets and two test domains, so we have four experiment configurations.Each configuration has as inputs a background dataset, an in-domain source-language corpus and a (small) trusted parallel dataset that is out-of-domain.The inputs of a configuration are shown in Figure 2.
As alternative background datasets, we use the English→French Paracrawl data, 5 (300 million pairs), and the WMT14 training data (40 million pairs).The former is severely noisier than the later.We adopt sentence-piece model and apply open-source implementation (Kudo, 2018) to segment data into sub-word units with a source-target shared 32000 sub-word vocabulary.
We IWSLT15 test domain.So, the trusted data are reversely shared across the two test domains.Additionally, WMT 2012-2013 are used as the validation set for the WMT14 test domain.Our method does not require the in-domain trusted data, but we use it to construct bounds in evaluation.
Training on Paracrawl uses Adam in warmup and then SGD for a total of 3 million steps using batch size 128, learning rate 0.5 annealed, at step 2 million, down to 0.05.Training on WMT 2014 uses batch size 96, dropout probability 0.2 for a total of 2 million steps, with learning rate 0.5 annealed, at step 1.2 million, down to 0.05, too.No dropout is used in Paracrawl training due to its large data volume.
For pace hyper-parameters (Section 4), we empirically use Floor values set for λ, β, γ are top 0.1, 0.2, 0.5 selection ratios, respectively, such that in the cascaded co-curriculum case, the tightest effective percentile value would be the same 0.1 = 0.2 × 0.5, too.All single curriculum experiments use the same pace setting as C mix .

Baselines and Oracles
We build various systems below as baselines and oracles.Oracle systems use in-domain trusted parallel data. Baselines: 1 We'll see if our method is better than either original curriculum and how close it is to the true curriculum oracle.In most experiments, we fine-tune a warmed-up (baseline) model to compare curricula, for quicker experiment cycles.
Baseline and oracle BLEU scores are shown in Table 2.Note that, except for P1 and W1, the two BLEU scores in a row are for two different training runs, each focusing on its own test domain.On either training dataset, domain curriculum, C domain , improves baseline, C random , by 0.8-1.1 BLEU (P3 vs P1, W3 vs W1).C domain falls behind of C denoise on the noisy Paracrawl dataset (P2 vs P3), but delivers matched performance on the cleaner WMT dataset (W2 vs W3) -noise compromises the domain capability.On the WMT training data, C denoise improves baselines by about +1.0 BLEU on either test domain (W3 vs W1), and more on the noisier Paracrawl data: +2.0 on either test domain (P3 vs P1).The true curriculum (P4, W4) bounds the performance of C domain and C denoise .Simple in-domain fine-tuning gives good improvements (P5 vs P1, W5 vs W1).

Co-Curricular Learning
Cascading vs. mixing.BLEU on WMT14.It is better than either constituent curriculum (P2 or P3), close to the true curriculum (P4).
So C co outperforms either constituent curriculum, as we target in Section 3. In both background data cases, using in-domain trusted parallel data to build oracles (P5, W5) are more effective than selecting data in our setup.

Effect of Curriculum Optimization
We further bootstrap the co-curriculum with the EM-style optimization procedure (Figure 2) for three iterations for all four configurations.Shallow models.We use the translation performance of the clean component P (y|x; θ) in scoring function φ (Eq.2) as an indicator to the quality of C co per iteration.Figure 3 shows that the BLEU scores of P (y|x; θ) steadily become better by iterations. 7θ has 512 dimensions and 3 lay-7 They also include two initialization points: the noisy θ, and the initial clean θ obtained by fine-tuning θ on the clean data.

Curriculum
Test ers.Surprisingly, EM-3 improves baseline by +10 BLEU on IWSLT15, +8.2 BLEU on WMT14 and performs better than fine-tuning baseline with the clean, out-of-domain parallel data we have.They even reach the performance of C random (P1) that uses a much deeper model (1024 dimensions x 8 layers) trained on the vanilla data.
Deep models.Table 5 shows the BLEUs of deep models (1024 dimensions x 8 layers) trained on the final co-curriculum.P8 performs slightly better than the non-bootstrapped version P7 on Paracrawl: +0.6 BLEU on WMT14 test and +0.2 on IWSLT15 test.The differences on the WMT data appear to be smaller (W8 vs. W7).So, curriculum bootstrapping has a small impact overall on deep models.
Why the difference?Why is there such a difference?We analyze the properties of the cocurriculum.Each curve in Figure 4 corresponds to a single curriculum that simulates the online data selection from looser selection (left x-axis) to moretightened selection (right x-axis).During the course of a single CL, the curriculum pushes "harder" examples with higher per-word loss (than  baseline) to the early curriculum phase (for exploration), and "easier-to-learn" examples with lower per-word loss to the late curriculum phase (for exploitation).Over iterations, a later-iteration curriculum schedules even easier examples than a previous iteration at late curriculum.The story happens reversely at early curriculum due to probability mass conservation.Figure 5 shows a similar story regarding per-word loss variance.So, curriculum optimization "regularizes" the curriculum and makes it easier-to-learn towards the end of CL.These may be important for a small-capacity model to learn efficiently.The fact that the deep model is not improved as much means that 'clean' may have taken most of the headroom for deep models.
Meanwhile, according to Figure 6 denoising curriculum, data in curriculum becomes cleaner, too.So, although the co-curriculum schedules data from hard to easier-to-learn, which seems opposite to the general CL, it also schedules data from less in-domain to cleaner and more in-domain, which captures the spirit of CL.

Retraining
On Paracrawl, retraining NMT with co-curriculum improves dynamic fine-tuning, as shown in Table 6 (P9 vs. P8): +0.6 BLEU on IWSLT15 and +1.0 BLEU on WMT14.On WMT14 training data, retraining (W9) seems to perform similarly to fine-tuning on a warmed-up model (W8): +0.3 on IWSLT15 but -0.2 on WMT14; We speculate that this may be due to the smaller WMT training data size.

Dynamic vs. Static Data Selection
Co-curricular learning is dynamic.How does being dynamic matter?Table 7 shows that finetuning on the top 10% data8 static selection (P10, W10) gives good improvements over baselines P1, W1, but co-curriculum (P9, W9) may do better.This confirms findings by (van der Wees et al., 2017).What if we retrain on the static data, too?In Table 8, W11 vs. W9 shows that retrained models on the static data is far behind for the WMT14 training -top 10% selection has only 4 million examples.On Paracrawl, P11 vs. P9 are closer, but retraining on co-curriculum performs still better.In all cases, co-curricular learning gives the best results.We may tune the static selection for better results, but then it is the exact point of CL, to evolve the data re-weighting without the need of a hard cutoff on selection ratio.

Discussion
Evidence of data-quality transfer.Figure 7 visualizes that CL in one domain (e.g., web) may enable CL in another.This is the foundation of our proposed method.To draw the figure, using a random sample of 2000 pairs from WMT training data and some additional in-domain parallel data, we sort examples by tightening the selection ratio according to a true web curriculum.The web curve shows the co-relation between selection ratio and data relevance to web.The same data order appears to yield increasing relevance to other domains, too, with bigger effect on a closer 'news' domain, but smaller effect on 'patent' and 'short' (sentences).
Regularizing data without a teacher.The analysis in Section 5.4 shows that the denoising scoring function and its bootstrapped versions tend to regularize the late curriculum and make the scheduled data easier for small models to learn on.One potential further application of this data property may be in learning a multitask curriculum where regular data may be helpful for multiple task distributions to work together in the same model.This has been achieved by knowledge distillation in existing research (Tan et al., 2019), by regularizing data with a teacher -We could instead regularize data by example selection, without a teacher.We leave this examination for future research.
Pace function hyper-parameters.In experiments, we found that data-discarding pace functions seem to work best when they simultaneously decay down to their respective floors.Adaptively adjusting them seems an interesting future work.

Conclusion
We present a co-curricular learning method to make domain-data selection work better on noisy data, by dynamically composing it with clean-data selection.We show that the method improves over either constituent selection and their static combination.We further refine the co-curriculum with an EM-style optimization procedure and show its effectiveness, in particular on small-capacity models.In future, we would like to extend the method to handle more than two curricula objectives.

Figure 1 :
Figure 1: BLEU curves over NMT training steps: domaindata selection on Paracrawl English→French data (lower curve) vs. clean-data selection on the same data (upper curve).Setup available in the experiment section.

Figure 2 :
Figure 2: Co-curricular learning with an EM-style optimization procedure.Thicker arrows form the bootstrapping loop.

Figure 3 :
Figure 3: The EM-style optimization has a big impact on small-capacity models, measured in BLEU.Experiments were carried out on Paracrawl data.

Figure 4 :
Figure 4: Curriculum learning and optimization push "easier-to-learn" (lower per-word loss) examples to late curriculum (right) and harder examples (higher per-word loss) to early curriculum (left).

Figure 5 :
Figure 5: Curriculum learning and optimization push "regularized" (lower variance) examples to late curriculum and higher-variance examples to early curriculum.

Figure 7 :
Figure 7: Curriculum learning in one domain may enable curriculum learning in another.
has been used to further improve traditional static selection.In CL, a curriculum, C, is a sequence of training criteria over training steps.A training criterion, Q t (y|x), at step t is associated with a set of weights, W t (x, y), over training examples (x, y) in a dataset D, where y is the translation for x.Q t (y|x) is a re-weighting of the training distribution P (y|x):

Table 1 :
Curriculum and co-curriculum examples generated from a toy dataset.Each is characterized by its re-weighting, Wt, over four steps, to stochastically order data to benefit a final task.ϕ: the domain scoring function (Eq.1).φ: the denoising scoring function (Eq.2).Strikethrough marks discarded examples.
3They con-trol the data-discarding paces for clean-data selection and domain-data selection, respectively.At step t, D φ β t, D XY retains the top β (t) of background data D XY , sorted by scoring function φ (x, y).
Strictly speaking, though all are in news, the WMT 2014 monolingual data, the WMT 2011-2012 test sets and the 2014 test set are not necessarily in the exact same news domain.So this news test domain could be treated as a looser case than the IWSLT domain and examines the method at a slightly different position in the spectrum of the problem.

Table 3 :
Per-step cascading works better than mixing on Paracrawl data.

Table 4 :
Co-curriculum improves either constituent curriculum and no CL, can be close to the true curriculum on noisy data.

Table 5 :
EM-style optimization further improves domain curriculum.But, overall, it has a small impact on deep models.

Table 6 :
, each individual curriculum concentrates more on news indomain examples as training progresses.Over iterations, bootstrapping makes the co-curriculum more news-domain aware.Due to the use of the Retraining with a curriculum may work better than fine-tuning with it, on a large, noisy dataset.

Table 7 :
Curriculum learning works slightly better than finetuning a warmed-up model with a top static selection.