Learning a Multi-Domain Curriculum for Neural Machine Translation

Most data selection research in machine translation focuses on improving a single domain. We perform data selection for multiple domains at once. This is achieved by carefully introducing instance-level domain-relevance features and automatically constructing a training curriculum to gradually concentrate on multi-domain relevant and noise-reduced data batches. Both the choice of features and the use of curriculum are crucial for balancing and improving all domains, including out-of-domain. In large-scale experiments, the multi-domain curriculum simultaneously reaches or outperforms the individual performance and brings solid gains over no-curriculum training.


Introduction
In machine translation (MT), data selection, e.g., (Moore and Lewis, 2010; Axelrod et al., 2011), has remained as a fundamental and important research topic. It has played a crucial role in domain adaptation by selecting domain-matching training examples, or data cleaning (aka denoising) by selecting high-quality examples. So far, the most extensively studied scenario assumes a single domain to improve.
It becomes both technically challenging and practically appealing to build a large-scale multidomain neural machine translation (NMT) model that performs simultaneously well on multiple domains at once. This requires addressing research challenges such as catastrophic forgetting (Goodfellow et al., 2014) at scale and data balancing. Such a model can easily find potential use cases, i.e., as a solid general service, for downstream transfer learning, for better deployment efficiency, or for transfer learning across datasets.
Unfortunately, existing single-domain dataselection methods do not work well for multiple domains. For example, improving the translation  accuracy of one domain will often hurt that of another (van der Wees et al., 2017;Britz et al., 2017), and improving model generalization across all domains by clean-data selection  may not promise optimization of a particular domain. Multiple aspects need to be considered for training a multi-domain model. This paper presents a dynamic data selection method to multi-domain NMT. Things we do differently from previous work in mixing data are the choice of instance-level features and the employment of a multi-domain curriculum that is additionally able to denoise. These are crucial for mixing and improving all domains, including outof-domain. We experiment with large datasets at different noise levels and show that the resulting models meet our requirements.

Related Work
In MT, research that is most relevant to our work is data selection and data mixing, both being concerned with how to sample examples to train an MT model, usually for domain adaptation. Table 1 categorizes previous research by two aspects and shows where our work stands. These two aspects are: 1. Is the method concerned with a single domain or multiple domains?
Static data selection for a single domain. Moore and Lewis (2010) select in-domain data for n-gram language model (LM) training. It is later generalized by Axelrod et al. (2011) to select parallel data for training MT models. ;  use classifiers to select domain data. Clean-data selection (Koehn et al., 2019Junczys-Dowmunt, 2018) reduces harmful data noise to improve translation quality across domains. All these works select a data subset for a single "domain" 1 and treat the selected data as a static/flat distribution.
Dynamic data selection for a single domain. Static selection has two shortcomings: it discards data and it treats all examples equally after selection. When data is scarce, any data could be helpful, even if it is out of domain or noisy 2 . Dynamic data selection is introduced to "sort" data from least in-domain to most in-domain. Training NMT models on data sorted this way effectively takes advantage of transfer learning. Curriculum learning (CL) (Bengio et al., 2009) has been used as a formulation for dynamic data selection. Domain curricula (van der Wees et al., 2017;Zhang et al., 2019) are used for domain adaptation. Model stacking (Sajjad et al., 2017;Freitag and Al-Onaizan, 2016) is a practical idea to build domain models. CL is also used for denoising Wang et al., 2018a,b), and for faster convergence and improved general quality (Zhang et al., 2018;Platanios et al., 2019). Wang et al. (2018a) introduce a curriculum for training efficiency. In addition to data sorting/curriculum, instance/loss weighting (Wang et al., 2017;Wang et al., 2019b) has been used as an alternative. CL for NMT represents the SOTA data-selection method, but most existing works target at a single "domain", be it a specific domain or the "denoising domain".  1 We treat denoising as a domain in the paper, inspired by previous works that treat data noise using domain adaptation methods, e.g., (Junczys-Dowmunt, 2018). 2 We refer to data regularization (using more data) and to transfer learning (fine-tuning) to exploit both data quantity and quality, the idea behind dynamic data selection. See Appendix C. 2017). Britz et al. (2017) learn domain-discerning (or -invariant) network representation with a domain discriminator network for NMT. The methods, however, require that domain labels are available in data. Tars and Fishel (2018) cluster data and tag each cluster as multi-domain NMT training data, but the method treats data in each cluster as a flat distribution. Farajian et al. (2017) implement multi-domain NMT by on-the-fly data retrieval and adaptation per sentence, at increased inference cost. Most existing methods (or experiment setups) have the following problems: (i) They mix data statically. (ii) They don't consider the impact of data noise, which is a source of catastrophic forgetting. (iii) Experiments are carried out with small datasets, without separate examination on the data regularization effect. (iv) They do not examine out-of-domain performamce.
Automatic data balancing for multi-domains. (Wang et al., 2020) automatically learn to weight (flat) data streams of multi-languages (or "domains"). We perform dynamic data selection and regularization through a mulit-domain curriculum.
Automatic curriculum learning. Our work falls under automatic curriculum construction (Graves et al., 2017) and is directly inspired by Tsvetkov et al. (2016), who learn to weight and combine instance-level features to form a curriculum for an embedding learning task, through Bayesian Optimization. A similar idea (Ruder and Plank, 2017) is used to improve other NLP tasks. Here, we use the idea for NMT to construct a multi-domain data selection scheme with various selection scores at our disposal. The problem we study is connected to the more general multiobjective optimization problem. Duh (2018) uses Bandit learning to tune hyper-parameters such as the number of network layers for NMT.
More related work. Previously, catastrophic forgetting has mostly been studied in the continued-training setup (Saunders et al., 2019;, to refer to the degrading performance on the out-of-domain task when a model is fine-tuned on in-domain data. This setup is a popular topic in general machine learning research (Aljundi et al., 2019).  study domain adaptation by freezing subnetworks. Our work instead addresses forgetting in the data-balancing scenario for multi-domains. We use curriculum to generalize fine-tuning. S1 S2 S3 S3 S2 S1 S1 S3 S2 S2 S1 S3 (1) S2 S1 S3 (4)

Curriculum Learning for NMT
We first introduce curriculum learning (CL) (Bengio et al., 2009), which serves as a formulation for SOTA single-domain dynamic data selection and which our method is built upon and generalizes. In CL, a curriculum, C, is a sequence of training criteria over training steps. A training criterion, Q t (y|x), at step t is associated with a set of weights, W t (x, y), 3 over training sentence pairs (x, y) in a parallel dataset D, where y is the translation for x. Q t (y|x) is a re-weighting of the original training distribution P (y|x): Hence, for T maximum training steps, C is a sequence: At step t, an online learner randomly samples a data batch from Q t to fine-tune model m t−1 into m t . Therefore, C corresponds to a sequence of models, M is the final model that the entire curriculum has been optimizing towards. Intermediate models, m t , serve as "stepping stones" to M , to transfer knowledge through them and regularize the training for generalization. A performance metric P(C) evaluates M on a development or test set, after training on C. (1) corresponds to data order Figure 1 (2).
In NMT, CL is used to implement dynamic data selection. First, a scoring function (Section 4.3) is employed to measure the usefulness of an example to a domain and sort data. Then mini-batch sampling, e.g., (Kocmi and Bojar, 2017), is designed to realize the weighting W t , to dynamically evolve the training criteria Q t towards in-domain. (1) shows three sentence pairs, S 1 , S 2 , S 3 , each having three scores, respectively representing usefulness to three domains. A greydomain training curriculum, for example, relies on the data order in (2), gradually discards least useful examples according to W t (x, y) (Eq. 1) in Table 2 (1): At step 1, the learner uniformly samples from all examples (W 1 ), producing model m 1 . In step 2, the least-in-domain S 3 is discarded (strikethrough) by W 2 so we sample from subset {S 1 , S 2 } uniformly to reach m 2 . We repeat this until reaching the final model M . In this process, sampling is uniform in each step, but in-domain examples (e.g., S 1 ) are reused more over steps. Similarly, we can construct the dark-domain curriculum in Figure 1 (3) and the white-domain (4).

Our Approach: Learning a
Multi-Domain Curriculum

General Idea
The challenges in multi-domain/-task data selection lie in addressing catastrophic forgetting and data balancing. In Figure 1, while curriculum (2) moves a model to the grey-domain direction, this direction may not necessarily be positively consistent with the dark domain ( Figure 1 (3)), causing dropped dark-domain performance. Ideally, a training example that introduces the least forgetting across all domains would have gradients that move the model in a common direction towards all domains. While this may not be easily feasible by selecting a single example, we would like the intuition to work in a data batch on average. Therefore, our idea is to carefully introduce Figure 2: Learning a multi-domain curriculum.
per-example data-selection scores (called features) to measure "domain sharing", intelligently weight them to balance the domains of interest, and dynamically schedule examples to trade-off between regularization and domain adaptation.
A method to realize the above idea has the following properties: 1. Features of an example reflect its relevance to domains.
2. Feature weights are jointly learned/optimized based on end model performance.
3. Training is dynamic, by gradually focusing on multi-domain relevant and noise-reduced data batches.
Furthermore, a viable multi-domain curriculum meets the following performance requirements: (i) It improves the baseline model across all domains.
(ii) It simultaneously reaches (or outperforms) the peak performance of individual singledomain curricula.
Above requires improvement over out-of-domain, too.

The Framework
Formally, for a sentence pair (x, y), let f n (x, y) ∈ R be its n-th feature that specifies how (x, y) is useful to a domain. Suppose we are interested in K domains and each example has N features. For instance, each sentence pair of S1, S2, S3 in Figure 1 (1) has three features (N = 3), each for one domain (K = 3). 4 We represent (x, y)'s features using a feature vector F (x, y) = for all sentence pairs, we compute an aggregated score for each sentence pair and sort the entire data in increasing order. We then construct a curriculum C(V ) to fine-tune a warmed-up model, evaluate its performance and propose a next weight vector. After several iterations/trials, the optimal weight vector V * is the one with the best end performance: Figure 2 shows the framework. For the process to be practical and scalable, C fine-tunes a warmedup model for a small number of steps. The learned V * can then eventually be used for retraining a final model from scratch.

Instance-Level Features
We design the following types of features for each training example and instantiate them in Experiments (Section 5).
NMT domain features (q Z ) compute, for a pair (x, y), the cross-entropy difference between two NMT models: P (y|x; θ base ) is a baseline model with parameters θ base trained on the background parallel corpus, P (y|x; θ Z ) is a Z-domain model with θ Z by finetuning θ base on a small, Z-domain parallel corpus D Z with trusted quality and |y| is the length of y. q Z discerns both noise and domain Z (Wang et al., 2019a). Each domain Z has its own D Z . Importantly, Grangier (2019) shows that, under the Taylor approximation (Abramowitz and Stegun, 1964), q Z approximates the dot product between gradient, g(x, y; θ base ), of training example (x, y) and gradient, g( D Z , θ base ), of seed data D Z . 5 Thus an example with positive q Z likely 5 That is, according to Grangier (2019): when θ base and θZ are close, which is the case for finetuning: θZ = θ base + λ g( DZ , θ base ). moves a model towards domain Z. For multiple domains, Z 1 , ..., Z K , selecting a batch of examples with q Z k 's all being positive would move a model towards a common direction shared across multiple domains, which alleviates forgetting. The Z-domain feature q Z (x, y) can be easily generalized into a single multi-domain feature, q Z , for a set of domains Z: by simply concatenating all the seed parallel corpus D Z from the constituent domains into D Z and use it to fine-tune the baseline θ base into θ Z . A benefit of q Z is scalability: using a single feature value to approximate (x, y)'s gradient consistency with the multiple domains at once. Simple concatenation means, however, domain balancing is not optimized as in Eq. 5.
where P (x; ϑ base ) is an NLM with parameters ϑ base trained on the x half of the background parallel data, and P (x; ϑ Z ) is obtained by fine-tuning P (x; ϑ base ) on Z-domain monolingual data. Although d Z may not necessarily reflect the translation gradient of an example under an NMT model, it effectively assesses the Z-domain relevance and, furthermore, allows us to include additional larger amounts of in-domain monolingual data. We do not use its bilingual version (Axelrod et al., 2011), but choose to consider only the source side, for simplicity.
Cross-lingual embedding similarity feature (emb) computes the cosine similarity of a sentence pair in a cross-lingual embedding space. The embedding model is trained to produce similar representations exclusively for true bilingual sentence pairs, following Yang et al. (2019).
These features compensate each other by capturing the information in a sentence pair from different aspects: NLM features capture domain. NMT features additionally discern noise. BERT and emb are introduced for denoising, by transfering the strength of the data they are trained on. All these features are from previous research and here we integrate them to solve a generalized problem.

Performance Metric P
Eq. 5 evaluates the end performance P( C(V )) of a multi-domain curriculum candidate. We simply combine the validation sets from multi-domains into a single validation set to report the perplexity of the last model checkpoint, after training the model on C(V ). The best multi-domain curriculum minimizes model's perplexity (or maximizes its negative per Eq. 5) on the mixed validation set. We experiment with different mixing ratios.

Curriculum Optimization
We solve Eq. 5 with Bayesian Optimization (BayesOpt) (Shahriari et al., 2016) as the optimizer in Figure 2. BayesOpt is derivative-free and can optimize expensive black-box functions, with no assumption of the form of P. It has recently become popular for training expensive machinelearning models in the "AutoML" paradigm. It consists of a surrogate model for approximating P( C(V )) and an acquisition function for deciding the next sample to evaluate. The surrogate model evaluates C(V ) without running the actual NMT training, by the Gaussian process (GP) priors over functions that express assumptions about P. The acquisition function depends on previous trials, as well as the GP hyper-parameters. The Expected Improvement (EI) criterion (Srinivas et al., 2010) is usually used as acquisition function. Algo-rithm 1 depicts how BayesOpt works in our setup. We use Vizier (Golovin et al., 2017) for Batched Gaussian Process Bandit, but open-source implementations of BayesOpt are easily available. 7 .

Curriculum Construction
We pre-compute all features for each sentence pair (x, y) in training data and turn its features into a single score f (x, y) by Eq. 4, given a weight vector. We then construct a curriculum by instantiating its re-weighting W t (x, y) (Eq. 1). To that end, we define a Boolean, dynamic data selection function χ f ρ (x, y; t) to check, at step t, if (x, y) ∈ D belongs to the top ρ(t)-ratio examples in training data D sorted in increasing order of f (x, y), (0 < ρ ≤ 1). So χ f ρ is a mask. Suppose n(t) examples are selected by χ f ρ (x, y; t), the re-weighting will then be W t (x, y) = 1/n(t) × χ f ρ (x, y; t).
Filtered examples have zero weights and selected ones are uniformly weighted. We set ρ(t) = (1/2) t/H to decay/tighten over time 8 , controlled by the hyper-parameter H. During training, χ f ρ (x, y; t) progressively selects higher f (x, y)scoring examples. In implementation, we integrate χ f ρ (x, y; t) in the data feeder to pass only selected examples to the downstream model trainer; we also normalize f (x, y) offline to directly compare to ρ(t) online to decide filtering. As an example, the W t (x, y) for the multi-domain curriculum order in Figure 1 (5) can look like Table 2 (2).

Setup
Data and domains. We experiment with two English→French training datasets: the noisy ParaCrawl data 9 (290 million sentence pairs) and the WMT14 training data (38 million pairs). We use SentencePiece model (Kudo, 2018) for subword segmentation with a source-target shared vocabulary of 32,000 subword units. We evaluate our method with three "domains": two specific domains, news and TED subtitles, and outof-domain. News domain uses the WMT14 news 7 E.g.,https://github.com/tobegit3hub/ advisor 8 When the training data is small, we can, in practice, let a model warm up before applying the schedule. 9 https://paracrawl.eu testset (N14) for testing, and WMT12-13 for validation in early stopping (Prechelt, 1997 (Vaswani et al., 2017) (more details in Appendix A). For the BERT feature, we sample positive pairs from the same data to train the cross-lingual embedding model. The negatives are generated using the cross-lingual embedding model, via 10-nearest neighbor retrieval in the embedding space, excluding the true translation. We pick the nearest neighbor to form a hard negative pair with the English sentence, and a random neighbor to form another negative pair. We sample 600k positive pairs and produce 1.8M pairs in total.
Model. We use LSTM NMT (Wu et al., 2016) as our models, but with the Adam optimizer (Kingma and Ba, 2015). The batch size is 10k averaged over 8 length-buckets (with synchronous training). NLM/NMT features uses 512 dimensions by 3 layers-NLM shares the same architecture as NMT by using dummy source sentences (Sennrich et al., 2016).  and the last 5 in exploitation. Each trial trains for 2k steps 12 by fine-tuning a warmed-up model with the candidate curriculum. The curriculum decays (ρ(t)) from 100% and plateaus at 20% at step 2k. We simply and heuristically set a range of [0.0, 1.0] for all feature weights. We don't normalize feature values when weighting them.

Results
We evaluate if the multi-domain curriculum meets requirements (i) and (ii) in Section 4.1.

Compared to no curriculum
We compare: • B: baseline that does not use curriculum learning.
• C 6-feats : multi-domain curriculum with 6 features, d N , d T , q N , q T , BERT, emb, weights learned by BayesOpt. Table 3 shows C 6-feats improves B on all testsets, especially on noisy ParaCrawl-requirement (i) is met. It is important to note that our WMT baseline (W1) matches Wu et al. (2016) on N14, as shown by re-computed tokenized BLEU (italics).

Compared to single-domain curricula
We examine the following individual curricula, by training NMT models with each, respectively: • C d N , uses news NLM feature d N (Eq. 9).
• C d T , uses TED subtitle NLM feature d T .
• C q N , uses news NMT feature q N (Eq. 6).
• C q T , uses TED NMT feature q T .
• C BERT , uses BERT quality feature.
12 2k is empirically chosen to be practical. We use a number of fine-tuning trials in Eq. 5. NMT training is expensive so we don't want a trial to tune for many steps. NMT is very adaptive on domain data, so each trial does not need many steps. We find no significant difference among 1k, 2k, 6k.  In Table 4, frame boxes mark the best BLEUs (P* or W*) per column, across P3-P7 or W3-W7. The last column shows averaged BLEU over all testsets. Bold font indicates C 6-feats matches or improves W*. As shown, C 6-feats matches or slightly outperforms the per-domain curricula across testsets. Therefore, C 6-feats meets requirement (ii).

Features
Strengths and weaknesses of a feature. Table 4 also reveals the relative strengths and weaknesses of each type of features. The peak BLEU (in a frame box) on each testset is achieved by one of C BERT/emb , C q N and C q T , less by NLM features d N , d T . This contrast seems bigger on the noisy ParaCrawl, but the NLM features do bring gains over B. Overall, C BERT/emb (P5, W5) perform well, attributed to their denoising power, but lose to the NMT features (P7, W7) on T15, due to lack of explicit capturing of domain. The NMT features seem to subtly compensate in domains, and the domain features in denoising, but working with other features improves the model.
BERT and emb features. Both BERT and emb use knowledge external to the experiment setup. For a fair comparison to baselines and a better understanding of them, we drop them by building   • C 4-feats , multi-domain curriculum that excludes BERT and emb and uses 4 features. Table 5 shows BERT and emb features in C 6-feats improve C 4-feats with ParaCrawl, adding to the intuition that they have a denoising effect.
Learned feature weights. Figure 3 shows BayesOpt learns to weight features adaptively in C 6-feats on ParaCrawl (grey) and WMT (white), respectively. ParaCrawl is very noisy thus noise non-discerning features d N and d T do not have a chance to help, but their weights become stronger on the cleaner WMT training data. It is surprising that BERT feature is still useful to the WMT training. We hypothesize this may suggest BERT feature have additional strength to just denoising, or that data noise could be subtle and exist in cleaner data.

BayesOpt vs. random search
We compare BayesOpt (BO) and Random Search (RS) (Bergstra and Bengio, 2012) to solve Eq. 5, as well as uniform weighting (Uniform). In Table 6, all improve baselines, especially on ParaCrawl (P). RS does surprisingly well on ParaCrawl, but BayesOpt appears better overall. 13

Mixing validation sets
Eq. 5 evaluates P using the concatenated validation set (Section 4.4). Table 7 shows that the newsvs-TED mixing ratios can affect the per-domain   BLEUs. For example, on ParaCrawl, when news sentences are absent from the validation set, N14 drops by 0.7 BLEU (P8 vs. P13). We use the four feats as in C 4-feats in this examination.

Dynamic data balancing
We simulate dynamic data selection with a random sample of 2000 pairs from the WMT data and annotate each pair by human raters with 0 (nonsense) -4 (perfect) quality scale (following Wang et al. (2018b)). We sort the pairs by f (x, y) (Eq. 4). A threshold selects a subset of pairs, for which we average the respective NMT feature values as the domain relevance. Figure 4 shows that the multi-domain curriculum ( C 6-feats ) learns to dynamically increase quality and multi-domain relevance. Therefore, our idea (Section 4.1) works as intended. Furthermore, training seems to gradually increase quality or domain in different speeds, determined by Eq. 5.

Weighting loss vs. curriculum
With the learned weights, we compute a weight for each example to sort data to form a curriculum. Alternatively, we could weight the cross-entropy loss for that sentence during training (Wang et al., 2017;. Table 8 shows that curriculum yields improvements over weighing per-

5.3.6
In-domain fine-tuning C q N and C q T each use a small in-domain parallel dataset, but we can simply fine-tune the final models on either dataset (+N, +T) or their concatenation (+N+T). Table 9 shows that C 6-feats can be further improved by in-domain fine-tuning 14 and that both C 6-feats and its fine-tuning still improve the fine-tuned baselines, in particular on ParaCrawl.

Discussion: Feature Dependency
One potential issue with using multiple perdomain features (q Z (x, y)'s in Eq. 6) is scores are not shared across domains and linear weighting may not capture feature dependency. For example, we need two NMT features if there are two domains. We replace the two NMT features, q N and q T , in C 4-feats with a single two-domain feature q Z={N,T } (Eq. 8), but with the two corresponding NLM features unchanged (so the new experiment has 3 features). Table 10 shows multi-domain feature contributes slightly better than linear combination of per-domain features (P19 vs. P8). The per-domain features, however, have the advantage of efficient feature weighting. In case of many features, learning to compress them seems to be an interesting future investigation.
14 We fine-tune with SGD for 20k steps, with batch size 16, learning rate 0.0001.  Table 9: The multi-domain curricula still bring improvements, even after models are fine-tuned on in-domain parallel data. +N: fine-tune on news parallel data DN (Section 5.1); +T: fine-tune on TED parallel data DT ; +N+T on concatenation.  Table 10: Multi-domain/task feature (Eq. 8) seems to contribute slightly better than linear combination of multiple perdomain features (Eq. 6).

Conclusion
Existing curriculum learning research in NMT focuses on a single domain. We present a multidomain curriculum learning method. We carefully introduce instance-level features and learn a training curriculum to gradually concentrate on multi-domain relevant and noise-reduced data batches. End-to-end experiments and ablation studies on large datasets at different noise levels show that the multi-domain curriculum simultaneously reaches or outperforms the individual performance and brings solid gains over no-curriculum training, on in-domain and out-ofdomain testsets.

Appendices A Cross-lingual Embedding Model Parameters
The sentence encoder has a shared 200k token multilingual vocabulary with 10k OOV buckets. For each token, we also extract character n-grams (n = [3, 6]) hashed to 200k buckets. Word token items and character n-gram items are mapped to 320 dim. character embeddings. Word and character n-gram representations are summed together to produce the final input token representation. The encoder is a 3-layer Transformer with hidden size of 512, filter size of 2048, and 8 attention heads. We train for 40M steps using an SGD optimizer with batch size K=100 and learning rate 0.003. During training, the word and character embeddings are scaled by a gradient multiplier of 25.

B Transformer-Big Results
We replicate experiments with the Transformer-Big architecture. Table 11 shows the Transformer-Big results that correspond to the RNN results in Table 3. These results show that the multi-domain curriculum meets the performance requirement (i) (Section 4.1) using the Transformer architecture. Table 12 shows the Transformer-Big results corresponding to RNN results in Table 4. They show that the proposed multi-domain curriculum meets the performance requirement (ii) using Transformer.

C An Explanation: Noisy Data Useful in Low-Resource Setup
With noisy, limited data (e.g., 100k pairs), we can train a model A on all data, or a model B on the filtered subset (e.g., 10k). We can also finetune A on the filtered data, to produce model C. C could be better than A due to use of higherquality data or better than B due to use of more data (200k>10k). Therefore, by "noisy data can be helpful", we refer to data regularization (using more data) and to transfer learning (fine-tuning) to exploit both data quantity and quality, the idea behind dynamic data selection.