Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation

To improve low-resource Neural Machine Translation (NMT) with multilingual corpus, training on the most related high-resource language only is generally more effective than us- ing all data available (Neubig and Hu, 2018). However, it remains a question whether a smart data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose and efficient algorithm, (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show TCS brings significant gains of up to 2 BLEU improvements on three of four languages we test, with minimal training overhead.


Introduction
Multilingual NMT has led to impressive gains in translation accuracy of low-resource languages (LRL) (Zoph et al., 2016;Firat et al., 2016;Gu et al., 2018;Neubig and Hu, 2018;Nguyen and Chiang, 2018).Many real world datasets provide sentences that are multi-parallel, with the same content in a variety of languages.
Examples include TED (Qi et al., 2018), Europarl (Koehn, 2005), and many others (Tiedemann, 2012).These datasets open up the tantalizing prospect of training a system on many different languages to improve accuracy, but previous work has found methods that use only a single related (HRL) often out-perform systems trained on all available data (Neubig and Hu, 2018).In addition, because the resulting training corpus is smaller, using a single language is also substantially faster to train, speeding experimental cycles (Neubig and Hu, 2018).In this paper, we go a step further and ask the question: can we design an intelligent data selection strategy that allows us to choose the most relevant multilingual data to further boost NMT performance and training speed for LRLs?
Prior work has examined data selection from the view of domain adaptation, selecting good training data from out-of-domain text to improve in-domain performance.In general, these methods select data that score above a preset threshold according to some metric, such as the difference between in-domain and out-ofdomain language models (Axelrod et al., 2011;Moore and Lewis, 2010) or sentence embedding similarity (Wang et al., 2017).Other works use all the data but weight training instances by domain similarity (Chen et al., 2017), or sample subsets of training data at each epoch (van der Wees et al., 2017).However, none of these methods are trivially applicable to multilingual parallel datasets, which usually contain many different languages from the same domain.Moreover, most of these methods need to pretrain language models or NMT models with a reasonable amount of data, and accuracy can suffer in low-resource settings like those encountered for LRLs (Duh et al., 2013).
In this paper, we create a mathematical framework for data selection in multilingual MT that selects data from all languages, such that minimizing the training objective over the sampled data approximately minimizes the loss of the LRL MT model.The formulation leads to an simple, efficient, and effective algorithm that first samples a target sentence and then conditionally samples which of several source sentences to use for training.We name the method Target Conditioned Sampling (TCS).We also propose and experiment with several design choices for TCS, which are especially effective for LRLs.On the TED multilin-gual corpus (Qi et al., 2018), TCS leads to large improvements of up to 2 BLEU on three of the four languages we test, and no degradation on the fourth, with only slightly increased training time.To our knowledge, this is the first successful application of data selection to multilingual NMT.

Multilingual Training Objective
First, in this section we introduce our problem formally, where we use the upper case letters X, Y to denote the random variables, and the corresponding lower case letters x, y to denote their actual values.Suppose our objective is to learn parameters θ of a translation model from a source language s into target language t.Let x be a source sentence from s, and y be the equivalent target sentence from t, given loss function L(x, y; θ) our objective is to find optimal parameters θ * that minimize: where P s (X, Y ) is the data distribution of s-t parallel sentences.Unfortunately, we do not have enough data to accurately estimate θ * , but instead we have a multilingual corpus of parallel data from languages {s 1 , S 2 , ..., S n } all into t.Therefore, we resort to multilingual training to facilitate the learning of θ.Formally, we want to construct a distribution Q(X, Y ) with support over s 1 , s 2 , ..., s n -T to augment the s-t data with samples from Q during training.Intuitively, a good Q(X, Y ) will have an expected loss that is correlated with Eqn 1 over the space of all θ, so that training over data sampled from Q(X, Y ) can facilitate the learning of θ.Next, we explain a version of Q(X, Y ) designed to promote efficient multilingual training.

Target Conditioned Sampling
We argue that the optimal Q(X, Y ) should satisfy the following two properties.First, Q(X, Y ) and P s (X, Y ) should be target invariant; the marginalized distributions Q(Y ) and P s (Y ) should match as closely as possible: This property ensures that Eqn 1 and Eqn 2 are optimizing towards the same target Y distribution.
Second, to have Eqn 2 correlated with Eqn 1 over the space of all θ, we need Q(X, Y ) to be correlated with P s (X, Y ), which can be loosely written as (4) Because we also make the target invariance assumption in Eqn 3, We call this approximation of P s (X|Y ) by Q(X|Y ) conditional source invariance.Based on these two assumptions, we define Target Conditioned Sampling (TCS), a training framework that first samples y ∼ Q(Y ), and then conditionally samples x ∼ Q(X|y) during training.Note P s (X|Y = y) is the optimal back-translation distribution, which implies that back-translation (?) is a particular instance of TCS.
Of course, we do not have enough s-t parallel data to obtain a good estimate of the true backtranslation distribution P s (X|y) (otherwise, we can simply use that data to learn θ).However, we posit that even a small amount of data is sufficient to construct an adequate data selection policy Q(X|y) to sample the sentences x from multilingual data for training.Thus, the training objective that we optimize is Next, in Section 2.3, we discuss the choices of Q(Y ) and Q(X|y).

Choosing the Sampling Distributions
Choosing Q(Y ).Target invariance requires that we need Q(Y ) to match P s (Y ), which is the distribution over the target of s-t.We have parallel data from multiple languages s 1 , s 2 , ..., s n , all into t.Assuming no systematic inter-language distribution differences, a uniform sample of a target sentence y from the multilingual data can approximate P s (Y ).We thus only need to sample y uniformly from the union of all extra data.
Choosing Q(X|y).Choosing Q(X|y) to approximate P s (X|y) is more difficult, and there are a number of methods could be used to do so.To do so, we note that conditioning on the same target y and restricting the support of P s (X|y) to the sentences that translate into y in at least one of s it, P s (X = x|y) simply measures how likely x is in s.We thus define a heuristic function sim(x, s) that approximates the probability that x is a sentence in s, and follow the data augmentation objective in Wang et al. (2018) in defining this probability according to where is a temperature parameter that adjusts the peakiness of the distribution.

Algorithms
The formulation of Q(X, Y ) allows one to sample multilingual data with the following algorithm: 1. Select the target y based on Q(y).In our case we can simply use the uniform distribution.2. Given the target y, gather all data (x i , y) ∈ s 1 , s 2 , ...s n -t and calculate sim(x i , s) 3. Sample (x i , y) based on Q(X|y) The algorithm requires calculating Q(X|y) repeatedly during training.To reduce this overhead, we propose two strategies for implementation: 1) Stochastic: compute Q(X|y) before training starts, and dynamically sample each minibatch using the precomputed Q(X|y); 2) Deterministic: compute Q(X|y) before training starts and select x ′ = argmax x Q(x|y) for training.The deterministic method is equivalent to setting τ , the degree of diversity in Q(X|y), to be 0.

Similarity Measure
In this section, we define two formulations of the similarity measure sim(s, x), which is essential for constructing Q(X|y).Each of the similarity measures can be calculated at two granularities: 1) language level, which means we calculate one similarity score for each language based on all of its training data; 2) sentence level, which means we calculate a similarity score for each sentence in the training data.
Vocab Overlap provides a crude measure of surface form similarity between two languages.It is efficient to calculate, and is often quite effective, especially for low-resource languages.Here we use the number of character n-grams that two languages share to measure the similarity between the two languages.We can calculate the language-level similarity between S i and S

LRL Train
represents the top k most frequent character n-grams in the training data of a language.
Then we can assign the same language-level similarity to all the sentences in s i .This can be easily extended to the sentence level by replacing vocab k (s i ) to the set of character ngrams of all the words in the sentence x.
Language Model trained on s can be used to calculate the probability that a data sequence belongs to s.Although it might not perform well if s does not have enough training data, it may still be sufficient for use in the TCS algorithm.The language-level metric is defined as sim LM-lang (s i , s) = exp c i ∈s i NLL s (c i ) |c i ∈ s i | where NLL s (•) is negative log likelihood of a character-level LM trained on data from s.Similarly, the corresponding sentence level metric is the LM probability over each sentence x.
We use multiple settings for baselines: 1) Bi: each LRL is paired with its related HRL, following Neubig and Hu (2018).The statistics of the LRL and their corresponding HRL are listed in Table 1; 2) All: we train a model on all 58 languages; 3) Copied: following Currey et al. (2017), we use the union of all English sentences as monolingual data by copying them to the source side.

Experiment Settings
A standard sequence-tosequence (Sutskever et al., 2014) NMT model with attention is used for all experiments.Byte Pair Encoding (BPE) (Sennrich et al., 2016;Kudo and Richardson, 2018) with vocabulary size of 8000 is applied for each language individually.Details of other hyperparameters can be found in Appendix A.1.

Results
We test both the Deterministic (TCS-D) and Stochastic (TCS-S) algorithms described in Section 2.4.For each algorithm, we experiment with the similarity measures introduced in Section 2.5.The results are listed in Table 2.
Of all the baselines, Bi in general has the best performance, while All, which uses all the data and takes much longer to train, generally hurts the performance.This is consistent with findings in prior work (Neubig and Hu, 2018).Copied is only competitive for slk, which indicates the gain of TCS is not simply due to extra English data.
TCS-S combined with the language-level similarity achieves the best performance for all four languages, improving around 1 BLEU over the best baseline for aze, and around 2 BLEU for glg and slk.For bel, TCS leads to no degradation while taking much less training time than the best baseline All.
TCS-D vs. TCS-S.Both algorithms, when using document-level similarity, improve over the baseline for all languages.TCS-D is quite effective without any extra sampling overhead.TCS-S To make it comparable to Bi, we use the sentence from the LRL and its most related HRL if there is one for the sampled y, but use the backtranslated sentence otherwise.Table 2 shows that for slk, back-translate achieves comparable results with the best similarity measure, mainly because slk has enough data to get a reasonable backtranslation model.However, it performs much worse for aze and bel, which have the smallest amount of data.

Effect on SDE
To ensure that our results also generalize to other models, specifically ones that are tailored for better sharing of information across languages, we also test TCS on a slightly different multilingual NMT model using soft decoupled encoding (SDE; Wang et al. (2019)), a word encoding method that assists lexical transfer for multilingual training.
The results are shown in Table 3. Overall the results are stronger, but the best TCS model outperforms the baseline by 0.5 BLEU for aze, and around 2 BLEU for the rest of the three languages, suggesting the orthogonality of data selection and better multilingual training methods.

Effect on Training Curves
In Figure 1, we plot the development perplexity of all four languages during training.Compared to Bi, TCS always achieves lower development perplexity, with only slightly more training steps.Although using all languages, TCS is able to decrease the development perplexity at similar rate as Bi.This indicates that TCS is effective at sampling helpful multilingual data for training NMT models for LRLs.

Conclusion
We propose Target Conditioned Sampling (TCS), an efficient data selection framework for multilingual data by constructing a data sampling distribution that facilitates the NMT training of LRLs.TCS brings up to 2 BLEU improvements over strong baselines with only slight increase in training time.

Table 1 :
Statistics of our datasets.