Understanding Data Augmentation in Neural Machine Translation: Two Perspectives towards Generalization

Many Data Augmentation (DA) methods have been proposed for neural machine translation. Existing works measure the superiority of DA methods in terms of their performance on a specific test set, but we find that some DA methods do not exhibit consistent improvements across translation tasks. Based on the observation, this paper makes an initial attempt to answer a fundamental question: what benefits, which are consistent across different methods and tasks, does DA in general obtain? Inspired by recent theoretic advances in deep learning, the paper understands DA from two perspectives towards the generalization ability of a model: input sensitivity and prediction margin, which are defined independent of specific test set thereby may lead to findings with relatively low variance. Extensive experiments show that relatively consistent benefits across five DA methods and four translation tasks are achieved regarding both perspectives.


Introduction
Data Augmentation (DA) is a training paradigm that has been proved to be very effective in many modalities (Park et al., 2019;Perez and Wang, 2017;Sennrich et al., 2016a), especially for classification (Perez and Wang, 2017). In structured domain, Neural Machine Translation (NMT) is the frontier of DA research (Sennrich et al., 2016a;Norouzi et al., 2016;Zhang and Zong, 2016;Fadaee et al., 2017;Wang et al., 2018;Edunov et al., 2018;Fadaee and Monz, 2018). However, by investigating a variety of DA methods, we find that their test performance across different translation tasks does not exhibit consistent improvement, and this phenomenon can be initially observed in (Wang et al., 2018) as well. The reason might be the evaluation ⇤ Work done at Tencent AI Lab. metric on a specific test set when compared to the whole data population, which generates all possible data, has large variance so that leads to the inconsistency. This evaluation dilemma is also recognized and explored by Recht et al. (2018Recht et al. ( , 2019; Werpachowski et al. (2019), and is especially notorious for language generation tasks (Chaganty et al., 2018;Hashimoto et al., 2019) where the evaluation metrics, e.g. BLEU (Papineni et al., 2001), are extrinsic and heavily relies on the reference provided. Therefore, we ask a fundamental question: what benefits, which are more consistent across different DA methods and translation tasks, can DA in general obtain?
A direct answer to the above question is to use generalization gap (Kawaguchi et al., 2018) defined by the difference between population risk and empirical risk. This measure does not rely on any specific test set, accurately depicts generalization but is intractable to compute. So recently, many theorists have proposed either non-vacuous generalization bound (Dziugaite and Roy, 2017; or novel generalization measures (Novak et al., 2018;Bartlett et al., 2017;Neyshabur et al., 2017;Jiang et al., 2019) to roughly reflect the gap. Inspired by them, we propose to understand the benefits of DA from two perspectives: input sensitivity and prediction margin. The proposed underlying two measures are well adapted from Novak et al. (2018) and Bartlett et al. (2017) and can be computed only on the train samples to unveil the consistent benefits of DA. Under a carefully designed fair setting over four different translation tasks, we examine five methods from two main categories of DA and compare them with a model trained without DA. The empirical experiments demonstrate the following findings: a). DA methods exhibit more consistent effects across different translation tasks in terms of both measures. b). DA methods can either allevi-ate input sensitivity or promote prediction margin.
By and large, our main contributions are: • We make an initial attempt to understand the essence of DA in NMT by investigating its benefits which are relatively consistent across five DA methods and four translation tasks.
• We highlight two perspectives towards generalization to measure the benefits of DA in NMT and study them with carefully designed fair experiments.

Training Objective Decomposition
Given the train set T the baseline NMT model p ✓ (y|x) without using DA is trained under the empirical data distributionp(X, Y |T ) through maximum likelihood estimation: wherep is a mixture of Dirac distribution concentrated around each training instance with uniform mixture coefficients (1/|T |). Then we define the augmentation (AUG) model as a conditional distribution over the train set, q(X, Y |T ). 1 Under the AUG model, the training objective becomes: More realistically, for any DA method in any training run, we can collect the augmented instances to form a set A distinguishing T , when considering the curriculum of mixing A with the original train T . Since we would like to derive a conceptual framework that reflects this form of importance weighting, we further decompose AUG model into a linear interpolation (↵) ofp(X, Y |T ) and an augmentation distribution q AUG (X, Y |T ): where ↵ controls the mixture ratio within a batch during SGD training. The ratio has been founded as an important factor influencing final performance (Sennrich et al., 2016a;Fadaee et al., 2017;Edunov et al., 2018;Fadaee and Monz, 2018). 1 In the paper, we do not consider using monolingual data for DA thus conditioning only on bilingual data since this will further bring monolingual data selection discussed in Fadaee and Monz (2018) as a factor to influence the performance of different DA methods; we leave this factor for future study.

Method Fr)En En)Fr Zh)En En)De
Baseline 38.38 (5) 38.88 (6) 17.25 (6) 26.19 (4) Key factors Through Eq. 3, we can identify two key factors for conducting fair experiments: a) the number of SGD updates on every original training instance means how much the model learns from T ; b) the mixture ratio means how much the model learns from A online, with which together balance the learning of the translation knowledge.

Settings and Main Performance
Settings By carefully controlling the above two factors, we conduct fair and extensive experiments with Transformer (Vaswani et al., 2017) on four translation tasks for five DA methods. Fairseq (Ott et al., 2019) is used as our codebase. We use standard benchmarks IWSLT17 En-Fr, WMT19 Zh-En, WMT19 En-De, where we train both translation directions on the IWSLT corpus. The five DA methods are briefly summarized as follows: • RAML: reward-augmented maximum likelihood training, which augment the target-side with a sampling distribution P (Y |Y ⇤ ) concentrated around Y ⇤ (Norouzi et al., 2016).
• Switchout (SO): similar to RAML, but also adds the some kind of augmentation to the source-side (Wang et al., 2018).
• Self-training (ST): fix the source-side, uses an forward NMT model to generate the target-side (Zhang and Zong, 2016).
• Target-agree (TA): similar to ST, but uses a forward NMT model with right-to-left decoder .
• Back-translation (BT): fix the target-side, uses an backward NMT model to generate the source-side (Sennrich et al., 2016a).
The implementation of RAML and SO are borrowed from the Appx. of Wang et al. (2018). 2
To measure the degree of consistency, we use a correlation measure called Kendall's coefficient of concordance (Kendall and Smith, 1939;Mazurek, 2011) to evaluate the correlation of the rankings produced on the four translation tasks (appx. C). The value shows strong consistency (correlation) of different rankings when it is close to 1. We call the correlation value Cross-Task Consistency measure or CTC. The CTC for the BLEU measure is 0.62, which is of weak consistency. This phenomenon might be a result of the intrinsic nature of using a single specific test as a substitute of the whole data population for evaluation. In the next section, we introduce two measures that are more consistent (with close-to-1 CTC value). They in some extent reflect the model generalization and are easy-to-compute as well.

Two Measures Towards Generalization
We attempt to understand the benefits that DA can obtain through the quantification of input sensitivity and prediction margin. The two measures are adapted from Novak et al. (2018) . They have been proved through massive experiments to be correlated with model generalization. Our main purpose here is to utilize them to unveil the consistency property (measured by CTC) of DA across different methods and translation tasks. The next two subsections define the two measures and report their statistics on subsamples of the train set respectively.

Input Sensitivity
Input sensitivity is the sensitivity of the loss computed from the model towards a minor change of input representation. Given a point of interest x, Figure 1: sensitivity binned avg. token freq. statistics. Each point represents a bin from which we compute the token level average sensitivity between that DA method and the baseline and the token level average frequency as its x and y coordinate value. the original form in Novak et al. (2018) is defined as the expected Jacobian norm of the loss vector log p ✓ (·|x) and p ✓ is a softmax classifier: where J(x) = @ log p ✓ (·|x)/@x T , and || · || F the Frobenius or L2 norm of the matrix. The paper also suggests a more predictive quantity of the generalization ability. That is to take advantage of the label y of x and only compute the L2 norm of a slice of the Jacobian matrix indexed by the label. We adopt the later measure which is the gradient norm of the loss scalar indexed by y to x: If x 2 R d lies in a space with differential structure, we can apply Eq. 5 directly. But in NMT the naive representation of an instance (x, y) is the token index given by the vocabulary, so we cannot compute the gradient of the loss with respect to x. We follow Sundararajan et al. (2017) and use the result of x after embedding lookup as its learned representation, denoted as Emb(x) 2 R Lx⇥d where L x is the length of the input and d the size of the embedding. By regarding the translation model p ✓ (y|x) as a function that decomposes at each step of y given Emb(x) as input to get a scalar average log likelihood, denoted as L x,y = 1 Ly P t log p ✓ (y t |y <t , x). Moreover, initial experiments on just using the single original x i to evaluate the gradient will still result in inconsistency, due to the non-equivalence of the localness concept compared with the continuous setting, i.e. for language input, the localness is between discrete inputs in the neighbor of x i . So we evaluate the sensitivity of x i by averaging gradient norms over its k nearest neighbor x i(j) 2 kNN[x i ] through cosine similarity between word embeddings. We set k to 5 in our experiment to guarantee words in the k nearest neighbor has similar semantic meaning. Formally, we define the input sensitivity of an NMT model as: where the Emb(x) i is the embedding lookup of the i th token index, so we compute the average tokenwise gradient of each instance.
We use subsamples of the train set to approximately compute the expectation in Eq. 6 and the overall statistics are shown in Table 2. Similar to Table 1, the DA methods are shown in their value respect to the baseline. A first thing to notice is that the ranking is more steady across tasks (CTC=0.72). It also shows that for input x, DA in general can reduce the gradient norm of the prediction loss on Emb(x) i , which shows that DA can obtain more stable model towards data corruption.
To further understand what effect DA in general has on each input token type, we compute the sensitivity between the baseline and one DA method on the same token type and sort them according to the with positive value (which means DA reduces the sensitivity of that token type). We then divide the sorted types into ten bins and compute the average token type frequency of that bin. As shown in Figure 1, DA in general, improves the sensitivity of token types with relatively low frequency more than those with high frequency, thus may somehow improve the translation quality of low frequency token types.

Prediction Margin
Margin is a classic concept in support vector machine (Vapnik, 2013), which is defined as the geometric distance between the support vectors and the decision boundary. Larger margin implies better generalization. In nonlinear case, it reflects the distance of a correctly classified input representation with class i to move towards the decision boundary between i and any other class j (Jiang et al., 2019). However, since the decision boundary does not have analytical form due to nonlinearity, computing the geometric distance is intractable. In our setting we regard NMT model as doing step-wise classification with z = (x, y <t ) as  input feature and y t as the label. In Bartlett et al. (2017), the original definition of the margin of correctly predicted input is: where y t is the ground-truth label, v 0 another class type, R the spectral complexity of the model and N the number of training instance in the train subsamples for computing their margins. We simplify Eq. 7 to only consider the numerator. The reasons are: a) under the same model architecture, Rs are very close across different DA methods; b) we can omit ||x|| 2 /N since it remains unchanged as well.
In this way, we can map every z, y t to a margin with label type y t = v, where v 2 V tgt : So for every target token type v we can collect a set of margins {m v z,y t }, and the margin sets of all token types are combined as the total margin set [ v {m v z,y t }. Following Neyshabur et al. (2017), we do not compute the minimum margin of the total set which can be highly sensitive to outliers. Instead, if the total set has cardinality N 0 , we obtain the ✏N 0 -th smallest margin from the set as the overall prediction margin, with a tolerant coefficient ✏ 2 [0, 0.1] (✏ is set to 0.001). 3 We can also obtain token-wise prediction margin from {m v z,y t }, which is the prediction margin of a specific token type v.
The overall prediction margins are listed in Table 3. The relative rank is highly consistent across the four translation tasks (CTC=0.98). Although RAML and SO seem to be inferior to the baseline, other DA methods improve the margin in general. We give a possible explanation for this in the next subsection. Similar to the previous subsection, we also report the average token type frequency of each margin binned token groups in Figure 2 and find that DA, in general, brings larger margin improvement over low frequency tokens.

Discussion
Why use these two measures? We have conducted a relatively complete survey of the recent measures towards measuring generalization ability proposed by the deep learning community, such as model complexity Neyshabur et al., 2017), flatness (Dinh et al., 2017), stiffness (Fort et al., 2019) and second order Hessian of the input or the number of linear regions in hiddens (Novak et al., 2018;Montufar et al., 2014). Some of those measures have complex definition such as linear regions, others are very expensive to compute for models as large as Transformer such as Hessian and stiffness. However, we compute weight norm with different forms proposed in Neyshabur et al. (2017) and find no regularity which suggests that the complexity measure through norms for networks architectures like Transformer or convolutional/recurrent neural networks might be very different from simple feedforward ones which might be still an open problem in theoretic deep learning community. As a matter of fact, due to computational easiness and large-scale empirical evidence, we choose sensitivity and margin the measures. Why no absolute consistency between the two measures? In Section 3.1 and 3.2, the two measures do not show well consistency between them: under the margin based measure, RAML and SO do not exhibit superiority over the baseline like they do under the sensitivity measure. One reason might be: despite the measures are empirically proved to reflect generalization, they are only one view towards generalization respectively. Specifically, in recent generalization theory (Novak et al., 2018;Bartlett et al., 2017), the measures are evaluated between models with extremely evident difference in generalization ability (measured by test performance difference), for example, between models trained with random labels and true labels. Instead, our comparison is among models with similar capacity and are well-trained, which rises challenge for us to get very consistent statistics through a single view. This may inspire us to combine multiple views of model training to design better measures with stronger correlation.

Conclusion and Future Work
This paper aims at delivering relatively consistent benefit measures of DA due to the phenomenon of inconsistant BLEU improvement across translation tasks. To our expect, the proposed two measures exhibit relative consistency (especially prediction margin) on five DA methods across four translation tasks, which demonstrate that DA can benefit model with improved sensitivity or prediction margin especially for low frequency words.
However, the problem of intrinsic evaluation or better understanding of the unreasonable effective of DA should just be a start. DA is a tradeoff between noise vs. knowledge injection, so it could be a nice theoretic direction to think about DA under statistical query model (Kearns, 1998) with translation between formal languages (ws-, 2019). This could inspire another essential question: what is the intrinsic properties of the augmented data (Branchaud-Charron et al., 2019) that matter in discrete domain. Applications like active data selection (Coleman et al., 2019) guided with margin or sensitivity can be derived. In general, understanding NMT model's behavior (not only with DAs) beyond BLEU (Neubig et al., 2019) should be taken seriously, e.g. to design a bevavior suite like Osband et al. (2019) is most valuable.