How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?

Task-agnostic forms of data augmentation have proven widely effective in computer vision, even on pretrained models. In NLP similar results are reported most commonly for low data regimes, non-pretrained models, or situationally for pretrained models. In this paper we ask how effective these techniques really are when applied to pretrained transformers. Using two popular varieties of task-agnostic data augmentation (not tailored to any particular task), Easy Data Augmentation (Wei andZou, 2019) and Back-Translation (Sennrichet al., 2015), we conduct a systematic examination of their effects across 5 classification tasks, 6 datasets, and 3 variants of modern pretrained transformers, including BERT, XLNet, and RoBERTa. We observe a negative result, finding that techniques which previously reported strong improvements for non-pretrained models fail to consistently improve performance for pretrained transformers, even when training data is limited. We hope this empirical analysis helps inform practitioners where data augmentation techniques may confer improvements.


Introduction
"Task-agnostic" data augmentations -those which are not tailored to a task, but are broadly applicable across the visual or textual domain -have long been a staple of machine learning. Task-agnostic data augmentation techniques for computer vision, such as image translation, rotation, shearing, and contrast jittering, have achieved considerable success, given their ease of use, and wide-spread applicability (Cubuk et al., 2018;Perez and Wang, 2017). In natural language processing, benefits of data augmentation have usually been observed where the augmentations are suited to the task: as with backtranslation for machine translation (Edunov et al., * equal contribution 2018;Xia et al., 2019), or negative sampling for question answering and document retrieval (Zhang et al., 2017;Yang et al., 2019a;Xiong et al., 2020). Outside of application-tailored augmentations, improvements are primarily reported on autoregressive models without unsupervised pretraining or contextual embeddings, such as LSTMs and CNNs, and even then in low data regimes (Zhang et al., 2015;Coulombe, 2018;Wei and Zou, 2019;Yu et al., 2018). Additionally, in computer vision taskagnostic augmentations continue to report benefits when applied to pretrained representations (Gu et al., 2019). However, in NLP it is less clear whether these general augmentations benefit modern Transformer architectures with unsupervised pretraining at scale.
We pose the question: to what extent do modern NLP models benefit from task-agnostic data augmentations? In this paper, we provide empirical results across a variety of tasks, datasets, architectures, and popular augmentation strategies. Among data augmentation techniques, we select Easy Data Augmentation (Wei and Zou, 2019) and Back-Translation (Sennrich et al., 2015); EDA and BT respectively. Both are popular task-agnostic options, and report significant gains for LSTMs on a wide variety of datasets. We apply these techniques to 6 classification-oriented datasets, spanning 5 tasks with varying linguistic objectives and complexity. For fair comparison, we tune each of BERT, XLNET, and ROBERTA extensively, allocating an equal budget of trial runs to models trained with and without augmentations. As a separate dimension, we also vary the availability of training data to understand under what specific conditions data augmentation is beneficial.

Augmentation Techniques
Among the many variations of data augmentation two families are widely used in NLP: back translation and text editing. Back Translation (BT): We use an English to German machine translation model (Ott et al., 2018) and a German to English model (Ng et al., 2019). 1 We selected German due to its strong results as a pairing with English for back translation, as reported in Yu et al. (2018);Sennrich et al. (2015). We translate each English sentence to one German sentence and back to six candidate English sentences. From these sentence candidates we obtain the best results sampling the most distant sentence from the original English sentence, measured by word edit distance. From manual inspection this approach produced the most diverse paraphrases, though this strategy needs to be tailored to the machine translation systems employed. The overall aim of this strategy is to maximize linguistic variety while retaining sentence coherency.
Easy Data Augmentation (EDA): Following Wei and Zou (2019) we employ a combination of popular text editing techniques that have shown strong performance on LSTMs. 2 Text edits include synonym replacement, random swap, random insertion, and random deletion. To improve upon EDA further, we enforce part-of-speech consistency for synonym selection. As an example, the verb "back" in the phrase "to back the government" will not be replaced by "rear", which is a synonym of the noun "back".

Experimental Setup
To conduct a fair assessment of each data augmentation technique, we ensure three properties of our experimental setup: (I) our tuning procedure mimics that of a machine learning practitioner; (II) the selected hyperparameters cannot be significantly improved as to change our conclusions; and (III) each strategy is evaluated with an equal number of trial runs. 3 We experiment with 3 types of Transformers . These models each use slightly different pretraining strategies. BERT and ROBERTA are both pretrained with Masked Language Modeling, but with different auxiliary objectives and number of training steps. XLNET is pretrained with its own "Permutation Language Modeling". For each model and dataset 1k examples are randomly selected for each of the validation and test sets. Separately from these fixed sets, we iterate over five training data sizes N ∈ {500, 1000, 2000, 3000, Full} to simulate data scarcity. // Find best hyperparameters Hα for augmentation α 5: Hα ← RANDOMSEARCH(M , Dtrain, K, T1) 6: Mα ← M.USE(Hα) 7: // Compute validation scores for augmentation α 8: for s = 1 to T2 do 9: SCORES ← TRAIN(Mα, Dtrain, seed=s) 10: end for 11: // Select test scores using best validation scores 12: µα, σα ← SELECTBEST(Scores, 10) 13: end for 14: return (µNoDA, σNoDA), (µBT , σBT ), (µEDA, σEDA) Given a model M , dataset D, and training set size N , we allocate an equal number of training runs to No Augmentation (NO DA), EDA, and BT. For each setting, we define continuous ranges for the learning rate, dropout, and number of epochs. EDA and BT settings also tune a "dosage" parameter governing augmentation τ ∈ {0.5, 1, 1.5, 2}. N ×τ is the quantity of augmented examples added to the original training set.
First, we conduct a RANDOMSEARCH for K = 30 parameter choices, each repeated for T 1 = 3 trials with differing random training seed. As shown in Algorithm 1 this stage returns the optimal hyperparameter choices H α for each augmentation type α ∈ {NoDA, BT, EDA}. The best hyperparameters are selected by mean validation accuracy over random seed trials. In the second stage, a model with these best hyperparameters (Algorithm 1 line 6) is trained over T 2 = 20 random seed trials. 4 Finally, the 10 best trials by validation accuracy are selected for each per setting (line 12). We report the mean and 95% confidence intervals of their test results. The bottom 10 trials are discarded to account for the high accuracy variance of pretrained language models with respect to weight initialization, and data order (Dodge et al., 2020). This procedure closely mimics that of an ML practitioner looking to select the best model. 5 3 Empirical Results Figure 1 shows both the baseline NO DA test accuracies as a reference point, and the mean relative improvement from applying EDA and BT. Empirically, improvements are marginal for 5 of the 6 datasets, only exceeding 1% for BERT-B in a couple of instances where N ≤ 1000. XLNET-B and ROBERTA-B see no discernible improvements at almost any data level and just as frequently observe regressions in mean accuracy from EDA or BT. MNLI presents a clear outlier, with augmentations yielding relative improvements in excess of 2%, but only for BERT. In contrast, the other pretrained transformers experience unpredictable, and mostly negative results.
In terms of augmentation preferences for BERT, BT confers superior results to EDA in 60% of cases, averaging 0.18% absolute difference. This advantage is muted for both XLNET and ROBERTA, with only 53% of cases preferring BT, and at smaller margins. Table 2 shows the improvement of either EDA or BT over NO DA, averaged across all 6 datasets. We compare against Wei and Zou (2019)'s experiments, measuring the impact of EDA on LSTM and CNN models over 5 classification datasets. 6 They observe consistent improvements for nonpretrained models. LSTMs and CNNs improve 3.8% and 2.1% on average at N = 500 training points, and 0.9% and 0.5% on average with full data (approximately equivalent to our own FULL setting). As compared to these, BERT observes muted benefits. To exclude MNLI from this average (not present in Wei and Zou (2019)'s experiments) would reduce all of BERT's improvements well below 1%. ROBERTA and XLNET again show no signs of improvement, frequently yielding worse results than the baseline, even with the best data augmentation.
Finally, we examine the claim that data augmentation confers an advantage with any statistical significance. We use a one-sided t-test with null hypothesis that data augmentation confers a greater mean performance than without, using p-value of .05. Examining BT and EDA vs NO DA over all dataset and data sizes we reject the null hypothesis in 43%, 85%, and 87% of cases for BERT, XL-NET, and ROBERTA respectively. Moreover, for ROBERTA, the inverse hypothesis (that NO DA is statistically better than DA) is true in 28% of cases.
We believe these results are surprising due to two advantages given to data augmentation in this ex-   tion (BERT-B on MNLI) sees significant benefits from data augmentation. We speculate the outlier results could pertain to the inherent difficulty of natural language inference in low data regimes. Alternatively, Gururangan et al. (2018) discuss "annotation artifacts" in MNLI that lead models to rely on simple heuristics, such as the presence of the word "not" in order to make classifications. EDA or BT could mitigate these spurious cues by distributing artifacts more evenly across labels.

Why can Data Augmentation be ineffective?
Our results consistently demonstrate that augmentation provides more benefits to BERT than to ROBERTA and XLNET. The key distinguishing factor between these models is the scale of unsupervised pretraining; therefore, we hypothesize that pretraining provides the same benefits targeted by common augmentation techniques. Arora et al. LSTM requires augmentation to classify correctly, but ROBERTA does not, we observe rare word choice, atypical sentence structure and generally off-beat reviews. This set contains reviews such as "suffers from over-familiarity since hit-hungry british filmmakers have strip-mined the monty formula mercilessly since 1997", "wishy-washy", or "wanker goths are on the loose! run for your lives!", as compared to "exceptionally well acted by diane lane and richard gere", more representative of examples outside this set. We verify this quantitatively: for 100 examples in this set there are 206 (rare) words which only appear in this set, whereas for 100 random samples we see an average of 116 rare words. Interestingly, we also notice label skew in this set (34% of examples are positive instead of the overall mean of 50%). While we leave deeper analysis to future work, we believe these results suggest data augmentation and pretraining both improve a model's ability to handle complex linguistic structure, ambiguous word usage, and unseen words within a category of label.

When can Data Augmentation be useful?
Given these observations, where might taskagnostic data augmentation be useful (with pretrained models)? One candidate application is outof-domain generalization. However, we believe the target domain must not be well represented in the pretraining corpus. For instance, Longpre et al. (2019) did not find standard BT useful for improving generalization of question answering models. While their training domains are diverse, they are mostly based in Wikipedia and other common sources well represented in the BERT pretraining corpus. Additionally, we suspect it is not enough to vary/modify examples in ways already seen in pretraining. Our results motivate more sophisticated (read: targeted) augmentation techniques rather than generic, task (and domain)-agnostic strategies which unsupervised pretraining may capture more effectively. Another candidate application of task-agnostic data augmentation is semi-supervised learning. Xie et al. (2019) illustrate a use for generic data augmentations as a noising agent for their consistency training method, assuming large quantities of unlabeled, in-domain data are available. While taskagnostic data augmentations are effective in this particular task setup, they are not the critical factor in the success of the method, nor is it clear that more tailored or alternative noising techniques might not achieve even greater success.
To our knowledge, our experiments provide the most extensive examination of task-agnostic data augmentation for pretrained transformers. Nonetheless, our scope has been limited to classification tasks, and to the more common models and augmentations techniques.

Conclusion
We examine the effect of task-agnostic data augmentation in modern pretrained transformers. Isolating low data regimes (< 10k training data points) across a range of factors, we observe a negative result: popular augmentation techniques fail to consistently improve performance for modern pretrained transformers. Further, we provide empirical evidence that suggests the scale of pretraining may be the primary factor in the diminished efficacy of textual augmentations. We hope our work provides guidance to ML practitioners in deciding when to use data augmentation and encourages further examination of its relationship to unsupervised pretraining.

A.1 Transformer Models and Training
We share the details of our hyper-parameter selection, for easy reproducibility. For each of BERT, XLNET, and ROBERTA we use configurations mostly consistent with their original releases' recommendations. In all cases code is adapted with minimal changes from open source repositories. The majority of changes to each repository pertain to supporting all 6 datasets, their augmentations, as well as better metrics reporting. All models were trained on 1 NVIDIA Tesla V100 GPU.
For each model we tune over 4 hyperparameters to which the final performance was particularly sensitive. The "augmentation dose" parameter, as described in the paper, only applies to models trained with either EDA or BT. We verify in Appendix Section B that the addition of this tuning dimension did not alter our conclusions with respect to the impact of data augmentation when fully tuned. Lastly, we would note that the final model size varies slightly depending on the size of the classification head -dictated by the number of classes in the task.  Table 3 for details in our training setup and hyperparameter tuning ranges.

A.3 XLNet-Base
For XLNET (Yang et al., 2019b) we also use the original implementation in TensorFlow. 9 See Table 4 for details in our training setup and hyperparameter tuning ranges.  Table 5

B Verifying Tuning Sufficiency
To ensure our conclusions are reliable we must verify that our tuning is sufficient to capture all the benefits of data augmentation. Accordingly, we double the number of hyperparameter configurations (K) and see if any of the conclusions change. As this experiment is computationally expensive, we benchmark the results only for BERT on SST-2. Full results are shown in Table 6.
We observe that on average, doubling the number of configuration trials from K A = 30 to K B = 60 results in minor accuracy improvements at lower training set sizes (e.g. +0.38 at N = 500), and negligible variations at higher training set sizes (e.g. −0.04 at N = F ull). We also measure the resulting change in the difference between using and not using any data augmentation ( (DA − No DA)). While improvements are reported in favour of data augmentation over NO DA, they are all < 0.15%, indicating at K = 30 our conclusions are robust. 15 Available at https://gluebenchmark.com/ tasks 16 Available at https://www.nyu.edu/projects/ bowman/multinli/

C Empirical Results
Detailed results are provided for analysis. In each results table we include the mean accuracy and 95% confidence interval, for every dataset, augmentation type, and training data size. These are the outputs of the second stage of tuning "SELECTBEST" that use the best hyperparameters per setting in the first RANDOMSEARCH stage. We select only the top 10 trials of 20 (by validation accuracy) to compute these test statistics, due to the observed volatility in fine-tuning Transformers with different seeds (Dodge et al., 2020).
The full results are shown below for BERT-BASE (Table 7), for XLNET-BASE (see Table 8), and for ROBERTA-BASE (see Table 9).     Table 9: ROBERTA-BASE mean test accuracy and the 95% confidence interval for each task, augmentation, and data size, computed over the top 10 best trials, by validation score.