Generative Data Augmentation for Commonsense Reasoning

Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance. However, manual curation of training examples is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit on. We investigate G-DAUG^C, a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting. Our approach generates synthetic examples using pretrained language models, and selects the most informative and diverse set of examples for data augmentation. In experiments with multiple commonsense reasoning benchmarks, G-DAUG^C consistently outperforms existing data augmentation methods based on back-translation, and establishes a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA. Further, in addition to improvements in in-distribution accuracy, G-DAUG^C-augmented training also enhances out-of-distribution generalization, showing greater robustness against adversarial or perturbed examples. Our analysis demonstrates that G-DAUG^C produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance. Our findings encourage future research toward generative data augmentation to enhance both in-distribution learning and out-of-distribution generalization.


Introduction
While recent advances in large-scale neural language models (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Raffel et al., 2019) have led to strong performance on several commonsense reasoning benchmarks (Talmor et al., 2019;Lv et al., 2020;Sakaguchi et al., 2020), their accuracy by and large depends on the availability of large-scale human-authored training data. However, crowdsourcing examples at scale for each new task and domain can be prohibitively expensive. Moreover, human-authored data has been shown to exhibit annotation artifacts (Gururangan et al., 2018;Agrawal Figure 1: Example of a selected high-quality generated example compared to a human-authored example from the WINOGRANDE dataset. Composing commonsense questions can require creativity. Schwartz et al., 2017), leading to models with considerably weaker performance on outof-distribution samples (Jia and Liang, 2017;Belinkov and Bisk, 2017;Iyyer et al., 2018).
A candidate solution that has shown promise in other tasks, such as reading comprehension, is to augment a human-authored training set with a large set of synthetically-generated examples (Zhou et al., 2017;Du et al., 2017;Zhao et al., 2018a). But, generating synthetic examples for commonsense reasoning poses a unique challenge. In reading comprehension, for instance, the goal of data augmentation is to generate questions that are directly answerable by a given reference passage. In contrast, answering commonsense questions relies on commonsense notions that are seldom stated explicitly (Gordon and Van Durme, 2013;Forbes and Choi, 2017), and authoring such questions can require creativity (see Figure 1). Based on promising evidence from previous work (Yang et al., 2018;Trinh and Le, 2018;Bosselut et al., 2019;Davison et al., 2019), we hypothesize that pretrained language models, such as GPT-2 (Radford et al., 2019), capture some common sense expressed implicitly in their pretraining corpus. Could questions generated by such models serve as helpful training data? In this work, we explore this question through Generative Data Augmentation for commonsense reasoning (G-DAUG c ; §2): a novel framework for augmenting training data with diverse and informative synthetic training examples to improve both in-distribution performance and out-of-distribution generalization of commonsense reasoning models. 1 Although a generative model allows us to produce large pools of synthetic training examples, the generated examples may be noisy or redundant. To ensure that we use the most informative examples for augmentation, we introduce data selection methods based on influence functions (Koh and Liang, 2017) and a heuristic to maximize the diversity of the generated data pool. Finally, we propose an effective two-stage training scheme for augmentation with synthetic data. In experiments across multiple commonsense benchmarks, we show that G-DAUG c can mitigate the expense and brittleness resulting from large training sets for commonsense reasoning tasks.
To summarize, our contributions include: 1. G-DAUG c , a generative data augmentation framework for commonsense reasoning ( §2), 2. novel selection methods that identify informative and diverse synthetic training examples from the generated pool ( §3), 3. experiments showing that G-DAUG c improves in-distribution performance, achieving a 1-4% average absolute gain across four commonsense reasoning data sets and state-of-theart results on the WINOGRANDE (Sakaguchi et al., 2020), COMMONSENSEQA (Talmor et al., 2019), and CODAH (Chen et al., 2019) benchmarks, and also improves model robustness in terms of resistance to adversarial attacks (Jin et al., 2020) and accuracy on perturbed evaluation sets ( §4), and 4. a comprehensive analysis of the factors that influence G-DAUG c 's performance ( §5).

G-DAUG c
We now describe our framework for Generative Data Augmentation for Commonsense Reasoning (G-DAUG c ). Figure 2 shows an overview of the approach. We describe G-DAUG c 's data generation procedure (steps 1 and 2 in the figure) in this section, and cover the data selection and training Figure 2: Illustration of the G-DAUG c process: (1) generate synthetic data and train a task model, (2) relabel the generated data using the task model, (3) filter the generated data based on estimated influence scores, (4) further select a subset based on a diversity-maximizing heuristic, (5) train a new task model using the filtered generations (synthetic training), and (6) further train this model using the original training data (organic training).

Synthetic Training Data Generation
We will use multiple choice question answering as a running example to describe synthetic data generation. Formally, consider a dataset of N questions D = {(Q i , C i , y i ) : i = 1, 2, ..., N }, where Q i is a sequence of words denoting the i th question, C i = {C i j : j = 1, 2, ..., K} is the corresponding choice set with K choices which are word sequences as well, and a ground truth label y i ∈ {1, 2, ..., K}. We denote the answer as C i y i and the distractors as C i j =y i s. Our text generators are pretrained generative language models, finetuned to maximize the loglikelihood of a sequence of text W, L W (θ) = T t=1 log P (w t |W 1:t−1 ; θ), where W 1:t−1 denotes a subsequence of W and θ denotes the model parameters. 2 Below, we describe how we use variations of this objective to finetune different LMs to generate questions, answers and distractors. 3 Generating Synthetic Questions To train our question generator, we finetune the LM on the training question set {Q i } to optimize the language modeling objective: L q (θ q ) = N i=1 log P (Q i ; θ q ), where θ q denotes the parameters of the question generator. After finetuning, we generate new questions with nucleus sampling (Holtzman et al., 2020), which is suitable for generating long-form text.

Generating Synthetic Answers and Distractors
To generate choice sets, we independently finetune two separate generative LMs, one for answers and the other for distractors. The answer and distractor generators are trained to maximize the conditional log-likelihood of the answer and the distractors, respectively, given the question. Mathematically, we where θ a and θ d denote the parameters of the answer and distractor generators, respectively. For answers, we use nucleus sampling with low temperature (for long answers) or greedy decoding (for short answers). To encourage diversity across generated distractors, we use nucleus sampling without temperature for these.
Data Relabeling. Our choice of generative LMs naturally defines labels for the synthetic choice sets. Alternatively, we consider using a supervised task model trained on the original training set, to relabel a candidate pool of synthetic answers and distractors. This is similar to treating the synthetic questions as unlabeled data and applying self-training. The utility of this self-training can be task-dependent; in our experiments, we used validation performance to determine whether or not to relabel our synthetic training data.

Synthetic Data Selection and Training
The above generation method can produce a large pool of examples, but training on all of them would be computationally expensive and might harm performance due to noisy generations. Here, we propose three data selection methods aimed at choosing more effective training examples from the generated pool ( §3.1). Further, we outline a simple staged training procedure ( §3.2) to mitigate the negative impact from noise in the synthetic data.

Selecting High-quality and Diverse Synthetic Examples
A randomly sampled synthetic dataset may contain examples that are similar to one another, along with low-quality generations (Holtzman et al., 2020).
We refer to such a random selection approach as G-DAUG c -Rand. We hypothesize that a diverse and high-quality synthetic set would benefit the task model more. We present three data selection algorithms that target quality, diversity and a combination of both.
Filtering with Influence Functions. We hypothesize that filtering out detrimental synthetic training examples can boost downstream performance (Bras et al., 2020). A given training example x is considered detrimental if including x in the training set results in a higher generalization error, approximated by validation loss, i.e.: This would naively require retraining the model with x, which is computationally prohibitive. Fortunately, the validation loss change can be efficiently approximated through the use of influence functions (Atkinson et al., 1983;Koh and Liang, 2017 (Atkinson et al., 1983;Koh and Liang, 2017) tells us that the influence of upweighting a training example x by some small on the model parametersθ with the corresponding parameter space Θ is given by: where w i is weight for the training example x i and Hθ is the Hessian evaluated atθ. The above result is a slight generalization of Koh and Liang (2017), but it is straightforward to generalize their proof to the weighted empirical risk case. Then, we apply the chain rule to get the influence of upweighting x on the validation loss: Note that L(X tr , θ) can be rewritten as the following weighted average form to incorporate a new training example x new : where w i = 1∀i = N + 1, w N +1 = 0 and x N +1 = x new . Adding the new training example x new is equivalent to upweighting x N +1 by 1 N : Applying the influence function I up,loss (x), we obtain the following linear approximation of the validation loss change upon adding the training example x new : We adopt the stochastic estimation method described in Koh and Liang (2017) to efficiently compute I up,loss . Detrimental synthetic data will have 1 N I up,loss > 0.
Another distinction between our approach and Koh and Liang (2017) is that they compute the influence of a single training example on a single test example, whereas we estimate influence of a synthetic training example on all validation examples at once, which makes our approach scalable to large pools of synthetic data. Our approach, referred to as G-DAUG c -Influence, filters out detrimental synthetic data (i.e., the examples that have a positive estimated influence on the validation loss).
Selecting Diverse Examples. While G-DAUG c -Influence promotes training data quality, it ignores diversity; we hypothesize that better diversity can provide a more reliable training signal. We propose a simple greedy algorithm that iteratively selects a synthetic training example from the pool that maximizes a diversity measure. Here, we use a simple measure of diversity equal to the number of unique unigrams in the selected training set. Surprisingly, preliminary experiments with a more sophisticated diversity method based on embedding distance did not improve results (see Appendix E for details).
We refer to this approach as G-DAUG c -Diversity (see Algorithm 1).

Algorithm 1 G-DAUG c -Diversity
Input: Synthetic data pool D pool , Target size N Output: Synthetic dataset Initialization: Combining Influence Filtering and Diversity Maximization G-DAUG c -Influence and G-DAUG c -Diversity have complementary benefits-the former aims at improving the quality of individual examples by filtering out detrimental ones, and the latter is designed to compose a diverse training set but does not consider quality. To reap both benefits, we propose a combined selection technique, G-DAUG c -Combo, that first filters the data using G-DAUG c -Influence, then selects examples according to G-DAUG c -Diversity.

Training with Synthetic Data
In traditional data augmentation, new data is usually mixed with the original training examples to create an augmented training set (Wei and Zou, 2019;Kafle et al., 2017). However, when augmenting with data produced using a generative model, label noise can be detrimental to learning (Kafle et al., 2017). Moreover, the generated questions themselves can be noisy, i.e. nonsensical or ambiguous (see Table 7 under §4.2). To address this issue, we propose a simple training procedure that treats the synthetic and original data differently. We first train a model on the synthetic data (Synthetic Training), then further train on the original, human-authored training set (Organic Training). The motivation is to correct any unfavorable noise that may have been learnt during the first stage, by subsequently training on original data as more recent training data is favored by neural models (Goodfellow et al., 2014) .
We also experiment with a mixing approach that minimizes a weighted average of the loss for the synthetic data and the original data, with an importance weight to downweight the synthetic examples to mitigate noise. We find that two-stage training performs better than the importance-weighted loss (see Section 5).

Experiments
We present experiments on four commonsense multiple choice QA benchmarks: COMMONSENSEQA (Talmor et al., 2019), WINOGRANDE (Sakaguchi et al., 2020), CODAH (Chen et al., 2019) and Hel-laSwag (Zellers et al., 2019). Our techniques are also directly applicable to other closed-book multiple choice QA setups, such as science QA, and to textual entailment tasks with minor modifications. To evaluate G-DAUG c 's extensibility to these settings, we also experiment with a textual entailment task, SNLI (Bowman et al., 2015), and a closedbook version of the ARC-Challenge Scientific QA task (Clark et al., 2018) in which access to the scientific corpus for the ARC dataset (or any other information sources) is disallowed during test. We simulate low-resource settings on the large Hel-laSwag and SNLI datasets by downsampling these to 2K and 3K training samples respectively; the other data sets are either already low-resource or have a low-resource component. Dataset details are provided in Appendix A.
Robustness Evaluation In addition to measuring in-distribution performance, we also analyze robustness to perturbed or adversarial data. Following Wei and Zou (2019), we perform WordNetbased (Fellbaum, 1998) synonym replacement on the validation or test set (when test labels are available) with a 10% replacement rate. 5 Our second evaluation with TextFooler (Jin et al., 2020) identifies the most important words and replaces these with the most semantically and grammatically correct substitutes, until the model prediction is altered. We adopt two metrics to measure robustness under TextFooler's attacks: 1) failure rate: the proportion of examples for which TextFooler fails to change the prediction and 2) average perturbation ratio: the average fraction of words replaced when TextFooler succeeds in altering a prediction. We re-implement TextFooler with two minor changes: we only swap words in questions, not answers, and we replace the Universal Sentence Encoder with SROBERTA (Reimers and Gurevych, 4 https://leaderboard.allenai.org/ winogrande/submissions/public, https: //www.tau-nlp.org/csqa-leaderboard 5 https://github.com/jasonwei20/eda_nlp 2019).

Experimental Settings
We use ROBERTA (Liu et al., 2019) as our pretrained task model, and GPT-2 (Radford et al., 2019) as our pretrained generator. 6 We use validation performance to decide whether to do relabeling for COMMONSENSEQA and WINOGRANDE, and apply relabeling by default on all other tasks (tuning this choice may boost performance). To perform a controlled comparison, we restrict the synthetic set size to be equal across all methods. We repeat all experiments with 10 random restarts and pick the best model based on validation performance. Additional experimental details, with hyperparameters, are provided in Appendix C.
Baselines Our first baseline is a finetuned ROBERTA model with no augmentation. We compare with existing work on data augmentation via a BACKTRANSLATION approach from Xie et al. (2019); under our setting the original and backtranslated data are mixed at random. 7

In-Distribution Results
Our main results for commonsense question answering are reported in Table 1. All G-DAUG c variants outperform the baselines, highlighting the impact of generative data augmentation. On average, every other variant achieves higher test performance than G-DAUG c -Rand, which further highlights the importance of our data selection approaches. In addition, influence and diversity selection methods score similarly, however, their combination (in G-DAUG c -combo) outperforms either alone, which suggests that they are complementary selection approaches. More specifically, G-DAUG c -Combo performs the best on 3/4 tasks and obtains the highest average score. Further, G-DAUG c -Combo provides a 5.0% absolute gain over previously published state-of-the-art results on WINOGRANDE. 8 For COMMONSENSEQA, G-DAUG c -Combo outperforms the previous nonensemble state-of-the-art (Zhu et al., 2020) by 0.4%. We also achieve a new state-of-the-art on CODAH, where the previous best (BERT-based) score was 67.5% (Chen et al., 2019

Robustness Results
Table 2 presents our evaluation on synonym replacement sets. The G-DAUG c variants outperform the baselines, and G-DAUG c -Combo obtains the best average performance. Table 3 shows results on the TextFooler adversarial attacks. Models trained with data augmentation are more robust to adversarial attacks, as all G-DAUG c variants and BACKTRANSLATION outperform the ROBERTA baseline on both metrics. G-DAUG c -Diversity obtains the best failure rate and average perturbation ratio (higher is better, in both metrics), and G-DAUG c -Combo performs comparably with slightly worse numbers. Overall, the findings suggest that optimizing diversity increases robustness.

Results on ARC and SNLI
We explore G-DAUG c 's applicability outside of the commonsense domain in Table 4, via evaluation on the closed-book ARC-Challenge Scientific QA. Valid science questions are hard to generate because their semantics need to be precise, and we find that many of G-DAUG c 's generations for ARC are noisy. Perhaps surprisingly, nonetheless G-DAUG c outperforms the baselines by a large margin. G-DAUG c -Influence achieves the best in-distribution performance, while G-DAUG c -Diversity is the most robust against TextFooler but has worse accuracy than G-DAUG c -Rand. This may suggest that optimizing for quality is more important when the synthetic data is noisier. We also evaluate G-DAUG c on a textual entailment using the SNLI dataset (Bowman et al., 2015) in Table 4. This task has a different format; it is a pair-wise classification task with 3 labels (details in Appendix A). We find that G-DAUG c slightly improves accuracy and robustness over baselines. The performance is likely affected by a label skew introduced by influence-based filtering.

Analysis and Discussion
We now analyze G-DAUG c 's performance, focusing on WINOGRANDE where G-DAUG c offers the most benefit. We first identify several factors that affect performance, and then present evidence that G-DAUG c works by transferring knowledge from the pretrained generator to the task model.

Factors that Affect G-DAUG c 's Performance
G-DAUG c is effective at different training sizes. Figure 3 illustrates that our winning strategy, G-DAUG c -Combo, remains effective as the amount of training data varies, for WINOGRANDE.    The improvement over baseline is largest in the low-resource (small training size) regime. For the smallest sizes, XS and S, G-DAUG c -Combo increases the effective training size by a factor of 4 (i.e. training on XS or S matches unaugmented ROBERTA's performance on S or M, respectively). In contrast, BACKTRANSLATION only helps for the XS size, but hurts performance on larger sizes.
Staged training is essential. G-DAUG c uses a two-staged training method (Section 3.2) aimed at mitigating the effect of noise in the generated data. We analyze alternative training protocols on the WINOGRANDE-L dataset: Mixing (training on the union of generated and original data) and Importance Weighted Loss. Compared to a no-augmentation baseline (with accuracy of 75.9), two stage training (+1.8 increase) outperforms both mixing (+0.0) and importance weighted loss (+0.7).  Filtering synthetic data does not hurt accuracy. G-DAUG c 's filtering methods are designed to identify a high-quality and diverse subset of the generated data, to reduce training cost (compared to training on the entire generated pool) without harming accuracy. We evaluate whether G-DAUG c is successful at achieving this in Table 5, by comparing G-DAUG c against using the entire synthetic data pool for G-DAUG c -Influence and G-DAUG c -Diversity. 9 The selection approaches provide comparable or better accuracy compared to using the entire pool, despite using three times less data.

Why Does G-DAUG c Work?
Below, we present analysis suggesting that G-DAUG c works by transferring knowledge from the pretrained model to the task model. In particular, we find that using a pre-trained generator is critical, and that the generated questions are often coherent, include new semantic units, and carry informative labels.
Using a Pretrained Generator is critical. We analyze the impact of the pretrained generator by comparing our standard G-DAUG c -Rand setting with a setting where the generator is not pretrained, but instead trained from scratch. We find that using GPT-2 trained from scratch results in a score of 67.8% on the WINOGRANDE-M validation set. This is a slight improvement (by 0.2%) over the unaugmented baseline, but is far inferior to the 3.9% improvement obtained when using the pretrained GPT-2. This suggests that using a pretrained generator is critical for G-DAUG c .  Synthetic data labels are important. Even fully unsupervised language model pretraining can improve performance, when using task-relevant data (Gururangan et al., 2020). This raises the question of whether G-DAUG c boosts performance by simply exposing the model to more task-relevant text, or if the generated labels are in fact informative. A related question is whether G-DAUG c 's optional self-supervised relabeling improves performance. We analyze these questions for WINO-GRANDE-L and COMMONSENSEQA in Table 6, evaluating G-DAUG c with three labeling methods: (i) generator labels, (ii) random relabeling, and (iii) relabeling with a task model. When the generator labels are flipped randomly, G-DAUG c is unable to outperform the baselines for either dataset (in fact, it dramatically underperforms on WINOGRANDE-L). This implies that the correctness of labels is crucial for G-DAUG c . Self-supervised relabeling provides a 1.5% absolute gain in WINOGRANDE-L, but a 0.4% drop in COMMONSENSEQA, which suggests its utility is task-dependent.
G-DAUG c introduces new semantic units. We investigate how distinct the generated questions are from each other and from the original training data. We observe that G-DAUG c only rarely generates exact duplicate questions (e.g., on COMMON-SENSEQA, 0.06% of the questions are duplicates). We further investigate if G-DAUG c introduces new entities and relations to the training data, or if it merely reuses the ones found in the original training set. We quantify the diversity of our synthetic dataset compared to the original data by counting the number of unique semantic units produced by performing Open Information Extraction (Banko et al., 2007) on the data. Specifically, we run the Stanford Open IE package  and report the number of unique triplets, relations and entities extracted from our WINOGRANDE-M datasets in Figure 4. The synthetic data includes many more unique semantic units than the original training data, suggesting that G-DAUG c does introduce new semantic units in the training set.
G-DAUG c produces mostly fluent questions.
To evaluate G-DAUG c 's output for fluency, we employ three human annotators to rate generated COMMONSENSEQA questions for their coherence and answerability on a scale of 1 to 4, where a rating of 3 denotes an acceptable question. We obtained a total of 1,387 labels. We measured annotator agreement on a separate set of 50 questions, obtaining a Fleiss' Kappa of 0.41, which is at the low end of moderate annotator agreement, acceptable given the subjective nature of the task. A large (74.04%) majority of questions met the acceptability threshold, with an overall average rating of 3.34. Examples are shown in Table 7. Next, we ask annotators to answer the 1,027 acceptable questions, where they can edit choices (but not questions) if they are unable to pick a unique correct answer from the given choices. The  Table 7: Examples and prevalence of generated commonsense questions with different manually-assigned fluency ratings, for the COMMONSENSEQA dataset. Ratings of 3 and higher correspond to questions that are answerable and address common sense, and most of G-DAUG c 's generated questions fall into this category.
editing rate is relatively high, at 55.3%. We mix these human-labeled examples with the original training set to train a ROBERTA model, and obtain 78.1% validation accuracy, which is comparable to G-DAUG c , despite using approximately 50x fewer questions. This suggests that human labels can provide higher leverage than the noisy labels from G-DAUG c , although human labeling is expensive. Additional analyses, provided in Appendix F, show that model sharpness approximated by the Hessian trace (Yao et al., 2019) does not completely explain G-DAUG c 's performance; and, G-DAUG c is more effective than ensembling with a finetuned generator. . Our framework is similar to the last of these as we focus on generative models for data augmentation, but our work is the first to present a generative approach for the challenging commonsense QA setting, and we introduce new data selection approaches to improve the informativeness and diversity of synthetic data.

Related Work
Concurrently, there has been work on generat- ing adversarial examples for analyzing black-box classifiers. These approaches use generative adversarial networks (Zhao et al., 2018b) and populationbased optimization algorithms (Alzantot et al., 2018). Previous work has also presented methods to generate questions for reading comprehension ( Our work is distinct in that it targets question generation in a closed-book setting, investigates the generation of answers as well as distractors, and is aimed at data augmentation.

Conclusion
We introduced G-DAUG c , a novel data augmentation framework to generate synthetic training data, preserving quality and diversity. We demonstrate that G-DAUG c is effective on multiple commonsense reasoning benchmarks, with improvements on in-distribution performance, as well as robustness against perturbed evaluation sets and challenge sets. Our analysis shows that G-DAUG c tends to perform better in low-resource settings and that our data selection strategies are important for performance. Future work might explore more sophisticated methods to enhance quality and diversity of generated training data, including having humans-in-the-loop for relabeling. We use the following method to generate twin questions: 1. generate a sequence until a blank symbol " " is produced. 2. use two independent runs of sampling to complete the question in two different ways to form twins. The above process does not guarantee that the labels will differ for the two twins, so we further filter out generated pairs that do not have different labels.
CODAH (Chen et al., 2019): CODAH is an adversarially-constructed benchmark which tests commonsense reasoning using sentencecompletion questions, inspired by the Swag dataset (Zellers et al., 2018). It contains 2,801 questions in total, and uses 5-fold cross validation for evaluation. 10 We lower the temperature to 0.5 for the answer generation in order to increase the confidence of the generated answers.
HellaSwag (Zellers et al., 2019): HellaSwag is a more challenging version of the Swag dataset (Zellers et al., 2018), and the task is similar to CO-DAH. The dataset consists of 70K questions where each question comes from one of two domains: Ac-tivityNet or WikiHow. In order to test our methods under a low-resource setting, we downsample the training set to 2,000 examples. We take a random 10 The original CODAH work does not specify a particular 5-fold split, so we choose these randomly. We will release our splits for replicability. sample of 1000 questions from the original validation set to serve as our validation data, and another non-overlapping random sample of 5,000 questions from the same set as our test data. The generation settings are the same as CODAH's. SNLI (Bowman et al., 2015): SNLI is a natural language inference dataset with 570K pairs of labeled sentences. The label assigned to each sentence pair is one of entailment, contradiction or neutral. For low-resource experiments, we downsample the dataset to 3K training examples, which contains 1K unique premises and a hypothesis for all three labels. Similarly, we use a downsampled development set with 999 examples (333 premises and 3 hypotheses for each label). The generative model is fine-tuned by providing the premise, label and hypothesis, separated by special delimiters marking the beginning and end of each element. (Clark et al., 2018): The ARC Dataset consists of 7,787 natural grade-school science questions that are used on standardized tests. The ARC-Challenge Set contains 2,590 questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. We use the official split, which has 1,119 train, 299 validation, and 1,172 test examples. The generation settings are the same as COMMONSENSEQA's.

B Validation Set Results
In Table 8, we summarize our main results on the validation sets, comparing the G-DAUG c methods against an unaugmented baseline and a backtranslation augmentation baseline. All G-DAUG c methods consistently outperform the baseline methods in every benchmark. The proposed selection methods provide an extra boost on average, compared to G-DAUG c -Rand. Among those, G-DAUG c -Influence achieves the best performance across all tasks, which is expected as G-DAUG c -Influence selects examples which are helpful in reducing validation loss. Interestingly, G-DAUG c -Combo scores lower than G-DAUG c -Influence, although it outperforms G-DAUG c -Diversity. Finally, backtranslation does not demonstrate any benefit and obtains lower results compared to the augmented baseline in all benchmarks.  Table 8: Results on the validation sets of four commonsense benchmarks. All G-DAUG c methods outperform the baseline methods, in particular, G-DAUG c -Influence performs the best on all tasks, which is expected as it selects examples which are helpful in reducing validation loss.

C Hyperparameter Settings and Input Formats
Hyperparameter settings for finetuning GPT-2, ROBERTA and G-DAUG c are shown in Tables 11,  12, 14, 15 and 16. We manually tune the learning rate and the number of epochs for GPT-2 finetuning based on validation perplexity. For finetuning ROBERTA baseline models, we select the number of epochs from {1,3,5,8,10} based on validation accuracy for CSQA, WINOGRANDE and HellaSwag-2K. For CODAH, SNLI-3K and ARC-Challenge, we simply use 5 epochs. For G-DAUG c synthetic training, we train all models using a learning rate of 5e-6 for one epoch. For G-DAUG c organic training, we use the same hyperparameter settings as ROBERTA baselines (except for CSQA and HellaSwag-2K, where we find reducing 2 epochs gives significantly better results). In Tables 9 and 10, we specify the input formats for finetuning GPT-2 and ROBERTA. Finally, we benchmark the running time of our implementations of the influence and diversity selection methods on the task of selecting 127,478 examples from a pool consisting of 380,700 candidates for WINOGRANDE-M. We use one Nvidia 2080 Ti GPU and one Intel Core I9-7900X with 10 cores and a clockspeed of 3.3 GHz. The running time of the influence and diversity algorithms is about 8.3 hours and 2.9 hours, respectively.

D Influence Functions
In practice, since the generalization error is usually approximated by validation loss, a training example x i is considered detrimental if it increases validation loss, i.e.: where is a validation set, l is a loss function, and θ(X train ) = argmin θ∈Θ L(X train , θ) is an empirical risk minimizer.
The main result from previous work (Atkinson et al., 1983;Koh and Liang, 2017) tells us that the influence of upweighting a training example x by some small on the model parametersθ with the corresponding parameter space Θ is given by: where w i is weight for the training example x i and is the Hessian evaluated atθ. The above result is a slight generalization of Koh and Liang (2017), since the simple average used in that work is a special case of our weighted average, but it is straightforward to generalize their proof to our weighted empirical risk case and we omit the details of the proof in this paper. Then, we apply the chain rule to get the influence of upweighting x on the validation loss: = ∇ θ L(X val ,θ) I up,params (x). PREM Five black dogs run in a field. /PREM ANS entailment /ANS HYP Some animals running. /HYP ARC-Challenge Q: Which of the following is an example of a physical change? A: breaking a glass /s Table 9: Input formats for GPT-2. "Q:" and "A:" are the prefix for a question and a candidate answer (choice).  Note that L(X train , θ) can be rewritten as the following weighted average form to incorporate a new training example x new : where w i = 1∀i = N + 1, w N +1 = 0 and x N +1 = x new . Adding the new training example x new is equivalent to upweighting x N +1 by 1 N : Applying the influence function I up,loss (x), we obtain the following linear approximation of the validation loss change upon adding the training example x new : We adopt the stochastic estimation method described in Koh and Liang (2017) (2019) show that pretraining helps BERT to achieve flat and wide optima in the finetuning stage, which partially explains its performance benefits. We investigate whether G-DAUG c 's data augmentation may also encourage flatter optima. Specifically, using the fact that a larger Hessian trace for a model implies a sharper local minimum (Yao et al., 2019), we compute the Hessian trace of 10 baseline and 10 G-DAUG c -Combo methods using 1e-5/5e-6/2e-5 * 4e-5/5e-5/5e-5 4e-5/5e-5/5e-5 5e-5 2e-5/1e-5/1e-5 Epochs (q/a/d) 3/5/3 * 3/3/3 3/3/3 3 3/5/5 Grad Clipping  LR (q/a) 5e-5/5e-5 2e-5/5e-5 2e-5/5e-5 2e-5/5e-5 1e-5/5e-5 Epochs (q/a) 8/12 6/6 3/3 3/3 3/1  the Hutchinson Method (Avron and Toledo, 2011) and find an average relative decrease of 9.5% for G-DAUG c -Combo, suggesting that G-DAUG c does find slightly flatter optima. Likewise, when comparing the best performing models of each approach, G-DAUG c -Combo's best model is slightly flatter than the baseline (a relative decrease of 0.2%). However, we also find the contradictory fact that, over the 20 models, flatter optima tend to be associated with worse task performance (Spearman correlation of 0.39, p ≈ 0.09). So, it does not appear that sharpness explains G-DAUG c 's performance advantage over the baseline. A more thorough analysis of this hypothesis is an item of future work.
Generator/Task Model Ensemble. G-DAUG c harnesses pretrained knowledge from GPT-2 in order to improve a ROBERTA-based task model. A more standard approach for model combination (albeit, with twice the computational cost at runtime) would be to ensemble the two models instead. We evaluate ensembling a baseline ROBERTA model with a finetuned GPT-2 generator for WINOGRANDE in Table 13. We adopt a weighted-average ensemble method, where the weights are tuned on validation data (the tuning is important to achieve peak performance). The ensemble model performs same as the baseline model, and G-DAUG c -Combo outperforms both of them by 3.9%. This suggests that G-DAUG c is more effective than simply ensembling the finetuned generator.