Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.


Introduction
While standard crowdsourced benchmarks have helped create significant progress within natural language processing (NLP), a growing body of evidence shows the existence of exploitable annotation artifacts in these datasets (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018) and that models can use artifacts to achieve state-of-the-art performance on these benchmarks (McCoy et al., 2019;Naik et al., 2018). The existence of these artifacts makes it difficult to predict out-of-domain generalization and creates uncertainty around the abilities these tasks are designed to test.
Recent work has explored using counterfactually-augmented datasets to address annotation artifacts with the intent to build more robust classifiers (Kaushik et al., 2020;Khashabi et al., 2020). These datasets are collected by first sampling a set of seed examples and then creating new examples by minimally editing the seed examples to yield counterfactual labels. This type of data collection has been found to mitigate the presence of artifacts in SNLI (Bowman et al., 2015) and is presented as a way to "elucidate the difference that makes a difference" (Kaushik et al., 2020). Further, Khashabi et al. (2020) present this as an efficient method to collect training data yielding models that are "more robust to minor variations and generalize better" (Khashabi et al., 2020). However, they also find that unaugmented datasets yield better performance than datasets with 50-50 original-to-augmented data when controlling for training set size and annotation cost.
In our work, we further study whether training with counterfactually-augmented data collected through standard crowdsourcing methods yields models with better generalization and robustness by focusing on the domain of natural language inference (NLI): the task of inferring whether a hypothesis is true given a true premise. We train and compare RoBERTa (Liu et al., 2019) trained on three different datasets: (1) the counterfactuallyaugmented natural language inference (CNLI) training set of 8.3k seed and augmented SNLI examples from Kaushik et al. (2020), (2) a subsampled set of 8.3k unaugmented SNLI examples to control for size, and (3) the 1.7k CNLI seed examples originally sampled from SNLI. We then compare model performances on MNLI (Williams et al., 2018)-a dataset for the same task with examples out-of-domain to SNLI-and two diagnostic sets (Naik et al., 2018;Wang et al., 2019a).
We find that RoBERTa trained on CNLI yields similar performance on out-of-domain MNLI examples when compared to the unaugmented subsampled SNLI training set and that including counterfactually-augmented examples to the CNLI seed set improves generalization. Further, we find that the improvement over seed examples correspond to an increase in n-grams from the addition of augmented examples, roughly doubling the number of 4-grams, and may be a result of improved lexical diversity from a larger training set. While we see similar trends in most of our diagnostic evaluations, we also find evidence that including augmented examples can yield worse performance than only training with seed examples.
While there is evidence of the benefits of using this type of data for model evaluation (Gardner et al., 2020), we find that using counterfactuallyaugmented data for training yields less robust models. We argue that further innovation is required to effectively crowdsource counterfactuallyaugmented natural language understanding (NLU) data for training more robust models with better generalization.

Related Work
Recent works show that several NLI benchmark datasets contain exploitable annotation artifacts. Several studies (Poliak et al., 2018;Gururangan et al., 2018;Tsuchiya, 2018) show that models trained on hypothesis-only examples manage to perform as much as 35 points higher than chance. Gururangan et al. (2018) also find negation words such as no or never are strongly associated with contradiction predictions. Other works (Naik et al., 2018;McCoy et al., 2019) find that models can exploit premise-hypothesis word overlap to achieve state-of-the-art performance on benchmarks by using associations of high overlap with entailment predictions and low overlap with neutral predictions. Nie et al. (2020) use an adversarial human-andmodel-in-the-loop procedure to address these concerns in Adversarial NLI (ANLI). Using a model in the loop makes ANLI inherently adversarial towards the model used, and we instead focus on naturally collected human-in-the-loop augmented data. Kaushik et al. (2020) crowdsource counterfactually-augmented NLI examples that reduce the presence of hypothesis-only bias in SNLI by providing a set of seed examples to crowdworkers and prompting them to minimally edit either the hypothesis or premise to yield a counterfactual label. Khashabi et al. (2020) present this type of data collection as an efficient method to build training sets yielding robust models that generalize better by crowdsourcing counterfactually-augmented BoolQ examples. However, they also find that augmented datasets yield similar to worse performance when the cost of augmenting an example is no cheaper than collecting a new one and the datasets are controlled for size. We differ from Kaushik et al. (2020) by focusing on performance on out-of-domain examples and from Khashabi et al. (2020) by focusing on the task of NLI instead of reading comprehension. Gardner et al. (2020) use contrast sets written manually by NLP researchers to evaluate models on various annotated tasks. They show that most datasets require 1-3 minutes per augmented example, taking 17-50 hours to create 1,000 examples. We differ by using crowdsourced counterfactuallyaugmented data and focusing on their use for training instead of evaluation.

Experimental Setup
We perform two experiments to study the effects of counterfactually-augmented NLI training data. All experiments use RoBERTa trained on SNLI, CNLI, or CNLI seed examples originally sampled from SNLI and compare performances on various tasks. We first compare MNLI performances to evaluate the impact on model generalization to out-ofdomain data. We then use the diagnostic examples from Naik et al. (2018) and the GLUE diagnostic set (Wang et al., 2019a) to study model robustness to challenge examples.
Training Data In SNLI, Bowman et al. (2015) prompt crowdworkers with a scene description premise to collect three hypothesis sentences corresponding to entailment, neutral, and contradiction labels, yielding 570k English premise-hypothesis pairs. Kaushik et al. (2020)

Results
Generalization to MNLI From the median scores in Figure 1, we see that models trained on CNLI perform no better than models trained on a comparably large sample of unaugmented SNLI examples. This is in line with findings from Khashabi et al. (2020), where training with their minimally perturbed BoolQ dataset of seed and augmented examples yields similar or worse performance on out-of-domain tasks compared to the original BoolQ training set. Additionally, the improvement of CNLI over the 1.7k seed examples shows that counterfactual examples are somewhat helpful when they are strictly additive, as in Khashabi et al. (2020). Figure 2 presents performances on the diagnostic sets from Naik et al. (2018) and Wang et al. (2019a). For the GLUE diagnostic sets, we follow the authors and use R 3 (Gorodkin, 2004) as our evaluation metric. The distributions of classification accuracy again show that CNLI yields similar performance compared to unaugmented datasets of similar size on most of the categories. However, we find that training on CNLI yields worse performance than using either unaugmented SNLI or CNLI seed examples for Negation examples. These challenge examples append the phrase "and false is not true" to every hypothesis in the MNLI validation set. This construction introduces the strong negation word "no" to target the association between negation words and the contradiction label without changing the truth condition of the Lexical Diversity Given the minimal edits constraint in CNLI, we study the lexical diversity of the training sets to see the effectiveness of this constraint and whether the general improvement of CNLI over seed examples is a result of greater diversity from a larger training set. Table 1 provides the number of n-grams present in each training set with n varying from one to four. We see that including minimally edited examples to CNLI increases the number of n-grams present, roughly doubling the number of 4-grams, which corresponds to the general improvement over seed examples. We also observe that CNLI contains roughly 70% of 2-, 3-, and 4-grams compared to similarly large unaugmented training sets. This seems natural given the minimal edits constraint when collecting counterfactually-augmented examples and highlights the fact that this type of data augmentation results in less diversity per example.

Conclusion
We follow a similar setup to Khashabi et al. (2020) and use English NLI data to test whether counterfactually-augmented training data yields models that generalize better to out-of-domain data and are more robust to challenge examples. We first find that adding counterfactually-augmented data improves generalization, but provides no advantage over adding similar amounts of unaugmented data. Further, we find that the improvement over seed examples corresponds to an increase in n-gram diversity. We also find that including counterfactuallyaugmented data can make models less robust to challenge examples. Assuming that crowdworkers take a similar amount of time to make targeted edits to examples and to write new examples (Bowman et al., 2020), there is then no obvious value in crowdsourcing augmentations under current protocols for use as training data.
Despite these findings, we argue that there is still value in naturally collected counterfactuallyaugmented NLU data. Gardner et al. (2020) show that collecting this type of data can be used as a method to address systematic gaps in testing data. As performances on benchmarks become saturated, we still view this style of augmenting test sets as a viable method to provide longer-lasting benchmarks in addition to standard test set creation.
The success of Gardner et al. (2020) in using expert-designed counterfactual augmentation to target specific phenomena for evaluation suggests that it may be possible to target heuristics in training data with expert guidance during the crowdsourcing process. Further, understanding how to identify heuristics to target and the types of useful augmentations to collect, assuming such a thing is possible, are important directions we leave to future work.