Syntactic Data Augmentation Increases Robustness to Inference Heuristics

Pretrained neural models such as BERT, when fine-tuned to perform natural language inference (NLI), often show high accuracy on standard datasets, but display a surprising lack of sensitivity to word order on controlled challenge sets. We hypothesize that this issue is not primarily caused by the pretrained model’s limitations, but rather by the paucity of crowdsourced NLI examples that might convey the importance of syntactic structure at the fine-tuning stage. We explore several methods to augment standard training sets with syntactically informative examples, generated by applying syntactic transformations to sentences from the MNLI corpus. The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, without affecting performance on the MNLI test set. This improvement generalized beyond the particular construction used for data augmentation, suggesting that augmentation causes BERT to recruit abstract syntactic representations.


Introduction
In the supervised learning paradigm common in NLP, a large collection of labeled examples of a particular classification task is randomly split into a training set and a test set. The system is trained on this training set, and is then evaluated on the test set. Neural networks-in particular systems pretrained on a word prediction objective, such as ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019)-excel in this paradigm: with large enough pretraining corpora, these models match or even exceed the accuracy of untrained human annotators on many test sets (Raffel et al., 2019).
At the same time, there is mounting evidence that high accuracy on a test set drawn from the same distribution as the training set does not indicate that the model has mastered the task. This discrepancy can manifest as a sharp drop in accuracy when the model is applied to a different dataset that illustrates the same task (Talmor and Berant, 2019;Yogatama et al., 2019), or as excessive sensitivity to linguistically irrelevant perturbations of the input (Jia and Liang, 2017;Wallace et al., 2019).
One such discrepancy, where strong performance on a standard test set did not correspond to mastery of the task as a human would define it, was documented by McCoy et al. (2019b) for the Natural Language Inference (NLI) task. In this task, the system is given two sentences, and is expected to determine whether one (the premise) entails the other (the hypothesis). Most if not all humans would agree that NLI requires sensitivity to syntactic structure; for example, the following sentences do not entail each other, even though they contain the same words: (1) The lawyer saw the actor.
(2) The actor saw the lawyer.
McCoy et al. constructed the HANS challenge set, which includes examples of a range of such constructions, and used it to show that, when BERT is fine-tuned on the MNLI corpus (Williams et al., 2018), the fine-tuned model achieves high accuracy on the test set drawn from that corpus, yet displays little sensitivity to syntax; the model wrongly concluded, for example, that (1) entails (2). We consider two explanations as to why BERT fine-tuned on MNLI fails on HANS. Under the Representational Inadequacy Hypothesis, BERT fails on HANS because its pretrained representations are missing some necessary syntactic information. Under the Missed Connection Hypothesis, BERT extracts the relevant syntactic information from the input (cf. Goldberg 2019; Tenney et al. 2019), but it fails to use this information with HANS because there are few MNLI training examples that indicate how syntax should support NLI (McCoy et al., 2019b). It is possible for both hypotheses to be correct: there may be some aspects of syntax that BERT has not learned at all, and other aspects that have been learned, but are not applied to perform inference.
The Missed Connection Hypothesis predicts that augmenting the training set with a small number of examples from one syntactic construction would teach BERT that the task requires it to use its syntactic representations. This would not only cause improvements on the construction used for augmentation, but would also lead to generalization to other constructions. In contrast, the Representational Inadequacy Hypothesis predicts that to perform better on HANS, BERT must be taught how each syntactic construction affects NLI from scratch. This predicts that larger augmentation sets will be required for adequate performance and that there will be little generalization across constructions.
This paper aims to test these hypotheses. We constructed augmentation sets by applying syntactic transformations to a small number of examples from MNLI. Accuracy on syntactically challenging cases improved dramatically as a result of augmenting MNLI with only about 400 examples in which the subject and the object were swapped (about 0.1% of the size of the MNLI training set). Crucially, even though only a single transformation was used in augmentation, accuracy increased on a range of constructions. For example, BERT's accuracy on examples involving relative clauses (e.g, The actors called the banker who the tourists saw The banker called the tourists) was 0.33 without augmentation, and 0.83 with it. This suggests that our method does not overfit to one construction, but taps into BERT's existing syntactic representations, providing support for the Missed Connection Hypothesis. At the same time, we also observe limits to generalization, supporting the Representational Inadequacy Hypothesis in those cases.

Background
HANS is a template-generated challenge set designed to test whether NLI models have adopted three syntactic heuristics. First, the lexical overlap heuristic is the assumption that any time all of the words in the hypothesis are also in the premise, the label should be entailment. In the MNLI training set, this heuristic often makes correct predictions, and almost never makes incorrect predictions. This may be due to the process by which MNLI was generated: crowdworkers were given a premise and were asked to generate a sentence that contradicts or entails the premise. To minimize effort, workers may have overused lexical overlap as a shortcut to generating entailed hypotheses. Of course, the lexical overlap heuristic is not a generally valid inference strategy, and it fails on many HANS examples; e.g., as discussed above, the lawyer saw the actor does not entail the actor saw the lawyer.
HANS also includes cases that are diagnostic of the subsequence heuristic (assume that a premise entails any hypothesis which is a contiguous subsequence of it) and the constituent heuristic (assume that a premise entails all of its constituents). While we focus on counteracting the lexical overlap heuristic, we will also test for generalization to the other heuristics, which can be seen as particularly challenging cases of lexical overlap. Examples of all constructions used to diagnose the three heuristics are given in Tables A.5, A.6 and A.7.
Data augmentation is often employed to increase robustness in vision (Perez and Wang, 2017) and language (Belinkov and Bisk, 2018;Wei andZou, 2019), including in NLI (Minervini andRiedel, 2018;Yanaka et al., 2019). In many cases, augmentation with one kind of example improves accuracy on that particular case, but does not generalize to other cases, suggesting that models overfit to the augmentation set (Jia and Liang, 2017;Ribeiro et al., 2018;Iyyer et al., 2018;Liu et al., 2019) Table 1 for examples, and §A.2 for details). We experimented with three augmentation set sizes: small (101 examples), medium (405) and large (1215). All augmentation sets were much smaller than the MNLI training set (297k). 1 We did not attempt to ensure the naturalness of the generated examples; e.g., in the INVERSION transformation, The carriage made a lot of noise was transformed into A lot of noise made the carriage. In addition, the labels of the augmentation dataset were somewhat noisy; e.g., we assumed that INVERSION changed the correct label from entailment to neutral, but this is not necessarily the case (if The buyer met the seller, it is likely that The seller met the buyer). As we show below, this noise did not hurt accuracy on MNLI.
Finally, we included a random shuffling condition, in which an MNLI premise and its hypothesis were both randomly shuffled, with a random label. We used this condition to test whether a syntactically uninformed method could teach the model that, when word order is ignored, no reliable inferences can be made.

Experimental setup
We added each augmentation set separately to the MNLI training set, and fine-tuned BERT on each resulting training set. Further fine-tuning details are in Appendix A.1. We repeated this process for five random seeds for each combination of augmentation strategy and augmentation set size, except for the most successful strategy (INVERSION + TRANS-FORMED HYPOTHESIS), for which we had 15 runs for each augmentation size. Following McCoy et al. (2019b), when evaluating on HANS, we merged the neutral and contradiction labels produced by the model into a single non-entailment label.
For both ORIGINAL PREMISE and TRANS-FORMED HYPOTHESIS, we experimented with using each of the transformations separately, and with a combined dataset including both inversion and passivization. We also ran separate experiments with only the passivization examples with an entailment label, and with only the passivization examples with a non-entailment label. As a baseline, we used 100 runs of BERT fine-tuned on the unaugmented MNLI (McCoy et al., 2019a).
We report the models' accuracy on HANS, as well as on the MNLI development set (MNLI test set labels are not publicly available). We did not tune any parameters on this development set. All of the comparisons we discuss below are significant at the p < 0.01 level (based on two-sided t-tests).

Results
Accuracy on MNLI was very similar across augmentation strategies and matched that of the unaugmented baseline (0.84), suggesting that syntactic augmentation with up to 1215 examples does not harm overall performance on the dataset. By contrast, accuracy on HANS varied significantly, with most models performing worse than chance (which is 0.50 on HANS) on non-entailment examples, suggesting that they adopted the heuristics (Figure 1). The most effective augmentation strategy, by a large margin, was inversion with a transformed hypothesis. Accuracy on the HANS word overlap cases for which the correct label is non-entailmente.g., the doctor saw the lawyer the lawyer saw the doctor-was 0.28 without augmentation, and 0.73 with the large version of this augmentation set. Simultaneously, this strategy decreased BERT's accuracy on the cases where the heuristic makes the correct prediction (The tourists by the actor called the authors → The tourists called the authors); in fact, the best model's accuracy was similar across cases where lexical overlap made correct and incorrect predictions, suggesting that this intervention prevented the model from adopting the heuristic.
The random shuffling method did not improve over the unaugmented baseline, suggesting that syntactically-informed transformations are essential (Table A.2). Passivization yielded a much smaller benefit than inversion, perhaps due to the presence of overt markers such as the word by, which may lead the model to attend to word order only when those are present. Intriguingly, even on the passive examples in HANS, inversion was more effective than passivization (large inversion augmentation: 0.13; large passivization augmentation: 0.01). Finally, inversion on its own was more effective than the combination of inversion and passivization.
We now analyze in more detail the most effective strategy, inversion with a transformed hypothesis. First, this strategy is similar on an abstract level to the HANS subject/object swap category, but the two differ in vocabulary and some syntactic properties; despite these differences, performance on this HANS category was perfect (1.00) with medium and large augmentation, indicating that BERT benefited from the high-level syntactic structure of the transformation. For the small augmentation set, accuracy on this category was 0.53, suggesting that 101 examples are insufficient to teach BERT that subjects and objects cannot be freely swapped. Conversely, tripling the augmentation size from medium to large had a moderate and inconsistent effect across HANS subcases (see Appendix A.3 for case-by-case results); for clearer insight about the role of augmentation size, it may be necessary to sample this parameter more densely.
Although inversion was the only transformation in this augmentation set, performance also improved dramatically on constructions other than subject/object swap ( Figure 2); for example, the models handled examples involving a prepositional phrase better, concluding, for instance, that The judge behind the manager saw the doctors does not entail The doctors saw the manager (unaugmented: 0.41; large augmentation: 0.89). There was a much more moderate, but still significant, improvement on the cases targeting the subsequence heuristic; this smaller degree of improvement suggests that contiguous subsequences are treated separately from lexical overlap more generally. One exception was accuracy on "NP/S" inferences, such as the managers heard the secretary resigned The managers heard the secretary, which improved dramatically from 0.02 (unaugmented) to 0.50 (large augmentation  While some of these approaches yield higher accuracy on HANS than ours, including better generalization to the constituent and subsequence cases (see Table A.4), they are not directly comparable: our goal is to assess how the prevalence of syntactically challenging examples in the training set affects BERT's NLI performance, without modifying either the model or the training procedure.

Discussion
Our best-performing strategy involved augmenting the MNLI training set with a small number of instances generated by applying the subject/object inversion transformation to MNLI examples. This yielded considerable generalization: both to another domain (the HANS challenge set), and, more importantly, to additional constructions, such as relative clauses and prepositional phrases. This supports the Missed Connection Hypothesis: a small amount of augmentation with one construction induced abstract syntactic sensitivity, instead of just "inoculating" the model against failing on the challenge set by providing it with a sample of cases from the same distribution (Liu et al., 2019). At the same time, the inversion transformation did not completely counteract the heuristic; in particular, the models showed poor performance on passive sentences. For these constructions, then, BERT's pretraining may not yield strong syntactic representations that can be tapped into with a small nudge from augmentation; in other words, this may be a case where our Representational Inadequacy Hypothesis holds. This hypothesis predicts that pretrained BERT, as a word prediction model, struggles with passives, and may need to learn the properties of this construction specifically for the NLI task; this would likely require a much larger number of augmentation examples.
The best-performing augmentation strategy involved generating premise/hypothesis pairs from a single source sentence-meaning that this strategy does not rely on an NLI corpus. The fact that we can generate augmentation examples from any corpus makes it possible to test if very large augmentation sets are effective (with the caveat, of course, that augmentation sentences from a different domain may hurt performance on MNLI itself).
Ultimately, it would be desirable to have a model with a strong inductive bias for using syntax across language understanding tasks, even when overlap heuristics lead to high accuracy on the training set; indeed, it is hard to imagine that a human would ignore syntax entirely when understanding a sentence. An alternative would be to create training sets that adequately represent a diverse range of linguistic phenomena; crowdworkers' (rational) preferences for using the simplest generation strategies possible could be counteracted by approaches such as adversarial filtering (Nie et al., 2019). In the interim, however, we conclude that data augmentation is a simple and effective strategy to mitigate known inference heuristics in models such as BERT.

A.1 Fine-tuning details
We used bert-base-uncased for all experiments. As is standard, we fine-tuned this pretrained model on MNLI by training a linear classifier to predict the label from the CLS token's final layer embedding, while continuing to update BERT's parameters (Devlin et al., 2019). The order of training examples was reshuffled for each model. All models were trained for three epochs.

A.2 Generating augmentation examples
The following list describes the augmentation strategies we used. Table A.1 illustrates all of these strategies as applied to a particular source sentence. Note that inversion generally changes the meaning of the sentence (the detective followed the suspect refers to a different event from the suspect followed the detective), but passivization on its own does not (the detective followed the suspect refers to the same event as the suspect was followed by the detective).
• Inversion (original premise): For a source example (p, h, →), generate (p, INV(h), ), where INV returns the source sentence with the subject and object switched. Ignore source examples whose label is .
• Passivization (original premise): For a source (p, h) (with any label), generate (p, PASS(h)), with the same label, where PASS returns the passive version of the source sentence (without changing its meaning).
We identified transitive sentences in MNLI that could serve as source sentences using the constituency parses provided with MNLI, excluding the noisier TELEPHONE genre. We did so by searching for matrix S nodes with exactly one NP daughter of the VP, where the subject and the object were both full noun phrases (i.e., neither were a personal pronoun such as me), and where the verb lemma was not be or have. We kept the original tense of the verb, and modified its agreement features if necessary (e.g., the movie stars Matt Dillon and Gary Sinise was transformed into Matt Dillon and Gary Sinise star the movie).
The size of the largest augmentation set was 1215 for all strategies. This size was determined based on the largest augmentation dataset we could generate from MNLI for the inversion with original premise strategy using the procedure mentioned above. For fair comparison, we kept the same size even for strategies where we could have generated a larger dataset. We also created a Medium dataset by randomly sampling 405 of the cases identifying using the procedure above, as well as a small dataset with 101 examples. We performed this process only once for each strategy: as such, runs varied only in the classifier's weight initialization and the order of examples but not in the augmentation examples included in training.
To create the Combined augmentation dataset, we concatenated the inversion and passivization datasets, then randomly discarded half of the examples (to match the size of the combined dataset with the others). As with the other datasets, we only did this once: the Combined augmentation set was the same across runs. One consequence of this procedure is that the number of passivization and inversion examples was not exactly identical.

A.3 Detailed Results
The following tables provide the detailed results of our experiments. Table A.2 shows each strategy's mean accuracy on MNLI, as well on the HANS cases that diagnose each of the three heuristics (the Lexical Overlap Heuristic, the Subsequence Heuristic, and the Constituent Heuristic), for which the correct label is non-entailment ( ). Table A.3 zooms in on the best-performing augmentation strategy-subject/object inversion with a transformed hypothesis-on BERT's accuracy on HANS, both when the correct label is entailment (→) and when the label is non-entailment ( ). Finally, the last three tables detail the effect of augmentation by inversion with a transformed hypothesis on each of the 30 HANS subcases, broken down by the heuristic that they were designed to diagnose: the Lexical Overlap Heuristic (Table A.5), the Subsequence Heuristic (Table A. Table A.5: Subject/object inversion with a transformed hypothesis: results for the HANS subcases that are diagnostic of the lexical overlap heuristic, for four training regimens-unaugmented (trained only on MNLI), and with small (n = 101), medium (n = 405) and large (n = 1215) augmentation sets. Chance performance is 0.5. Top: cases in which the gold label is non-entailment. Bottom: cases in which the gold label is entailment.

Subcase
Unaugmented  Table A.6: Subject/object inversion with a transformed hypothesis: results for the HANS subcases diagnostic of the subsequence heuristic, for four training regimens-unaugmented (trained only on MNLI), and with small (n = 101), medium (n = 405) and large (n = 1215) augmentation sets. Top: cases in which the gold label is non-entailment. Bottom: cases in which the gold label is entailment.

Subcase
Unaugmented  Table A.7: Subject/object inversion with a transformed hypothesis: results for the HANS subcases diagnostic of the constituent heuristic, for four training regimens-unaugmented (trained only on MNLI), and with small (n = 101), medium (n = 405) and large (n = 1215) augmentation sets. Chance performance is 0.5. Top: cases in which the gold label is non-entailment. Bottom: cases in which the gold label is entailment.