BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance

If the same neural network architecture is trained multiple times on the same dataset, will it make similar linguistic generalizations across runs? To study this question, we fine-tuned 100 instances of BERT on the Multi-genre Natural Language Inference (MNLI) dataset and evaluated them on the HANS dataset, which evaluates syntactic generalization in natural language inference. On the MNLI development set, the behavior of all instances was remarkably consistent, with accuracy ranging between 83.6% and 84.8%. In stark contrast, the same models varied widely in their generalization performance. For example, on the simple case of subject-object swap (e.g., determining that “the doctor visited the lawyer” does not entail “the lawyer visited the doctor”), accuracy ranged from 0.0% to 66.2%. Such variation is likely due to the presence of many local minima in the loss surface that are equally attractive to a low-bias learner such as a neural network; decreasing the variability may therefore require models with stronger inductive biases.


Introduction
Generalization is a crucial component of learning a language. No training set can contain all possible sentences, so learners must be able to generalize to sentences that they have never encountered before. We differentiate two types of generalization: 1. In-distribution generalization: Generalization to examples which are novel but which are drawn from the same distribution as the training set.
2. Out-of-distribution generalization: Generalization to examples drawn from a different distribution than the training set.
Standard test sets in natural language processing are generated in the same way as the corresponding training set, therefore testing only in-distribution generalization. Current neural architectures perform very well at this type of generalization. For example, on the natural language understanding tasks included in the GLUE benchmark , several Transformer-based models (Liu et al., 2019b,a;Raffel et al., 2020) have surpassed the human baselines from Nangia and Bowman (2019). However, this strong performance does not necessarily indicate mastery of language. Because of biases in training distributions, it is often possible for a model to achieve strong in-distribution generalization by using shallow heuristics rather than deeper linguistic knowledge. Therefore, evaluating only on standard test sets cannot reveal whether a model has learned abstract properties of language or if it has only learned shallow heuristics.
An alternative evaluation approach addresses this flaw by testing how the model handles particular linguistic phenomena, using datasets designed to be impossible to solve using shallow heuristics. In this line of investigation, which tests out-of-distribution generalization, the results are more mixed. Some works have found successful handling of phenomena such as subject-verb agreement (Gulordava et al., 2018) and filler-gap dependencies (Wilcox et al., 2018). Other works, however, have illuminated surprising failures even on seemingly simple types of examples (Marvin and Linzen, 2018;. Such results make it clear that there is still much room for improvement in how neural models perform on syntactic structures that are rare in training corpora.
In this work, we investigate whether the linguistic generalization behavior of a given neural architecture is consistent across multiple instances of that architecture. This question is important because, in order to tell which types of architectures generalize best, we need to know whether suc-cesses and failures of generalization should be attributed to aspects of the architecture or to random luck in the choice of the model's initial weights.
We investigate this question using the task of natural language inference (NLI). We fine-tuned 100 instances of BERT (Devlin et al., 2019) on the MNLI dataset (Williams et al., 2018). 1 These 100 instances differed only in (i) the initial weights of the classifier trained on top of BERT, and (ii) the order in which training examples were presented. All other aspects of training, including the initial weights of BERT, were held constant. We evaluated these 100 instances on both the in-distribution MNLI development set and the out-of-distribution HANS evaluation set , which tests syntactic generalization in NLI models.
We found that these 100 instances were remarkably consistent in their in-distribution generalization accuracy, with all accuracies on the MNLI development set falling in the range 83.6% to 84.8%, and with a high level of consistency on labels for specific examples (e.g., we identified 526 examples that all 100 instances labeled incorrectly). In contrast, these 100 instances varied dramatically in their out-of-distribution generalization performance; for example, on one of the thirty categories of examples in the HANS dataset, accuracy ranged from 4% to 76%. These results show that, when assessing the linguistic generalization of neural models, it is important to consider multiple training runs of each architecture, since models can differ vastly in how they perform on examples drawn from a different distribution than the training set, even when they perform similarly on an in-distribution test set.

In-distribution generalization
Several works have noted that the same architecture can have very different in-distribution generalization across restarts of the same training process Gurevych, 2017, 2018;Madhyastha and Jain, 2019). Most relevantly for our work, finetuning of BERT is unstable for some datasets, such that some runs achieve state-of-the-art results while others perform poorly (Devlin et al., 2019;Phang et al., 2018). Unlike these past works, we focus on out-of-distribution generalization, rather than in-distribution generalization.

Out-of-distribution generalization
Several other works have noted variation in outof-distribution syntactic generalization. Weber et al. (2018) trained 50 instances of a sequenceto-sequence model on a symbol replacement task. These instances consistently had above 99% accuracy on the in-distribution test set but varied on out-of-distribution generalization sets; in the most variable case, accuracy ranged from close to 0% to over 90%. Similarly, McCoy et al. (2018) trained 100 instances for each of six types of networks, using a synthetic training set that was ambiguous between two generalizations. Some models consistently made the same generalization across runs, but others varied considerably, with some instances of a given architecture strongly preferring one of the two generalizations that were plausible given the training set, while other instances strongly preferred the other generalization. Finally, Liška et al. (2018) trained 5000 instances of recurrent neural networks on the lookup tables task. Most of these instances failed on compositional generalization, but a small number generalized well.
These works on variation in out-of-distribution generalization all used simple, synthetic tasks with training sets designed to exclude certain types of examples. Our work tests if models are still as variable when trained on a natural-language training set that is not adversarially designed. In concurrent work,  also measured variability in out-of-distribution performance for 3 models (including BERT) on 12 datasets (including HANS). Their work has impressive breadth, whereas we instead aim for depth: We analyze the particular categories within HANS to give a fine-grained investigation of syntactic generalization, while Zhou et al. only report overall accuracy averaged across categories. In addition, we fine-tuned 100 instances of BERT, while Zhou et al. only fine-tuned 10 instances. The larger number of instances allows us to investigate the extent of the variability in more detail.

Linguistic analysis of BERT
Many recent papers have sought a deeper understanding of BERT, whether to assess its encoding of sentence structure (Lin et al., 2019;Hewitt and Manning, 2019;Chrupała and Alishahi, 2019;Jawahar et al., 2019;Tenney et al., 2019b); its representational structure more generally (Abnar et al., 2019); its handling of specific linguistic phenomena such as subject-verb agreement (Goldberg, 2019), negative polarity items (Warstadt et al., 2019), function words (Kim et al., 2019), or a variety of psycholinguistic phenomena (Ettinger, 2020); its internal workings (Coenen et al., 2019;Tenney et al., 2019a;Clark et al., 2019); or its inductive biases (Warstadt and Bowman, 2020). The novel contribution of this work is the focus on variability across a large number of fine-tuning runs; previous works have generally used models without fine-tuning or have used only a small number of fine-tuning runs (usually only one fine-tuning run, or at most ten fine-tuning runs).

Task and datasets
We used the task of natural language inference (NLI, also known as Recognizing Textual Entailment; Condoravdi et al., 2003;Dagan et al., 2006Dagan et al., , 2013, which involves giving a model two sentences, called the premise and the hypothesis. The model must then output entailment if the premise entails (i.e., implies the truth of) the hypothesis, contradiction if the premise contradicts the hypothesis, or neutral otherwise. For training, we used the training set of the MNLI dataset (Williams et al., 2018), examples from which are given below: (1) a. Premise: Finally she turned back to him.  Figure 1).
To assess whether a model has learned these heuristics, HANS contains examples where each heuristic makes the right predictions (i.e., where the correct label is entailment) and examples where each heuristic makes the wrong predictions (i.e., where the correct label is non-entailment). A model that has adopted one of the heuristics will output entailment for all examples targeting that heuristic, even when the correct answer is non-entailment.

Models and training
All of our models consisted of BERT with a linear classifier on top of it outputting labels of entailment, contradiction, or neutral. We fine-tuned 100 instances of this model on MNLI using the finetuning code from the BERT GitHub repository. 2 The BERT component of each instance was initialized with the pre-trained bert-base-uncased weights. For evaluation on HANS, we translated outputs of contradiction and neutral into a single non-entailment label, following McCoy et al.

In-distribution generalization
The 100 instances were remarkably consistent on in-distribution generalization, with all models scoring between 83.6% and 84.8% on the MNLI development set (Figure 2

Heuristic
Definition Example Lexical overlap Assume that a premise entails all hypotheses constructed from words in the premise The doctor was paid by the actor.
The doctor paid the actor.

Subsequence
Assume that a premise entails all of its contiguous subsequences.
The doctor near the actor danced.
The actor danced.

Constituent
Assume that a premise entails all complete subtrees in its parse tree.
If the artist slept, the actor ran.
The artist slept. On average, among any pair of fine-tuned BERT instances, the two members of the pair agreed on the labels of 93.1% of the examples (when considering all three labels of entailment, contradiction, and neutral, rather than the collapsed labels of entailment and non-entailment). To give a sense of consistency across all 100 instances (rather than only among pairs of instances), Figure 2 ( (12) show some of the 526 cases that all 100 instances answered incorrectly. Some of these examples arguably have incorrect labels in the dataset, such as (7) (because the hypothesis mentions a report which the premise does not mention), so it is unsurprising that models found such examples difficult. Other consistently difficult examples involve areas that one might intuitively expect to be tricky for models trained on natural language, such as world knowledge (e.g., (8) requires knowledge of how long forearms are, and (9) requires knowledge of what nodding is), the ability to count (e.g., (10)), or fine-grained shades of meaning that might require multiple steps of reasoning (e.g., (11) and (12)). Some of the consistently difficult examples have a high degree of lexical overlap yet are not labeled entailment (such as (13)); the difficulty of such examples adds further evidence to the conclusion that these models have adopted the lexical overlap heuristic. Finally, there are some examples, such as (14), for which it is unclear why models find them so difficult.

Out-of-distribution generalization
On HANS, performance was much more variable than on the MNLI development set. HANS consists of 6 main categories of examples, each of which can be further divided into 5 subcategories. Performance was reasonably consistent on five of these categories, but on the sixth category-lexical overlap examples that are inconsistent with the lexical overlap heuristic-performance varied dramatically, ranging from 5% accuracy to 55% accuracy ( Figure 6). Since this is the most variable category, we focus on it for the rest of the analysis.
The category of lexical overlap examples that are inconsistent with the lexical overlap heuristic encompasses examples for which the correct label is non-entailment and for which all the words in the hypothesis also appear in the premise but not as a contiguous subsequence. This category has five subcategories; examples and results for each subcategory are in Figure 5. Chance performance on HANS was 50%; on all subcategories except for passives, accuracies ranged from far below chance to modestly above chance. Models varied considerably even on categories that humans find simple . For example, accuracy on the subject-object swap examples, which can be handled with only rudimentary knowledge of syntax (in particular, the distinction between subjects and objects), ranged from 0% to 66%. Overall,

Heuristic
Subcase Minimum Maximum Mean Std. dev.

Lexical
Untangling relative clauses 0.94 1.00 0.98 0.01 overlap The athlete who the judges saw called the manager. → The judges saw the athlete.
Sentences with PPs 0.98 1.00 1.00 0.00 The tourists by the actor called the authors. → The tourists called the authors.
Sentences with relative clauses 0.97 1.00 0.99 0.01 The actors that danced encouraged the author. → The actors encouraged the author. PP on object 1.00 1.00 1.00 0.00 The authors called the judges near the doctor. → The authors called the judges.

Constituent
Embedded under preposition 0.81 1.00 0.96 0.02 Because the banker ran, the doctors saw the professors. → The banker ran.
Outside embedded clause 1.00 1.00 1.00 0.00 Although the secretaries slept, the judges danced. → The judges danced.
Embedded under verb 0.93 1.00 0.99 0.01 The president remembered that the actors performed. → The actors performed.

Conjunction
1.00 1.00 1.00 0.00 The lawyer danced, and the judge supported the doctors. → The lawyer danced.

Adverbs
1.00 1.00 1.00 0.00 Certainly the lawyers advised the manager. → The lawyers advised the manager.  The scientist shouted.
Relative clause on subject 0.00 0.23 0.07 0.04 The secretary that admired the senator saw the actor.
The senator saw the actor.
MV/RR 0.00 0.02 0.00 0.00 The senators paid in the office danced.
The senators paid in the office.
NP/Z 0.02 0.13 0.06 0.02 Before the actors presented the doctors arrived.
The actors presented the doctors.

Constituent
Embedded under preposition 0.14 0.70 0.41 0.12 Unless the senators ran, the professors recommended the doctor.
The senators ran.
Outside embedded clause 0.00 0.03 0.00 0.01 Unless the authors saw the students, the doctors resigned.
The doctors resigned.
Embedded under verb 0.02 0.42 0.17 0.08 The tourists said that the lawyer saw the banker.
The lawyer saw the banker.
Disjunction 0.00 0.03 0.00 0.01 The judges resigned, or the athletes saw the author.
The athletes saw the author.
Adverbs 0.00 0.17 0.06 0.04 Probably the artists saw the authors.
The artists saw the authors.

Passive:
The student was stopped by the doctor. The student stopped the doctor.

Conjunction:
The doctors saw the athlete and the judge. The athlete saw the judge. although these models performed consistently on the in-distribution test set, they have nevertheless learned highly variable representations of syntax.

Discussion
We have found that models that differ only in their initial weights and the order of training examples can vary substantially in out-of-distribution linguistic generalization. We found this variation even with the vast majority of initial weights held constant (i.e., all the weights in the BERT component of the model). We conjecture that models might be even more variable if the pre-training of BERT were also redone across instances. These results underscore the importance of evaluating models on multiple restarts, as conclusions drawn from a single instance of a model might not hold across instances. Further, these results highlight the importance of evaluating out-of-distribution generalization; since all of our instances displayed similar in-distribution generalization, only their out-of- distribution generalization illuminates the substantial differences in what they have learned.
In stark contrast to the models we have looked at-which generalized in highly variable ways despite being trained on the same set of exampleshumans tend to converge to similar linguistic generalizations despite major differences in the linguistic input that they encounter as children (Chomsky, 1965(Chomsky, , 1980. This suggests that reducing the generalization variability of NLP models may help bring them closer to human performance in one major area where they still dramatically lag behind humans, namely in out-of-distribution generalization.
How could the out-of-distribution generalization of models be made more consistent? The variability that we have observed likely reflects the presence of many local minima in the loss surface, all of which are equally attractive to our models. This makes the model's choice of a minimum essentially arbitrary and easily affected by the initial weights and the order of training examples. To reduce this variability, then, one approach would be to use models with stronger inductive biases, which can help distinguish between the many local minima. An alternate approach would be to use training sets that better represent a large set of linguistic phenomena, to decrease the probability of there being local minima that ignore certain phenomena.