The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

Large-scale pretrained language models are the major driving force behind recent improvements in perfromance on the Winograd Schema Challenge, a widely employed test of commonsense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones.


Introduction
Large-scale pre-trained language models have recently led to improvements across a range of natural language understanding (NLU) tasks (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019), but there is some scepticism that benchmark leaderboards do not represent the full picture (Kaushik and Lipton, 2018;Jumelet and Hupkes, 2018;Poliak et al., 2018).An open question is whether these models generalize beyond their training data samples.
In this paper, we examine how pre-trained language models generalize on the Winograd Schema Challenge (WSC).
Named after Terry Winograd, the WSC, in its current form, was proposed by Levesque et al. (2012) as an alternative to the Turing Test.The The man couldn't lift his son because he was so heavy.
The man couldn't lift his son because he was so weak.
The men couldn't lift their sons because they were so heavy.
The men couldn't lift their sons because they were so weak.task takes the form of a binary reading comprehension test where a statement with two referents and a pronoun (or a possessive adjective) is given, and the correct antecedent of the pronoun must be chosen.Examples are chosen carefully to have a preferred reading, based on semantic plausibility rather than co-occurrence statistics.WSC examples come in pairs that are distinguished only by a discriminatory segment that flips the correct referent, as shown in Figure 1a.Levesque et al. define a set of qualifying criteria for instances and the pitfalls to be avoided when constructing examples (see §3.2).These combine to ensure an instance functions as a test of what they refer to as 'thinking' (or common sense reasoning).
Recent work has reported significant improvements on the WSC (Kocijan et al., 2019;Sakaguchi et al., 2019).As with many other NLU tasks, this improvement is primarily due to largescale language model pre-training, followed by fine-tuning for the target task.We believe that further examination is warranted to determine whether these impressive results reflect a funda-mental advance in reasoning ability, or whether our models have learned to simulate this ability in ways that do not generalize.In other words, do models learn accidental correlations in our datasets, or do they extract patterns that generalize in robust ways beyond the dataset samples?
In this paper, we conduct experiments to investigate this question.We define a set of lexical and syntactic variations and perturbations for the WSC examples and use altered examples (Figure 1b) to test models that have recently reported improved results.These variations and perturbations are designed to highlight the robustness of human linguistic and reasoning abilities and to test models under these conditions.
Contributions We introduce a new Winograd Schema dataset for evaluating generalization across seven controlled linguistic perturbations. 1  We use this dataset to compare human and language model sensitivity to those perturbations, finding marked differences in model performance.We present a detailed analysis of the behaviour of the language models and how they are affected by the perturbations.Finally, we investigate the effect of fine-tuning with large task-specific datasets, and present an error analysis for all models.

Related Work
Probing datasets Previous studies have explored the robustness of ML models towards different linguistic phenomena (Belinkov and Glass, 2019), e.g., by creating challenge datasets such as the one introduced here.When predicting subjectverb agreement, Linzen et al. (2016) found that inserting a relative clause hurt the performance of recurrent networks. 2 A large body of research has since emerged on probing pre-trained (masked) language models for linguistic structure (Goldberg, 2019;Hewitt and Manning, 2019;Lin et al., 2019;Clark et al., 2019) and analysing them via comparison to psycholinguistic and brain imaging data (Abnar et al., 2019;  Ettinger, 2019; Abdou et al., 2019; Gauthier and   1 Code and dataset can be found at: https://github.com/mhany90/enhanced_wsc/ 2 This contrasts with our results with Transformer-based architecture and is probably explained by memory loss in recurrent networks trained on short sequences.Similarly, Gulordava et al. (2018) tested whether a Recurrent Neural Network can predict long-distance number agreement in various constructions comparing natural and nonsensical sentences where RNNs cannot rely on semantic or lexical cues.Levy, 2019).Other recent work has attempted to probe these models for what is referred to as common sense or factual knowledge (Petroni et al., 2019;Feldman et al., 2019).Their findings show that these models do indeed encode such knowledge and can be used for knowledge base completion or common sense mining from Wikipedia.
Clever Hans A considerable amount of work has also been devoted to what might be described as the Clever Hans effect.This work has aimed to quantify the extent to which models are learning what we expect them to as opposed to leveraging statistical artifacts.This line of work has to date revealed significant problems (and some possible solutions to those problem) with reading comprehension datasets (Chen et al., 2016;Kaushik and Lipton, 2018), natural language inference datasets (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018;Belinkov et al., 2019a;McCoy et al., 2019), and the story cloze challenge (Schwartz et al., 2017), among others.
Winograd Schema Challenge Trinh and Le (2018) first proposed using neural language models for the WSC, achieving an accuracy of 63.7% using an ensemble of 14 language models.Ruan et al. (2019) and Kocijan et al. (2019) fine-tune BERT (Devlin et al., 2019) on the PDP (Rahman and Ng, 2012) and an automatically generated MaskedWiki dataset, reaching an accuracy of 71.1% and 72.5% respectively.Meanwhile, Radford et al. (2019) report an accuracy of 70.7% without fine-tuning using the GPT-2 language model.Most recently, Sakaguchi et al. (2019) present an adversarial filtering algorithm which they use for crowd-sourcing a large corpus of WSC-like examples.Fine-tuning RoBERTa (Liu et al., 2019) on this, they achieve an accuracy of 90.1%.
In an orthogonal direction, Trichelair et al. (2018) presented a timely critical treatment of the WSC.They classified the dataset examples into associative and non-associative subsets, showing that the success of the LM ensemble of Trinh and Le (2018) mainly resulted from improvements on the associative subset.Moreover, they suggested switching the candidate referents (where possible) to test whether systems make predictions by reasoning about the "entirety of a schema" or by exploiting "statistical quirks of individual entities".
In a similar spirit, our work is a controlled study of robustness along different axes of linguistic variation.This type of study is rarely possible in NLP due to the large size of datasets used and the focus on obtaining improved results on said datasets.Like a carefully constructed dataset which is thought to require true natural language understanding, the WSC presents an ideal testbed for this investigation.

Perturbations
We define a suite of seven perturbations that can be applied to the Gender switch (GEN) Each of the referents in the sentence has their gender switched by replacing their names with other randomly drawn frequent English names of the opposite gender. 392% of the generated data involved a gender switch for a name.Though humans may be biased towards gender (Collins, 2011;Desmond and Danilewicz, 2010;Hoyle et al., 2019), the perturbations do not  (Olson and Filby, 1972;Feng et al., 2015).
Relative clause insertion (RC) A relative clause is inserted after the first referent.For each example, an appropriate clause was constructed by first choosing a template such as "who we had discussed" or "that is known for" from a preselected set of 19 such templates.An appropriate ending, such as "who we had discussed with the politicians" is then appended to the template depending on the semantics of the particular instance.Relative clauses impose an increased demand on working memory capacity, thereby making processing more difficult for humans (Just and Carpenter, 1992;Gibson, 1998).
Adverbial qualification (ADV) An adverb is inserted to qualify the main verb of each instance.When a conjunction is present both verbs are modified.For instances with multiple sentences, all main verbs are modified.
Synonym/Name substitution (SYN/NA) Each of the two referents in an example is substituted with an appropriate synonym, or if it is a name, is replaced with a random name of the same gender from the same list of names used for the gender perturbation.

Human Judgments
We expect that humans are robust to these perturbations because they represent naturally occurring phenomena in language; we test this hypothesis by collecting human judgements for the perturbed examples.We collect the judgments for the perturbed examples using Amazon Mechanical Turk.The annotators are presented with each instance where the pronoun of interest is boldfaced and in red font.They are also presented with two options, one for each of the possible referents.They are then instructed to choose the most likely option, in exchange for $0.12.Following Sakaguchi et al. (2019), each instance is annotated by three anno- tators and majority vote results are reported.Results are reported later in §5.All three annotators agreed on the most likely option in 82-83% of the instances, except for gender, where a full agreement was obtained for only 78% of the instances.See Appendix B for further annotation statistics, a sample of the template presented to annotators, and restrictions applied to pool of annotators.We did not require an initial qualification task to select participants.

Confounds and Pitfalls
Constructing WSC problems is known to be difficult.Indeed, the original dataset was carefully crafted by domain experts and subsequent attempts at creating WSC-like datasets by nonexperts such as in Rahman and Ng (2012) have produced examples which were found to be less challenging than the original dataset.Two likely pitfalls listed in Levesque et al. (2012) concern A) statistical preferences which make one answer more readily associated with the special discriminatory segment or other components of an example4 (this is termed as Associativity, and it is described as non-Google-proofness in Levesque et al. ( 2012 the original problems with regards to pitfall A, we employ pointwise mutual information (PMI) to test the associativity of both the original and perturbed examples.PMI is known to be a reasonable measure of associativity (Church and Hanks, 1990) and, among a variety of measures, has been shown to correlate best with association scores from human judgements of contextual word association (Frassinelli, 2015).We compute unigram PMI on the two corpora used to train BERT (see Appendix C for details).Figure 2 shows the divergence of the perturbed examples from the original WSC dataset.We estimate divergence as the average difference in PMI between the correct (C) and incorrect (I) candidates: ∆ = pmi(c j , x j ) − pmi(i j , x j ) where X is either: i) the discriminatory segments or ii) the full text of the example, and pmi(•, •) is average unigram PMI.∆ can be seen as a measure of whether the correct or incorrect candidate is a better 'associative fit' for either the discriminatory segment or the full context, making the examples trivial to resolve.Observe that this difference in PMI declines for the perturbed examples, showing that these the perturbed example do not increase in associativity.
Confirming Solvability Three expert annotators5 are asked to solve the small subset of examples (99 in total across perturbations) which were annotated incorrectly by the majority vote of Mechanical Turk workers.To address pitfall B, the expert annotators are asked to both attempt to solve the instances and indicate if they believe them to be too ambiguous to be solved.The majority vote of the annotators determines the preferred referent or whether an instance is ambiguous.Out of a total of 99 examples, 10 were found to be ambiguous.Of the remaining 89 examples, 67 were answered correctly by the majority vote.See Appendix D for more details.

Experimental Protocol
Our experiments are designed to test the robustness of language models to the Winograd Schema perturbations described in the previous section.
Evaluation Models are evaluated using two types of measures.The first is accuracy.For each of the perturbations, we report (a) the accuracy on the perturbed set (Perturbation accuracy), (b) the difference in accuracy on the perturbed set and on the equivalent subset of original dataset:6 ∆ Acc.= Perturbation accuracy − Original subset accuracy, and (c) Pair accuracy, defined as the number of pairs for which both examples in the pair are correctly answered divided by the total number of pairs.
The second measure is stability, S.This is the proportion of perturbed examples P for which the predicted referent is the same as the original prediction P: Since the perturbations do not alter the correct referent, this provides a strong indication of robustness towards them.
Baseline We take the unigram PMI between candidates and discriminatory segments (see §3.2) as a baseline.We expect that this simple baseline will perform well for instances with a high level of associativity but not otherwise.
Language Models Our analysis is applied to three out-of-the-box language models (LMs): BERT (Devlin et al., 2019), ROBERTA (Liu et al., 2019), andXLNET (Yang et al., 2019).These models are considered to be the state-of-the-art for the wide variety of natural language understanding tasks found in the GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks.We use the large pre-trained publicly available models (Wolf et al., 2019).7 Fine-tuned Language Models We also examine the effect of fine-tuning language models.BERT+WW uses BERT fine-tuned on the

Results and Analysis
Following the experimental protocol, we evaluate the three out-of-the-box language models and the two fine-tuned models on the original WSC and each of the perturbed sets.Table 2 shows Perturbation accuracy results for all models9 and contrasts them with human judgements and the PMI baseline.

Language Models
Humans maintain a much higher performance compared to out-of-the-box LMs across perturbations.The difference in accuracy between the perturbed and original examples, ∆ Acc., as defined in Section 4 is shown in Figure 4.A general trend of decrease can be observed for both models and humans across the perturbations.This decline in accuracy is on average comparable between models and humans -with a handful of exceptions.Taking the large gap in absolute accuracy into account, this result might be interpreted in two ways.If a comparison is made relative to the upper bound of performance, human performance has suffered from a larger error increase.Alternately, if we compare relative to the lower bound of performance, then the decline in the already low performance of language models is more meaningful, since 'there is not much more to lose'.
A more transparent view can be gleaned from the stability results shown in Table 3.Here it can be seen that the three out-of-the-box LMs are substantially more likely to switch predictions due to the perturbations than humans.Furthermore, we observe that the LMs are least stable for word-level perturbations like gender (GEN), number (NUM), and synonym or name replacement (SYN/NA), while humans appear to be most affected by sentence-level ones, such as relative clause insertion (RC) and voice perturbation (VC).

Understanding Language Model Performance
To better understand the biases acquired through pre-training which are pertinent to this task, we consider a) a case of essential feature omission and b) the marginal cases where LMs answer very correctly or incorrectly, in both the original and perturbed datasets.We present analysis for BERT, but similar findings hold for the other LMs.
Masking discriminatory segments result in identical sentence pairs because these segments are the only part of a sentence that sets WSC pairs apart (see Figure 1a).To determine whether there is a bias in the selectional preference for one of the candidates over the other, we test BERT on examples where these discriminatory segments have been replaced with the MASK token.An unbiased model should be close to random selection but BERT consistently prefers (by a margin of ∼25-30%) the candidate which appears second in the text to the one appearing first, for all perturbations except voice, where it prefers the first.This observation holds even when the two referents are inverted, which is possible for the 'switchable' subset of the examples as shown in Trichelair et al. (2018).This indicates that the selections are not purely semantic but also syntactic or structural and it points towards BERT having a preference referents in the object role.Detailed results are presented in Appendix F.
Marginal examples are found where the model assigns a much higher probability to one referent over the other.We extract the top 15% examples where the correct candidate is preferred by the largest margin (P correct P incorrect ) and the bottom 15% where the incorrect one is preferred (P incorrect P correct ).Surprisingly, we find that there is a large overlap (50%-60%) between these two sets of examples, both in the original and the perturbed datasets. 10For the examples which are both the most correct and incorrect, BERT strongly prefers one of the candidates without considering the special discriminatory segment which flips the correct referent.Indeed we find that the correlation between the probability assigned by BERT to a referent when it is the correct referent and when it is not is very strong and significant, with Spearman's ρ ≈ 0.75 across perturbations (see Appendix G for details).

(i)
Alice looked for her friend Jade in the crowd.Since she always has good luck, Alice spotted her quickly.
(ii) Alice looked for her friend Jade in the crowd.Since she always wears a red turban, Alice spotted her quickly.
The first example gives Pcorrect Pincorrect by the largest margin, and its counterpart gives Pincorrect Pcorrect by the largest margin.In other words, the model assigns a much higher probability for Alice in both cases.

The effect of fine-tuning
The accuracy and stability results (Tables 2  and 3) indicate that fine-tuning makes language models more robust to the perturbations.ROBERTA+WG, in particular, is the most accurate and most stable model.While impressive, this is not entirely surprising: fine-tuning on taskspecific datasets is a well-tested recipe for bias correction (Belinkov et al., 2019b).Indeed, these results provide evidence that it is possible to construct larger fine-tuning datasets whose distribution is correct for the WSC.We note that both fine-tuned models perform worst on the VC and RC perturbations, which may not frequently occur in the crowd-sourced datasets used for finetuning.To test this intuition, we apply a dependency parser (UDPipe (Straka et al., 2016)  How do perturbations affect token probability distributions?To obtain a holistic view of the effect the perturbations have on LMs and finetuned LMs, we analyze of the shift in the probability distribution (over the entire vocabulary) which a model assigns to a MASK token inserted in place of the pronoun of interest.We apply probability distribution truncation with a threshold of p = 0.9 as proposed in Holtzman et al. (2019) to filter out the uninformative tail of the distribution.Following this, we compute the Jensen-Shannon distance between this dynamically truncated distribution for an original example and each of its perturbed counterparts.Figure 5 shows the average of this measure over the subset of the 128 examples which are common to all perturbations.Overall, we observe that large shifts in the distribution correspond to lower stability and accuracy scores and that fine-tuned models exhibit lower shifts than their non-fine-tuned counterparts.The difference in shifts between out-of-the-box models and their fine-tuned counterparts is lower for the VC, RC and ADV perturbations, meaning that when finetuned, the models' probability distributions are roughly just as divergent for these perturbations as they were before fine-tuning.We hypothesize the same reasons we did in 5.2, which is that these examples are just under-represented in our finetuning corpus; indeed, these results roughly correspond to the differences in ∆ Acc. from Figure 4.
Further details about the number of examples excluded via the probability distribution truncation and other measures of the perturbations' effect can be found in Appendix G.

Error Analysis
Pair Accuracy Here we consider a more challenging evaluation setting where each WSC pair is treated as a single instance.Since the WSC examples are constructed as minimally contrastive pairs (Levesque et al., 2012), we argue that this is an appropriate standard of evaluation.Consider again the example in Figure 1a.It is reasonable to suppose that for an answerer which truly 'understands' (Levesque et al., 2012), being able to link the concepts heavy and son in one of the resolutions is closely related and complementary to linking the concepts weak and man in the other. 13he results for this evaluation are shown in Figure 6.They show that human resolution of the problems exhibits greater complementarity compared to the language models; human pair accuracy (pair) is closer to perturbation accuracy (single) than is the case for the LMs.Furthermore, human performance on pair accuracy is more robust to perturbations when compared to the models.Indeed, the large gap between pair accuracy and perturbation accuracy raises some doubts about the performance of these models.However, ROBERTA-WG is a notable exception, showing near-human robustness to pair complementarity.
Associativity Next, we examine the effect of associativity on performance.Figure 7 shows accuracy results14 for all perturbations on the associative and non-associative subsets of the WSC as labelled by Trichelair et al. (2018).We observe that the difference between associative and nonassociative is much smaller for humans and that unlike all language models, humans do better on the former than the latter.As expected, the PMI baseline does almost as well as the LMs on the associative subset but it performs at chance level for the non-associative subset.

Conclusion
We presented a detailed investigation of the effect of linguistic perturbations on how language models and humans perform on the Winograd Schema Challenge.We found that compared to out-of-the-box models, humans are significantly more stable to the perturbations and that they answer non-associative examples with higher accuracy than associative ones, show sensitivity to WSC pair complementarity, and are more sensitive to sentence-level (as opposed to wordlevel) perturbations.In an analysis of the behaviour of language models, we observe that there is a preference for referents in the object role and that the models do not always consider the discriminatory segments of examples.Finally, we find that fine-tuning language models can lead to much-improved accuracy and stability.It remains an open question whether this task-specific approach to generalisation constitutes a true advancement in "reasoning".Fine-tuning a model on a rather large number of examples similar to the WSC leads to increased robustness, but this stands in stark contrast to humans, who are robust to the perturbations without having been exposed to similar examples in the past.

A Observations on original dataset
1.A few of the original examples were of unorthodox design: for instance, consider the pair: (1) a. Look!There is a minnow swimming right below that duck!It had better get away to safety fast!b.Look!There is a shark swimming right below that duck!It had better get away to safety fast!Here, instead of having a discriminatory segment select which of the two nouns could be the antecedent, one of the nouns is switched out with another.
2. Example 90 has a typo in the question where Kamchatka is spelled as 'Kamtchatka'.

B Human Judgements
Table 4 shows the proportion of instances for which all three annotators agreed and the average time required by annotators for the original examples and each of the perturbed datasets.Figure 8 shows the Amazon Mechanical Turk template used.The annotator pool was restricted to native speakers of English located in the United States who were classified by Mturk as 'masters' and had a HITs approval rate above 99%.

C Pointwise Mutual Information
We compute unigram Pointwise Mutual Information statistics using the Hyperwords15 package (Levy et al., 2015).If a corpus is split into a collection D of words W and their contexts C, we

D Confirming Solvability
Table 5 shows the breakdown by perturbation type of the expert annotations which were gathered for examples that were annotated incorrectly by the Mechanical Turk workers.

E Notes on construction of perturbed dataset
Tense switch (TEN) Examples 168-172 could not be changed while maintaining the semantics of the instance intact.• "who we had discussed "

Relative clause insertion (RC)
• "who he had discussed " • "who she had discussed " • "who you had discussed " • "which we had seen " • "which he had seen " • "which she had seen " • "which you had seen " • "who we know from " • "who he knows from " • "who she knows from " • "who you know from " • "that is mentioned in " • "that is located at " • "that is close to " • "that is known for " • "which had been ", • "who you met " • "that is " • "which was put there " Synonym/Name substitution (SYN/NA) No appropriate synonyms were found for tide and wind in examples 130 and 131.
Adverbial qualification (ADV) Two instances (95 and 96) in which the main verb was already modified were excluded.

F Referent preferences
Table 6 shows the percentage of examples in the switchable subset of the datasets where the second referent in the text was assigned a higher probability than the first, for both the original and reversed referent order.

G Effect of perturbations
Nucleus Sampling Probability shift is defined as the difference in the probability of a candidate before and after a perturbation is applied.Figure 9 shows the difference in average probability shift between the correct candidates and the incorrect candidates for each of the models per perturbation type.This provides a view that is meaningfully different from accuracy, as the probability of a candidate can shift without exceeding the threshold required to change a model's prediction.We find that there is a general trend of the incorrect candidates becoming more likely relative to the correct ones.This can be seen as confirming that, on average, nearly all perturbations make the problems more difficult for all models.
Hidden state representation distance is used to provide a more holistic view of the correspondence between the representations derived for the different perturbations.The analysis is conducted on the 128 examples which are common between all datasets.A representation is derived for each example by taking the max-pool of hidden-state representations of a model's final layer.For each of the seven perturbations p, we compute pairwise correlation distance16 between each pair of original and perturbed example representations yielding a vector D p ∈ R 128 .The mean of D p is then computed as an aggregate measure of the distance between the representations derived from a perturbation p and the original o. Figure 10 shows a plot of this for all perturbations for each of the models.

H Candidate probability correlations
Figure 11 shows the average correlation between a candidate's probability when it is the correct referent and when it is not.

Figure 1 :
Figure 1: An example pair from the Winograd Schema Challange (a) and its perturbation (b).The pronoun resolves to one of the two referents, depending on the choice of the discriminatory segment.The perturbation in (b) pluralizes the referents and the antecedents.

Figure 2 :
Figure 2: PMI divergence from the original WSC examples in average ∆ for each perturbation.Values below 0 indicate that the difference in PMI between the correct candidate and the incorrect one decreased.

Figure 3 :
Figure 3: Accuracy and stability scores (averaged across perturbations) for ROBERTA when fine-tuned on five increasing training split sizes.

Figure 4 :
Figure 4: ∆ Acc.results for all models across perturbations.Values below the x-axis indicate a decline in accuracy compared to the original dataset.

Figure 5 :
Figure 5: Jensen-Shannon distance between the original and perturbed examples when masking the pronoun of interest.

Figure 6 :Figure 7 :
Figure 6: Pair accuracy and Perturbation accuracy results.The latter are labeled as single.

Figure 8 :
Figure 8: Sample of Mturk template shown to annotators.

Figure 9 :
Figure9: The difference between average probability shift for the correct and the incorrect referents per perturbation.Y-axis values above zero mean the correct referent became more likely on average after a perturbation and vice versa.

Figure 10 :Figure 11 :
Figure 10: The correlation of pronoun hidden state representation distance from the original for each perturbation.

Table 1 :
Examples from our dataset of the different perturbations applied to a WSC instance.
Kocijan et al. (2019)19)sets which consist of 2.4M and 1322 examples (Kocijan et al., 2019), and RoBERTa+WG is fine-tuned on WinoGrande XL, which consists of 40,938 adversarially filtered examples(Sakaguchi et al., 2019).Both finetuned models have been reported by recent work to achieve significant improvements on the WSC.ScoringTo score the two candidate referents in each WSC instance we employ one of two mechanisms.The first, proposed in Trinh and Le (2018) and adapted to masked LMs byKocijan et al. (2019)involves computing the probability of the two candidates c1 and c2, given the rest of the text in the instance s.To accomplish this, the pronoun of interest is replaced with a number of MASK tokens corresponding to the number of tokens in each of c1 and c2.The probability of a candidate, p(c|s) is then computed as the average of the probabilities assigned by the model to the candidate's tokens and the maximum probability candidate is taken as the answer.This scoring method is used for all models, except ROBERTA+WG.

Table 2 :
Original dataset accuracy (ORIG) and Perturbation accuracy results for all models and humans.The penultimate column shows the average Perturbation accuracy results.The rightmost column shows the ∆ Acc.results, averaged over all perturbations.

Table 3 :
Stability results for all models and humans.
) to the WinoGrande XL examples, finding that only ∼ 5% of the examples are in the passive voice and ∼ 6.5% contain relative clauses.

Table 5 :
Breakdown of solvability annotation counts by perturbation.Ambig.indicates the count of examples labeled as Ambiguous, Non-Ambig.is the number of remaining examples.Correct indicates the number of those which is solved correctly.

Table 6 :
Table 7 shows the average number of vocabulary items kept after Nucleus sampling with p = 0.9 is applied.Percentage of examples in switchable subset with probabilities assigned to the second referent in the text rather than the first, for both the original and reversed referent order.

Table 7 :
Average number of vocabulary items left after probability distribution truncation with p = 0.9 is applied.