Undersensitivity in Neural Reading Comprehension

Current reading comprehension methods generalise well to in-distribution test sets, yet perform poorly on adversarially selected data. Prior work on adversarial inputs typically studies model oversensitivity: semantically invariant text perturbations that cause a model’s prediction to change. Here we focus on the complementary problem: excessive prediction undersensitivity, where input text is meaningfully changed but the model’s prediction does not, even though it should. We formulate an adversarial attack which searches among semantic variations of the question for which a model erroneously predicts the same answer, and with even higher probability. We demonstrate that models trained on both SQuAD2.0 and NewsQA are vulnerable to this attack, and then investigate data augmentation and adversarial training as defences. Both substantially decrease adversarial vulnerability, which generalises to held-out data and held-out attack spaces. Addressing undersensitivity furthermore improves model robustness on the previously introduced ADDSENT and ADDONESENT datasets, and models generalise better when facing train / evaluation distribution mismatch: they are less prone to overly rely on shallow predictive cues present only in the training set, and outperform a conventional model by as much as 10.9% F1.

While semantically invariant text transformations can remarkably alter a model's predictions, the complementary problem of model undersensitivity is equally troublesome: a model's text input can often be drastically changed in meaning while retaining the original prediction. In particular, previous works (Feng et al., 2018;Ribeiro et al., 2018a;Welbl et al., 2020) show that even after deletion of all but a small fraction of input words, models often produce the same output. However, such reduced inputs are usually unnatural to a human reader, and it is both unclear what behaviour we should expect from natural language models evaluated on unnatural text, and how to use such unnatural inputs to improve models.
In this work we explore RC model undersensitivity for natural language questions, and we show that addressing undersensitivity not only makes RC models more sensitive where they should be, but also less reliant on shallow predictive cues. Fig. 1 shows an example for a BERT LARGE model  trained on SQuAD2.0 (Rajpurkar et al., 2018) that is given a text and a comprehension question, i.e. What was Fort Caroline renamed to after the Spanish attack? which it correctly answers as San Mateo with 98% probability. Altering this question, however, can increase model probability for this same prediction to 99%, although the new question is unanswerable given the same con-  text. That is, we observe an increase in probability despite removing relevant question information and replacing it with new, irrelevant content.
We formalise the process of finding such questions as an adversarial search in a discrete space arising from perturbations of the original question. There are two types of discrete perturbations we consider, based on part-of-speech tags and named entities, with the aim of obtaining grammatical and semantically consistent alternative questions that do not accidentally have the same correct answer. We find that SQuAD2.0 and NewsQA models can be attacked on a substantial proportion of samples.
The observed undersensitivity correlates negatively with in-distribution test set performance metrics (EM/F 1 ), suggesting that this phenomenonwhere present -is indeed a reflection of a model's lack of question comprehension. When training models to defend against undersensitivity attacks with data augmentation and adversarial training, we observe that they can generalise their robustness to held out evaluation data without sacrificing in-distribution test set performance. Furthermore, the models improve on the adversarial datasets proposed by Jia and Liang (2017), and behave more robustly in a learning scenario that has dataset bias with a train / evaluation distribution mismatch, increasing performance by up to 10.9%F 1 .
List of Contributions: i) We propose a new type of adversarial attack exploiting the undersensitivity of neural RC models to input changes, and show that contemporary models are vulnerable to it; ii) We compare data augmentation and adversarial training as defences, and show their effectiveness at reducing undersensitivity errors on both held-out data and held-out perturbations without sacrificing nominal test performance; iii) We demonstrate that the resulting models generalise better on the adversarial datasets of Jia and Liang (2017), and in the biased data setting of Lewis and Fan (2019).

Related Work
Adversarial Attacks in NLP Adversarial examples have been studied extensively in NLP -see Zhang et al. (2019) for a recent survey. Yet automatically generating adversarial inputs is non-trivial, as altering a single word can change the semantics of an instance or render it incoherent. Prior work typically considers semantic-invariant input transformations to which neural models are oversensitive. For instance, Ribeiro et al. (2018b) use a set of simple perturbations such as replacing Who is with Who's. Other semantics-preserving perturbations include typos (Hosseini et al., 2017), the addition of distracting sentences (Jia and Liang, 2017;Wang and Bansal, 2018), character-level adversarial perturbations (Ebrahimi et al., 2018;Belinkov and Bisk, 2018), and paraphrasing (Iyyer et al., 2018b). In this work, we focus on the complementary problem of undersensitivity of neural RC models to semantic perturbations of the input. Our method is based on the idea that modifying, for instance, the named entities in a question can completely change its meaning and, as a consequence, the question should become unanswerable given the context. Our approach does not assume white-box access to the model, as do e.g. Ebrahimi et al. (2018) and Wallace et al. (2019).
Undersensitivity Jacobsen et al. (2019) demonstrated classifier undersensitivity in computer vision. Niu and Bansal (2018) investigated undersensitivity in dialogue models and addressed the problem with a max-margin training approach. Ribeiro et al. (2018a) describe a general model diagnosis tool to identify minimal feature sets that are sufficient for a model to form high-confidence predictions. Feng et al. (2018) showed that it is possible to reduce inputs to minimal input word sequences without changing a model's predictions. Welbl et al. (2020) investigated formal verification against undersensitivity to text deletions. We see our work as a continuation of these lines of inquiry, with a particular focus on undersensitivity in RC. In contrast to Feng et al. (2018) and Welbl et al. (2020) we consider concrete alternative questions, rather than arbitrarily reduced input word sequences, and furthermore address the observed phenomenon using dedicated training objectives, in contrast to Feng et al. (2018) and Ribeiro et al. (2018a) who simply highlight it. Gardner et al. (2020) and Kaushik et al. (2020) also recognise the problem of models learning shallow but successful heuristics, and propose counterfactual data annotation paradigms as prevention. The perturbations used in this work define such counterfactual samples. Their composition does not require additional annotation efforts, and we furthermore adapt an adversarial perspective on the choice of such samples. Finally, one of the methods we evaluate for defending against undersensitivity attacks is a form of data augmentation that has similarly been used for de-biasing NLP models Lu et al., 2018). Concurrent work on model CHECKLIST evaluation (Ribeiro et al., 2020) includes an invariance test which also examines model undersensitivity. In contrast to CHECKLIST, our work focuses with more detail on the analysis of the invariance phenomenon, the automatic generation of probing samples, an investigation of concrete methods to overcome undesirably invariant model behaviour, and shows that adherence to invariance tests leads to more robust model generalisation. Rajpurkar et al. (2018) proposed the SQuAD2.0 dataset, which includes over 43,000 human-curated unanswerable questions. NewsQA is a second dataset with unanswerable questions, in the news domain (Trischler et al., 2017). Training on these datasets should conceivably result in models with an ability to tell whether questions are answerable or not; we will however see that this does not extend to adversarially chosen unanswerable questions. Hu et al. (2019) address unanswerability of questions from a given text using additional verification steps. Other approaches have shown the benefit of synthetic data to improve performance in SQuAD2.0 (Zhu et al., 2019;Alberti et al., 2019). In contrast to prior work, we demonstrate that despite improving performance on test sets that include unanswerable questions, the problem persists when adversarially choosing from a larger space of questions.

Methodology
Problem Overview Consider a discriminative model f θ , parameterised by a collection of vectors θ, which transforms an input x into a prediction y = f θ (x). In our task, x = (t, q) is a given text t paired with a question q about this text. The label y is the answer to q where it exists, or a NoAnswer label where it cannot be answered. 1 In an RC setting, the set of possible answers is large, and predictionsŷ should be dependent on x. And indeed, randomly choosing a different input (t , q ) usually changes the model prediction y. However, there exist many examples where the prediction erroneously remains stable; the goal of the attack formulated here is to find such cases. Formally, the goal is to discover inputs x for which f θ still erroneously predicts f θ (x ) = f θ (x), even though x is not answerable from the text.
Identifying suitable candidates for x can be achieved in manifold ways. One approach is to search among a large question collection, but we find this to only rarely be successful; an example is shown in Table 8, Appendix E. Generating x , on the other hand, is prone to result in ungrammatical or otherwise ill-formed text. Instead, we consider a perturbation space X T (x) spanned by perturbing original inputs x using a perturbation function family T : Ideally the transformation function family T is chosen such that the correct label of these inputs is changed, and the question becomes unanswerable. We will later search within X T (x) to find inputs x which erroneously retain the same prediction as x:ŷ(x) =ŷ(x ).
Part-of-Speech (PoS) Perturbations We first consider the perturbation space X T P (x) with PoS perturbations T P of the original question: we swap individual tokens with other, PoS-consistent alternative tokens, drawing from large collections of tokens of the same PoS types. For example, we might alter the question "Who patronized the monks in Italy?" to "Who betrayed the monks in Italy?" by replacing the past tense verb "patronized" with "betrayed". There is however no guarantee that the altered question will require a different answer (e.g. due to synonyms). Even more so -there might be type clashes or other semantic inconsistencies. We perform a qualitative analysis to investigate the extent of this problem and find that, while a valid concern, for the majority of attackable samples (at least 51%) there exist attacks based on correct well-formed questions (see Section 5).
Named Entity (NE) Perturbations The space X T E (x) of the transformation family T E is created by substituting NE mentions in the question with different type-consistent NE, derived from a large set E. For example, a question "Who patronized the monks in Italy?" could be altered to "Who patronized the monks in Las Vegas?", replacing the geopolitical entity "Italy" with "Las Vegas", chosen from E. Altering NE often changes the question specifics and alters the answer requirements, which are unlikely to be satisfied from what is stated in the given text, given the broad nature of entities in E. While perturbed questions are not guaranteed to be unanswerable or require a different answer, we will in a later analysis see that for the large majority of cases (at least 84%) they do.
Undersensitivity Attacks Thus far we have described methods for perturbing questions. We will search in the resulting perturbation spaces X T P (x) and X T E (x) for inputs x for which model prediction remains constant. However, we pose a slightly stronger and more conservative requirement to rule out cases where the prediction is retained, but with lower probability: f θ should assign a higher probability to the same predictionŷ(x) =ŷ(x ) than for the original input: To summarise, we are searching in a perturbation space for altered questions which result in a higher model probability to the same answer as the original input question. If we have found an altered question that satisfies inequality (1), then we have identified a successful attack, which we will refer to as an undersensitivity attack. Adversarial Search in Perturbation Space In its simplest form, a search for an adversarial attack in the previously defined spaces amounts to a search over a list of single lexical alterations for the maximum (or any) higher prediction probability. We can however repeat the replacement procedure multiple times, arriving at texts with larger lexical distance to the original question. For example, in two iterations of PoS-consistent lexical replacement, we can alter "Who was the duke in the battle of Hastings?" to inputs like "Who was the duke in the expedition of Roger?" The space of possibilities grows combinatorially, and with increasing perturbation radius it becomes computationally infeasible to comprehensively cover the full perturbation space arising from iterated substitutions. To address this, we follow Feng et al. (2018) and apply beam search to narrow the search space, and seek to maximise the difference ∆ = P (ŷ | x )−P (ŷ | x). Beam search is conducted up to a pre-specified maximum perturbation radius ρ, but once x with ∆ > 0 has been found, we stop the search.
Relation to Attacks in Prior Work Note that this type of attack stands in contrast to other attacks based on small, semantically invariant input perturbations which investigate oversensitivity problems. Such semantic invariance comes with stronger requirements and relies on synonym dictionaries (Ebrahimi et al., 2018) or paraphrases harvested from back-translation (Iyyer et al., 2018a), which are both incomplete and noisy. Our attack is instead focused on undersensitivity, i.e. where the model is stable in its prediction even though it should not be. Consequently the requirements are not as difficult to fulfil when defining perturbation spaces that alter the question meaning, and one can rely on sets of entities and PoS examples automatically extracted from a large text collection.

Experiments: Model Vulnerability
Training and Dataset Details We next conduct experiments using the attacks laid out above to investigate model undersensitivity. We attack the BERT model fine-tuned on SQuAD2.0, and measure to what extent the model exhibits undersensitivity to adversarially chosen inputs. Our choice of BERT is motivated by the currently widespread adoption of its variants across the NLP field, and empirical success across a wide range of datasets. SQuAD2.0 per design contains unanswerable questions; models are thus trained to predict a NoAnswer option where a comprehension question cannot be answered. As the test set is unavailable, we split off 5% from the original training set for development purposes and retain the remaining 95% for training, stratified by articles. The original SQuAD2.0 development set is then used as evaluation data, where the model reaches 73.0%EM and 76.5%F 1 ; we will compute the undersensitivity attacks on this entirely held out part of the dataset. Appendix B provides further training details.
Attack Details To compute the perturbation spaces, we gather large collections of NE and PoS expressions across types that define the perturbation spaces T E and T P , which we gather from the Wikipedia paragraphs used in the SQuAD2.0 training set, with the pretrained taggers in spaCy, and the Penn Treebank tag set for PoS. This results on average in 5,126 different entities per entity type, and 2,337 different tokens per PoS tag. When computing PoS perturbations, we found it useful to disregard perturbations of particular PoS types that often led to only minor changes or incorrectly formed expressions, such as punctuation or determiners; details on these can be found in Appendix A. As the number of possible perturbations to consider is potentially very large, we limit the beam search at each step to a maximum of η randomly chosen type-consistent entities from E, or tokens from P , and re-sample these throughout the search. We use a beam width of b = 5, resulting in a bound to the total computation spent on adversarial search of b · ρ · η model evaluations per sample, where ρ is the perturbation "radius" (maximum search depth).
We quantify vulnerability to the described attacks by measuring the fraction of evaluation samples for which at least one undersensitivity attack is found given a computational search budget, disregarding cases where a model predicts NoAnswer. 2 2 Altering such samples likely retains their unanswerability.
Results Fig. 2 shows plots for adversarial error rates on SQuAD2.0 for both perturbation types. We observe that attacks based on PoS perturbations can already for very small search budgets (η = 32, ρ = 1) reach more than 60% attack success rates, and this number can be raised to 95% with a larger computational budget. For perturbations based on NE substitution, we find overall lower attack success rates, but still find that more than half of the samples can successfully be attacked with the budgets tested. Note that where attacks were found, we observed that there often exist multiple alternatives satisfying inequality 1.
These findings demonstrate that BERT is not necessarily considering the entire content of a comprehension question given to it, and that even though trained to tell when questions are unanswerable, the model often fails when facing adversarially selected unanswerable questions.
In a side experiment we also investigated undersensitivity attacks using NE perturbations on SQuAD1.1, which proves even more vulnerable with an adversarial error rate of 70% already using η = 32; ρ = 1 (compared to 34% on SQuAD2.0). While this demonstrates that undersensitivity is also an issue for SQuAD1.1, the unanswerable question behaviour is not really well-defined, rendering results difficult to interpret. On the other hand, the notable drop between the datasets demonstrates the effectiveness of the unanswerable questions added during training in SQuAD2.0.

Analysis of Vulnerable Samples
Qualitative Analysis of Attacks The attacks are potentially noisy, and the introduced substitutions are by no means guaranteed to result in semanti-

PoS NE
Valid attack 51% 84% Syntax error 10% 6% Semantically incoherent 24% 5% Same answer 15% 5% cally meaningful and consistent expressions, or require a different answer than the original. To gauge the extent of this we inspect 100 successful attacks conducted at ρ = 6 and η = 256 on SQuAD2.0, both for PoS and NE perturbations. We label them as either: i) Having a syntax error (e.g. What would platform lower if there were fewer people?). These are mostly due to cascading errors stemming from incorrect NE / PoS tag predictions. ii) Semantically incoherent (Who built the monks?). iii) Questions that require the same correct answer as the original, e.g. due to a paraphrase. iv) Valid attacks: Perturbed questions that would either demand a different answer or are unanswerable given the text (e.g. When did the United States / Tuvalu withdraw from the Bretton Woods Accord?) Table 1 shows several example attacks along with their annotations, and in Table 2 the respective proportions are summarised. We observe that a non-negligible portion of questions has some form of syntax error or incoherent semantics, especially for PoS perturbations. Questions with the identical correct answer are comparatively rare. Finally, about half (51%) of the attacks in PoS, as well as 84% for NE are valid questions that should either have a different answer, or are unanswerable.
Overall the NE perturbations result in cleaner questions than PoS perturbations, which suffer from semantic inconsistencies in about a quarter (24%) of the cases. While these questions have some sort of inconsistency (e.g. What year did the case go before the supreme court? vs. a perturbed version What scorer did the case go before the supreme court?), it is remarkable that the model assigns higher probabilities to the original answer even when faced with incoherent questions, casting doubt on the extent to which the replaced question information is used to determine the answer. Since NE-based attacks have a substantially larger fraction of valid, well-posed questions, we will focus our study on these for the remainder of this paper.
Characterising Successfully Attacked Samples We observe that models are vulnerable to undersensitivity adversaries, yet not all samples are successfully attacked. This raises the question of what distinguishes samples that can and cannot be attacked. We thus examine various characteristics, aiming to understand model vulnerability causes.
First, successfully attacked questions produce lower original prediction probabilities, on average 72.9% vs. 83.8% for unattackable questions. That is, there exists a direct inverse link between a model's original prediction probability and sample vulnerability to an undersensitivity attack. The adversarially chosen questions had an average probability of 78.2% -a notable gap to the original questions. It is worth noting that search halted once a single question with higher probability was found; continuing the search increases the respective probabilities.
Attackable questions have on average 12.3 tokens, whereas unattackable ones are slightly shorter  with on average 11.1 tokens.
Next we considered the distribution of different question types (What, Who, When, ...) for both attackable and unattackable samples and did not observe notable differences apart from the single most frequent question type What; it is a lot more prevalent among the unattacked questions (56.4%) than among successfully attacked questions (42.1%). This is by far the most common question type, and furthermore one that is comparatively open-ended and does not prescribe particular type expectations to its answer, as e.g., a Where question would require a location. A possible explanation for the prevalence of the What questions among unsuccessfully attacked samples is that the model cannot rely on type constraints alone to arrive at its predictions (Sugawara et al., 2018), and is thus less prone to such exploitation -see Section 6 for a more in-depth analysis. Finally, Figure 3 shows a histogram of the 10 most common NE tags appearing in unsuccessfully attacked samples, and the corresponding fraction of replaced entities in successfully attacked samples. Besides one exception, the distributions are remarkably similar: Undersensitivity can be induced for a variety of entity types used in the perturbation. Notably, questions with geopolitical entities (GPE) are particularly error-prone. A possible explanation can be provided by observations regarding (noncontextualised) word vectors, which cluster geopolitical entities (e.g. countries) together, thus making them harder to distinguish for a model operating on these embeddings (Mikolov et al., 2013).

Robustness to Undersensitivity Attacks
We will now investigate methods for mitigating excessive model undersensitivity. Prior work has considered both data augmentation and adversarial training for more robust models; we will conduct experiments with both. Adding a robustness objective can negatively impact standard test metrics (Tsipras et al., 2019), and it should be noted that there exists a natural trade-off between performance on one particular test set and performance on a dataset of adversarial inputs. We perform data augmentation and adversarial training by adding a corresponding loss term to the log-likelihood training objective: L Total = L llh (Ω)+λ·L llh (Ω ) where Ω is the standard training data, fit with a discriminative log-likelihood objective, Ω either a set of augmentation data points, or of successful adversarial attacks where they exist, and λ ∈ R + a hyperparameter. In data augmentation, we randomly sample perturbed input questions, whereas in adversarial training we perform an adversarial search to identify them. In both cases, alternative data points in Ω will be fit to a NULL label to represent the NoAnswer prediction -again using a log-likelihood objective. We continuously update Ω throughout training to reflect adversarial samples with respect to the current model. We conduct experiments on both SQuAD2.0 and NewsQA; details on training and hyperparameters can be found in the appendix.
Experimental Outcomes Results for these experiments can be found in Table 3 for the two da-tasets, respectively. First, we observe that both data augmentation and adversarial training substantially reduce the number of undersensitivity errors the model commits, consistently across adversarial search budgets, and consistently across the two datasets. This demonstrates that both training methods are effective defences and can mitigate -but not eliminate -the model's undersensitivity problem. Notably the improved robustness -especially for data augmentation -is possible without sacrificing performance in the overall standard metrics EM and F 1 , even slight improvements are possible. Second, data augmentation is a more effective defence training strategy than adversarial training. This holds true both for standard and adversarial metrics and hints at potential adversarial overfitting.
Finally, a closer inspection of how performance changes on answerable (HasAns) vs. unanswerable (NoAns) samples of the datasets reveals that models with modified training objectives show improved performance on unanswerable samples, while sacrificing some performance on answerable samples. 3 This suggests that the trained models -even though similar in standard metrics -evolve on different paths during training: the modified objectives prioritise unanswerable questions to a higher degree.
Evaluation on Held-Out Perturbation Spaces In Table 3 results are computed using the same perturbation spaces also used during training. These perturbation spaces are relatively large, and questions are about a disjoint set of articles at evaluation time. Nevertheless there is the potential of overfitting to the particular perturbations used during training. To measure the extent to which the defences generalise also to new, held out sets of perturbations, we assemble a new, disjoint perturbation space of identical size per NE tag as those used during training, and evaluate models on attacks with respect to these perturbations. Named entities are chosen from English Wikipedia using the same method as for the training perturbation spaces, and chosen such that they are disjoint from the training perturbation space. We then execute adversarial attacks using these new attack spaces on the previously trained models, and find that both vulnerability rates of the standard model, as well as relative defence success transfer to the new attack spaces. For example, with η = 256 we observe vulnerability ratios of 51.7%, 20.7%, and 23.8% on SQuAD2.0 for standard training, data augmen-3 The NoAns threshold is tuned on the respective valid. sets.  tation, and adversarial training, respectively. Detailed results for different values of η, as well as for NewsQA can be found in Appendix D.
Generalisation in a Biased Data Setting Datasets for high-level NLP tasks often come with annotation and selection biases; models then learn to exploit shortcut triggers which are dataset but not task-specific (Gururangan et al., 2018). For example, a model might be confronted with question/paragraph pairs which only ever contain one type-consistent answer span, e.g. mention one number in a text with a How many...? question. It is then sufficient to learn to pick out numbers from text to solve the task, irrespective of other information given in the question. Such a model might then have trouble generalising to articles that mention several numbers, as it never learned that it is necessary to take into account other relevant question information to determine the correct answer.
We test models in such a scenario: a model is trained on SQuAD1.1 questions with paragraphs containing only a single type-consistent answer expression for either a person, date, or numerical answer. At test time, we present it with question/article pairs of the same respective question types, but now there are multiple possible typeconsistent answers in the paragraph. We obtain such data from Lewis and Fan (2019), who first described this biased data scenario. Previous experiments on this dataset were conducted without dedicated development set, so while using the same training data, we split the test set with a 40/60% split into development and test data. 4 We then test both a vanilla fine-tuned BERT BASE transformer model, and a model trained to be less vulnerable to undersensitivity attacks using data augmentation. Finally, we perform a control experiment, where we join and shuffle all data points from train/dev/test (of each question type, respectively), and split the  dataset into new parts of the same size as before, which now follow the same data distribution (w/o data bias setting). Table 4 shows the results. In this biased data scenario we observe a marked improvement across metrics and answer type categories when a model is trained with unanswerable samples. This demonstrates that the negative training signal stemming from related -but unanswerablequestions counterbalances the signal from answerable questions in such a way, that the model learns to better take into account relevant information in the question, which allows it to correctly distinguish among several type-consistent answer possibilities in the text, which the standard BERT BASE model does not learn well. Evaluation on Adversarial SQuAD We next evaluate BERT LARGE and BERT LARGE + Augmentation Training on ADDSENT and ADDONE-SENT, which contain adversarially composed samples (Jia and Liang, 2017). Our results, summarised in Table 5, show that including altered samples during the training of BERT LARGE improves EM/F 1 scores by 2.7/4.3 and 0.1/1.6 points on the two datasets, respectively.
Transferability of Attacks We train a Ro-BERTa model  on SQuAD2.0, and conduct undersensitivity attacks (ρ = 6; η = 256). Notably the resulting undersensitivity error rates are considerably lower for RoBERTa (34.5%) than for BERT (54.7%). Interestingly then, when considering only those samples where RoBERTa was found vulnerable, BERT has an undersensitivity error rate of 90.7%. That is, the same samples tend to be vulnerable to undersensitivity attacks in both models. Even more, we find that concrete adversarial inputs x selected with RoBERTa transfer when evaluating them on BERT for 17.5% of samples (i.e. satisfy Inequality 1).

Conclusion
We have investigated undersensitivity: a problematic behaviour of RC models, where they are overly stable in their predictions when given semantically altered questions. A model's robustness to undersensitivity attacks can be drastically improved with appropriate defences without sacrificing nominal performance, and the resulting models become more robust also in other adversarial data settings.      perturbations are used for the defence methods, and adversarial attacks with η = 32 and ρ = 1 in adversarial training. Where no attack is found for a given question sampled during SGD training, we instead consider a different sample from the original training data. We evaluate the model on its validation data every 5,000 steps and perform early stopping with a patience of 5.
NewsQA Following the experimental protocol for SQuAD, we further test a BERT BASE model on NewsQA, which -like SQuAD2.0 -contains unanswerable questions. As annotators often do not fully agree on a single annotation in NewsQA, we opt for a conservative choice and filter the dataset, such that only samples with the same majority annotation are retained, following the preprocessing pipeline of Talmor and Berant (2019).

D Appendix: Generalisation to Held-out Perturbations
Vulnerability results for new, held-out perturbation spaces, disjoint from those used during training, can be found in Table 6 for SQuAD2.0, and in Table 7 for NewsQA.

E Appendix: Adversarial Example from a Question Collection
Searching in a large collection of (mostly unrelated) natural language questions, e.g. among all questions in the SQuAD2.0 training set, yields several cases where the prediction of the model increases, compared to the original question, see Table 8 for one such example. These cases are however relatively rare, and we found the yield of this type of search to be very low.
F Appendix: Attack Examples Table 9 shows more examples of successful adversarial attacks on SQuAD2.0.

G Appendix: Biased Data Setup
For completeness and direct comparability, we also include an experiment with the data setup of Lewis and Fan (2019) (not holding aside a dedicated validation set). Results can be found in Table 10. We again observe improvements in the biased data setting. Furthermore, the robust model outperforms GQA (Lewis and Fan, 2019) in two of the three subtasks.
H Appendix: Vulnerability Analysis on NewsQA