Adversarial Examples for Evaluating Reading Comprehension Systems

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of 75% F1 score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%. We hope our insights will motivate the development of new models that understand language more precisely.


Introduction
Quantifying the extent to which a computer system exhibits intelligent behavior is a longstanding problem in AI (Levesque, 2013). Today, the standard paradigm is to measure average error across a held-out test set. However, models can succeed in this paradigm by recognizing patterns that happen to be predictive on most of the test examples, while ignoring deeper, more difficult phenomena (Rimell et al., 2009;Paperno et al., 2016).
In this work, we propose adversarial evaluation for NLP, in which systems are instead evaluated on adversarially-chosen inputs. We focus on the Figure 1: An example from the SQuAD dataset. The BiDAF Ensemble model originally gets the answer correct, but is fooled by the addition of an adversarial distracting sentence (in blue).
SQuAD reading comprehension task (Rajpurkar et al., 2016), in which systems answer questions about paragraphs from Wikipedia. Reading comprehension is an appealing testbed for adversarial evaluation, as existing models appear successful by standard average-case evaluation metrics: the current state-of-the-art system achieves 84.7% F1 score, while human performance is just 91.2%. 1 Nonetheless, it seems unlikely that existing systems possess true language understanding and reasoning capabilities.
Carrying out adversarial evaluation on SQuAD requires new methods that adversarially alter reading comprehension examples. Prior work in computer vision adds imperceptible adversarial perturbations to input images, relying on the fact that such small perturbations cannot change an image's true label (Szegedy et al., 2014;Goodfellow et al., 2015). In contrast, changing even one word of a paragraph can drastically alter its meaning. Instead of relying on semantics-preserving perturbations, we create adversarial examples by adding distracting sentences to the input paragraph, as shown in Figure 1. We automatically generate these sentences so that they confuse models, but do not contradict the correct answer or confuse humans. For our main results, we use a simple set of rules to generate a raw distractor sentence that does not answer the question but looks related; we then fix grammatical errors via crowdsourcing. While adversarially perturbed images punish model oversensitivity to imperceptible noise, our adversarial examples target model overstabilitythe inability of a model to distinguish a sentence that actually answers the question from one that merely has words in common with it.
Our experiments demonstrate that no published open-source model is robust to the addition of adversarial sentences. Across sixteen such models, adding grammatical adversarial sentences reduces F1 score from an average of 75% to 36%. On a smaller set of four models, we run additional experiments in which the adversary adds nongrammatical sequences of English words, causing average F1 score to drop further to 7%. To encourage the development of new models that understand language more precisely, we have released all of our code and data publicly.

Task
The SQuAD dataset (Rajpurkar et al., 2016) contains 107,785 human-generated reading comprehension questions about Wikipedia articles. Each question refers to one paragraph of an article, and the corresponding answer is guaranteed to be a span in that paragraph.

Models
When developing and testing our methods, we focused on two published model architectures: BiDAF (Seo et al., 2016) and Match-LSTM (Wang and Jiang, 2016). Both are deep learning architectures that predict a probability distribution over the correct answer. Each model has a single and an ensemble version, yielding four systems in total.
We also validate our major findings on twelve other published models with publicly available test-time code: ReasoNet Single and Ensemble versions (Shen et al., 2017), Mnemonic Reader Single and Ensemble versions , Structural Embedding of Dependency Trees (SEDT) Single and Ensemble versions (Liu et al., 2017), jNet (Zhang et al., 2017), Ruminating Reader (Gong and Bowman, 2017), Multi-Perspective Context Matching (MPCM) Single version , RaSOR (Lee et al., 2017), Dynamic Chunk Reader (DCR) (Yu et al., 2016), and the Logistic Regression Baseline (Rajpurkar et al., 2016). We did not run these models during development, so they serve as a held-out set that validates the generality of our approach.

Standard Evaluation
Given a model f that takes in paragraph-question pairs (p, q) and outputs an answerâ, the standard accuracy over a test set D test is simply where v is the F1 score between the true answer a and the predicted answer f (p, q) (see Rajpurkar et al. (2016) for details).

General Framework
A model that relies on superficial cues without understanding language can do well according to average F1 score, if these cues happen to be predictive most of the time. Weissenborn et al. (2017) argue that many SQuAD questions can be answered with heuristics based on type and keyword-matching. To determine whether existing models have learned much beyond such simple patterns, we introduce adversaries that confuse deficient models by altering test examples. Consider the example in Figure 1: the BiDAF Ensemble model originally gives the right answer, but gets confused when an adversarial distracting sentence is added to the paragraph.
We define an adversary A to be a function that takes in an example (p, q, a), optionally with a model f , and returns a new example (p , q , a ). The adversarial accuracy with respect to A is While standard test error measures the fraction of the test distribution over which the model gets the correct answer, the adversarial accuracy measures the fraction over which the model is robustly correct, even in the face of adversarially-chosen alterations. For this quantity to be meaningful, the adversary must satisfy two basic requirements: first, it should always generate (p , q , a ) tuples that are valid-a human would judge a as the correct answer to q given p . Second, (p , q , a ) should be somehow "close" to the original example (p, q, a).

Semantics-preserving Adversaries
In image classification, adversarial examples are commonly generated by adding an imperceptible amount of noise to the input (Szegedy et al., 2014;Goodfellow et al., 2015). These perturbations do not change the semantics of the image, but they can change the predictions of models that are oversensitive to semantics-preserving changes. For language, the direct analogue would be to paraphrase the input (Madnani and Dorr, 2010). However, high-precision paraphrase generation is challenging, as most edits to a sentence do actually change its meaning.

Concatenative Adversaries
Instead of relying on paraphrasing, we use perturbations that do alter semantics to build concatenative adversaries, which generate examples of the form (p + s, q, a) for some sentence s. In other words, concatenative adversaries add a new sentence to the end of the paragraph, and leave the question and answer unchanged. Valid adversarial examples are precisely those for which s does not contradict the correct answer; we refer to such sentences as being compatible with (p, q, a). We use semantics-altering perturbations to that ensure that s is compatible, even though it may have many words in common with the question q. Existing models are bad at distinguishing these sentences from sentences that do in fact address the question, indicating that they suffer not from oversensitivity but from overstability to semantics-altering edits. Table 1 summarizes this important distinction. The decision to always append s to the end of p is somewhat arbitrary; we could also prepend it to the beginning, though this would violate the expectation of the first sentence being a topic sentence. Both are more likely to preserve the validity of the example than inserting s in the middle of p, which runs the risk of breaking coreference links. Now, we describe two concrete concatenative adversaries, as well as two variants. ADDSENT, our main adversary, adds grammatical sentences that look similar to the question. In contrast, ADDANY adds arbitrary sequences of English words, giving it more power to confuse models. Figure 2 illustrates these two main adversaries.

ADDSENT
ADDSENT uses a four-step procedure to generate sentences that look similar to the question, but do not actually contradict the correct answer. Refer to Figure 2 for an illustration of these steps. In Step 1, we apply semantics-altering perturbations to the question, in order to guarantee that the resulting adversarial sentence is compatible. We replace nouns and adjectives with antonyms from WordNet (Fellbaum, 1998), and change named entities and numbers to the nearest word in GloVe word vector space 2 (Pennington et al., 2014) with the same part of speech. 3 If no words are changed during this step, the adversary gives up and immediately returns the original example. For example, given the question "What ABC division handles domestic television distribution?", we would change "ABC" to "NBC" (a nearby word in vector space) and "domestic" to "foreign" (a WordNet antonym), resulting in the question, "What NBC division handles foreign television distribution?" In Step 2, we create a fake answer that has the same "type" as the original answer. We define a set  of 26 types, corresponding to NER and POS tags from Stanford CoreNLP , plus a few custom categories (e.g., abbreviations), and manually associate a fake answer with each type. Given the original answer to a question, we compute its type and return the corresponding fake answer. In our running example, the correct answer was not tagged as a named entity, and has the POS tag NNP, which corresponds to the fake answer "Central Park." In Step 3, we combine the altered question and fake answer into declarative form, using a set of roughly 50 manually-defined rules over CoreNLP constituency parses. For example, "What ABC division handles domestic television distribution?" triggers a rule that converts questions of the form "what/which NP 1 VP 1 ?" to "The NP 1 of [Answer] VP 1 ". After incorporating the alterations and fake answer from the previous steps, we generate the sentence, "The NBC division of Central Park handles foreign television distribution." The raw sentences generated by Step 3 can be ungrammatical or otherwise unnatural due to the incompleteness of our rules and errors in constituency parsing. Therefore, in Step 4, we fix errors in these sentences via crowdsourcing. Each sentence is edited independently by five workers on Amazon Mechanical Turk, resulting in up to five sentences for each raw sentence. Three additional crowdworkers then filter out sentences that are ungrammatical or incompatible, resulting in a smaller (possibly empty) set of human-approved sentences. The full ADDSENT adversary runs the model f as a black box on every human-approved sentence, and picks the one that makes the model give the worst answer. If there are no humanapproved sentences, the adversary simply returns the original example.
A model-independent adversary. ADDSENT requires a small number of queries to the model under evaluation. To explore the possibility of an adversary that is completely model-independent, we also introduce ADDONESENT, which adds a random human-approved sentence to the paragraph. In contrast with prior work in computer vision (Papernot et al., 2017;Narodytska and Kasiviswanathan, 2016;Moosavi-Dezfooli et al., 2017), ADDONESENT does not require any access to the model or to any training data: it generates adversarial examples based solely on the intuition that existing models are overly stable.

ADDANY
For ADDANY, the goal is to choose any sequence of d words, regardless of grammaticality. We use local search to adversarially choose a distracting sentence s = w 1 w 2 . . . w d . Figure 2 shows an example of ADDANY with d = 5 words; in our experiments, we use d = 10.
We first initialize words w 1 , . . . , w d randomly from a list of common English words. 4 Then, we run 6 epochs of local search, each of which iterates over the indices i ∈ {1, . . . , d} in a random order. For each i, we randomly generate a set of candidate words W as the union of 20 randomly sampled common words and all words in q. For each x ∈ W , we generate the sentence with x in the i-th position and w j in the j-th position for each j = i. We try adding each sentence to the paragraph and query the model for its predicted probability distribution over answers. We update w i to be the x that minimizes the expected value of the F1 score over the model's output distribution. We return immediately if the model's argmax prediction has 0 F1 score. If we do not stop after 3 epochs, we randomly initialize 4 additional word sequences, and search over all of these random initializations in parallel.
ADDANY requires significantly more model access than ADDSENT: not only does it query the model many times during the search process, but it also assumes that the model returns a probability distribution over answers, instead of just a single prediction. Without this assumption, we would have to rely on something like the F1 score of the argmax prediction, which is piecewise constant and therefore harder to optimize. "Probabilistic" query access is still weaker than access to gradients, as is common in computer vision (Szegedy et al., 2014;Goodfellow et al., 2015).
We do not do anything to ensure that the sentences generated by this search procedure do not contradict the original answer. In practice, the generated "sentences" are gibberish that use many question words but have no semantic content (see Figure 2 for an example).
Finally, we note that both ADDSENT and ADDANY try to incorporate words from the question into their adversarial sentences. While this is an obvious way to draw the model's attention, we were curious if we could also distract the model without such a straightforward approach. To this end, we introduce a variant of ADDANY called ADDCOMMON, which is exactly like ADDANY except it only adds common words. 4 We define common words as the 1000 most frequent words in the Brown corpus (Francis and Kucera, 1979).     age F1 score to fall to 46.1%, despite only adding common words. We also verified that our adversaries were general enough to fool models that we did not use during development. We ran ADDSENT on twelve published models for which we found publicly available test-time code; we did not run ADDANY on these models, as not all models exposed output distributions. As seen in Table 3, no model was robust to adversarial evaluation; across the sixteen total models tested, average F1 score fell from 75.4% to 36.4% under ADDSENT.

Main Experiments
It is noteworthy that the Mnemonic Reader models  outperform the other models by about 6 F1 points. We hypothesize that Mnemonic Reader's self-alignment layer, which helps model long-distance relationships between parts of the paragraph, makes it better at locating all pieces of evidence that support the correct answer. Therefore, it can be more confident in the correct answer, compared to the fake answer inserted by the adversary.

Human Evaluation
To ensure our results are valid, we verified that humans are not also fooled by our adversarial examples. As ADDANY requires too many model queries to run against humans, we focused on ADDSENT. We presented each original and adversarial paragraph-question pair to three crowdworkers, and asked them to select the correct answer by copy-and-pasting from the paragraph. We then took a majority vote over the three responses (if all three responses were different, we picked one at random). These results are shown in Table 4. On original examples, our humans are actually slightly better than the reported number of 91.2 F1 on the entire development set. On ADDSENT, human accuracy drops by 13.1 F1 points, much less than the computer systems.
Moreover, much of this decrease can be explained by mistakes unrelated to our adversarial sentences. Recall that ADDSENT picks the worst case over up to five different paragraph-question pairs. Even if we showed the same original example to five sets of three crowdworkers, chances are that at least one of the five groups would make a mistake, just because humans naturally err. Therefore, it is more meaningful to evaluate humans on ADDONESENT, on which their accuracy drops by only 3.4 F1 points.

Analysis
Next, we sought to better understand the behavior of our four main models under adversarial evaluation. To highlight errors caused by the adversary, we focused on examples where the model originally predicted the (exact) correct answer. We divided this set into "model successes"-examples where the model continued being correct during adversarial evaluation-and "model failures"examples where the model gave a wrong answer during adversarial evaluation.

Manual verification
First, we verified that the sentences added by ADDSENT are actually grammatical and compatible. We manually checked 100 randomly chosen BiDAF Ensemble failures. We found only one where the sentence could be interpreted as answering the question: in this case, ADDSENT replaced the word "Muslim" with the related word "Islamic", so the resulting adversarial sentence still contradicted the correct answer. Additionally, we found 7 minor grammar errors, such as subject-verb disagreement (e.g., "The Alaskan Archipelago are made up almost entirely of hamsters.") and misuse of function words (e.g., "The gas of nitrogen makes up 21.8 % of the Mars's atmosphere."), but no errors that materially impeded understanding of the sentence.
We also verified compatibility for ADDANY. We found no violations out of 100 randomly chosen BiDAF Ensemble failures.

Error analysis
Next, we wanted to understand what types of errors the models made on the ADDSENT examples. In 96.6% of model failures, the model predicted a span in the adversarial sentence. The lengths of the predicted answers were mostly similar to those of correct answers, but the BiDAF models occasionally predicted very long spans. The BiDAF Single model predicted an answer of more than 29 words-the length of the longest answer in the SQuAD development set-on 5.0% of model failures; for BiDAF Ensemble, this number was 1.6%. Since the BiDAF models independently predict the start and end positions of the answer, they can predict very long spans when the end pointer is influenced by the adversarial sentence, but the start pointer is not. Match-LSTM has a similar structure, but also has a hard-coded rule that stops it from predicting very long answers.
We also analyzed human failures-examples where the humans were correct originally, but wrong during adversarial evaluation. Humans predicted from the adversarial sentence on only 27.3% of these error cases, which confirms that many errors are normal mistakes unrelated to adversarial sentences.

Categorizing ADDSENT sentences
We then manually examined sentences generated by ADDSENT. In 100 BiDAF Ensemble failures, we found 75 cases where an entity name was changed in the adversarial sentence, 17 cases where numbers or dates were changed, and 33 cases where an antonym of a question word was used. 5 Additionally, 7 sentences had other miscellaneous perturbations made by crowdworkers during Step 4 of ADDSENT. For example, on a question about the "Kalven Report", the adversarial sentence discussed "The statement Kalven cited" instead; in another case, the question, "How does Kenya curb corruption?" was met by the unhelpful sentence, "Tanzania is curbing corruption" (the model simply answered, "corruption").

Reasons for model successes
Finally, we sought to understand the factors that influence whether the model will be robust to adversarial perturbations on a particular example. First, we found that models do well when the question has an exact n-gram match with the original paragraph. Figure 3 plots the fraction of examples for which an n-gram in the question appears verbatim in the original passage; this is much higher for model successes. For example, 41.5% of BiDAF Ensemble successes had a 4-gram in common with the original paragraph, compared to only 21.0% of model failures.
We also found that models succeeded more often on short questions. Figure 4 shows the dis-

Transferability across Models
In computer vision, adversarial examples that fool one model also tend to fool other models (Szegedy et al., 2014;Moosavi-Dezfooli et al., 2017); we investigate whether the same pattern holds for us. Examples from ADDONESENT clearly do transfer across models, since ADDONESENT always adds the same adversarial sentence regardless of model. Table 5 shows the results of evaluating the four main models on adversarial examples generated by running either ADDSENT or ADDANY against each model. ADDSENT adversarial examples transfer between models quite effectively; in particular, they are harder than ADDONESENT examples, which implies that examples that fool one model are more likely to fool other models. The ADDANY adversarial examples exhibited more limited transferability between models. For both ADDSENT and ADDANY, examples transferred slightly better between single and ensemble versions of the same model.

Training on Adversarial Examples
Finally, we tried training on adversarial examples, to see if existing models can learn to become more robust. Due to the prohibitive cost of running ADDSENT or ADDANY on the entire training set, we instead ran only Steps 1-3 of ADDSENT (everything except crowdsourcing) to generate a raw adversarial sentence for each training example. We then trained the BiDAF model from scratch on  the union of these examples and the original training data. As a control, we also trained a second BiDAF model on the original training data alone. 6 The results of evaluating these models are shown in Table 6. At first glance, training on adversarial data seems effective, as it largely protects against ADDSENT. However, further investigation shows that training on these examples has only limited utility. To demonstrate this, we created a variant of ADDSENT called ADDSENTMOD, which differs from ADDSENT in two ways: it uses a different set of fake answers (e.g., PERSON named entities map to "Charles Babbage" instead of "Jeff Dean"), and it prepends the adversarial sentence to the beginning of the paragraph instead of appending it to the end. The retrained model does almost as badly as the original one on ADDSENTMOD, suggesting that it has just learned to ignore the last sentence and reject the fake answers that ADDSENT usually proposed. In order for training on adversarial examples to actually improve the model, more care must be taken to ensure that the model cannot overfit the adversary.

Discussion and Related Work
Despite appearing successful by standard evaluation metrics, existing machine learning systems for reading comprehension perform poorly under adversarial evaluation. Standard evaluation is overly lenient on models that rely on superficial cues. In contrast, adversarial evaluation reveals that existing models are overly stable to perturbations that alter semantics.
To optimize adversarial evaluation metrics, we may need new strategies for training models. For certain classes of models and adversaries, efficient training strategies exist: for example, Globerson and Roweis (2006) train classifiers that are optimally robust to adversarial feature deletion. Ad-versarial training (Goodfellow et al., 2015) can be used for any model trained with stochastic gradient descent, but it requires generating new adversarial examples at every iteration; this is feasible for images, where fast gradient-based adversaries exist, but is infeasible for domains where only slower adversaries are available.
We contrast adversarial evaluation, as studied in this work, with generative adversarial models. While related in name, the two have very different goals. Generative adversarial models pit a generative model, whose goal is to generate realistic outputs, against a discriminative model, whose goal is to distinguish the generator's outputs from real data (Smith, 2012;. Bowman et al. (2016) and Li et al. (2017) used such a setup for sentence and dialogue generation, respectively. Our setup also involves a generator and a discriminator in an adversarial relationship; however, our discriminative system is tasked with finding the right answer, not distinguishing the generated examples from real ones, and our goal is to evaluate the discriminative system, not to train the generative one.
While we use adversaries as a way to evaluate language understanding, robustness to adversarial attacks may also be its own goal for tasks such as spam detection. Dalvi et al. (2004) formulated such tasks as a game between a classifier and an adversary, and analyzed optimal strategies for each player. Lowd and Meek (2005) described an efficient attack by which an adversary can reverseengineer the weights of a linear classifier, in order to then generate adversarial inputs. In contrast with these methods, we do not make strong structural assumptions about our classifiers.
Other work has proposed harder test datasets for various tasks. Levesque (2013) proposed the Winograd Schema challenge, in which computers must resolve coreference resolution problems that were handcrafted to require extensive world knowledge. Paperno et al. (2016) constructed the LAMBADA dataset, which tests the ability of language models to handle long-range dependencies. Their method relies on the availability of a large initial dataset, from which they distill a difficult subset; such initial data may be unavailable for many tasks. Rimell et al. (2009) showed that dependency parsers that seem very accurate by standard metrics perform poorly on a subset of the test data that has unbounded dependency construc-tions. Such evaluation schemes can only test models on phenomena that are moderately frequent in the test distribution; by perturbing test examples, we can introduce out-of-distribution phenomena while still leveraging prior data collection efforts.
While concatenative adversaries are well-suited to reading comprehension, other adversarial methods may prove more effective on other tasks. As discussed previously, paraphrase generation systems (Madnani and Dorr, 2010) could be used for adversarial evaluation on a wide range of language tasks. Building on our intuition that existing models are overly stable, we could apply meaningaltering perturbations to inputs on tasks like machine translation, and adversarially choose ones for which the model's output does not change. We could also adversarially generate new examples by combining multiple existing ones, in the spirit of Data Recombination (Jia and Liang, 2016). The Build It, Break It shared task (Bender et al., 2017) encourages researchers to adversarially design minimal pairs to fool sentiment analysis and semantic role labeling systems.
Progress on building systems that truly understand language is only possible if our evaluation metrics can distinguish real intelligent behavior from shallow pattern matching. To this end, we have released scripts to run ADDSENT on any SQuAD system, as well as code for ADDANY. We hope that our work will motivate the development of more sophisticated models that understand language at a deeper level.