What do Models Learn from Question Answering Datasets?

While models have reached superhuman performance on popular question answering (QA) datasets such as SQuAD, they have yet to outperform humans on the task of question answering itself. In this paper, we investigate if models are learning reading comprehension from QA datasets by evaluating BERT-based models across five datasets. We evaluate models on their generalizability to out-of-domain examples, responses to missing or incorrect data, and ability to handle question variations. We find that no single dataset is robust to all of our experiments and identify shortcomings in both datasets and evaluation methods. Following our analysis, we make recommendations for building future QA datasets that better evaluate the task of question answering through reading comprehension. We also release code to convert QA datasets to a shared format for easier experimentation at https://github.com/amazon-research/qa-dataset-converter.


Introduction
Question answering (QA) through reading comprehension has seen considerable progress in recent years. This progress is in large part due to the release of large language models like BERT  and new datasets that have introduced impossible questions (Rajpurkar et al., 2018), bigger scales (Kwiatkowski et al., 2019), and context (Choi et al., 2018;Reddy et al., 2019) to question answering. At the time of writing this paper, models have outperformed human baselines on the widely-used SQuAD 1.1 and SQuAD 2.0 datasets, and more challenging datasets such as QuAC have models less than 7 F1 points away. Despite these increases in F1 scores, we are still far from saying question answering is a solved problem.
Concerns have been raised over how challenging QA datasets are. Previous work has found that simple heuristics can give good performance on SQuAD 1.1 (Weissenborn et al., 2017), and successful SQuAD models lack robustness by giving inconsistent answers (Ribeiro et al., 2019) or being vulnerable to adversarial attacks (Jia and Liang, 2017;Wallace et al., 2019). If state-of-the-art models are excelling at test sets but not solving the underlying task of reading comprehension, then our test sets are flawed. We need to better understand what models learn from QA datasets. In this work, we ask three questions: (1) Does performance on individual QA datasets generalize to new datasets?
(2) Do models need to learn reading comprehension for QA datasets?, and (3) Can QA models handle variations in questions?
To answer these questions, we evaluate BERT models trained on five QA datasets using simple generalization and robustness probes. We find that (1) Model performance does not generalize well outside of heuristics like question-context overlaps, (2) Removing or corrupting dataset examples does not always harm model performance, showing that models can rely on simpler methods than reading comprehension to answer questions, and (3) No dataset fully prepares models to handle question variations like filler words or negation. These findings suggest that while QA models can learn heuristics around question-context overlaps and named entities, they do not need to learn reading comprehension to perform well on QA datasets. Based on these findings, we make recommendations on how to better create and evaluate QA datasets.

Related Work
Our work is inspired by recent trends in NLP to evaluate generalizability and probe what models learn from datasets. In terms of generalizability, prior work has been done by Yogatama et al. (2019) who evaluated a SQuAD 1.1 model across four datasets and Talmor and Berant (2019), who comprehensively evaluated ten QA datasets. The MRQA 2019 shared task (Fisch et al., 2019) evaluated transferability across multiple datasets, and Khashabi et al. (2020) proposed a method to train one model on 17 different QA datasets. In our work, we focus on question answering through reading comprehension and extend the work on generalizability by including impossible questions in all our datasets and analyzing the effects of questioncontext overlap on generalizability.
There is also growing interest in probing what models learn from datasets (McCoy et al., 2019;Geirhos et al., 2020;Richardson and Sabharwal, 2020;Si et al., 2020). Previous work in question answering has found that state-of-the-art models can get good performance on incomplete input (Agrawal et al., 2016;Sugawara et al., 2018;Niven and Kao, 2019), under-rely on important words, (Mudrakarta et al., 2018), and over-rely on simple heuristics (Weissenborn et al., 2017;Ko et al., 2020). Experiments on SQuAD in particular have shown that SQuAD models are vulnerable to adversarial attacks (Jia and Liang, 2017;Wallace et al., 2019) and not robust to paraphrases (Ribeiro et al., 2018;Gan and Ng, 2019).
Our work continues exploring what models learn by comprehensively testing multiple QA datasets against a variety of simple but informative probes. We take inspiration from previous studies, and we make novel contributions by using BERT, a stateof-the-art model, and running several experiments against five different QA datasets to investigate the progress made in reading comprehension. We compare five datasets in our experiments: SQuAD 2.0, TriviaQA, Natural Questions, QuAC, and NewsQA. All our datasets treat question answering as a reading comprehension task where Question Context Answer   SQuAD  10  120  3  TriviaQA  15  746  2  NQ  9  96  4  QuAC  7  395  14  NewsQA  8  709  4   Table 2: Comparison of the average number of words in questions, contexts, and answers in each dataset the question is about a document and the answer is either an extracted span of text or labeled unanswerable. To consistently compare and experiment across models, we convert all datasets into a SQuAD 2.0 JSON format. 1 Since most datasets have a hidden test set, we evaluate models on the dev set and consequently refer to the dev sets as test sets in this paper. The train and dev sets sizes are shown in Table 1 Below we describe each dataset and any modifications we made to run our experiments:

Datasets
SQuAD 2.0 (Rajpurkar et al., 2018) consists of 150K question-answer pairs on Wikipedia articles. To create SQuAD 1.1, crowd workers wrote questions about a Wikipedia paragraph and highlighted the answer (Rajpurkar et al., 2016). SQuAD 2.0 includes an additional 50K plausible but unanswerable questions written by crowd workers.
TriviaQA (Joshi et al., 2017) includes 95K question-answer pairs from trivia websites. The questions were written by trivia enthusiasts and the evidence documents were retrieved by the authors retrospectively. We use the variant of TriviaQA where the documents are Wikipedia articles.
Natural Questions (NQ) (Kwiatkowski et al., 2019) contains 300K questions from Google search logs. For each question, a crowd worker found a long and short answer on a Wikipedia page. We use the subset of NQ with a long answer and frame the task as finding the short answer in the long answer. We only include examples with answers in the paragraph text (as opposed to a table or list).
QuAC (Choi et al., 2018)   NewsQA (Trischler et al., 2017) contains 100K questions on 10K CNN articles. One set of crowd workers wrote questions based on a headline and summary, and a second set of workers found the answer in the article. We reintroduce unanswerable questions that were excluded in the original paper.
There are notable differences among our datasets in terms of genre and how they were built. In Table  2, we see a large variation in the average number of words in questions, contexts, and answers. Despite these differences, all our datasets are reading comprehension tasks. We believe a good reading comprehension model should handle question answering well regardless of dataset differences, and so we compare across all five datasets.   Table 4. We evaluate on the dev set with the SQuAD 2.0 evaluation script (Rajpurkar et al., 2018). We run our experiments on a single Nvidia Tesla v100 16GB GPU.

Model
In Table 5, we provide a comparison between our models and previously published BERT results. Differences occur when we make modifications to match SQuAD. We simplified NQ by removing the long answer identification task and framed the short answer task in a SQuAD format, so we see higher results than the NQ BERT baseline. For QuAC, we ignored all context-related fields and treated each example as an independent question, so we see lower results than models built on the full dataset. For NewsQA, we introduced impossible questions, resulting in lower performance. We accept these drops in performance since we are interested in comparing changes to a baseline rather than achieving state-of-the-art results.

Dataset
Reference Ours

Experiments
In this section, we discuss the experiments run to evaluate what models learn from QA datasets. All results are reported as F1 scores since they are correlated with Exact Match scores and are more forgiving to sometimes arbitrary cutoffs of answers (for example, we prefer to give some credit to a model for selecting "Charles III" even if the answer was "King Charles III").   (Fisch et al., 2019), and building cross-dataset evaluation methods (Dua et al., 2019). We test generalizability by fine-tuning models on each dataset and evaluating against all five test sets. The results are reported as F1 scores in Table 3. The rows show a single model's performance across all five datasets, and the columns show the performance of all the models on a single test set. The model-on-self baseline is indicated in bold.
All models take a drop in performance when evaluated on an out-of-domain test set. This shows that performance on an individual dataset does not generalize across datasets, confirming results found on different mixes of datasets (Talmor and Berant, 2019;Yogatama et al., 2019). However there is variation in how the models perform. For example, models score up to 53.5 F1 points on SQuAD without seeing SQuAD examples, while models do not score above 21.6 F1 points on QuAC without QuAC examples. This suggests that some test sets are easier than others.
To quantify this, we calculate what proportion of each test set can be correctly answered by how many models. This data is represented as a bar graph in Figure 1. Each bar represents one dataset, and the segments show how much of the test set is answered correctly by 0 to 5 of the models.
We consider questions easier if more models correctly answer them. The figure shows that QuAC and NewsQA are more challenging test sets and contain a higher proportion of questions that are answered by 0 or 1 model. In contrast, more than half of SQuAD and NQ and almost half of TriviaQA can be answered correctly by 3 or more models.
While difficult questions pose a challenge for QA models, too many easy questions inflate our understanding of a model's performance. What

Experiment Question
Answer Text Answer Start  makes a question easy? We identified a trend between question difficulty and the overlap between the question and the context. We measured overlap as the number of words that appeared in both the question and the context divided by the number of words in the question. For answerable questions, Figure 2 shows that more models return correct answers when there is higher overlap, while Figure 3 shows that fewer models correctly identify impossible questions when there is higher overlap. This suggests that models learn to use questioncontext overlap to identify answers. Models may even over-rely on this strategy and return answers to impossible questions when no answer exists.
Overall, the results show that our models do not generalize well to different datasets. The models do seem to exploit question-context overlap, even on questions that are out-of-domain. Reducing the number of high overlap questions in a dataset could create more challenging datasets in the future and better evaluate generalization and test more complex strategies for question answering.

Do models need to learn reading comprehension for QA datasets?
State-of-the-art models get good performance on QA datasets, but does good performance mean good reading comprehension? Or are models able to take shortcuts to arrive at the same answers? We explore this by performing three dataset ablation experiments with random labels, shuffled contexts, and incomplete questions. If models can answer test set questions with incorrect or missing information, then the models are likely not learning the task of reading comprehension. The three experiments and their results are discussed in the next sections.

Random Labels
A robust model should withstand some amount of noise at training time to offset annotation error. However if a model can perform well with a high level of noise, we should be wary of what the model has learned. In our first dataset ablation experiment, we evaluated how various amounts of noise at training time affected model performance.
To introduce noise to the training sets, we randomly selected 10%, 50%, or 90% of the training examples that were answerable and updated the answer to a random string from the same context and of the same length as the original answer. We ensured that the random answer contained no overlaps with the original answer. For simplicity, we did not alter impossible examples. An example of a random label is in the second row of Table 6.
We fine-tuned new models on increasingly noisy training sets and evaluated them on the original test sets. The results are in Table 7 in terms of F1 scores and reported only for answerable questions. On training sets with 10% random labels, all models see an F1 score drop. SQuAD, NQ, and NewsQA achieve over 90% of their baseline score, showing robustness to a reasonable level of noise. TriviaQA and QuAC take larger F1 hits (achieving only 78% and 81% of their baselines), suggesting that they are less robust to this noise.
As the amount of noise increases, most F1 scores drop to nearly 0. SQuAD and NQ, however, are suspiciously robust even when 90% of their training examples are random. SQuAD achieves 41% of its baseline and NQ achieves 27% of its baseline with training sets that are 90% noise. We find it unlikely that randomly selected strings provide a signal, so this suggests that some examples in each test set are answerable trivially and without learning reading comprehension.  The task of reading comprehension aims to measure how well a model understands a given passage. If a model is able to answer questions without understanding the logic or structure of a passage, we can get high scores on a test set but be no closer to learning reading comprehension. In our second experiment, we investigate how much of model performance can be accounted for without understanding the full passage.

Shuffled Context
For each context paragraph in the test set, we split the context by sentence, randomly shuffled the sentences, and rejoined the sentences into a new paragraph. The original answer text remained the same, but the answer start token was updated by locating the correct answer text in the shuffled context. An example is in the third row of Table 6.
We used our models fine-tuned on the original training sets and evaluated on the shuffled context test sets. The results are in Table 8. TriviaQA sees the largest drop in performance, achieving only 66% of its baseline, followed by NewsQA with 80% of its baseline. SQuAD and QuAC, on the other hand, get over 93% of their original baselines even with randomly shuffled contexts. TriviaQA and NewsQA have longer contexts, with an average of over 700 words, and so shuffling longer contexts seems more detrimental. While these results show that models do not rely on naive approaches, like position, they do show that for many questions, models do not need to understand a paragraph's structure to correctly predict the answer.  QA dataset creators and their crowd workers spend considerable effort hand-crafting questions that are meant to challenge a model's ability to understand language. But are models using the questions? In previous work, Agrawal et al. (2016) found that a Visual Question Answering (VQA) model could get good performance with just half the original question, while Sugawara et al. (2018) saw drops in BiDAF model performance on QA datasets with increasingly incomplete questions. We expand on these previous works by using BERT, including impossible questions, and introducing NER baselines for comparison.

Incomplete Input
We created three variants of each test set containing only the first half, first word, or no words from each question. The answer expectations were not changed. Examples are in the fourth, fifth, and sixth rows of Table 6.
We evaluated models fine-tuned on the original training set on the incomplete test sets. The results are in Table 9. F1 scores mostly decrease on test sets with incomplete input, but models can return correct answers without being given the question. SQuAD achieves 65% of its baseline given no question, an increase from the First Word test set primarily because of higher success on impossible questions. NQ achieves up to 68% of its baseline F1 score given the first word and up to 44% given no question. These results show that not all examples require full or any question understanding to make correct predictions. We also see higher F1 scores compared to Sugawara et al. (2018) when using the first word. In TriviaQA, Sugawara et al. (2018) saw their F1 score drop by 75% (from 49.3 to 12.5) while we see a drop of 46% (from 58.7 to 31.8), which could suggest that our BERT models have overfit more than BiDAF models.
How can models answer without the full question? We investigated our results further by creating two naive named entity recognition (NER) baselines using spaCy 3 to see if models can rely on entity types. For our First Word NER baseline, we used the first word of the question to choose an entity as the answer. If a question started with "who", we returned the first person entity in the context, for "when", the first date, for "where", the first location, and for "what", the first organization, event, or work of art. The results are in the First Word NER column of  Our Person NER baseline returns the first person entity found in each context. The results are shown in Table 10. Both NQ and SQuAD achieve 33-35% of their baseline by only extracting the first person entity. TriviaQA sees a much larger drop when using only person entities, suggesting there is more entity type variety in the TriviaQA test set. These results show that some questions can be answered by extracting entity types and without needing most or all of the question text.

Can QA models handle variations in questions?
The previous section found that models can perform well on test sets even as seemingly important features are stripped from datasets. This section considers the opposite problem: Can models remain robust as features are added to datasets? To 3 https://spacy.io analyze this, we run two experiments where we add filler words and negation to test set questions.   (Ribeiro et al., 2018;Gan and Ng, 2019), we take an even simpler approach by adding filler words that do not affect the rest of the question. For each question in the test set, we randomly added one of three filler words (really, definitely, or actually) before the main verb, as identified by spaCy. An example is shown in the seventh row of Table 6. Table 11 shows the results of models fine-tuned on their original training sets and evaluated on the filler word test sets. All models drop between 2 to 5 F1 points. Although these drops may seem small, they do show that even a naive approach can hurt performance. It is no surprise that more sophisticated paraphrases of questions cause models to fail. The SQuAD model in particular had better performance when 50% of the training set was randomly labeled (73.9) than when filler words were added to the test set (69.5), suggesting that our models have become robust to less consequential features.

Negation
Negation is an important grammatical construction for QA systems to understand. Giving the same answer to a question and its negative (Who invented the telescope? vs. Who didn't invent the telescope?) can frustrate or mislead users. In previous work, Kassner and Schütze (2019) studied negation by manually negating 305 SQuAD 1.1 questions and found that a BERT model largely ignored negation. We expand on this work by using the full SQuAD 2.0 dataset and comparing performance across five datasets.  For each dataset, we negated every question in the test set by mapping common verbs (i.e. is, did, has) to their contracted negative form (i.e. isn't, didn't, hasn't) or by inserting never before the main verb, as identified by spaCy. We keep the original answers in the test set since we want to evaluate how often the original answer is returned for the negative question. In this case, a lower F1 score means better performance. An example is in the last row of Table 6.
We used the models fine-tuned on their original training sets and evaluated them on the negated test sets. The results are in Table 12 and show how often each model continued to return the original answer given a negative question. We see that SQuAD outperforms all the other models by giving the original answer to a negative question less than 3% of the time. Other models return the original answer to the negative question between 48% and 94% of the time, suggesting that the negation is largely ignored.  Table 13: The percentage of questions in the training set containing n't or never that are impossible Does the SQuAD model understand negation, or is this a sign of bias? Table 13 shows how often a question containing n't or never was impossible in the training set. SQuAD has a high bias, with 85% of questions containing n't and 89% of questions containing never being impossible. This difference could be a result of SQuAD annotators having a bias to include n't or never more often in impossible questions than answerable questions, while impossible questions in other datasets were more organically collected. This suggests that SQuAD's performance is due to an annotation artifact. These results find that no dataset adequately prepares a model to understand negation.

Conclusions
In this work, we probed five QA datasets with six tasks and found that our models did not learn to generalize well, remained suspiciously robust to incorrect or missing data, and failed to handle variations in questions. These findings show that models learn simple heuristics around question-context overlap or entity types and pick up on underlying patterns in the datasets that allow them to remain robust to corrupt examples but not to valid variations. The shortcomings in datasets and evaluation methods make it difficult to judge if models are learning reading comprehension. Based on our work, we make the following recommendations to researchers who create or evaluate QA datasets: • Test for generalizability: Models are more valuable to real-world applications if they generalize. New QA model should report performance across multiple relevant datasets.
• Challenge the models: Evaluating on too many easy questions can inflate our judgement of what a model has learned. Discard questions that can be solved trivially by high overlap or extracting the first named entity.
• Be wary of cheating: Good performance does not mean good understanding. Probe datasets by adding noise, shuffling contexts, or providing incomplete input to ensure models aren't taking shortcuts.
• Include variations: Models should be prepared to handle a variety of questions. Include variations such as filler words or negation to existing questions to evaluate how well models have understood a question.
• Standardize dataset formats: When creating new datasets, consider following a standardized format, such as SQuAD, to make cross-dataset evaluations simpler.