A guide to the dataset explosion in QA, NLI, and commonsense reasoning

Question answering, natural language inference and commonsense reasoning are increasingly popular as general NLP system benchmarks, driving both modeling and dataset work. Only for question answering we already have over 100 datasets, with over 40 published after 2018. However, most new datasets get “solved” soon after publication, and this is largely due not to the verbal reasoning capabilities of our models, but to annotation artifacts and shallow cues in the data that they can exploit. This tutorial aims to (1) provide an up-to-date guide to the recent datasets, (2) survey the old and new methodological issues with dataset construction, and (3) outline the existing proposals for overcoming them. The target audience is the NLP practitioners who are lost in dozens of the recent datasets, and would like to know what these datasets are actually measuring. Our overview of the problems with the current datasets and the latest tips and tricks for overcoming them will also be useful to the researchers working on future benchmarks.


Reading list
The core approaches to machine reading comprehension and several widely-used datasets are covered in the survey by Qui et al. (2019). For NLI, we refer the reader to the surveys on resources and approaches (Storks et al., 2019b), as well as issues with the current data (Schlegel et al., 2020). A survey on benchmarks and approaches is also available for commonsense reasoning (Storks et al., 2019a).

Tutorial outline
The tutorial will present three hours of content with a thirty minute break.
Motivation. We will start by discussing the place of high-level reasoning tasks in the current NLP system evaluation paradigm: how the focus shifted away from the the low-level tasks such as POStagging, and how the low-level linguistic competences seems to be coming back (Ribeiro et al., 2020).
The dataset explosion. This first part of the tutorial will provide an overview of the main types of datasets for QA, NLI, and commonsense reasoning. For each sub-type, we will discuss representative dataset examples.
The field of question answering encompasses both open-world QA and reading comprehension (RC). Open-world QA focuses on factoid questions, with the answers typically extracted from web snippets or Wikipedia. The questions usually come from search engine queries (Kwiatkowski et al., 2019) and quiz data (Joshi et al., 2017). Bordering on open-world QA is the task of QA on structured data, such as tables and databases (Jiang et al., 2019).
Most current reading comprehension datasets are extractive (Rajpurkar et al., 2016;Dua et al., 2019), i.e. the correct answer is contained within the text itself, and the task is to find the correct span. Multiplechoice questions are harder to generate, as they need good confounds, and often come from curated test collections (Lai et al., 2017). Freeform answers remain rare (Bajaj et al., 2016), as evaluating them faces the general problem of evaluating language generation. Most RC datasets are single-domain, with a few exceptions (Reddy et al., 2018).
For NLI, we will organize the discussion in terms of domains covered by the current datasets: singledomain (Bowman et al., 2015), multi-domain (Williams et al., 2017), specialized domains (Romanov and Shivade, 2018). In both QA and NLI there have also been attempts to recast datasets from other tasks as QA/NLI problems (McCann et al., 2018;White et al., 2017), and researchers working on NLI often rely on the datasets for the related problem of RTE (Dzikovska et al., 2013).
Commonsense reasoning datasets come in different formats: multi-choice reading comprehension (Ostermann et al., 2018), extractive reading comprehension , story completion (Mostafazadeh et al., 2017) and also as multi-choice questions for a single sentence input (Levesque et al., 2012). The task of commonsense reasoning is supposed to involve a combination of contextinternal knowledge with context-external world knowledge, and we will briefly mention the major sources of such knowledge that are typically recommended in commonsense challenges, such as scripts (Wanzare et al., 2016), frames (Baker et al., 1998), and entity relations (Speer et al., 2017).
Reality check. One of the reasons there are so many new datasets is that most of them get "solved" very soon after publication, as it happened with CoQA (Reddy et al., 2018). However, this is not necessarily a testimony to the linguistic power of deep learning. It is becoming increasingly clear that, given the opportunity, our models exploit annotation artifacts and shallow lexical cues, achieving a high performance but not a high degree of language understanding. The second part of the tutorial will synthesize a string of papers exposing such issues (Niven and Kao, 2019;McCoy et al., 2019;Geva et al., 2019;Wallace et al., 2019).
To give a few examples, for QA it has been shown that human-level performance on SQuADcan be achieved while relying only on superficial cues (Jia and Liang, 2017), and 73% of the NewsQA can be solved by simply identifying the single most relevant sentence (Chen et al., 2016). A system trained on one QA dataset does not tend to perform well on another one, even if it is in the same domain (Yatskar, 2019). Research on adversarial attacks suggests that it is possible to find dataset-specific phrases that will force a QA system to output a certain prediction when added to any input. For example, a SQuADtrained QA system can be hacked in this way to always predict "to kill Americam people" as the answer to any question (Wallace et al., 2019).
In NLI, 67% of SNLI (Bowman et al., 2015) and 53% of MultiNLI (Williams et al., 2017) can be solved without looking at the premise (Gururangan et al., 2018). The HANS dataset showed that models trained on MNLI (Williams et al., 2017) actually learn to rely on shallow cues and can be fooled by syntactic heuristics (McCoy et al., 2019). Furthermore, the models trained on such datasets are unaware of lexical knowledge that would have enabled them to solve simple WordNet-based permutations of the original data (Glockner et al., 2018).
In commonsense reasoning, by definition, the challenge is to get the system to make decision based on both the current context and some general knowledge about the world. However, in the challenge of SemEval2018-Task 11 (Ostermann et al., 2018) most participants did not use any extra knowledge sources, and one of them still achieved 0.82 accuracy vs 0.84 achieved by the ConceptNet-based winner. It is argued that large pre-trained language models already possess much of such knowledge: for instance, BERT (Devlin et al., 2018) achieved over 86% on SWAG (Zellers et al., 2018).
We will also mention the widespread methodological problem of under-reporting of environment factors that may make as much difference as the proposed architecture changes. The effect of such factors as random seed, hardware, library versions has been discussed for several QA datasets (Crane, 2018).
Methodology developments and challenges. For existing datasets, simply removing annotation artifacts will not solve the problem, as it creates other exploitable artifacts (Gururangan et al., 2018). Among the recent improvements in the dataset collection methodology are complex queries that require aggregating information from several sources (Dua et al., 2019;Kocisky et al., 2018;. Reliance on shallow patterns could be reduced by paraphrasing, including adversarial paraphrasing with a model-in-the-loop as an oracle that would reject questions that were too easy (Dua et al., 2019). Another alternative is balanced datasets with as many question types and genres (Rogers et al., 2020). Diversity can also be somewhat improved with partly synthesized data (Labutov et al., 2018), but any templates or annotator examples themselves are potential sources of bias.
Questions are more difficult if they are collected independently from the text (Kwiatkowski et al., 2019), written from summaries (Trischler et al., 2016) or hints . Finally, unanswerable questions (Rajpurkar et al., 2018) in conjunction with adversarial inputs should also force the model to go beyond lexical pattern-patching.
A radically different direction is shifting to exclusively out-of-distribution evaluation (Linzen, 2020), e.g with adversarial (McCoy et al., 2019) and multi-dataset (Fisch et al., 2019) evaluation. However, for that we still need to be aware of the training distribution, which becomes particularly challenging because with very large pre-trained models it is hard to guarantee that the test examples were not seen in pre-training (Brown et al., 2020).

Diversity efforts
The tutorial will be presented by an all-female team with a senior researcher and a post-doc as the lead organizer.
The survey will focus on English datasets, but we will provide references to the existing datasets in other languages that we are aware of.