ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

Given questions regarding some prototypical situation -- such as Name something that people usually do before they leave the house for work? -- a human can easily answer them via acquired experiences. There can be multiple right answers for such questions with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common-sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show -- Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose an open-domain task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. On evaluating our dataset with various competitive state-of-the-art models, we find there is a significant gap between the best model and human performance on a number of evaluation metrics.


Introduction
Humans posses the innate ability to implicitly reason using a wealth of shared background knowledge, much of which is acquired via experiences. For example, consider the question in Figure 1 -"Name something that people usually do before they leave the house for work?". Humans can easily answer such questions with 'prototypical' answersi.e. a set of answers that they commonly associate with situations invoked by the question. These questions require common-sense reasoning, however, because of its 'common' nature, such knowledge is * Equal contribution.
Data and interactive demo available at http://protoqa.com (i) Name something that people usually do before they leave for work? (ii) Name a piece of equipment that you are likely to find at your office and not at home?
Categories: phone (24), toothbrush/towels (17), clothing/shoes (15) keys (14), purse/wallet (14), accessories (8), charger (5) Figure 1: We focus on common-sense reasoning over prototypical situations when there could be many different answers but some are more common than others. Our task is in open-domain style (not multiple-choice format). Answers to a question are crowd-sourced from 100 workers and are then manually clustered into categories. To perform well, a model has to output a ranked list of answers covering multiple categories.
often not explicitly captured in text (reporting bias) (Gordon and Van Durme, 2013). We present a new dataset to train and evaluate models about their common-sense knowledge of prototypical situations. In such situations, there are often multiple right answers, with some answers more prototypical (common) than others, thereby forming a distribution over them. For example, when we polled 100 people (Figure 1), popular answers to the previous question were 'shower/cleaning ' (43) or 'breakfast' (30). However, we also received very reasonable answers in the tail such as 'lock door/grab keys' (7), 'say goodbye' (4) and 'pray' (1). We think, for artificial intelligence (AI) systems to achieve human-level common-sense reasoning, they should be able to match the distribution over prototypical answers.
Our dataset and task are inspired by a source of naturally occurring common sense questions (not developed for any particular NLP task) used in a competitive trivia game show -FAMILY-FEUD. FAMILY-FEUD is a long-running trivia game show which started on American television in 1976 and has been adopted internationally in more than 50 countries. The game show is played by asking participants questions such as those in Figure 1, and they receive points not for a "correct" answer, but when their answer matches the answers from other people surveyed, and in proportion to how many people gave that answer. What makes FAMILY-FEUD appealing is the fact that the original answers to the question were collected by a professional polling company by doing a telephone survey of 100 different people all over the country and further clustered into meaningful categories (e.g. cantaloupe, honeydew, watermelons as 'melons')thereby automatically giving us a distribution over both the answers and the underlying concepts those answers refer to.
We present a common-sense reasoning task in which a model has to output a ranked list of answers for each question. Our evaluation metrics ( § 3) are designed to encourage both popularity and diversity in answers, i.e. models that not only predict more popular answers, but also cover all plausible answer categories are encouraged. We present both a publicly available set of around 9.7K questions and along with 7-8 labeled answer categories for each question oriented towards this kind of commonsense knowledge, and a newly annotated and unseen test set for evaluation. This test set provides a set of 15,400 crowd-sourced human judgments over 154 new questions. These questions were created by us by perturbing existing FAMILY-FEUD questions to ensure that they do not occur in any iteration of the game show, while maintaining the same level of common-sense and prototypical reasoning to answer them ( § 2.4). The crowd-sourced answers were further categorized manually by two expert annotators, creating the same setup as in the original game-show.
Recent common-sense reasoning benchmarks often use a multiple-choice paradigm, where the task is to identify the most plausible answer from a list of options (Mostafazadeh et al., 2016;Zellers et al., 2018;Talmor et al., 2019). However, it has been shown that language models (LMs) trained on large http://shorturl.at/guW34 http://shorturl.at/bhKS1 http://shorturl.at/rFMT2 amounts of unlabeled data such as BERT (Devlin et al., 2019) do exceedingly well on these datasets, achieving human-level accuracy in a few of them. To counter this,  propose a new adversarial filtering approach where benchmarks evolve in an adversarial way as new models are proposed and introduced the HELLASWAG dataset created by using powerful LMs such as BERT. Soon after its introduction, ROBERTA (Liu et al., 2019) improved upon the accuracy of BERT model by 45 points. A primary reason for this is that generating hard negative examples for the multiple choice format is challenging, even for humans (Schwartz et al., 2017;Gururangan et al., 2018;Poliak et al., 2018).
Instead of the multiple-choice paradigm, we setup our task in an open-domain question answering (QA) format where a model has to output a ranked list of answers which is 'matched' to crowdsourced answers in each category. While such an approach can penalize a correct model prediction when it does not match an existing reference answer, we counter this issue in (a) by collecting 100 answer annotations per question, a number substantially higher than any other work in open-domain QA and (b) by proposing evaluation metrics which use large lexical resources such as WordNet (Miller, 1995) to perform matching, and (c) by focusing upon methods to score ranked lists of answers, instead of focusing upon a top score. We suggest that open-domain evaluation of some common-sense reasoning tasks is a natural and realistic paradigm, and one which shares natural similarities to the evaluation challenges found in various natural language generation tasks such as summarization (Radev et al., 2003) and translation (Callison-Burch et al., 2010) and should be an area of active research.
We evaluate this dataset on a variety of baseline models -from models trained on symbolic common-sense knowledge store such as Concept-Net (Speer et al., 2017), to QA models powered by large masked LMs such as BERT, to the direct prediction of answers in a language-modeling paradigm using a large GPT-2 LM (Radford et al., 2018). While most models perform quite poorly at this challenging task, when GPT-2 was fine-tuned using the FAMILY-FEUD training set, its performance improved drastically, although remaining significantly below a score of human-level performance.
The contributions of this paper are as follows. (a) We introduce a large scale QA dataset regarding common-sense knowledge of prototypical situations, and a rich evaluation set for models trained upon that data. (b) We present this as an opendomain task, and review a range of directions for robust evaluation in this open-domain setting, both with rich data (large sets of reference answers and clustering over answers) and with evaluation measures such as a WordNet-based similarity. (c) We also design evaluation metric that encourage models to provide diverse answers covering all plausible answer categories (d) We evaluate our dataset on existing models, and reveal the strong ability of large contextualized language models when finetuned on this data. Finally, (e), we discuss the gap between model and human performance on this task, showing that this is still a challenging task for models with room for improvement.

Training Corpus Collection
Three publicly available fan websites for the show Family Feud were used to collect a large collection of questions.
A range of fan websites for Family Feud have transcribed such Family Feud questions. Well over 10,000 questions were collected with their answers, and a set of 9,762 questions remained after filtering, quality control, and de-duplication. That filtering included the omission of classes of questions (such as name a vegetable) which did not evaluate interesting commonsense knowledge.
While any "commonsense" dataset inherently bears the risks of encoding culture-specific information and biases, questions regarding prototypical scenarios and prototypical behavior are naturally quite susceptible to this, and we note it as an important issue to be aware of regarding such data. A small subset of 29 questions which might be viewed as problematic or encoding stereotypes were explicitly labeled and will be released separately with the training data, so that one might evaluate the extent to which models trained on such a task might acquire undesirable biases.

Test Corpus Collection
In order to focus upon a rich, open-ended answer generation task, we collected 100 answers for each question from the crowd-sourcing platform Fig-ureEight, and then provided rich double-annotated clustering over those answers. By gathering large sets of possible answers and clustering them we for example, https://protoqa.com/ can provide rough distributions over the expected answers, increasing the ability to recognize any way of expressing one of those answers.
We gathered a test set of new questions with an eye towards maintaining the same domain and the same commonsense reasoning seen in the training data. In order to maintain similarity to existing questions, we removed a set of questions from the scraped data and then perturbed important aspects of them. For example, given an existing question of "Name something a person might forget to put on if they leave the house in a hurry.", changes of polarity and events would derive a related question "Name something that people usually do before they leave house for work". Deriving such unseen test questions was especially important to avoid the risk of having a publicly-available question be included in the training data for contextual language models; by making new data, we can be more confident that any high-performing model has not yet seen the data.
Having derived new questions, we then created tasks on FigureEight for each of those questions to be answered by 100 workers. To match the training data (which is inherently grounded in US culture), we limited workers to US locations. Low-quality workers were automatically detected through test questions during annotation, and the clustering pass provided a second manual quality control check. This left us with 154 questions which we split into a test set and dev set of 102 and 52 respectively.

Answer Clustering
After initial collection of 100 answers for each question, we then clustered answers of each question. Each list was manually clustered by two different experts familiar with the task. The clusterings were generated separately and then compared, and a final clustering was agreed on. During this clustering phase, answers could be marked as invalid as well -most commonly, either due to low-quality annotators or a clear misunderstanding of a question. In order to keep these clusters roughly similar to the granularity of answers used in the training data and to avoid low-quality evaluation, we eliminated questions for which the 8 most popular cluster did not cover more than 85 of the 100 responses.
Since each set of answers was clustered twice and adjudicated, we measure the agreement with a cluster agreement metric BLANC (Recasens and Hovy, 2011;Luo et al., 2014), an extension of the Rand index used to score coreference clustering. Using this, the similarity between the clusters produced by any two annotators averaged out to a BLANC score of 83.17, suggesting a coherent amount of agreement regarding the clustering of answers. Figure 2 illustrates how this crowd-sourced test set related to the training data; the actual size of the largest clusters remains similar between the two datasets, but our data tends to have more clusters, generally capturing all possible answers within the top 8 clusters, but often using seven or eight clusters. More cluster provide more relaxed evaluation as we include more answer strings with smaller size clusters, which also provide with us more interesting answer strings.

Exploration of Dataset
The data presented here involves a range of different types of commonsense knowledge. To examine the distribution of different kinds of reasoning, and to examine whether that distribution of reasoning varied between the publicly available data and the crowdsourced development and test set, we propose a small inventory of six types of commonsense reasoning often present in these questions.
These types consist of (1) MENTAL OR SO-CIAL REASONING, (2) KNOWLEDGE OF PRO-TOTYPICAL SITUATIONS which one is familiar with, (3) REASONING ABOUT NOVEL, COMPLEX EVENTS, (4) NEGATION AND EXCEPTIONS and understanding their consequences, (5) SPECIFIC ENTITY KNOWLEDGE of named people, locations, or organizations, and finally (6) KNOWLEDGE OF HABITUAL ACTIVITIES of specific occupations or types of entities.
To study the distribution over the data, we took 25 questions from the training collection and 25 questions from the crowd-sourced development set, and marked each one with any number of the six categories which seemed necessary for the question as a simple approximation of prior works which examine the types of knowledge required for reasoning tasks (LoBue and Yates, 2011;Boratko et al., 2018). Table 1 illustrates examples of questions with these types, and one can see the frequency of each type used in Table 2. The counts shown for each dataset illustrate that while the creation methodology varied between the two resources, the kind of commonsense reasoning tasks evaluated by these models is quite similar between the two corpus types. The greatest difference to note is that the crowd-sourced data makes less use of questions regarding specific entities, which were avoided as they tended to involve fact-based world-knowledge rather than commonsense reasoning.

Evaluation
Recent commonsense reasoning benchmarks often use a multiple-choice paradigm where the task is to identify the most plausible answer from a list of options (Zellers et al., 2018. However generating challenging negative examples is hard, so that often within months of the release of a dataset, models may achieve human-level or near humanlevel performance, as occured with BERT for the SWAG dataset and ROBERTA for the HELLASWAG dataset. Such issues highlight the difficulty of establishing robust and stable metrics using negative samples and adversarial methods alone. It also has been shown that generating negative examples is even hard for humans, who can inadvertently introduce annotation artifacts which models can easily identify in order to solve the task (Schwartz et al., 2017;Gururangan et al., 2018;Poliak et al., 2018).
An appealing alternative for benchmarking models is via open-domain answer generation tasks where the model has to generate the correct answers. This side-steps the need to find challenging negative examples. However, this paradigm introduces another challenge -the possibility of models getting wrongly penalized for predictions not in the list of correct answers. This problem is also faced in other natural language generation task such as machine translation (MT), summarization and dialog generation.
Our solution to the above problem is to collect and cluster a large number of open-ended responses Question

Example Answers Types
Name a profession where you might be fired if you lost your voice radio host , teacher 3, 4, 6 Name something a boy scout might learn. knot tying, camping 2, 5, 6 Name a bad sport for someone who is afraid of the water.
diving, water polo 1, 3 ,6 Name something a monk probably would not own. weapons, smartphone 2, 4, 6 Name something parents tell their kids not to do steal, smoke 1, 2, 4, 6 Name a reason why someone would wear gloves cold weather, cleaning 2, 3  (100 crowd-sourced responses in our case). This is much higher than other typical tasks (e.g. there are usually very few reference translation or summaries). Also, compared to summarization or MT, a prototypical answer for our task is only a word or a short phrase, making the problem less severe. Furthermore, as described shortly, we do not restrict ourselves to rigid exact matches and propose a similarity measure that uses synonyms from WordNet (Miller, 1995). The next subsections describe how we score a ranked list of model predictions w.r.t the answer clusters containing the 100 crowd-sourced responses. We first describe a similarity function to compute similarity between two strings ( § 3.1) and then we describe how we score the ranked list encouraging diversity in the answers ( § 3.2).

WordNet based similarity
With the large number of raw answers retrieved for each question, exact string matching of a new answer to those in each answer cluster works surprisingly well. Still, it is clear that reasonable answer strings (eg. synonyms or slightly embellished phrases) may be incorrectly marked as wrong with such a stringent matching criteria. METEOR (Banerjee and Lavie, 2005;Lavie and Denkowski, 2009) addressed similar issues in machine translation via stemming and synonym matching. We take a similar approach, expanding the set of answer clusters using WordNet synsets, and comparing all possible partitions of the tokenization of the raw strings. For more details, please see the appendix.

Encouraging Diversity of Answers
As mentioned before, we want to design evaluation metrics that favor models that can cover all plausible answer categories and not just predict the most popular answer. We first compute an alignment score between each answer in the ranked list and each of our answer clusters. The alignment score is computed as the maximum score between the predicted answer string and any reference string present in the cluster, scaled by the size of the cluster. After computing the alignment scores between all pairs of answers and clusters, we employ Hungarian matching algorithm (Kuhn, 1955;Munkres, 1957) to compute the exact optimal matching of answers to clusters. It is worth noting that, a model which produces a ranked list of answers only in one cluster will be penalized and a model which maximally covers all plausible clusters will score the maximum. Lastly, to make the comparison between lists of different lengths uniform, we propose the following metrics.

MAX ANSWERS@k limits the total number
of answers allowed to upto k answers. et al., 2017), a QA-based model which retrieves related posts in a discussion forum for each question, and a language-modeling baseline which examines how well modern pre-trained language models do at directly producing the answers.

Knowledge Base Baseline
ConceptNet (Speer et al., 2017) is a knowledge base containing common sense related triples which has been shown to be helpful for various downstream tasks. (Zhong et al., 2019;Wang et al., 2019) This makes it a good potential source for solving this task, as well as assessing how well this dataset captures existing notions of common sense. For example, with a question 'Besides music, name something you might hear on a morning radio show' and answer 'weather', the following ConceptNet triples (listen to radio, Cause, you hear local weather report) and (listen to radio, Has-Subevent, hear weather report) provide valid support for the answer of weather report. We use this idea to conduct the ConceptNet baseline. The ConceptNet baseline is a purely symbolic baseline which can be evaluated without the need for training. Given a question, we extract a list of key words from the question by removing stop words. We perform the same key word extraction procedure for the term1 and term2 of ConceptNet triples. Provided with the list of key words, we compare them against term1 in ConceptNet triples. If there are any overlap words between sentence key words and term1 key words, we return elements term2 as possible answers to the question.
The answers are further ranked by the corresponding ConceptNet triple score. Note that this is a fairly generous baseline since the model is able to return an unlimited number of answers, however the resulting answers will be noisy as well which prevents it from performing well on the Max Incorrect ranking task. The set intersection score will give some indication on the overlap between common sense captured in ConceptNet, and provides an idea of how much gain there could be had by a more sophisticated ranking of the outputs.

Question-Answering Baseline
As this dataset is in the form of questions and answers, it may be treated as a QA dataset, although the content is far from the fact-based data usually modeled in QA tasks. As the training set only shows answers out of context, one must use distant supervision in order to train a QA model on the data, a well-explored situation in modern QA work (Joshi et al., 2017).
We should note that unlike factoid-based QA, one may expect there to be a limit in the performance of such a models, as commonsense data is well-known to have a reporting bias (Gordon and Van Durme, 2013) wherein many parts of general knowledge are never explicitly stated in text. Because of that, models trained in this paradigm cannot be expected to find explicit statements of the generalizations (which would often be left unstated) but can only hope to learn how to identify situations where a particular fact is presupposed or entailed.
To train a model in this approach, we collected a set of 85,781 documents by using a web search for each question. All searches were constrained to Reddit, which contains a large amount of advice and personal narratives of a domain useful for the task. For any post matching that query, any strings matching an answer to that question in the training data would be treated as a positive example for the QA model. Table 4 illustrates the kind of examples found for a single query "name something you do at a concert", which illustrates that while many examples are roughly correct in that they address those activities in a concert environment, learning a QA model from them is more difficult.
For the baseline results reported here, we finetune the "Bert for QA" model of the transformers package (Wolf et al., 2019) designed for Squad 2.0 (Rajpurkar et al., 2018) , fine-turning BERTlarge (Devlin et al., 2019). At test time, the model was applied to all passages of all Reddit threads found in the first page of the search query for the question, and the 20-best scores from each passage were combined together, reporting a ranked list using the summed scores.

Language Model Baseline
We also report a language model generation baseline due to the improved representation power by language models. The baseline is performed using AI2 GPT2 large model (Radford et al., 2019) of hugging face pytorch implementation (Wolf et al., 2019). We get the answer predictions through the language model by either doing fine-tuning by our training data or without fine tuning.
We transform the original question by handdesigned transformed rules in order for it to be compatible with the GPT2 training data. E.g "Name something people do when they wake up." → "One thing people do when they wake up is ...". The   hand-designed rules are attached in appendix of the paper. The transformed questions can be used as the input to the language model, GPT2 is expected to finish the sentence, and we will take the generated tokens as our predicted answer. The reported fine-tuning result is trained on the scraped corpus and the best model selected based on performance on our annotated dev data. In order to generate diverse answers for a given sentence we use Nucleus Sampling (Holtzman et al., 2019) as our decoding methods. We get 300 sampled answers for each question then group them by counts. The returned ranked list is ranked by each answer's occurrence in the 300 samples.

Human Performance
As can be seen from table 3, fine-tuned GPT2 model which independently samples answers performs the best. To benchmark human performance against such models, we collected 30 human responses per question and aggregated them by counts (like GPT2 predictions). Last column in table 3 reports the human performance. Discussion It is clear from table 3 that both the KB and QA based model significantly underperform on our dataset. The low performance of the KB baseline hints towards the low coverage of Con-ceptNet required for answering the prototypical scenarios of our questions. Similarly powerful QA models and large LMs that are trained on large corpus also seem to not have the common sense knowledge required to answer the question. Surprisingly, the performance of GPT2 model that was further fine-tuned on our training data significantly improved, suggesting the usefulness of our accompanying training set. However, the human performance for all our metrics significantly outperform all baselines suggesting large scope of improvement.

Related Work
A wide variety of commonsense reasoning datasets now exist, although none address the same kind of commonsense generalizations evaluated here. Datasets exist evaluating plausible contexts, reasons or results for physical commonsense, social reasoning, visual question answering and abductive reasoning Bhagavatula et al., 2019;Huang et al., 2019;Sap et al., 2019b;Zellers et al., 2018, but differ in evaluating against negative samples during evaluation. The ATOMIC dataset (Sap et al., 2019a) is a more similar commonsense reasoning dataset that proposed here; while ATOMIC utilized if-then reasoning (such as the resultant states and motivations of participants), it also assumes an open-domain task of freely predicting strings, although they evaluate using human assessment. Related veins of work study commonsense reasoning and inference or entailment (Zhang et al., 2017;Bowman et al., 2015;Roemmele et al., 2011;Levesque et al., 2012).
This particular dataset might be said to study generalizations, or prototypical events and situations. This naturally has connections both to the modeling of scripts and frames (Schank and Abelson, 1977;Chambers and Jurafsky, 2009;Fillmore et al., 1976;Ferraro and Van Durme, 2016), but we assume no need to predict latent structures.

Conclusion
We have presented a new common sense dataset with many novel features. The inclusion of counts over clusters of answers provides a very rich structure to train and evaluate with. The collection of a large set of answers and a proposed automated method of assigning answers to clusters facilitates an open-ended style of evaluation, which is often the desired use-case for these models. As shown in table 3, existing fine-tuned state-of-the-art models have a way to go before modeling the distribution of this common sense data.

Future Work
In addition to the elements of this task which are appealing from a common sense modeling perspective, the inherent appeal of this task to humans opens a number of possibilities for future data collection and evaluation. While the high availability of crowdsource workers has led to great progress in dataset generation it is not without its flaws, and weeding out poor quality responses is often nontrivial for more interesting tasks. On the other hand, millions of people play this game as an app on their phones, not for any monetary gain but simply for their own enjoyment. In the future we propose to collect more data by creating a form of this game, leveraging people's natural interest and enjoyment and using mechanism design to encourage high quality answers to more common sense questions.