GLUCOSE: GeneraLized and COntextualized Story Explanations

When humans read or listen, they make implicit commonsense inferences that frame their understanding of what happened and why. As a step toward AI systems that can build similar mental models, we introduce GLUCOSE, a large-scale dataset of implicit commonsense causal knowledge, encoded as causal mini-theories about the world, each grounded in a narrative context. To construct GLUCOSE, we drew on cognitive psychology to identify ten dimensions of causal explanation, focusing on events, states, motivations, and emotions. Each GLUCOSE entry includes a story-specific causal statement paired with an inference rule generalized from the statement. This paper details two concrete contributions: First, we present our platform for effectively crowdsourcing GLUCOSE data at scale, which uses semi-structured templates to elicit causal explanations. Using this platform, we collected 440K specific statements and general rules that capture implicit commonsense knowledge about everyday situations. Second, we show that existing knowledge resources and pretrained language models do not include or readily predict GLUCOSE's rich inferential content. However, when state-of-the-art neural models are trained on this knowledge, they can start to make commonsense inferences on unseen stories that match humans' mental models.


Introduction
Humans make countless implicit commonsense inferences about everyday situations. For example, consider the following short story from the ROC-Stories corpus (Mostafazadeh et al., 2016): Gage was riding his bike. A car turned in front of him. Gage turned his bike sharply. He fell off of his bike. Gage skinned his knee. When even young children * Current affiliation Verneek, Inc. read this story, they construct a coherent representation of what happened and why, combining information from the text with relevant background knowledge (Kintsch and Van Dijk, 1978). For example, they can construct the causal chain that explains how the car's unexpected turn ultimately led to Gage falling, describe how Gage's emotion and location changed throughout, and even hypothesize that he likely shouted for help after falling.
Though humans build such mental models with ease (Zwaan et al., 1995), AI systems for tasks such as reading comprehension and dialogue remain far from exhibiting similar commonsense reasoning capabilities. Two major bottlenecks have been acquiring commonsense knowledge and successfully incorporating it into state-of-the-art AI systems. To address the first bottleneck, we have built an effective platform to acquire causal commonsense knowledge at scale. To address the second, we show that pre-trained neural models can start making similar inferences when trained on such rich curated data.
We introduce the GLUCOSE (GeneraLized and COntextualized Story Explanations) dataset. Given a short story and a sentence X in the story, GLU-COSE captures ten dimensions of causal explanation related to X. These dimensions, inspired by human cognitive psychology, cover often-implicit causes and effects of X, including events, location, possession, and other attributes, the vast majority of which are not captured by existing resources and models. Importantly, GLUCOSE encodes commonsense knowledge in the form of semistructured inference rules 1 (mini-theories about the world), each grounded in a specific story. As the examples in Table 1   grounded in a particular context. To facilitate acquisition at scale, we designed an effective multi-stage crowdsourcing platform. Using this platform, we acquired 440K GLUCOSE annotations in the context of children's stories, which will be released with this paper. Our analysis shows that these explanations extend substantially beyond the scope of the existing knowledge resources.
Given the breadth of commonsense knowledge needed for real-world inference tasks, no static knowledge source is expected to provide sufficient coverage. GLUCOSE's key contribution is enabling models to dynamically produce general inference rules to explain novel scenarios. To systematically evaluate such models, we present an evaluation task where given a story S, a sentence X, and dimension d, a model predicts relevant specific and general rules as captured in GLUCOSE. We evaluate on the task using a curated test set, based on novel stories not used for any training purposes.
We show a strong correlation between human and automatic evaluation metrics, which makes systematic and reliable evaluation of models feasible. We show that pre-trained neural models perform poorly on the task; however, when finetuned on GLUCOSE data, they are able to generate commonsense explanations that rival humans'. This finding supports our hypothesis that a promising recipe for giving machines commonsense is to use high-quality commonsense knowledge for training neural models that have pre-existing lexical and conceptual knowledge.
One well-known type of commonsense knowl-edge is script knowledge, defined by Schank and Abelson (1977) as structured knowledge about stereotypical event sequences and their participants. However, manual encoding of such knowledge is notoriously unscalable and brittle. A more recent line of work is unsupervised learning of "narrative schemas" Jurafsky, 2008, 2009;Balasubramanian et al., 2013;Sha et al., 2016), where common event sequences are automatically induced from large corpora. While promising, this approach has not produced high-quality knowledge usable for downstream tasks at scale (Mostafazadeh et al., 2016). Furthermore, since commonsense knowledge is often implicit, such corpus-based methods are unlikely to induce implicit commonsense inferences (Gordon and Van Durme, 2013). In contrast, our data collection framework enables us to acquire high-quality and robust commonsense knowledge, including often unstated rules such as "Someone A gives Someone B Something A Results in Someone B possesses Something A " or "Someone A is at Somewhere A Enables Someone A puts Something A at Somewhere A ." The most fruitful efforts to date for acquiring commonsense knowledge have been crowdsourced knowledge resources. ConceptNet (Speer et al., 2017), a partially-crowdsourced resource, is a relational knowledge graph that connects short naturallanguage phrases via semantic edges. Most Con-ceptNet knowledge is taxonomic, consisting of factoids like "apple is a fruit", however, it also includes some causal relations, e.g., "kill is motivated by revenge." Despite its broad coverage, ConceptNet has been found to be noisy . Its knowledge also lacks context, hampering accurate application at inference time, e.g., "kill requires eat breakfast" is hard to make sense of without more context.
A more directly relevant resource is ATOMIC , which consists of 877K textual descriptions of if-then knowledge. Each entry describes a likely cause/effect of one of 24K+ events. ATOMIC entries are organized into nine categories such as xIntent (PersonX's intention) and xEffect (effect on PersonX). For instance, "Per-sonX makes PersonYs coffee xEffect PersonX gets thanked". ATOMIC is a step forward in acquiring high-quality inferential knowledge. However, it has two main shortcomings. First, ATOMIC is noncontextual and conflates knowledge about an event that may have occurred under different scenarios, which hinders interpreting and applying the knowledge in context. For example, the event "PersonX arrives the next day" has xIntents "to go on vacation" and "to attend a reunion," and xEffects "get time to relax" and "meet some friends." Although each xIntent should be associated with only one of the xEffects, such dependencies are not encoded in ATOMIC. As a result, ATOMIC cannot be used to determine which xEffect is more likely given an xIntent. GLUCOSE addresses this by grounding each piece of inferential knowledge to a particular story context consistent across dimensions.
Second, events and relations in ATOMIC are person centric; agentless events are not covered, and each relation is either about PersonX or Per-sonY. As a result, ATOMIC cannot describe events involving common entity types such as places, things, or groups of people, nor can it encode causes and effects other than to PersonX and their peers. In GLUCOSE, sentence X can describe any event/state, and GLUCOSE general rules can refer to indexed variables such as "Someone A " or "Somewhere C ." Beyond these major shortcomings, ATOMIC also does not cover many commonsense knowledge types in GLUCOSE, including change of attributes such as location, which will be further discussed in Section 4.3.

The Knowledge Model of GLUCOSE
GLUCOSE has a unique take on explaining story events. As illustrated in Table 1, each story is explained through ten causal dimensions. The semistructured explanation for each dimension includes both a specific statement and a general rule.

Causal Dimensions of Explanation
One of our main contributions is the identification of ten causal dimensions of explanation in the context of narratives, for which we can reliably collect high quality data from lay crowd workers. Cognitive psychology research on human comprehension of narratives (Kintsch and Van Dijk, 1978;Zwaan and Radvansky, 1998;Grazzani et al., 2018) suggests that humans primarily focus on events, their timeline, locations of entities throughout the story, causes and motivations of events, and emotional trajectory of characters. Based on this research, GLUCOSE dimensions are designed to focus on causal reasoning around events and states, eliciting event causal chains, character motivations, emotions, naive psychology, and change of attributes such as location and possessions to core story entities. For an event or state X stated in a sentence, we categorize the dimensions of causality into events and states happening before X and those occurring after X. Each category includes five dimensions, as shown in Table 1. The precise definition and scope of these ten dimensions are the result of multiple pilot studies with crowd workers to identify intuitive and distinguishable causal dimensions, so that the overlap among dimensions is minimized and the agreement among workers is maximized.

Semi-structured Inference Rules
To uncover what constitutes a good explanation, we ran several pilot studies exploring how people define, generate, and present explanations about short stories. We concluded that in order to achieve some consensus among explanations and to facilitate further processing and evaluation, the explanations should not be entirely free-form. Instead, we represent them as semi-structured inference rules whose expressivity lies between free text and logical forms. Each rule takes the form "antecedent connective consequent," where the antecedent and consequent are composed by filling in syntactic slots for subject, verb, object(s), and preposition(s). For some dimensions, slot-filling involves choosing from a predefined list, e.g., dimension 2, which states a motivating emotion or basic human drive, limits its verb choices to feel, want, like. Details regarding the slots can be found in Appendix A.
To eliminate the need for pronoun resolution when applying our general rules, variables are indexed, such as "Someone A " and "Something A and Something B ", to refer to the same entities on both sides of the rule. Each variable can be further elaborated using an attribute phrase in the form of a relative clause, e.g., "Somewhere C (that is Someone A 's location)." Our studies indicate that this format gives the explainers sufficient expressivity to convey their reasoning, yet constrains the resulting explanations enough to identify commonalities between them. Note that the semi-structured rules are deterministically converted to natural language form by simply concatenating all the filled slots. Table 1 shows examples of semi-structured GLUCOSE explanations.

Generalized and Contextualized
Each GLUCOSE explanation is stated both as a specific statement (grounded in a given context) and a corresponding general rule (applicable to other con-texts). Research in cognitive psychology suggests that humans typically choose which of an event's many causes to cite based on its relevance to the context (Miller, 2019). Hence, grounding explanations in context is crucial for acquiring accurate explanations. Furthermore, it has been shown that human explanations take situation-specific information and link it to pre-existing knowledge about the world; people explain by appealing to broader theories that enable generalization (Lombrozo, 2006). Also, there is evidence that explanations and generalizations help scaffold cognitive development in humans (Busch et al., 2018), which can potentially play a role in the learning capabilities of AI systems as well. By explicitly stating general rules as mini-theories of how the world works, GLUCOSE seeks to enable better generalization and causal reasoning in future AI systems.

Data Acquisition Platform
To enable developing models that can build mental models of narratives, we aimed to crowdsource a large, high-quality dataset. Beyond the scalability benefits, using crowd workers (as opposed to a small set of expert annotators) ensures diversity of thought, thus broadening coverage of a commonsense knowledge resource.
The annotation task is complex: it requires annotators to understand different causal dimensions in a variety of contexts and to come up with generalized theories beyond the story context. For strict quality control, we designed a three-stage knowledge acquisition pipeline for crowdsourcing the GLUCOSE dataset on the Amazon Mechanical Turk (Mturk) Platform. The workers first go through a qualification test 2 where they must score at least 90% on 10 multiple-choice questions on select GLUCOSE dimensions. Next, qualified workers can work on the main GLUCOSE data collection task: given a story S and a story sentence X, they are asked to fill in (allowing for non-applicable) all ten GLUCOSE dimensions, getting step-by-step guidance from our designed UI 3 . To ensure data consistency, the same workers answer all dimensions for an S, X pair. Finally, the submissions are reviewed by an expert who rates each worker on a scale from 0 to 3, and provides feedback on how to improve. Our final UIs are

Dataset Composition and Statistics
Our source of stories for the GLUCOSE dataset is ROCStories (Mostafazadeh et al., 2016). ROCStories consists of crowdsourced five-sentence everyday stories rich in causal and temporal relations, making them ideal for acquiring commonsense knowledge. We focus on children's stories due to their simpler language and concepts. We computed an estimated target age 5 for each story and sampled from the 5-8 age group. To ensure diverse viewpoints and hypotheses, each S, X pair was assigned to three workers. Data collection statistics are shown in Table 2 and Figure 1.

Comparison to Other Resources
To assess the novelty of GLUCOSE knowledge, we compared its coverage against that of the two most relevant commonsense resources: Concept-Net and ATOMIC. 6 We performed a best-effort mapping from GLUCOSE dimensions to relations in ConceptNet and ATOMIC. For example, GLU-4 Our pilot studies helped narrow our dimensions from 18 down to 10 which workers could reliably distinguish. Notably, we collapsed Enable and Cause on which workers had significant disagreement. 5 Target age was judged by age-of-acquisition and readability tests: Flesch-Kincaid Grade Level, the Coleman-Liau Index, and the Dale-Chall formula (Kuperman et al., 2012). 6 Note that (Rashkin et al., 2018a) and (Rashkin et al., 2018b) are in essence a subset of ATOMIC, and hence, have even lower coverage compared with GLUCOSE.  Since all three resources contain mostly naturallanguage entries, it is not possible to automatically quantify their precise overlap, so we adopted a lenient evaluation scheme. For each GLUCOSE general rule 7 A relation B, we queried each target resource for tuples R (A , B ), where R is the resource's mapped equivalent of relation, and A and B consist of just the main verb in A and B. Using fuzzy matching on A and B , we retrieved a large number of hits for the query, then filtered to those with >50% lexical overlap with the GLUCOSE rule. The results, shown in Table 3, represent a ceiling in overlap with other resources. The results indicate that GLUCOSE captures extensive commonsense knowledge unavailable in existing resources.

Empirical Evaluation Task
We set up a standalone evaluation task for evaluating models that predict GLUCOSE explanations: given a story S, a story sentence X, and a dimension d, provide an explanation in both specific and general forms.
Test Set Curation For a test set on commonsense reasoning to offer accurate and reliable evaluation, it should contain unambiguous examples with clear gold answers. This led to a curation process that identifies examples on which humans have high agreement, as follows: we sampled S, X pairs annotated by any three workers with the highest quality rating. A dimension d for S, X was allowed into the test set if 1) d was annotated by all three workers, and 2) the three specific statements had a round-robin average sentence-level BLEU (Lin and Och, 2004) score 8 above 0.75. Finally, two in-house annotators manually removed cases with typographical or core content errors, resulting in a test set of 500 story/sentence pairs, each with 1-5 dimensions answered.
Human and Automatic Evaluation Human evaluation is crucial for any language generation task. We crowdsourced our human evaluation on MTurk, using a dedicated UI 9 , asking three of our top-rated crowd workers from the main GLUCOSE crowdsourcing job to rate the predictions. We set up the following evaluation process to ensure calibrated judgments: the judge first reads a story with a highlighted sentence X, then reads a question about X corresponding to a GLUCOSE dimension. Next, they are shown a randomly-shuffled list of candidate answers, each produced by a different system. Finally, the judge rates each candidate answer on a four-point Likert scale: "completely incorrect," "almost incorrect," "almost correct," and "completely correct." To compare system performance, the ratings are mapped to numerical scores of 0-3, which are then averaged.
Automatic evaluation for tasks involving language generation has been a major bottleneck for research (Liu et al., 2016;Hashimoto et al., 2019). BLEU's ease of replicability has made it a popular automated metric, but its correlation with human judgement has proven weak on various tasks (Novikova et al., 2017;Gatt and Krahmer, 2018). For automatic evaluation, we use SacreBLEU (Post, 2018) with equal weights up to 4-grams at corpuslevel on the three-reference test set. Using pairwise correlation analysis, we found strong correlation between human and BLEU scores on our test set, with correlation coefficients Spearman = 0.891, Pearson = 0.855, and Kendall's = 0.705, all with p-value < 0.001. The high correlation is due to various design choices, including 1) semi-structured inference rules in GLUCOSE are designed to be evaluable, where the structure constrains the variability of the rules, and 2) we minimized the noise in our human evaluation by designing a UI that could collect calibrated ratings from human judges educated about the task. The strong correlation suggests that BLEU is a viable metric for reporting future results on the GLUCOSE test set. 9 GLUCOSE evaluation UI: https://bit.ly/2rJWFwy

Models
We developed several models for tackling the prediction task described in Section 5. The train and development sets for each model consisted of all GLUCOSE data minus entries that share the context story with the test instances. Due to their superior performance in sequence prediction, all our neural models use transformer blocks (Vaswani et al., 2017), which use multi-headed attention and fully connected layers to encode sequences. For decoding, all models use top-k random sampling (Fan et al., 2018). Details on all the models we experimented with can be found in Appendix C.

Pretrained Language Model (PT-LM)
PT-LM tests what GLUCOSE-like knowledge is captured by the pretrained 774M-parameter GPT-2 (Radford et al., 2019) language model. We elicit commonsense explanations from GPT-2 by feeding it the story followed by sentence X and a dimension-specific trigger word like "because", and allowing the model to complete the sentence. For best results, we implemented "constrained decoding" by conditioning the GPT-2 model on the input S, X as context, then generating the next token for a dimension d as follows: if d's template specifies a set of allowable words at the current position-e.g., locative prepositions for dimensions 3 and 8-sample from the options based on their likelihood as conditioned on the preceding tokens. Otherwise, allow sampling freely from the entire vocabulary. See Appendix C for a list of all templates used.

Language Models
We finetuned separate language models for specific and general rules. Each model monolithically covers all ten GLUCOSE dimensions: it generates rules given a dimension indicator as input. 10 Rules are sampled from the learned distribution p(s) = n i=1 p(s i | s 1 , . . . , s i−1 ), where s is the concatenation of input and output sequences. For all models in this section, we finetuned the PT-LM model described above.
One-sided Generation (1S-LM) One side of a GLUCOSE rule-the antecedent or the consequent, depending on the dimension-is always a paraphrase and/or a generalization of sentence X. In the one-sided model, we use X as is for this side of the specific statement; the model generates only the target side. Each training example is a text sequence S#X#d#answer#EOS, where d is the dimension number and answer is the target side. At test time, the model generates answer characters until it produces an EOS token.
Full Rule Generation (Full-LM) Full-LM learns to produce the complete rule, including the connective and the paraphrase of X. Instead of just the target side of the rule, the training examples have the full rule as the answer portion of the sequence. This allows the model to produce more human-like rules, including paraphrasing and/or generalizing X appropriately.

Encoder-Decoder Model (Enc-Dec)
Our most complex model is an encoder-decoder transformer model that jointly predicts the specific and general rules. It maximizes p(y | x) = n i=1 p(y i | x; y 1 , . . . , y i−1 ), where x is the input and y is the answer. We obtained the best results by formulating the input as #d: S * [X], where d is the dimension and S * [X] is the story S with sentence X surrounded by asterisks. We chose to finetune the state-of-the-art T5 model (with 770Mparameters, to be comparable to the size of the LM model), using the same hyperparameters as in (Raffel et al., 2019). Table 4 shows the results from the models described in Section 6, evaluated as per Section 5. It shows that Enc-Dec uniformly outperforms all other models, confirming that full visibility into context 11 helps an architecture better learn the intricacies of GLUCOSE rules. In fact, Enc-Dec performs competitively with humans in many dimensions. The strength of this model's performance in predicting both specific and general rules is a testament to the high quality of the GLUCOSE training data. Its worst performance is on general rules for dimensions 5 and 10, which have the lowest number of training points and are the most diverse in content.

Results and Discussion
Other models perform as expected. PT-LM's poor performance shows that finetuning on our dataset significantly improves the commonsense inference capabilities of LMs. 1S-LM, which only predicts half of an inference rule, outperforms Full-LM in predicting specific statements, but lacks the ability to generalize them. We also tested various other baselines, including an ATOMIC-trained transformer model , retrieval of K-nearest-neighbors, and non-contextual variants of the presented models, all of which significantly underperformed the results in Table 4, and are presented in Appendix C.
Our results also show that our best models perform noticeably better on specific statements than on general rules. This is because generating a specific statement involves paraphrasing a story sentence and predicting an antecedent/consequent, while a general rule requires further generalizing the paraphrase and the antecedent/consequent appropriately such that the rule remains a generally valid statement about the world.
Although rule generalization can sometimes be as simple as replacing a named entity (e.g., Gage) with a typed variable (Someone A ), more often more complex transformations are needed, such as generalizing the action and producing type constraints on variables in the form of attribute phrases. For example, take into account the Enc-Dec results in Table  5. For dimension 3, the generalization of the story sentence, Karen makes a pan of lasagna, included generalizing Karen to Someone A and makes a pan of lasagna to cooks Something A . Note that sentence generalizations are dimension-specific: For dimension 6, the generalization of same sentence retains the verb make but adds a type constraint to the object, Something A (that is a food), which is required for making the rule generally valid. Table  1 shows another complex transformation example where turning his bike is generalized into moves away from Something (that is dangerous), that takes into account story context. In our current evaluation setup, we evaluate each dimension for each sentence individually, without consideration for consistency across dimensions or across sentences. In the future, we plan to explore joint prediction of all the dimensions across the story, a considerably more challenging endeavor that would yield more useful predictions for a downstream task. We also intend to show the value of incorporating GLUCOSE-trained models in other downstream NLP tasks such as reading comprehension and dialog. It is important to note   that static test sets are inherently narrow and prone to hidden curation biases (Sharma et al., 2018;Belinkov et al., 2019). We believe that the ultimate evaluation for models that show GLUCOSE-like commonsense reasoning capabilities should be on naturally-occurring arbitrary stories and through our presented human evaluation process.

Conclusions
We introduced GLUCOSE, a large-scale dataset of implicit commonsense knowledge, encoded as explanatory mini-theories grounded in a narrative context. The theories are categorized into ten causal dimensions, inspired by cognitive psychology.
We presented our multi-stage pipeline for acquiring semi-structured causal explanations at scale from lay workers, resulting in 440K annotations in the context of everyday children's stories. We demonstrated the utility of GLUCOSE data in two ways. 1) Our analysis showed that GLUCOSE rules capture knowledge not available in existing resources or pre-trained models. 2) In order to evaluate how well AI models can predict GLUCOSE knowledge on novel inputs, the ultimate value of such a dataset, we defined a standalone evaluation task for predicting specific and general inference rules given a story/sentence pair and a dimension. We curated a doubly-vetted test set, developed a platform to facilitate human judgment of system outputs, and validated BLEU as a strong automated evaluation metric. We show that training on GLU-COSE data improves model performances significantly on unseen stories.
Our results validate our hypothesis that a promising approach for imbuing machines with commonsense is to use carefully-crafted data, as in GLU-COSE, to train neural architectures that have a wide range of lexical and conceptual knowledge encoded, as in models pretrained on large corpora. Together with this paper, we release our dataset and models, which we hope will help advance commonsense reasoning research in the AI community.