Event2Mind: Commonsense Inference on Events, Intents, and Reactions

We investigate a new commonsense inference task: given an event described in a short free-form text (“X drinks coffee in the morning”), a system reasons about the likely intents (“X wants to stay awake”) and reactions (“X feels alert”) of the event’s participants. To support this study, we construct a new crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations. We report baseline performance on this task, demonstrating that neural encoder-decoder models can successfully compose embedding representations of previously unseen events and reason about the likely intents and reactions of the event participants. In addition, we demonstrate how commonsense inference on people’s intents and reactions can help unveil the implicit gender inequality prevalent in modern movie scripts.


Introduction
Understanding a narrative requires commonsense reasoning about the mental states of people in relation to events. For example, if "Alex is dragging his feet at work", pragmatic implications about Alex's intent are that "Alex wants to avoid doing things" (Figure 1). We can also infer that Alex's emotional reaction might be feeling "lazy" or "bored". Furthermore, while not explicitly mentioned, we can infer that people other than Alex are affected by the situation, and these people are likely to feel "frustrated" or "impatient".
This type of pragmatic inference can potentially be useful for a wide range of NLP applications ⇤ These two authors contributed equally.  Figure 1: Examples of commonsense inference on mental states of event participants. In the third example event, common sense tells us that Y is likely to feel betrayed as a result of X reading their diary. that require accurate anticipation of people's intents and emotional reactions, even when they are not explicitly mentioned. For example, an ideal dialogue system should react in empathetic ways by reasoning about the human user's mental state based on the events the user has experienced, without the user explicitly stating how they are feeling. Similarly, advertisement systems on social media should be able to reason about the emotional reactions of people after events such as mass shootings and remove ads for guns which might increase social distress (Goel and Isaac, 2016). Also, pragmatic inference is a necessary step toward automatic narrative understanding and generation (Tomai and Forbus, 2010;Ding and Riloff, 2016;Ding et al., 2017). However, this type of social commonsense reasoning goes far beyond the widely studied entailment tasks (Bowman et al., 2015;Dagan et al., 2006) and thus falls outside the scope of existing benchmarks.
In this paper, we introduce a new task, corpus, and model, supporting commonsense inference on events with a specific focus on modeling stereotypical intents and reactions of people, described in short free-form text. Our study is in a similar spirit to recent efforts of Ding and Riloff (2016) and Zhang et al. (2017), in that we aim to model aspects of commonsense inference via natural language descriptions. Our new contributions are: (1) a new corpus that supports commonsense inference about people's intents and reactions over a diverse range of everyday events and situations, (2) inference about even those people who are not directly mentioned by the event phrase, and (3) a task formulation that aims to generate the textual descriptions of intents and reactions, instead of classifying their polarities or classifying the inference relations between two given textual descriptions.
Our work establishes baseline performance on this new task, demonstrating that, given the phrase-level inference dataset, neural encoderdecoder models can successfully compose phrasal embeddings for previously unseen events and reason about the mental states of their participants. Furthermore, in order to showcase the practical implications of commonsense inference on events and people's mental states, we apply our model to modern movie scripts, which provide a new insight into the gender bias in modern films beyond what previous studies have offered (England et al., 2011;Agarwal et al., 2015;Ramakrishna et al., 2017;Sap et al., 2017). The resulting corpus includes around 25,000 event phrases, which combine automatically extracted phrases from stories and blogs with all idiomatic verb phrases listed in the Wiktionary. Our corpus is publicly available. 1

Dataset
One goal of our investigation is to probe whether it is feasible to build computational models that can perform limited, but well-scoped commonsense inference on short free-form text, which we refer to as event phrases. While there has been much prior research on phrase-level paraphrases (Pavlick et al., 2015) and phrase-level entailment (Dagan et al., 2006), relatively little prior work focused on phrase-level inference that requires prag-matic or commonsense interpretation. We scope our study to two distinct types of inference: given a phrase that describes an event, we want to reason about the likely intents and emotional reactions of people who caused or affected by the event. This complements prior work on more general commonsense inference (Speer and Havasi, 2012;Li et al., 2016;Zhang et al., 2017), by focusing on the causal relations between events and people's mental states, which are not well covered by most existing resources.
We collect a wide range of phrasal event descriptions from stories, blogs, and Wiktionary idioms. Compared to prior work on phrasal embeddings (Wieting et al., 2015;Pavlick et al., 2015), our work generalizes the phrases by introducing (typed) variables. In particular, we replace words that correspond to entity mentions or pronouns with typed variables such as PersonX or PersonY, as shown in examples in Table 1. More formally, the phrases we extract are a combination of a verb predicate with partially instantiated arguments. We keep specific arguments together with the predicate, if they appear frequently enough (e.g., PersonX eats pasta for dinner). Otherwise, the arguments are replaced with an untyped blank (e.g., PersonX eats for dinner). In our work, only person mentions are replaced with typed variables, leaving other types to future research.
Inference types The first type of pragmatic inference is about intent. We define intent as an explanation of why the agent causes a volitional event to occur (or "none" if the event phrase was unintentional). The intent can be considered a mental pre-condition of an action or an event. For example, if the event phrase is PersonX takes a stab at , the annotated intent might be that "PersonX wants to solve a problem".
The second type of pragmatic inference is about emotional reaction. We define reaction as an explanation of how the mental states of the agent and other people involved in the event would change as a result. The reaction can be considered a mental post-condition of an action or an event. For example, if the event phrase is that PersonX gives PersonY as a gift, PersonX might "feel good about themselves" as a result, and PersonY might "feel grateful" or "feel thankful".  Table 2: Data and annotation agreement statistics for our new phrasal inference corpus. Each event is annotated by three crowdworkers.

Event Extraction
We extract phrasal events from three different corpora for broad coverage: the ROC Story training set (Mostafazadeh et al., 2016), the Google Syntactic N-grams (Goldberg and Orwant, 2013), and the Spinn3r corpus (Gordon and Swanson, 2008). We derive events from the set of verb phrases in our corpora, based on syntactic parses (Klein and Manning, 2003). We then replace the predicate subject and other entities with the typed variables (e.g., PersonX, PersonY), and selectively substitute verb arguments with blanks ( ). We use frequency thresholds to select events to annotate (for details, see Appendix A.1). Additionally, we supplement the list of events with all 2,000 verb idioms found in Wiktionary, in order to cover events that are less compositional. 2 Our final annotation corpus contains nearly 25,000 event phrases, spanning over 1,300 unique verb predicates (Table 2).

Crowdsourcing
We design an Amazon Mechanical Turk task to annotate the mental pre-and post-conditions of event phrases. A snippet of our MTurk HIT design is shown in Figure 2. For each phrase, we ask three annotators whether the agent of the event, PersonX, intentionally causes the event, and if so, to provide up to three possible textual descriptions of their intents. We then ask annotators to provide up to three possible reactions that PersonX might experience as a result. We also ask annotators to provide up to three possible reactions of other people, when applicable. These other people can be either explicitly mentioned (e.g., "PersonY" in PersonX punches PersonY's lights out), or only implied  Figure 2: Intent portion of our annotation task. We allow annotators to label events as invalid if the phrase is unintelligible. The full annotation setup is shown in Figure 8 in the appendix.
(e.g., given the event description PersonX yells at the classroom, we can infer that other people such as "students" in the classroom may be affected by the act of PersonX). For quality control, we periodically removed workers with high disagreement rates, at our discretion. To prune the set of events that will be annotated for intent and reaction, we ran a preliminary annotation to filter out candidate events that have implausible coreferences. In this preliminary task, annotators were shown a combinatorial list of coreferences for an event (e.g., PersonX punches PersonX's lights out, PersonX punches PersonY's lights out) and were asked to select only the plausible ones (e.g., PersonX punches PersonY's lights out). Each set of coreferences was annotated by 3 workers, yielding an overall agreement of  =0.4. This annotation excluded 8,406 events with implausible coreference from our set (out of 17,806 events).

Mental State Descriptions
Our dataset contains nearly 25,000 event phrases, with annotators rating 91% of our extracted events as "valid" (i.e., the event makes sense). Of those events, annotations for the multiple choice portions of the task (whether or not there exists intent/reaction) agree moderately, with an average Cohen's  = 0.45 (Table 2). The individual  scores generally indicate that turkers disagree half as often as if they were randomly selecting answers.
Importantly, this level of agreement is acceptable in our task formulation for two reasons. First, unlike linguistic annotations on syntax or semantics where experts in the corresponding theory would generally agree on a single correct label, pragmatic interpretations may better be defined as distributions over multiple correct labels (e.g., after PersonX takes a test, PersonX might feel relieved and/or stressed; de Marneffe et al., 2012). Second, because we formulate our task as a conditional language modeling problem, where a distribution over the textual descriptions of intents and reactions is conditioned on the event description, this variation in the labels is only as expected.
A majority of our events are annotated as willingly caused by the agent (86%, Cohen's  = 0.48), and 26% involve other people ( = 0.41). Most event patterns in our data are fully instantiated, with only 22% containing blanks ( ). In our corpus, the intent annotations are slightly longer (3.4 words on average) than the reaction annotations (1.5 words).

Models
Given an event phrase, our models aim to generate three entity-specific pragmatic inferences: Per-sonX's intent, PersonX's reaction, and others' reactions. The general outline of our model architecture is illustrated in Figure 3.
The input to our model is an event pattern described through free-form text with typed variables such as PersonX gives PersonY as a gift. For notation purposes, we describe each event pattern E as a sequence of word embeddings he 1 , e 2 , . . . , e n i 2 R n⇥D . This input is encoded as a vector h E 2 R H that will be used for predicting output. The output of the model is its hypotheses about PersonX's intent, PersonX's reaction, and others' reactions (v i ,v x , and v o , respectively). We experiment with representing the decoder  Figure 3: Overview of the model architecture.
From an encoded event, our model predicts intents and reactions in a multitask setting.
output in two decoding set-ups: three vectors interpretable as discrete distributions over words and phrases (n-gram reranking) or three sequences of words (sequence decoding).
Encoding events The input event phrase E is compressed into an H-dimensional embedding h E via an encoding function f : R n⇥D ! R H : We experiment with several ways for defining f , inspired by standard techniques in sentence and phrase classification (Kim, 2014). First, we experiment with max-pooling and mean-pooling over the word vectors {e i } n i=1 . We also consider a convolutional neural network (ConvNet; LeCun et al., 1998) taking the last layer of the network as the encoded version of the event. Lastly, we encode the event phrase with a bi-directional RNN (specifically, a GRU; Cho et al., 2014), concatenating the final hidden states of the forward and backward cells as the encoding: For hyperparameters and other details, we refer the reader to Appendix B.
Though the event sequences are typically rather short (4.6 tokens on average), our model still benefits from the ConvNet and BiRNN's ability to compose words.
Pragmatic inference decoding We use three decoding modules that take the event phrase embedding h E and output distributions of possible PersonX's intent (v i ), PersonX's reactions (v x ), and others' reactions (v o ). We experiment with two different decoder set-ups.
First, we experiment with n-gram re-ranking, considering the |V | most frequent {1, 2, 3}grams in our annotations. Each decoder projects the event phrase embedding h E into a |V |dimensional vector, which is then passed through a softmax function. For instance, the distribution over descriptions of PersonX's intent is given by: Second, we experiment with sequence generation, using RNN decoders to generate the textual description. The event phrase embedding h E is set as the initial state h dec of three decoder RNNs (using GRU cells), which then output the intent/reactions one word at a time (using beam-search at test time). For example, an event's intent sequence Training objective We minimize the crossentropy between the predicted distribution over words and phrases, against the one actually observed in our dataset. Further, we employ multitask learning, simultaneously minimizing the loss for all three decoders at each iteration.
Training details We fix our input embeddings, using 300-dimensional skip-gram word embeddings trained on Google News (Mikolov et al., 2013). For decoding, we consider a vocabulary of size |V | = 14,034 in the n-gram re-ranking setup. For the sequence decoding setup, we only consider the unigrams in V , yielding an output space of 7,110 at each time step.
We randomly divided our set of 24,716 unique events (57,094 annotations) into a training/dev./test set using an 80/10/10% split. Some annotations have multiple responses (i.e., a crowdworker gave multiple possible intents and reactions), in which case we take each of the combinations of their responses as a separate training example. Table 3 summarizes the performance of different encoding models on the dev and test set in terms of cross-entropy and recall at 10 predicted intents and reactions. As expected, we see a moderate improvement in recall and cross-entropy when using the more compositional encoder models (Con-vNet and BiRNN; both n-gram and sequence de-  Table 3: Average cross-entropy (lower is better) and recall @10 (percentage of times the gold falls within the top 10 decoded; higher is better) on development and test sets for different modeling variations. We show recall values for PersonX's intent, PersonX's reaction and others' reaction (denoted as "Intent", "XReact", and "OReact"). Note that because of two different decoding setups, cross-entropy between n-gram and sequence decoding are not directly comparable.

Empirical Results
coding setups). Additionally, BiRNN models outperform ConvNets on cross-entropy in both decoding setups. Looking at the recall split across intent vs. reaction labels ("Intent", "XReact" and "OReact" columns), we see that much of the improvement in using these two models is within the prediction of PersonX's intents. Note that recall for "OReact" is much higher, since a majority of events do not involve other people.
Human evaluation To further assess the quality of our models, we randomly select 100 events from our test set and ask crowd-workers to rate generated intents and reactions. We present 5 workers with an event's top 10 most likely intents and reactions according to our model and ask them to select all those that make sense to them. We evaluate each model's precision @10 by computing the average number of generated responses that make sense to annotators. Figure 4 summarizes the results of this evaluation. In most cases, the performance is higher for the sequential decoder than the corresponding n-gram decoder. The biggest gain from using sequence decoders is in intent prediction, possibly because intent explanations are more likely to be longer. The BiRNN and ConvNet encoders consistently have higher precision than the mean-pooling with the BiRNN-seq setup slightly outperforming other models. Unless otherwise specified, this is the model we employ in further sections.  ilar for all three sets of events, it is 10% behind intent prediction on the full development set. Additionally, predicting other people's reactions is more difficult for the model when other people are explicitly mentioned. Unsurprisingly, idioms are particularly difficult for commonsense inference, perhaps due to the difficulty in composing meaning over nonliteral or noncompositional event descriptions.
To further evaluate the geometry of the embedding space, we analyze interpolations between pairs of event phrases (from outside the train set), similar to the homotopic analysis of Bowman et al. (2016). For a handful of event pairs, we decode intents, reactions for PersonX, and reactions for other people from points sampled at equal inter-vals on the interpolated line between two event phrases. We show examples in Figure 5. The embedding space distinguishes changes from generally positive to generally negative words and is also able to capture small differences between event phrases (such as "washes" versus "cuts").

Analyzing Bias via Event2Mind Inference
Through Event2Mind inference, we can attempt to bring to the surface what is implied about people's behavior and mental states. We employ this inference to analyze implicit bias in modern films. As shown in Figure 7, our model is able to analyze character portrayal beyond what is explicit in text, by performing pragmatic inference on character actions to explain aspects of a character's mental state. In this section, we use our model's inference to shed light on gender differences in intents behind and reactions to characters' actions.

Processing of Movie Scripts
For our portrayal analyses, we use scene descriptions from 772 movie scripts released by Gorinski and Lapata (2015), assigned to over 21,000 characters as done by Sap et al. (2017). We extract events from the scene descriptions, and generate their 10 most probable intent and reaction sequences using our BiRNN sequence model (as in Figure 7). We then categorize generated intents and reactions into groups based on LIWC category scores of the generated output (Tausczik and Pennebaker, 2016). 3 The intent and reaction categories are then  (1990, bottom), augmented with Event2mind inferences on the characters' intents and reactions. E.g., our model infers that the event PersonX sits on PersonX's bed, lost in thought implies that the agent, Vivian, is sad or worried. aggregated for each character, and standardized (zero-mean and unit variance).
We compute correlations with gender for each category of intent or reaction using a logistic regression model, testing significance while using Holm's correction for multiple comparisons (Holm, 1979). 4 To account for the gender skew in scene presence (29.4% of scenes have women), we statistically control for the total number of words in a character's scene descriptions. Note that the original event phrases are all gender agnostic, as their participants have been replaced by variables (e.g., PersonX). We also find that the types of gender biases uncovered remain similar when we run these analyses on the human annotations or the generated words and phrases from the BiRNN with n-gram re-ranking decoding setup. and Needs', 'Personal Concerns', 'Biological Processes', 'Cognitive Processes', 'Social Words', 'Affect Words', 'Perceptual Processes'. We refer the reader to Tausczik and Pennebaker (2016) or http://liwc.wpengine.com/ compare-dictionaries/ for a complete list of category descriptions. 4 Given the data limitation, we represent gender as a binary, but acknowledge that gender is a more complex social construct.  Our Event2Mind inferences automate portrayal analyses that previously required manual annotations (Behm-Morawitz and Mastro, 2008;Prentice and Carranza, 2002;England et al., 2011). Shown in Table 4, our results indicate a gender bias in the behavior ascribed to characters, consistent with psychology and gender studies literature (Collins, 2011). Specifically, events with female semantic agents are intended to be helpful to other people (intents involving FRIEND, FAMILY, and AFFILIATION), particularly relating to eating and making food for themselves and others (INGEST, BODY). Events with male agents on the other hand are motivated by and resulting in achievements (ACHIEVE, MONEY, REWARDS, POWER).
Women's looks and sexuality are also emphasized, as their actions' intents and reactions are sexual, seen, or felt (SEXUAL, SEE, PERCEPT). Men's actions, on the other hand, are motivated by violence or fighting (DEATH, ANGER, RISK), with strong negative reactions (SAD, ANGER, NEGA-TIVE EMOTION).
Our approach decodes nuanced implications into more explicit statements, helping to identify and explain gender bias that is prevalent in modern literature and media. Specifically, our results indicate that modern movies have the bias to portray female characters as having pro-social attitudes, whereas male characters are portrayed as being competitive or pro-achievement. This is consistent with gender stereotypes that have been studied in movies in both NLP and psychology literature (Agarwal et al., 2015;Madaan et al., 2017;Prentice and Carranza, 2002;England et al., 2011).

Related Work
Prior work has sought formal frameworks for inferring roles and other attributes in relation to events (Baker et al., 1998;Das et al., 2014;Schuler et al., 2009;Hartshorne et al., 2013, inter alia), implicitly connoted by events (Reisinger et al., 2015;White et al., 2016;Greene, 2007;Rashkin et al., 2016), or sentiment polarities of events (Ding and Riloff, 2016;Choi and Wiebe, 2014;Russo et al., 2015;Ding and Riloff, 2018). In addition, recent work has studied the patterns which evoke certain polarities (Reed et al., 2017), the desires which make events affective (Ding et al., 2017), the emotions caused by events (Vu et al., 2014), or, conversely, identifying events or reasoning behind particular emotions (Gui et al., 2017). Compared to this prior literature, our work uniquely learns to model intents and reactions over a diverse set of events, includes inference over event participants not explicitly mentioned in text, and formulates the task as predicting the textual descriptions of the implied commonsense instead of classifying various event attributes.
Previous work in natural language inference has focused on linguistic entailment (Bowman et al., 2015;Bos and Markert, 2005) while ours focuses on commonsense-based inference. There also has been inference or entailment work that is more generation focused: generating, e.g., entailed statements (Zhang et al., 2017;Blouw and Eliasmith, 2018), explanations of causality (Kang et al., 2017), or paraphrases (Dong et al., 2017). Our work also aims at generating inferences from sentences; however, our models infer implicit information about mental states and causality, which has not been studied by most previous systems.
Also related are commonsense knowledge bases (Espinosa and Lieberman, 2005;Speer and Havasi, 2012). Our work complements these ex-isting resources by providing commonsense relations that are relatively less populated in previous work. For instance, ConceptNet contains only 25% of our events, and only 12% have relations that resemble intent and reaction. We present a more detailed comparison with ConceptNet in Appendix C.

Conclusion
We introduced a new corpus, task, and model for performing commonsense inference on textuallydescribed everyday events, focusing on stereotypical intents and reactions of people involved in the events. Our corpus supports learning representations over a diverse range of events and reasoning about the likely intents and reactions of previously unseen events. We also demonstrate that such inference can help reveal implicit gender bias in movie scripts.