Causal Inference of Script Knowledge

When does a sequence of events define an everyday scenario and how can this knowledge be induced from text? Prior works in inducing such scripts have relied on, in one form or another, measures of correlation between instances of events in a corpus. We argue from both a conceptual and practical sense that a purely correlation-based approach is insufficient, and instead propose an approach to script induction based on the causal effect between events, formally defined via interventions. Through both human and automatic evaluations, we show that the output of our method based on causal effects better matches the intuition of what a script represents


Introduction
Commonsense knowledge of everyday situations, defined in terms of prototypical sequences of events, 1 has long been held to play a major role in text comprehension and understanding Abelson, 1975, 1977;Bower et al., 1979;Abbott et al., 1985). Naturally, this has motivated a large body of work looking to learn such knowledge from text corpora through data-driven approaches.
A minimal (and oftentimes implicit) preliminary requirement for any such approach is to provide a reasonable answer to the following: for any pair of events e 1 and e 2 what quantitative measure can be used to determine whether e 2 should follow e 1 in a commonsense scenario (a 'script')?
The initial work of Chambers and Jurafsky (Chambers andJurafsky, 2008, 2009) adopted point-wise mutual information (PMI) between events as an answer to the above. Later work in the same tradition employed probabilities from a Figure 1: The events of Watching a sad movie, Eating popcorn, and Crying, may highly co-occur in a hypothetical corpus. What distinguishes valid event pair inferences (event pairs linked in a commensense scenario; noted by checkmarks above) versus invalid inferences (noted by a 'X')?
Despite differences, these previous approaches largely follow the same underlying principle: a high enough value of the conditional probability 2 p(e 2 |e 1 ) should indicate that e 2 should follow e 1 in a script. As with any measure, introspection is required: Does a measure rooted in p(e 2 |e 1 ) capture the notion of whether e 2 should follow e 1 in a script? We posit that it does not; observed correlations between events indicate relatedness, but relatedness is not the only factor in determining whether events form a meaningful script.
An example given in Ge et al. (2016) illustrates this point: a hurricane event may be prototypically connected with the event of donations coming in. Likewise, the hurricane event may also be connected to an evacuation. But the donation event is not connected to the evacuation event in the same sense (and vice-versa). Nevertheless, strong statistical associations will be built between the two. Figure 1 provides another example of this issue; clearly eating popcorn is not linked to crying. But if they were to co-occur together in a hypothetical corpus due to shared associations with the event of watching a sad movie, how could a measure based on conditional probability tell the difference? In this instance, even temporal information does not provide the answer. The problem is exacerbated when one considers that events (such as the hurricane) may not even be made fully explicit in corresponding text; they may only be strongly implied in some given context.
So what is a measure based on p(e 2 |e 1 ) missing?
In both examples, the 'invalid' inferences (let's say, inferring that e 2 =crying is linked with e 1 =eating popcorn) arise from the same underlying issue; observing the eating popcorn event raises the probability of crying, not due to the eating popcorn event itself, but because observing the popcorn event implies a context of possible prior events (like watching a sad movie), that by themselves do raise the probability of the crying event. To put it another way: the act of introducing a popcorn event in an ongoing discourse would in no scenario raise the probability/degree of belief in the crying event. Introducing the sad movie event would. Observing the popcorn event (or, to continue with the analogy, being told the event happened without further context) does raise this probability, but only by virtue of the shared link with the sad movie event. Clearly, the former relationship is more inline with the type of information we wish to extract, but p(e 2 |e 1 ) captures the later by definition.
In this paper, we argue that capturing this former relationship (does introducing e 1 into a discourse raise the probability of e 2 ?) is essential for any method purporting to extract this flavor of script knowledge, on both conceptual and practical grounds.
On conceptual grounds, we posit that modeling this relationship better captures an important property that most events linked within a classical script posses: that they be causally linked, something underscored both in the original papers defining scripts and related works in psychology Black and Bower, 1980;Trabasso and Sperry, 1985). We argue that the practical issues noted above are byproducts of ignoring this conceptual property; a mismatch between the knowledge we want to extract, and the measures we are using to extract it.
We show that this notion of 'introducing e 1 into the discourse' and its resultant effects on the probability of e 2 can be cleanly formalized as the distribution over e 2 under a particular intervention, a central object of study in the field of causal inference (Hernan and Robins, 2019). We contend that measures for extracting script events from text are more aptly based on this distribution.
The exact semantics of this intervention are unambiguously specified by a graphical causal model of our problem (Spirtes et al., 2000;Pearl, 2000), which we design utilizing insights from prior work in discourse processing. Under this model, we show how these intervention distributions can be defined and estimated from observational data. Using crowdsourced human evaluations and a variant of the automatic cloze evaluation, we show how this definition better captures the notion of script knowledge compared to prior standard measures, PMI and event sequence language models.

Motivation
Does that fact that event e 2 is often observed after e 1 in the data (i.e., p(e 2 |e 1 ) is "high") mean that e 2 prototypically follows e 1 , in the sense of being part of a script? As an example of what we mean: the event of paying is expected to follow the event of eating while the event of running is not. 3 In this section we argue that conditional probability is not sufficient for the purpose of extracting this information from text. We argue from a conceptual standpoint that some notion of causal relevance is required. We then give examples showing the practical pitfalls that may arise from ignoring this component. Finally, we propose our intervention based definition for script events, and show how it both explicitly defines a notion of 'causal relevance,' while simultaneously fixing the aforementioned practical pitfalls.

The Significance of Causal Relevance
The original works defining scripts are unequivocal about the importance of causal linkage between script events, 4 and other components of the origi-nal script definition (e.g. what-ifs, preconditions, postconditions, etc.) are arguably causal in nature. Early rule-based works on inducing scripts heavily utilize intuitively causal concepts in their schema representations (DeJong, 1983;Mooney and DeJong, 1985), as do related works in psychology looking at how humans store and utilize discourse information in memory (Black and Bower, 1980;Trabasso and Sperry, 1985;Trabasso and Van Den Broek, 1985;Van den Broek, 1990).
But any measure based solely on p(e 2 |e 1 ) is agnostic to notions of causal relevance. Does this matter in practice? A relatively high p(e 2 |e 1 ) indicates either: (1) a causal influence of e 1 on e 2 , or (2) a common cause e 0 between the two, meaning the relation between e 1 and e 2 is mostly spurious. In the latter case, e 0 acts essentially as a confounder between e 1 and e 2 .
Ge et al. (2016) acknowledges that the associations picked up by correlational measures may often be spurious (seen by the example in the intro). Their solution relies on using trends of words in a temporal stream of newswire data, and hence is fairly domain specific. In this work, we show how a more general solution may be arrived at by recognizing the problem as what it is: a confounding problem, and hence, a causal problem.

Defining Causal Relevance
Early works such as Schank and Abelson (1975) are vague with respect to the meaning of "causally chained." Can one say that watching a movie has causal influence on the subsequent event of eating popcorn happening? Furthermore, can this definition be operationalized in practice?
We argue that both of these questions may be elucidated by taking a manipulation-based view of causation. Roughly speaking, this view holds that a causal relationship is one that is "potentially exploitable for the purposes of manipulation and control " -Woodward (2005). In other words, a causal relationship between A and B means that (in some cases) manipulating the value of A should result in a change in the value of B. A primary benefit of this view is that the meaning of a causal claim can be clarified by specifying what these 'manipulations' are exactly. We take this approach below to clarify what exactly is meant by 'causal relevance' between script events.
Imagine an agent reading a discourse. After reading a part of the discourse, the agent has some ex-pectations for events that might happen next. Now imagine that, before the agents reads the next passage, we surreptitiously replace it with an alternate passage in which the event e 1 happens. We then allow the agent to continue reading. If e 1 is causally relevant to e 2 , then this replacement should, in some contexts, raise the agents degree of belief in e 2 happening next (compared to a case where we didn't intervene to make e 1 happen ).
So, for example, if we replaced a passage such that e 1 = watching a movie was true, we could expect on average that the agent's degree of belief that e 2 = eating popcorn happens next will be higher. In this way, we say these events are causally relevant, and are for our purposes, script events. For event pairs that are not linked in a script, the opposite is true. There exist very few contexts in which replacing the passage with the popcorn event would raise the probability of crying.
With this little 'story,' we have clarified the conceptual notion of causal relevance in our problem, and connected it to the notion of "introducing e 1 into a discourse" described in the introduction. In the next section, we further formalize this story into a causal model, a necessary first step for anyone looking to compute causal effects from observed data.

Method
Here we define our causal model, show how the effects of interventions may be computed, and how these effects may be employed in extracting scriptlike associations between pairs of events.
To best contrast with prior work, we use the event representation of Chambers and Jurafsky (2008). Each event is a pair (p, d), where p is the event predicate (e.g. hit), and d is the dependency relation (e.g. nsubj) between the predicate and the protagonist entity. The protagonist is the entity that participates in every event in the considered event chain, e.g., the 'Bob' in the chain 'Bob sits, Bob eats, Bob pays.' We additionally make the oft-used simplifying assumption that document order is the same as temporal order. Future work can consider whether improvements over this assumption can be had via models for document timeline generation such as by Govindarajan et al. (2019)

Defining a Causal Model
A causal model defines a set of causal assumptions that are needed when computing causal quantities (such as the effect of interventions). In this paper, we use the formalism of Causal Bayesian Networks (Spirtes et al., 2000;Pearl, 2000). Informally, a CBN can be thought of as a Bayesian network whose edges imply a direction of causal influence (though see both Pearl (2000) and Bareinboim et al. (2012) for formal charecterizations). Our variable of interest is the categorical variable e i ∈ E, where e i indicates the identity of the i th event mentioned in text, and E is the set of possible atomic event types (the predicate-dependency pairs described above). It is important to note that e i does not represent a 'real world' event; it is solely a property of the text. This interpretation of the variable e i is what is implicitly taken in prior work. In the context of the high level 'story' given in section 2, e i represents the identity of the event that an agent would infer upon reading the text 5 To create our causal model, we must identify the factors that play a causal role in determining the value of e i . In the graph, these variables will have a directed edge incident on e i . We list these variables below, along with their meaning in italics. Variables in bold are those posited to have a direct causal effect on e i : T i : All the text describing the i th event 6 . Clearly the text corresponding to e i directly affects it (e i 5 As we utilize automatic tools in this paper to extract the identities of events, it is important to note that there will be bias due to measurement error. Fortunately, there do exists methods in the causal inference literature that can adjust for this bias (Kuroki and Pearl, 2014;Miao et al., 2018). Wood-Doughty et al. (2018) derive equations in a case setting similar to ours (ie with measurement bias on the variable being intervened on). For now, we leave these efforts for future work. 6 In this paper, we use the text output of PredPatt (White et al., 2016) as the textual representation of an event.
is, after all, the identity of the event that an agent would infer after reading the text T i !). However, due to the ambiguity and vagueness of text, it obviously cannot fully determine it; further context may be needed.
The identity of e i does not only depend on T i , the prior context also will also play a role in identifying e i . The prior context comes in two forms: (1) the text describing prior events in the discourse, T 0 , ..., T i−1 and (2) the identities of the previous events, e 0 , ..., e i−1 . It is here that we make our largest causal assumption: Assumption 1: Given the the identities of the prior events, e 0 , ..., e i−1 , and the current chunk of text T i , the identity of e i is invariant to changes (consistent with the given values of T i and e 0 , ..., e i−1 ) in the previous textual content, T 0 , ...T i−1 . In other words, the identities of the prior events capture the relevant information needed from the prior text.
This assumption posits that the prior text effects e i via only two causal paths; by influencing the current text that was written, T i , or through its semantic content encapsulated by the identities of the prior events. Stylistic changes in T 0 , ..., T i−1 that do not effect its core semantic content do not effect how we infer e i , and hence, we do not include an arrow from T 0 , ..., T i−1 to e i .
While this assumption has intuitive appeal, it can also justified by prior work in causal network theories of discourse processing (Trabasso and Sperry, 1985;Trabasso and Van Den Broek, 1985;Van den Broek, 1990). These theories hold that the causal network among events in a discourse are a primary part of how a read discourse is represented in human memory. One could read Assumption 1 along these lines; that the prior chain of events is a sufficient representation of the prior discourse to allow reasoning about e i . 7 Since we assume no direct causal influence from the prior text and e i , it is clear that there must exist one between the prior events e 0 , ..., e i−1 and e i . For notational convenience, we represent the prior events as a single combinatorial variable M i−1 , and assume a direct arrow from M i−1 to e i . We describe this variable below: M i : A variable taking a value in 2 E indicating all events, both described in text and left out, that happened prior to e i 8 . The prior chain of events provides the required context to, along with T i , determine (up to noise) the identity of e i . Note that this variable accounts for both variables described previously in the text, and those not explicitly stated in the text (out-of-text events). The value of M i is affected by M i−1 , e i−1 , and U , described next. U : The World. An unknowable, immeasurable variable representing the context of the world in which the text was written.
A causal diagram given in Figure 2 gives a clear picture of the causal assumptions made for our problem. Solid arrows indicate posited causal dependencies. Bidirectional arrows indicate unknown dependencies (that is, we don't claim to know the causal dependencies between parts of the text, any configuration is possible).
We make one other assumption for practical reasons: M i is restricted to only the previous 10 intext events, and only contains out of text (inferred) events from step i.

Identifying Intervention Distributions
As specified by our story in section 2, our goal is to compute the effect that intervening and setting the proceeding event e i−1 to k ∈ E has on the distribution over the subsequent event e i . Now that we have a causal model in the form of 2, we can now meaningfully define this quantity. Using the notation of Pearl (2000), we write this as: The semantics of do(e i−1 = k) are defined with respect to the graph, corresponding to a graph in which we have deleted the incoming arrows of e i−1 and set it to k (the dotted arrows in Figure 2). Before a causal query such as Eqn. 1 can be estimated we must first establish identifiability (Shpitser and Pearl, 2008): can the causal query be written as a function of (only) the observed data? Eqn. 1 is identified by noting that variables T i−1 and M i−1 meet the 'back-door criterion' of Pearl (1995), allowing us to write Eqn. 1 as the following: Our next step will be estimating the above equation. If one has a estimate for the conditional p(e i |e i−1 , M i−1 , T i−1 ), then one may plug it into Eq 2 and use a Monte Carlo estimate to approximate the expectation (using samples of (T, M ) from our dataset). This leads to an simple estimator called a plugin estimator, and is what we utilize here.
It is important to be aware of the fact that this estimator, specifically when plugging in machine learning methods, is quite naive (eg Chernozhukov et al. (2018)), and will suffer from an asymptotic (first order) bias 9 which prevents one from constructing meaningful confidence intervals or performing certain hypothesis tests. That said, in practice these machine learning based plug in estimators can achieve quite reasonable performance (see for example, the results in Shalit et al. (2017)), and since our current use case can in some sense be validated empirically (quite the rare occurrence), we save the utilization of more sophisticated estimators for future work.

Estimating the Needed Conditional
Eq 2 has a dependency on the conditional, , which we estimate via standard machine learning techniques using a dataset of samples drawn from p(e i , e i−1 , M i−1 , T i−1 ). There are two issues to deal with here: (1) How to deal with out-of-text events in M i−1 ? (2) What form will p e i take?
Dealing with Out-of-Text Events Recall that M i is 'bag' of all previous events, both those that occur in the text, M I i , and those that are implicit and not in the text, To learn a model for p e i we require samples from the full joint (which includes M O i ), though we only have access to p(e i , e i−1 , M I i−1 , T i−1 ). If, for the samples in our current dataset, we could draw sam- , we would result in a dataset with samples drawn from the full joint.
In order to 'draw' samples from p M we employ human annotation. Annotators are presented with a human readable form of (e i , e i−1 , M I i−1 , T i−1 ) 10 and are asked to annotate for possible events belonging in M O i−1 . Rather than opt for noisy annota-tions obtained via freeform elicitation, we instead provide users with a set of 6 candidate choices for members of M O i−1 . The candidates are obtained from various knowledge sources: ConceptNet (Speer and Havasi, 2012), VerbOcean(Chklovski and Pantel, 2004), and high PMI events from the NYT Gigacorpus (Graff et al., 2003). The top two candidates are selected from each source.
In a scheme similar to Zhang et al. (2017), we ask users to rate candidates on an ordinal scale and consider candidates rated above a certain value to be considered within M O i−1 . We found annotator agreement to be quite high, with a Krippendorf's α of 0.79. Under this scheme, we crowdsourced a dataset of 2000 fully annotated examples on the Mechanical Turk platform. An image of our annotation interface is provided in the Appendix.
The Conditional Model We opt to use neural networks to model p e i . In order to deal with the small amount of fully annotated data available, we employ a finetuning paradigm. We first train a model on a large dataset that does not include an-

Extracting Script Knowledge
Provided a model of the conditional p e i we can estimate p(e i |do(e i−1 = k)) by Eq 2. We evaluate the expectation by Monte Carlo, taking our annotated dataset of N = 2000 examples and computing the following average: Which gives us a vector C k ∈ R |E| whose l th component, C kl gives p(e i = l|do(e i−1 = k)). We compute this vector for all values of k (this computation only needs to be done once).
There are several ways one could extract scriptlike knowledge using this information. In this paper, we define a normalized score over intervenedon events such that the script compatibility score between two concurrent events is defined as:

Experiments and Evaluation
Automatic evaluation of methods that extract scriptlike knowledge is an open problem that we do not attempt to tackle here, 11 relying foremost on crowdsourced human evaluations to validate our method. However, as we aim to provide a contrast to prior script-induction approaches, we perform an experiment looking at a variant of the popular, but knowingly flawed (Chambers, 2017) automatic narrative cloze evaluation, in which the cloze test set is increasingly filtered to remove instances who's answer are high frequency events.

Dataset
For these experiments, we use the Toronto Books corpus (Zhu et al., 2015;Kiros et al., 2015), a collection of fiction novels spanning multiple genres. The original corpus contains 11,040 books by unpublished authors. We remove duplicate books from the corpus (by exact file match), leaving a total of 7,101 books. The books are assigned randomly to train, development, and test splits in 90%-5%-5% proportions. Each book is then run through a pipeline of tokenization with CoreNLP 3.8 (Manning et al., 2014), parsing with CoreNLP's universal dependency parser (Nivre et al., 2016)

Baselines
In this paper, we compare against the two dominant approaches for script induction (under a atomic event representation 13 ): PMI (similar to Chambers andJurafsky (2008, 2009)) and LMs over event sequences (Rudinger et al., 2015;Pichotta and Mooney, 2016). We defer definitions for these models to the cited papers, below we provide the relevant details for each baseline, with further training details provided in the Appendix. For computing PMI we follow many of the details from (Jans et al., 2012). Due to the nature of the evaluations, we utilize their 'ordered PMI' variant. Also like Jans et al. (2012), we use skipbigrams with a window of 2 to deal with count sparsity. Consistent with prior work we additionally employ the discount score of Pantel and Ravichandran (2004). For the LM, we use a standard, 2 layer, GRU-based neural network language model, with 512 dimensional hidden states, trained on a log-likelihood objective.

Eval I: Pairwise Event Associations
Any system aimed at extracting script-like knowledge should be able to answer the following abductive question: given an event e i happened, what previous event e i−1 best explains why e i is true? In other words, what e i−1 , if it were true, would maximize my belief that e i was true. We evaluate each method's ability to do this via a human evaluation.
On each task, annotators are presented with six event pairs (e i−1 , e i ), where e i is the same for all pairs, but e i−1 is generated by one of the three systems. Similar to the human evaluation in Pichotta and Mooney (2016), we filter out outputs in the top-20 most frequent events list for all systems.
For each system, we pick the top two events that maximize S(·, e i ), P M I(·, e i ), and p lm (·, e i ), for the Causal, PMI, and LM systems respectively, and Causal LM PMI Target X tripped X came X featured X fell X lit X sat X laboured X inhaled X aimed X came X alarmed X fired X poured X nodded X credited X refilled X radioed X made X fostered X ordered  present them in random order. For each pair, users are asked to provide a scalar annotation (from 0%-100%, via a slider bar) on the chance that e i is true afterwards or happened as a result of e i−1 . The annotation scheme is modeled after the one presented in Sakaguchi and Van Durme (2018), and shown to be effective for paraphrase evaluation in Hu et al. (2019). Example outputs for systems are provided for several e 1 choices for this task in Table 2. The evaluation is done for 150 randomly 14 chosen instances of e i , each with 6 candidate e i−1 . We have two annotators provide annotations for each task, and similar to Hu et al. (2019), average these annotations together for a gold annotation.
In Table 1 we provide the results of the experiment, providing both the average annotation score for the outputs of each system, as well as the average relative ranking (with a rank of 6 indicating the annotators ranked the output as the highest/best in the task, and a rank of 1 indicating the opposite). We find that annotators consistently rated the Causal system higher. The differences (in both Score and Rank) between the Causal system and the next best system are significant under a Wilcoxon signed-rank test (p < 0.01).

Eval II: Event Chain Completion
Of course, while pairwise knowledge between events is a minimum prerequisite, we would also like to generalize to handle chains of events containing multiple events (in our case, essentially equiva-lent to the 'narrative chains' studied in Chambers and Jurafsky (2008)). In this section, we look at each system's ability to provide an intuitive completion to an event chain. More specifically, the model is provided with a chain of three context events, (e 1 , e 2 , e 3 ), and is tasked with providing a suitable e 4 that might follow given the first three events. We evaluate each method's ability to do this via a human evaluation.
Since both PMI and the Causal model 15 work only as pairwise models, we adopt the method of Chambers and Jurafsky (2008) for chains. For both the PMI and Causal model, we pick the e 4 that maximizes 1 3 3 i=1 Score(e i , e 4 ), where Score is either P M I or Eq 4. The LM model chooses an e 4 that maximizes the joint over all events.
Our annotation task is similar to the one in 4.3, except the pairs provided consist of a context (e 1 , e 2 , e 3 ) and a system generated e 4 . Each system generates its top choice for e 4 , giving annotators 3 pairs 16 to annotate for each task (i.e. each context). On each task, human annotators are asked to provide a scalar annotation (from 0%-100%, via a slider) on the chance that e 4 is true afterwards or happened as a result of the chain of context events. The evaluation is done for 150 tasks, with two annotators on each task. As before, we average these annotations together for a gold annotation.
In Table 3 we provide results of the experiment. Note the the rankings are now from 1 to 3 (higher is better). We find annotators usually rated the Causal system higher, though the LM model is much closer in this case. The differences (in both Score/Rank) between the Causal and LM system outputs are not significant under a Wilcoxon signed-rank test, though the differences between the Causal and PMI system is (p < 0.01). The fact that the pairwise Causal model is still able to (at minimum) match the full sequential model on a chain-wise evaluation speaks to the robustness of the event associations mined from it, and further motivates work in extending the method to the sequential case.

Diversity of System Outputs
But what type of event associations are found from the Causal model? As noted both in Rudinger et al. (2015) and in Chambers (2017), PMI based approaches can often extract intuitive event rela-15 Generalizing the Causal model to multiple interventions, though out of scope here, is a clear next step for future work. 16 We found providing six pairs per task to be overwhelming given the longer context

Method
Pairwise Chain  tionships, but may sometimes overweight low frequency events or suffer problems from count sparsity. LM based models, on the other hand, were noted for their preference towards boring, uninformative, high frequency events (like 'sat' or 'came'). So where does the Causal model lay on this scale? We study this by looking at the percentage of unique words used by each system in the previous evaluations, presented in Table 5. Unsurprisingly, we find that PMI chooses a new word to output often (77%-84% of the time), while the LM model very rarely does (only 7%-13%). The Causal model, while not as adventurous as the PMI system, tends to produce very diverse output, generating a new output 60%-76% of the time. Both the PMI and Causal system produce relatively less diverse output on the chain task, which is expected due to the averaging scheme used by each to select events.
Qualitatively looking at the output, it appears that the Causal model indeed produces answers similar to the 'good' outputs of PMI system, while also being more robust to noise due to sparse counts. The top two most output events of each system for both annotations are provided to illustrate this in Table 4. See also the model outputs in Table 2.

Infrequent Narrative Cloze
The narrative cloze task, or some variant of it, has remained a popular automatic test for systems aiming to extract 'script' knowledge. The task is usually formulated as follows: given a chain of events e 1 , ...e n−1 that occurs in the data, predict the held out next event that occurs in the data, e n . There  Table 6: Recall@100 Narrative Cloze Results. < C indicates that instances whose cloze answer is one of the top C most frequent events are not evaluated on exists various measures to calculate a models ability to perform in this task, but arguably the most used one is the Recall@N measure introduced in Jans et al. (2012). Recall@N works as follows: for a cloze instance, a system will return the top N guesses for e n . Recall@N is the percentage of times e n is found anywhere in the top N list.
The automatic version of the cloze task has notable limitations. As noted in Rudinger et al. (2015), the cloze task is essentially a language modeling task; it measures how well a model fits the data. The question then becomes whether data fit implies valid script knowledge was learned. The work of Chambers (2017) casts serious doubts on this, with various experiments showing automatic cloze evaluations are biased to high frequency, uninformative events, as opposed to informative, core, script events. They further posit human annotation as a necessary requirement for evaluation.
In this experiment, we provide another datapoint for the inadequacy of the automatic cloze, while simultaneously showing the relative robustness of the knowledge extracted from our Causal system. For the experiment, we make the following assumptions: (1) Highly frequent events tend to appear in many scenarios, and hence are less likely to be a informative 'core' event for a script (such as 'pay' or 'shoot'), and (2) Less frequent events are more likely to appear only in specific scenarios, and are thus more likely to be informative events. If these are true, then a system that has extracted useful script knowledge should keep (or even improve) performance on the cloze when the correct answer for e n is a less frequent event.
We thus propose a Infrequent Cloze task. In this task we create a variety of different cloze datasets (each with 2000 instances) from our test set. Each set is indexed by a value C, such that the indicated dataset does not include instances from the top C most frequent events (C = 0 is the normal cloze setting). We compute a Recall@100 cloze task on 7 sets of various C and report results in Table 6.
At C = 0, as expected, the LM model is vastly superior. The performance of the LM model drastically drops however, as soon as C increases, indicating an overreliance on prior probability. The LM performance drops below 2% once C = 200, indicating almost no ability in predicting informative events such as drink or pay, both of which occur in this set in our case.
The PMI and Causal model's performance on the other hand, steadily improve while C increases, with the Causal model consistently outperforming PMI. This result, when combined with the results of the human evaluation, give further evidence towards the relative robustness of the Causal model in extracting informative core events. The precipitous drop in performance of the LM further underscores problems that a naive automatic cloze evaluation may cover up.

Related Work
Our work looks at script like associations between events in a manner similar to Chambers and Jurafsky (2008), and works along similar lines (Jans et al., 2012;Pichotta and Mooney, 2016). Related lines of work exist, such as work using generative models to induce probabilistic schemas (Chambers, 2013;Cheung et al., 2013;Ferraro and Van Durme, 2016), and other work showing how script knowledge may be mined from user elicited event sequences (Regneri et al., 2010;Orr et al., 2014). The cognitive linguistics literature is rich with work studying the role of causal semantics in linguistic constructions and argument structure (Talmy, 1988;Croft, 1991Croft, , 2012, as well as the causal semantics of lexical items themselves (Wolff and Song, 2003;Wolff, 2007). Work in the NLP literature on extracting causal relations has benefited from this line of work, utilizing the systematic way in which causation in expressed in language to mine relations (Girju and Moldovan, 2002;Girju, 2003;Blanco et al., 2008;Do et al., 2011;Bosselut et al., 2019). This line work aims to extract causal rela-tions between events that are in some way explicitly expressed in the text (e.g. through the use of particular constructions).Taking advantage of how causation is expressed in language may benefit our causal model, and is a potential path for future work.

Conclusions and Future Work
In this work we argued for a causal basis in script learning. We showed how this causal definition could be formalized and used in practice utilizing the tools of causal inference, and verified our method with human and automatic evaluations. In the current work, we showed a method calculating the 'goodness' of a script in the simplest case: between pairwise events, which we showed still to be quite useful. A causal definition is in no way limited to this pairwise case, and future work may generalize it to the sequential case or to event representations that are compositional (for example, by performing multiple interventions). Having a causal model shines a light on the assumptions made here, and indeed, future work may further refine or overhaul them, a process which may further shine a light on the nature of the knowledge we are after.
Paul Van den Broek. 1990  For these experiments, we use the Toronto Books corpus (Zhu et al., 2015;Kiros et al., 2015), a collection of fiction novels spanning multiple genres. The original corpus contains 11,040 books by unpublished authors. We remove duplicate books from the corpus (by exact file match), leaving a total of 7,101 books; a distribution by genre is provided in Table 7. The books are assigned randomly to train, development, and test splits in 90%-5%-5% proportions (6,405 books in train, and 348 in development and test splits each). Each book is then sentence-split and tokenized with CoreNLP 3.8 (Manning et al., 2014); these sentence and token boundaries are observed in all downstream processing.

Narrative Chain Extraction Pipeline
In order to extract the narrative chains from the Toronto Books data, we implement the following pipeline. First, we note that coreference resolution systems are trained on documents much smaller than full novels (Pradhan et al., 2012); to accommodate this limitation, we partition each novel into non-overlapping windows that are 100 sentences in length, yielding approximately 400,000 windows in total. We then run CoreNLP's universal dependency parser (Nivre et al., 2016;Chen and Manning, 2014), part of speech tagger (Toutanova et al., 2003), and neural coreference resolution system (Clark and Manning, 2016a,b) over each window of text. For each window, we select the longest coreference chain and call the entity in that chain the "protagonist," following Chambers and Jurafsky (2008).
We feed the resulting universal dependency (UD) parses into PredPatt (White et al., 2016), a rulebased predicate-argument extraction system that runs over universal dependency parses. From Pred-Patt output, we extract predicate-argument edges, i.e., a pair of token indices in a given sentence where the first index is the head of a predicate, and the second index is the head of an argument to that predicate. Edges with non-verbal predicates are discarded.
At this stage in the pipeline, we merge information from the coreference chain and predicateargument edges to determine which events the protagonist is participating in. For each predicateargument edge in every sentence, we discard it if the argument index does not match the head of a protagonist mention. Each of the remaining predicate-argument edges therefore represents an event that the protagonist participated in.
With a list of PredPatt-determined predicateargument edges (and their corresponding sentences), we are now able to extract the narrative event representations, (p, d) For p, we take the lemma of the (verbal) predicate head. For d, we take the dependency relation type (e.g., nsubj) between the predicate head and argument head indices (as determined by the UD parse); if a direct arc relation does not exist, we instead take the unidirectional dependency path from predicate to argument; if a unidirectional path does not exist, we use a generic "arg" relation.
To extract a factuality feature for each narrative event (i.e. whether the event happened or not, according to the meaning of the text), we use the neural model of Rudinger et al. (2018a).As input to this model, we provide the full sentence in which the event appears, as well as the index of the event predicate's head token. The model returns a fac-tuality score on a [−3, 3] scale, which is then discretized using the following intervals: [1, 3] is "positive" (+), (−1, 1) is "uncertain," and [−3, −1] is "negative" (−).
From this extraction pipeline, we yield one sequence of narrative events (i.e., narrative chain) per text window.

RNN Encoder
We use a single layer GRU based RNN encoder with a 300 dimensional hidden state and 300 dimensional input event embeddings to encode the previous events into a single 300 dimensional vector.

CNN Encoder
We use a CNN to encode the text into a 300 dimensional output vector. The CNN uses 4 filters with ngram windows of (2, 3, 4, 5) and max pooling.

Training Details -Pretraining
The conditional for the Causal model is trained using Adam with a learning rate of 0.001, gradient clipping at 10, and a batch size of 512. The model is trained to minimize cross entropy loss. We train the model until loss on the validation set does not go down after three epochs, afterwhich we keep the model with the best validation performance, which in our case was epoch 4

Training Details -Finetuning
The model is then finetuned on our dataset of 2000 annotated examples. We use the same objective as above, training using Adam with a learning rate of 0.00001, gradient clipping at 10, and a batch size of 512. We split our 2000 samples into a train set of 1800 examples and a dev set of 200 examples. We train the model in a way similar to above, keeping the best validation model (at epoch 28).

Training and Model Details -LM Baseline
We use a 2 layer GRU based RNN encoder with a 512 dimensional hidden state and 300 dimensional input event embeddings as our baseline event sequence LM model.

Training Details
The LM model is trained using Adam with a learning rate of 0.001, gradient clipping at 10, and a batch size of 64. We found using dropout at the  embedding layer and the output layers to be helpful (with dropout probability of 0.1). The model is trained to minimize cross entropy loss. We train the model until loss on the validation set does not go down after three epochs, afterwhich we keep the model with the best validation performance, which in our case was epoch 5.

Annotation Interfaces
To get an idea for about the annotation set ups used here, we also provide screen shots of the annotation suites for all three annotation experiments. The out-of-text annotation experiment of Section 3.3 is shown in Figure 3. The pairwise annotation evaluation of Section 4.3 is shown in Figure 4. The chain completion annotation evaluation of Section 4.4 is shown in Figure 5.