Inferring Narrative Causality between Event Pairs in Films

To understand narrative, humans draw inferences about the underlying relations between narrative events. Cognitive theories of narrative understanding define these inferences as four different types of causality, that include pairs of events A, B where A physically causes B (X drop, X break), to pairs of events where A causes emotional state B (Y saw X, Y felt fear). Previous work on learning narrative relations from text has either focused on “strict” physical causality, or has been vague about what relation is being learned. This paper learns pairs of causal events from a corpus of film scene descriptions which are action rich and tend to be told in chronological order. We show that event pairs induced using our methods are of high quality and are judged to have a stronger causal relation than event pairs from Rel-Grams.


Introduction
Telling and understanding stories is a central part of human experience, and many types of human communication involve narrative structures. Theories of narrative posit that NARRATIVE CAUSAL-ITY underlies human understanding of a narrative (Warren et al., 1979;Trabasso et al., 1989;Van den Broek, 1990). However previous computational work on narrative schemas, scripts or event schemas learn "collections of events that tend to co-occur" (Chambers and Jurafsky, 2008;Balasubramanian et al., 2013;Pichotta and Mooney, 2014), rather than causal relations between events (Rahimtoroghi et al., 2016). Another limitation of previous work is that it has mostly been applied to newswire, limiting what is learned to relations between newsworthy events, rather than everyday events (Rahimtoroghi et al., 2016;Hu et al., 2013;Beamer and Girju, 2009;Manshadi et al., 2008).
Our focus here is on NARRATIVE CAUSAL-ITY (Trabasso et al., 1989;Van den Broek, 1990), the four different relations posited by narrative theories to underly narrative coherence: • PHYSICAL: Event A physically causes event B to happen • MOTIVATIONAL: Event A happens with B as a motivation • PSYCHOLOGICAL: Event A brings about emotions (expressed in event B) • ENABLING: Event A creates a state or condition for B to happen. A enables B.
Previous work on learning causal relations has primarily focused on physical causality (Riaz and Girju, 2010;Beamer and Girju, 2009), while our aim is to learn event pairs manifesting all types of narrative causality, and test their generality as a source of causal knowledge. We posit that film scene descriptions are a good resource for learning narrative causality because they are: (1) action rich; (2) about everyday events; and (3) told in temporal order, providing a primary cue to causality (Beamer and Girju, 2009;Hu et al., 2013).
Film scenes contain many descriptions encoding PHYSICAL CAUSALITY, e.g. in Fig. 1, Scene 1, Frodo grabs Pippin's sleeve, causing Pippin to spill his beer (grab -spill). Pippin then pushes Frodo away, causing Frodo to stumble backwards and fall to the floor (push -stumble, stumble -fall, and push -fall). But they also contain all other types of narrative causality: in Scene 2, Gandalf has to stoop, because he wants to avoid hitting his head on the low ceiling (stoop -avoid: MOTIVA-TIONAL). He then looks around, and enjoys the result of looking: the familiarity of Bag End (lookenjoy: PSYCHOLOGICAL). He turns, which causes  (Riaz and Girju, 2010;Rahimtoroghi et al., 2016), we explore differences in learning between genres of film, positing e.g. that horror films may feature very different types of events than comedies. We also test the quality of what is learned when we train on genre specific texts vs. the whole collection. Our results show that: • human judges can distinguish between strong and weakly causal event pairs induced using our method (Section 3.1); • our strongly causal event pairs are rated as more likely to be causal than those provided by the Rel-gram corpus (Balasubramanian et al., 2013) (Section 3.2); • human judges can recognize different types of narrative causality (Section 3.3); • using both whole-corpus and genre-specific methods yields similar results for quality, despite the smaller size of the genre-specific subcorpora. Moreover, the genre-specific method learns some event pairs that are different than whole corpus event-pairs, while still being high-quality. (Section 3.4); We explain our method in Section 2, and then present experimental results in Section 3. We leave a more detailed discussion of related work until Section 4 when we can compare it more directly with our own.

Experimental Method
We estimate the likelihood of a narrative causality relation between events in film scenes.

Film Scenes & Pre-Processing.
We chose 11 genres with more than 100 films from a corpus of film scene descriptions (Walker et al., 2012;Hu et al., 2013), 2 resulting in 955 unique films. Film scripts were scraped from the IMSDb website, film dialogs and scene descriptions were then automatically separated. Films per genre range from 107 to 579. Films can belong to multiple genres, e.g. the scenes from The Fellowship of the Ring shown in Figure 1 would become part of the genres of Action, Adventure, and Fantasy. Each film's scene descriptions ranges from 2000 to 35000 words. Table 1 enumerates the sizes of each genre, illustrating the potential tradeoff between getting good probability estimates for event co-occurrence when the same events are repeated within a genre, vs. across the whole corpus. We use Stanford CoreNLP 3.5.2 to tokenize, lemmatize, POS tag, dependency parse and label named entities (Manning et al., 2014).

Compute Event Representations.
An event is defined as a verb lemma, as in previous work (Chambers and Jurafsky, 2008;Do et al., 2011;Riaz and Girju, 2010;Manshadi et al., 2008). We extract events by keeping all tokens whose POS tags begin with VB: VB, VBD, VBG, VBN, VBP, and VBZ. This results in extracting deverbal nouns that implicitly evoke events, such as the events of ripping and tearing in Scene 4 of Figure 1. This definition also allows us to pick up resultative clauses along with the action that caused the result (Hovav and Levin, 2001;Goldberg and Jackendoff, 2004), e.g. in He slammed the door shut, both slammed and shut are picked up as verbs. We exclude light verbs e.g. be, let, do, begin, have, start, try, because they often only represent a meaningful event when combined with their complements.
We extract the subject (nsubj, agent), direct object (dobj, nsubjpass), indirect object (iobj) and particle of the verb (compound:prt), if any. In order to abstract and merge different arguments, we generalize the arguments to two types: person and something. We generalize an argument to person when: (1) its named entity type is PERSON;or (2) it is a pronoun (except "it"); or (3) it is a noun in WordNet with more than half of its Synsets having lexical filename noun.person, e.g. doctor, soldier, waiter, man, woman. Our narrative causal semantics would be more specific if we could generalize over other types of named entities as well, such as location. However Stanford NER identifiable named entities rarely occur in film data.
For every event, we record the combinations of its arguments and particle for every instance. For example, the instance of event "pick" in sentence: He picked it up... a pearl, has combination subj: person, dobj: something, iobj: none, particle: up. We pick the combination with the highest frequency to represent the arguments and particle for each event.

Calculating Narrative Causality.
We use the Causal Potential (CP) measure in (1), shown to work well in previous work (Beamer and Girju, 2009;Hu et al., 2013): where PMI (e 1 , e 2 ) = log P (e 1 , e 2 ) P (e 1 )P (e 2 ) where the arrow notation means ordered event pairs, i.e. event e 1 occurs before event e 2 . CP consists of two terms: the first is pair-wise mutual information (PMI) and the second is relative ordering of bigrams. PMI measures how often events occur as a pair (without considering their order); whereas relative ordering accounts for the order of the event pairs because temporal order is one of the strongest cues to causality (Beamer and Girju, 2009;Riaz and Girju, 2010).
We obtain the frequency of every event and event pair for each genre. Unseen event pairs are smoothed with frequency equal to 1. In this paper, the notion of window size indicates how many events after the current event are paired with the current event. We use window sizes 1, 2 and 3, and calculate narrative causality for each window size. In film scenes, events are very densely distributed, (see Figure 1), thus related event pairs are often adjacent to one another, but the discourse structure of film scenes, not surprisingly, also contain related events separated by other events (Grosz and Sidner, 1986;Mann and Thompson, 1987). For example, in Scene 3 of Figure 1, Bilbo pulling out the ring enables him to slide it off his palm later (pull outslide off ). Moreover, while related events are less In this task, we will present you with two pairs of events (upper case verbs) that were automatically extracted from film scripts, and ask you to tell us which event pair is more likely to have a narrative causality relation. According to the theories of narrative, in a pair of events [A -> B], the narrative causality relation consists of 4 possible types of event relations, given below with defining examples. Note that the order of event A and B matters.
(1) Physical Causality: event A physically causes event B to happen. Thus the assumption is that when A is put into the context of the story, B will inevitably follow.
[person PUSH person -> person FALL]: Pippin pushes Frodo away...he stumbles backwards, and falls to the floor.
(2) Motivational Causality: event A happens with B as a motivation.
[person SWERVE -> AVOID something]: He swerves to avoid an ugly pickup truck crawling like a snail ahead. Given any common story context that you can imagine, which event pair is more likely to have a narrative causality relation?
(1) All the events are in their verb base forms. But they can be in any tense in order to satisfy the narrative causality relation.
(2) Please use the arguments (subject, object etc) as reference only and focus on the events. Arguments are extracted automatically and could be incorrect. "person" and "something" are merely indicators of types of arguments (human or thing). In an event pair, "person" does not necessarily refer to the same person, and "something" does not necessarily refer to the same thing either.  frequently separated (window size 3), we assume that unrelated events will be filtered out by their low probabilities. We thus define a CPC measure, shown in (2) that combines the frequencies across window size: where w max is the max window size. CP i (e 1 , e 2 ) is the CP score for event pair e 1 , e 2 calculated using window size i. The CPC measure combines frequencies across window sizes, but punishes event pairs from larger window sizes, thus assuming that nearby events are more likely to be causal.

Evaluation and Results
We posit that human judgments are the best way to evaluate the quality of the induced event pairs, as opposed to automatic measures such as Narrative Cloze, which assume that the event pairs in a particular instance of text can be used as held-out test data (Chambers and Jurafsky, 2008). Our first experiment tests whether event pairs with high CPC scores are more likely to have a narrative causality relation. Our second experiment compares pairs with high CPC scores with their corresponding top Rel-gram pairs. Our third experiment tests whether annotators can distinguish narrative causality types. Our final experiment compares the quality and type of causal pairs learned on a per genre basis, vs. those learned on the whole film corpus.

High vs. Low CPC Event Pairs
After processing all the data, we have a list of event pairs scored by CPC, and rank-ordered within each genre. Some of the genre specific event pairs seem to intuitively reflect their genre, however there are many learned pairs that are in overlap across genres. We select the top 3000 event pairs with high scores from all the genres ("high pairs"). The number of event pairs from a genre is proportional to the number of films in that genre. We also select the bottom 6000 event pairs with low scores from all the genres using similar method ("low pairs"). Since many pairs are duplicated across genre, the high pairs and low pairs are then de-duplicated (two event pairs are defined as equal if they have the same verbs in the same order). We keep the arguments with the highest frequencies. This result in 960 high pairs. If an event has no subject, "person" is added as   The results show that humans judge the high pairs as more likely to have a narrative causality relation in 82.8% of items. Among those, all the items receive 3 or more votes for the high pairs. Overall, all five Turkers select the high CPC pairs in 51% of the items. The average pairwise Krippendorff's Alpha score is respectable at 0.56. Table 2 shows items where all 5 Turkers selected the high pair. For example, clink -drink in Row 1 could have either a MOTIVATIONAL or ENABLING narrative causality depending on the context, but the causal relation in either case is much clearer than with the low CPC pair strike -give. Row 2 and Row 5 beckon -come and crane -see both have ENABLING causality which is a weakly causal relation, but again more meaningful than their low CPC counterparts. In Row 3, it is clear that a person often bends with the motivation to pick up something. In row 4 a person coughs, PHYSICALLY causes him to splutter everywhere. Table 3 shows majority vote results for percentages of high pairs that are considered to exhibit more narrative causality, sorted by genre. The results for all genres are good, ranging from ∼82% to ∼91%. Interestingly, Drama has the highest number of films with the lowest percentage of judged narrative causality, while Fantasy has the lowest number of films with the highest judged narrative causality. This may be because the Drama category is a catch-all (over half of the films are categorized this way suggesting that it has low coherence as a genre). The poor performance on Drama would then be consistent with previous work that shows that topical coherence (genre in this case) improves causal relation learning (Rahimtoroghi et al., 2016;Riaz and Girju, 2010). We will return to this point in Section 3.4.

CPC vs. Rel-gram Event Pairs
We then compare the narrative causality event pairs (high pairs) with event pairs from the Relgrams corpus (Balasubramanian et al., 2012(Balasubramanian et al., , 2013. Rel-grams (Relational n-grams) are pairs of opendomain relational tuples (T,T'). They are analogous to lexical n-grams, but is computed over relations rather than over words. For example, "A person who gets arrested is typically charged with some activity." yield the tuple: T = (  Over 1.8M news wire documents are used to build a database of Rel-grams co-occurence statistics. Using a similar HIT template, we randomly sample 100 high CPC event pairs from the 960 high CPC pairs, where we ensure that each of the first events of the pairs are distinct. We use the publicly available search interface for Rel-grams 4 to find Rel-gram statement pairs that have the same first event. Modeling our own experimental setup we set the co-occurrence window to 5 5 , and select the Rel-gram pair with the highest #50(FS) (frequency of first statement occurring before second statement within a window of 50).
To make Rel-gram event pairs similar to ours, we generalize their arguments to "person" and "something" manually. We keep the verb particle if any. ". It is possible that this disadvantages Rel-grams in some way, but our main focus is on the causality relation between verbs, which should not be affected. Moreover the two sets of event pairs cannot be compared without this generalization. The same 5 annotators participate in this 5 HITs (100 items).
The results show that humans judge the CPC pairs to be more likely to manifest a narrative causality relation 81% of the time. The average pairwise Krippendorff's Alpha score of all Turkers is 0.482. Table 4 shows items where all Turkers judge the CPC pairs as more likely to be causally related. For example, in Row 1 to clear seems more likely to enable something being revealed, instead of causing a person to hit something. In Row 2, even though embrace and kiss might only have an ENABLING narrative causality relation, the 4 http://relgrams.cs.stonybrook.edu/ 5 The search interface does not support a window size of 3, thus we chose 5 as it's the closest window size larger than 1. reversed causality between embrace and meet in the Rel-gram pair is based on symmetric conditional probability (SCP) rather than explicit causal modeling. SCP combines Bigram probability in both directions as follows: SCP (e 1 , e 2 ) = P (e 2 |e 1 ) × P (e 1 |e 2 ) (3) In Row 4, marrying someone might just possibly enable one to think about something, but could hardly enable/cause someone to die. In Row 5 stumble physically causes one to fall, while it is more difficult to see the causal relation between stumbling on someone and then a person taking another person (somewhere).

Narrative Causality Type Count Example Pair
Physical 13 fire -blast Motivational 29 bend -retrieve Psychological 9 look -astonish Enabling 28 lean -whisper Table 5: Distribution of narrative causality types .

Narrative Causality Types
Although theories of narrative posit four different types of narrative causality, previous work has not conducted reliability studies with non-experts such as Turkers. Here we explore whether humans can distinguish narrative causality types, by asking Turkers to decide which relation holds between an event pair. The instructions contain descriptions of narrative causality types and the strength of these relations (from strong to weak: PHYSICAL, MOTI-VATIONAL, PSYCHOLOGICAL and ENABLING (Trabasso et al., 1989)). Because the stronger types of narrative causality could also be considered ENABLING, Turkers are instructed to choose the strongest narrative causality that could be applied to the event pair.  We select 100 pairs randomly from the high CPC pairs of the 479 questions that had the highest Turker agreement. Among all 100 questions, 79% of the items receive a majority vote result (3 or more Turkers selecting the same answer). The distribution of narrative causality types of the 79 items is shown in Table 5. Interestingly, films are full of motivational causality, which often reflect action sequences where protagonist pursue particular narratively relevant goals Gerrig, 2006, 2002).

Genre Specific Causality
Previous work suggests that topical coherence and similarity of events within the corpus used for learning causal/contingent event relations might be as important as the size of the corpus (Riaz and Girju, 2010;Rahimtoroghi et al., 2016). In other words, smaller corpora filtered by topic or genre might be more useful than large undifferentiated sets (Riloff, 1996), although obviously very large corpora that are topic or genre sorted could be even more useful. We therefore test whether separating films by genre yields higher quality event pairs than a method that combines all films, irrespective of genre. We assume that the very notion of a film genre defines a set of films with similar types of events.
We first compute a list of CPC scores using films from all genres and take 960 event pairs with highest scores. Comparing the 960 event pairs from all films with the 960 pairs from merging genres described in Section 3.1, we find that 728 pairs overlap between the two sets. Thus with the smaller genre-specific corpora we learn more than 70% of the same causal pairs. The results shown in Table 3 suggest furthermore that the genre-specific pairs are high quality. However, it is still possible that the 232 pairs from each set that are not in overlap vary in quality from the 728 pairs that are in overlap. We therefore pick 100 random pairs from each set, match the pairs randomly to form items, and repeat the event pairs comparison HIT with these pairs. The results suggest that there are no differences between the two methods as far as quality: in 48 of the 100 questions, pairs from genre-separated method have Turkers' majority vote, vs. in 52 of the 100 questions pairs from combined genres have the majority vote.
Moreover we obtain more high-quality, reliable narrative causality relations using both methods, and we learn some genre-specific causal relations that we do not learn on the whole corpus. Table 8 shows the the overlap in learned pairs amongst the top 30 CPC pairs in five of the most distinct genres (genres with highest percentages in Table 3: Fantasy, Sci-Fi, Horror, Mystery and Thriller) vs. all films (All). Mystery has the smallest overlap with All, followed by Fantasy and Sci-Fi.
To illustrate some of the differences, Table 6 shows event pairs with the highest CPC scores in Fantasy, Action, Sci-Fi and Thriller genres. Table 7 shows event pairs unique to each genre within its top 30 CPC pairs.
We also compare our 960 pairs from merging genres described in Section 3.1 with 200 event pairs extracted from camping and storm personal blog stories in Rahimtoroghi et al. (2016). The only pairs that overlap are: siteat, playsing, illustrating again that causal relations learned are not as dependent on the size of the corpus, as they are on its topical and event-based coherence. Since most previous work on narrative schemas, scripts,   event schemas or rel-grams has only been applied to one large corpus of newswire (Gigaword corpus), these methods have only learned relations about newsworthy topics, and even then, perhaps only the most frequent, highly common news events. In contrast, both our approach and that of Rahimtoroghi et al. (2016) learn fine-grained causal relations that underly narratives, which we believe are more in the spirit of Schank's original motivation for scripts (Lehnert, 1981;Schank et al., 1977;Wilensky, 1982;de Jong, 1979 causality relations between events in PDTB in context (both verbs and nouns) (Prasad et al., 2008). They present a detailed formula for calculating contingency/causality that takes into account several different kinds of argument overlap between adjacent events. However they do not provide any evidence that all the components of this formula actually contribute to their results. Gordon et al. (2011) used event ngrams and discourse cues to learn causal relations from first person stories posted on weblogs and evaluated them with respect to the COPA SEM-EVAL task. Other related work learns likely sequences of temporally ordered events but does not explicitly model CAUSALITY (Chambers and Jurafsky, 2009;Balasubramanian et al., 2013;Manshadi et al., 2008).
Work on VerbOcean (Chklovski and Pantel, 2004) use lexical patterns to learn semantic verb relations of similarity, strength, antonymy, enablement and happens-before relations. Balasubramanian et al. (2013) use symmetric probability to learn semantically typed relational triples (actor, relation, actor), which they call Rel-grams (relational n-grams), and show that their schemas outperform previous work (Chambers and Jurafsky, 2009). We thus compared our event pairs with Rel-grams, showing that humans are more likely to perceive narrative causality in our event pairs.

Discussion and Future Work
We present an unsupervised model based on Causal Potential (Beamer and Girju, 2009) to induce event pairs with narrative causality relations from film scenes in 11 genres. Results from four human evaluations show that narrative causality event pairs induced using our method are of high quality, and are perceived as more causally related than corresponding Rel-grams. We show that humans can identify different types of narrative causality, but we leave automatic identification of these to future work. We also show that inducing narrative causality event pairs using both whole-corpus and genre-specific methods yields similar results for quality, despite the smaller size of the genre-specific subcorpora. Moreover, the genre-specific method learns high quality event pairs that are different than whole corpus event-pairs.
We are looking into applying and evaluating our CPC method to other genre and topic sorted datasets such as books and personal blogs. We want to expand our set of event pairs with narrative causality relations, which could potentially aid text understanding, information extraction, question answering, and content summarization. We also aim to explore features for narrative causality type classification. Information such as event A physically causes event B, or event C enables event D could further help aforementioned applications.