Inference of Fine-Grained Event Causality from Blogs and Films

Human understanding of narrative is mainly driven by reasoning about causal relations between events and thus recognizing them is a key capability for computational models of language understanding. Computational work in this area has approached this via two different routes: by focusing on acquiring a knowledge base of common causal relations between events, or by attempting to understand a particular story or macro-event, along with its storyline. In this position paper, we focus on knowledge acquisition approach and claim that newswire is a relatively poor source for learning fine-grained causal relations between everyday events. We describe experiments using an unsupervised method to learn causal relations between events in the narrative genres of first-person narratives and film scene descriptions. We show that our method learns fine-grained causal relations, judged by humans as likely to be causal over 80% of the time. We also demonstrate that the learned event pairs do not exist in publicly available event-pair datasets extracted from newswire.


Introduction
Computational models of language understanding must recognize narrative structure because many types of natural language texts are narratively structured, e.g. news, reviews, film scripts, conversations, and personal blogs (Polanyi, 1989;Jurafsky et al., 2014;Bell, 2005;Gordon et al., 2011a). Human understanding of narrative is driven by reasoning about causal relations between the events and states in the story (Ger-We packed all our things on the night before Thu (24 Jul) except for frozen food. We brought a lot of things along. We woke up early on Thu and JS started packing the frozen marinatinated food inside the small cooler... In the end, we decided the best place to set up the tent was the squarish ground that's located on the right. Prior to setting up our tent, we placed a tarp on the ground. In this way, the underneaths of the tent would be kept clean. After that, we set the tent up.  rig, 1993;Graesser et al., 1994;Lehnert, 1981;Goyal et al., 2010). Thus previous work has aimed to learn a knowledge base of semantic relations between events from text (Chklovski and Pantel, 2004;Gordon et al., 2011a;Chambers and Jurafsky, 2008;Balasubramanian et al., 2013;Pichotta and Mooney, 2014;Do et al., 2011), with the long-term aim of using this knowledge for understanding. Some of this work explicitly models causality; other work characterizes the semantic relations more loosely as "events that tend to cooccur". Related work points out that causality is granular in nature, and that humans flexibly move back and forth between different levels of granularity of causal knowledge (Hobbs, 1985). Thus methods are needed to learn causal relations and reason about them at different levels of granularity (Mulkar-Mehta et al., 2011).
One limitation of prior work is that it has primarily focused on newswire, thus have only learned relations about newsworthy topics, and likely the most frequent, highly common (coarsegrained) news events. But news articles are not the only resource for learning about relations between events. Much of the content on social media in personal blogs is written by ordinary people about their daily lives (Burton et al., 2009), and these blogs contain a large variety of everyday events (Gordon et al., 2012). Film scene descriptions are also action-rich and told in fine-grained detail (Beamer and Girju, 2009;Hu et al., 2013). Moreover, both of these genres typically report events in temporal order, which is a primary cue to causality. In this position paper, we claim that knowledge about fine-grained causal relations between everyday events is often not available in news, and can be better learned from other narrative genres.
For example, Figure 1 shows a part of a personal narrative written in a blog about a camping trip (Burton et al., 2009). The major event in this story is camping, which is contingent upon several finer-grained events, such as packing things the night before, waking up in the morning, packing frozen food, and later on at the campground, placing a tarp and setting up the tent. Similarly film scene descriptions, such as the one shown in Figure 2, typically contain fine-grained causality. In this scene from Lord of the Rings, grabbing leads to spilling, and pushing leads to stumbling and falling.
We show that unsupervised methods for modeling causality can learn fine-grained event relations from personal narratives and film scenes, even when the corpus is relatively small compared to those that have been used for newswire. We learn high-quality causal relations, with over 80% judged as causal by humans. We claim that these fine-grained causal relations are much closer in spirit to those motivating earlier work on scripts (Lehnert, 1981;Schank et al., 1977;Wilensky, 1982;de Jong, 1979), and we show that the causal knowledge we learn is not found in causal knowledge bases learned from news.
Section 2 first summarizes previous work on learning causal knowledge. We then present our experiments and results on modeling event causality in blogs and film scenes in Section 3. Conclusions and future directions are discussed in Section 4.

Background and Related Work
Cognitive theories of narrative understanding define narrative coherence in terms of four different sources of causal inferences between events A and B (Trabasso and van den Broek, 1985;Warren et al., 1979;Trabasso et al., 1989;Van den Broek, 1990  There has been a great deal of interest in learning narrative relations or narrative schema in an unsupervised or weakly supervised manner from text. Here we focus on work where the resulting knowledge bases have been made publicly available, allowing us to compare the learned knowledge directly. The VerbOcean project learned five different semantic relations between event types (verbs) from newswire, with the HAPPENS-BEFORE relation defined as "indicating that the two verbs refer to two temporally disjoint intervals or instances". WordNet's cause relation, between a causative and a resultative verb (as in buy::own) is tagged as an instance of HAPPENS-BEFORE in VerbOcean, consistent with the heuristic that temporal ordering is a major component of causality. Other examples of the HAPPENS-BEFORE relation in the VerbOcean knowledge base include marry::divorce, detain::prosecute, enroll::graduate, schedule::reschedule, and tie::untie (Chklovski and Pantel, 2004). Balasubramanian et al. (2013) generate pairs of event relational tuples, called Rel-grams. The Rel-grams are publicly available through an online search interface 1 . Rel-gram tuples are extracted using a co-occurrence statistical metric, Symmetric Conditional Probability (SCP), which combines Bigram probability in both directions as follows: SCP (e 1 , e 2 ) = P (e 2 |e 1 ) × P (e 1 |e 2 ) (1) Their evaluation experiments directly compared the knowledge learned in Rel-grams to the previous work on narrative schemas Jurafsky, 2008, 2009), showing that they achieve better results, thus our work compares directly to the tuples available in Rel-grams.
Other work focuses more directly on learning causal or contingency relations between events. Beamer and Girju (2009) introduced a distributional measure called Causal Potential to assess the likelihood of a causal relation holding between two events. This measure is based on Suppes' probabilistic theory of causality (Suppes, 1970).
where PMI (e 1 , e 2 ) = log P (e 1 , e 2 ) P (e 1 )P (e 2 ) where the arrow notation means ordered event pairs, i.e. event e 1 occurs before event e 2 . CP consists of two terms: the first is pair-wise mutual information (PMI) and the second is relative ordering of bigrams. PMI measures how often events occur as a pair (without considering their order); whereas relative ordering accounts for the order of the event pairs because temporal order is one of the strongest cues to causality (Beamer and Girju, 2009;Girju, 2010, 2013). This work explicitly links their definitions to research using the Penn Discourse Treebank (PDTB) definition of CONTINGENCY. Beamer and Girju (2009) applied the CP measure to 173 film scripts, resulting in a high correlation between human-judged causality and the CP measure. Their paper provides a list of 90 verb pairs, selected from the high, middle and low CP ranges in their learned causal pairs. We compare their 30 highest CP events with causal event pairs that we learn from film. Riaz and Girju (2010) apply a similar measure to topic-sorted news stories about Hurricane Katrina and the Iraq War and present ranked causality relations between events for these topics, suggesting that topic-sorted corpora can produce better causal knowledge. Other work has also used CP to measure the contingency relation between two events, reporting better results than achieved with PMI or bigrams alone (Hu et al., 2013;Rahimtoroghi et al., 2016).

Methods and Evaluations
Our primary goal is simply to show that finegrained causal relations can be learned from film scripts and blogs, and that these are not found in causal knowledge bases learned from newswire. In this section we describe our datasets and methods, and the present two evaluations. First, we evaluate whether the relations learned are causal

Datasets
Topical coherence and similarity of events within the corpus used for learning event relations can be as important as the size of the corpus (Riaz and Girju, 2010;Rahimtoroghi et al., 2016). We use two datasets for learning causal event pairs: firstperson narratives from blogs (Burton et al., 2009;Rahimtoroghi et al., 2016), and film scene descriptions (excluding dialogs because dialogs are not as action-rich) (Walker et al., 2012;Hu et al., 2013). Our experiment on blogs learns causal relations from a topic-sorted corpus of ∼1000 camping stories. We also posit that the genre of a film may select for similar types of events. However genres can be defined broadly or narrowly, e.g. the Drama genre overlaps with many other genres. We thus compare two narrow film genres of Fantasy and Mystery with the Drama genre from an existing corpus (Walker et al., 2012;Hu et al., 2013). The raw numbers for each subcorpus are shown in Table 1. Note that Camping corpus consists of blog posts which are much shorter compared to movie scripts. Thus their word count is much smaller compared to films corpus despite the larger number of documents.

Methods
In the blogs, related event pairs are more frequently separated by utterances that provide state descriptions or affective reactions to events (Swanson et al., 2014). As a result, we use Causal Potential (CP) measure to assess the causal relation between events and apply skip-2 bigram method for modeling event pairs. But in film scenes, events are very densely distributed, thus related event pairs are often adjacent to one another and therefore nearby events are more likely to be causal. So, for event pairs extracted from   Girju, 2010, 2013;Do et al., 2011;Pichotta and Mooney, 2014).
where w max is the max window size (how many events after the current event are paired with the current event). CP i (e 1 ; e 2 ) is the CP score for event pair e 1 ; e 2 calculated using window size i.

Experiments and Results
We process the data in each dataset and calculate causal potential score for each extracted event pair, resulting in a rank-ordered list of causal event pairs. We evaluate the top 100 event pairs for camping, and the top 684 event pairs for films. We take a number of event pairs from each film genre (proportional to the number of films in that genre, see Table 1 and 3), then remove duplicate event pairs, which result in the 684 event pairs from film. Table 2 presents examples of learned high-  CP event pairs from each corpus. In our following Mechanical Turk experiments, Turkers have to pass qualification tests similar to the actual HITs to be able to participate in our task. In a study on each genre of films, we compare high-CP pairs to a random sample of low-CP pairs on Mechanical Turk to see if pairs with high CP score more strongly encode causal relations that ones with low CP. For every event pair in the 684 high pairs, we randomly select a low pair in order to collect human judgments on Mechanical Turk. The task first defines events and event pairs, then gives examples of event pairs with causal relations. Turkers are asked to select the event pair that is more likely to manifest a causal relation. The results, summarized in Table 3, show that humans judge a large majority of the high-CP pairs to have a causal relation and the results vary by genre. The causality rate is achieved for more focused genres, Fantasy (90.7%) and Mystery (87.7%), despite their smaller size, and the lowest for Drama (82.6%). We believe this result is further evidence that topical coherence improves causal relation learning (Rahimtoroghi et al., 2016;Riaz and Girju, 2010).
In our second evaluation method, we compare the learned CP event pairs to the existing causal knowledge collections. First, we compare our results to the Rel-grams data (learned from newswire) (Balasubramanian et al., 2013). For event pairs from films, we randomly sample 100 high-CP event pairs ensuring that each of the first events of the pairs are distinct. We use the publicly available search interface for Rel-grams to find tuples with the same first event for direct comparison of content of the learned knowledge. We set the co-occurrence window to 5, and select the Rel-gram tuples with the highest # 50 (FS) (frequency of first statement occurring before second statement within a window of 50) to choose high-quality tuples. We evaluate the extracted Rel-gram tuples using the same Mechani-cal Turk HIT described above. Table 4 shows Mechanical Turk evaluation results for our method on films vs. Rel-grams: in 81% questions, humans judge the high-CP pairs to be more likely to manifest a causal relation. We believe this is because the fine-grained event pairs we learn do not exist in the Rel-gram collections and thus the Rel-gram tuples that matched our first events are not highly coherent, despite the filtering we applied.  For event pairs from camping blogs, we evaluate all 100 high-CP pairs in a Mechanical Turk study where Turkers are asked to choose whether an event pair has causal relation or not. We also evaluate Rel-gram tuples using the same task. However, Rel-grams are not sorted by topic. To find tuples relevant to Camping Trip, we use our top 10 indicative events and extracted all the Relgram tuples that included at least one event corresponding to one of the Camping indicative events, e.g. go camp. We remove any tuple with frequency less than 25 and sort the rest by the total symmetrical conditional probability. The evaluation results presented in Table 5 show that 82% of the blog paurs were labeled as causal, where as only 42% of the Rel-gram pairs were labeled as causal. We argue that this is mainly due to the limitations of the newswire data which does not contain the fine-grained everyday events that we have extracted from our corpus.  Next, we compare our results to the event pairs in VerbOcean (learned from newswire) with the HAPPENS-BEFORE relation (Chklovski and Pantel, 2004). We use all 6497 event pairs from Ver-bOcean, comparing with our 684 event pairs from films and 100 event pairs from camping blogs with high CP scores. Our result shows that there are 12 event pairs that exist in both VerbOcean and films, e.g. turn -leave and slow -stop, and there is only one event pair that exist in both VerbOcean and camping blogs: pack -leave. This confirms that most causal relations learned from other narrative genres do not exist in the currently available knowledge bases extracted from newswire. A number of event pairs from these collections share the first event, e.g. dig -find and scan -spot from films vs. dig -repair and scan -upload from Ver-bOcean; drive -park and pick -eat from blogs vs. drive -drag and pick -plunk from VerbOcean.
Finally, we compare our high-CP pairs learned from film to the high-CP event pairs from Beamer and Girju (2009), learned from only 173 films. There is no public release of Beamer and Girju's event pairs, thus we take the 29 event pairs with high CP score presented in the paper. A total of 14 of their 29 pairs are also in our top 684 film pairs. These include pairs such as swerve -avoid, leave -stand and unlock -open. However on our larger genre-sorted corpus we also learn pairs such as grab -haul, scratch -claw and saddle-mount that do not exist in their collection.