Script Induction as Association Rule Mining

We show that the count-based Script Induction models of Chambers and Jurafsky (2008) and Jans et al. (2012) can be unified in a general framework of narrative chain likelihood maximization. We provide efficient algorithms based on Association Rule Mining (ARM) and weighted set cover that can discover interesting patterns in the training data and combine them in a reliable and explainable way to predict the missing event. The proposed method, unlike the prior work, does not assume full conditional independence and makes use of higher-order count statistics. We perform the ablation study and conclude that the inductive biases introduced by ARM are conducive to better performance on the narrative cloze test.


Introduction
The goal of this paper is to demonstrate how the efforts in Script Induction (SI), up until recently dominated by statistical approaches (Chambers and Jurafsky, 2008;Jans et al., 2012;Pichotta and Mooney, 2014;Rudinger et al., 2015a,b), can be productively framed and extended as a special case of Association Rule Mining (ARM), a wellestablished problem in Data Mining (Agrawal et al., 1993(Agrawal et al., , 1994Han et al., 2000).
We start by introducing SI and ARM and then demonstrate a unification under a general chain likelihood maximization framework. We discuss how the existing count-based SI models tackle this maximization problem using naïve Bayes assumptions. We provide an alternative: mining higherorder count statistics using ARM and picking the most reliable rules using the weighted set cover algorithm. We validate the proposed approach and demonstrate improved performance over other count-based approaches. We conclude with a discussion on the implications and potential extensions of the proposed framework. sup(I), |I| > 2 Eq. 5 int(A → {e}), |A| > 1 Eq. 12 Table 1: Mapping between ARM and Count-based SI terminology. Bolded are contributions of this paper. Namely, we make use of frequent itemsets and interesting rules, or higher-order count statistics that can be efficiently mined and used in the narrative cloze test.
Our intent in this work is not to establish new state of the art results in the area of SI. Rather, our primary contribution is retrospective, drawing a connection between a sub-topic in Computational Linguistics (CL) with a major pre-existing area of Computer Science, i.e., Data Mining. In the case one approached SI through counting co-occurrence statistics, then the existing tools of ARM lead naturally to solutions that had not been previously considered within CL.

Association Rule Mining
ARM is a prevalent problem in Data Mining, introduced by Agrawal et al. (1993). The task is often referred to as market basket analysis due to its widespread usage for discovering interesting patterns in consumer purchases. The applicability of ARM extends far beyond this specific scenario, where examples of ARM usage for NLP applications include detecting annotation inconsistencies (Novák and Razímová, 2009), discovering strongly-related events (Shibata and Kurohashi, 2011), adding missing knowledge to the KB (Galárraga et al., 2013), as well as understanding clinical narratives (Boytcheva et al., 2017).
ARM aims to extract interesting patterns from a transactional database D. A transaction is a set of items, and a non-empty subset of a transaction is called an itemset. We define support as the number of transactions we observe an itemset I in: sup(I) = |{t|t ∈ D, I ⊆ t}|. (1) We say that an itemset I is frequent, if its support (on a given database D) exceeeds a user-defined threshold t sup : sup(I) ≥ t sup .
A pair of itemsets A, B is called a rule if A ∩ B = ∅ and is denoted as A → B. We say that a rule A → B is interesting if 1) both A and B are frequent, 2) the interestingess of the rule exceeds a user-defined threshold t int : int(A → B) ≥ t int . The definition of the interestingness function int(·) is problem-specific.
ARM is thus concerned with: 1. mining frequent itemsets from a transactional database, 2. discovering interesting rules from frequent itemsets.

Script Induction
The concept of script knowledge in AI, along with early knowledge-based methods to learn scripts were introduced by Minsky (1974); Schank and Abelson (1977); Mooney and DeJong (1985).
With the rise of statistical methods, the next generation of algorithms made use of co-occurrence statistics and distributional semantics for script learning Jurafsky, 2008, 2009;Jans et al., 2012;Pichotta and Mooney, 2014). Our primary focus is on drawing connections between ARM and this body of work.
Following Chambers and Jurafsky (2008), we define a narrative chain as "a partially ordered set of narrative events that share a common actor", where the partial ordering typically represents temporal or causal order of events, and a narrative event is "a tuple of an event and its participants, represented as typed dependencies". Formally, we define a narrative event e := (v, d), where v is a verb lemma, and d is a dependency arc between the verb and the common actor (dobj or nsubj). An example of a narrative chain is given in Figure 1.  Chambers and Jurafsky (2008). Arrows indicate partial temporal ordering.
SI is thus concerned with: 1. automatic mining of commonly co-occurring sets of narrative events from text, 2. partially ordering those sets.
The narrative cloze test (Chambers and Jurafsky, 2008) is a standard extrinsic evaluation procedure for Task 1 of SI. In this test, a sequence of narrative events is automatically extracted from a document, and one event is removed; the goal is to predict the missing event. Formally, given an incomplete narrative chain {e 1 , e 2 , . . . , e L } and an insertion point k ∈ [L], we would like to predict the most likely missing eventê to complete the chain: {e 1 , e 2 , . . . , e k ,ê, e k+1 , . . . e L }.
Although the recent work in SI (Rudinger et al., 2015b;Pichotta and Mooney, 2016;Peng and Roth, 2016;Weber et al., 2018) has focused on a Language Modeling (LM) approach for the narrative cloze test, it is fundamentally different from ARM in that it makes use of the total ordering of events and is thus incomparable to ARM, which does not assume any ordering of events within a chain.
In the next section, we survey two of the most influential count-based SI models, showing how each of them is related to ARM.

Unordered PMI model
The original model for this task by Chambers and Jurafsky (2008) is based on the pointwise mutual information (PMI) between events.
where C(e 1 , e 2 ) is defined as the number of narrative chains where e 1 and e 2 both occurred and where E is a fixed vocabulary of narrative events. The model selects the missing eventê in the narrative cloze test according to the scorê assuming that the missing eventê is inserted at the end of the existing chain (k = L).
From (2) and (3) we observe that One way to interpret Eq. 4 is to say that it was obtained from the following model with the naïve Bayes assumption: e = arg max e∈E P (e 1 , e 2 , . . . , e L |e). (5) Importantly, in the above equation, no assumptions are made about the order in which events e 1 , . . . , e L happened and we treat the narrative chain as a document, where individual events are features (the "bag of events" assumption).

Bigram Probability model
The bigram probability model was proposed by Jans et al. (2012) and was also used by Pichotta and Mooney (2014). It utilizes positional information between co-occurring events. It selects the missing eventê according to the scorê where k is the insertion point of the missing event e, P (e 2 |e 1 ) = C ord (e 1 ,e 2 ) C ord (e 1 , * ) , and counts C ord (e 1 , e 2 ) are ordered, e.g. C ord (e 1 , e 2 ) = C ord (e 2 , e 1 ).
Similarly to the Unordered PMI model, we can relax the conditional independence assumption. However, to apply Bayes' theorem, we would need (e 1 , e 2 ) and (e 2 , e 1 ) to be the same events in the outcome space, thus we have to assume unordered counts: C(e 1 , e 2 ) = C ord (e 1 , e 2 ) + C ord (e 2 , e 1 ). Proceeding with this, we get: where the last equality is obtained by relaxing the full conditional independence assumption (similar to Eq. 5). It follows that the Bigram Probability model with unordered counts is exactly the Unordered PMI model augmented with the prior probability of a missing event multiplied by its position in a chain. Additionally, note that if k = 1, this model is equivalent to maximizing the posterior probability of a missing event (rather than the likelihood of a narrative chain in Eq. 5): e = arg max e∈E log P (e 1 , . . . , e L |e) + log P (e) = arg max e∈E log (P (e 1 , . . . , e L |e) · P (e)) = arg max e∈E log P (e|e 1 , . . . , e L ).
Similar to Eq. 5, we view the narrative chain e 1 , . . . , e n as a set, and thus Eq. 6 is not a language model in the traditional NLP sense.

SI as ARM
The models defined by Eqs. 5, 6, and 7 are hard to compute directly: without simplifying assumptions, they would require huge number of parameters and large training sets (Jurafsky and Martin, 2019). A common approach in the existing Count-based SI work is to assume full conditional independence. A viable and less restrictive alternative, as we show in this section, is estimating higher-order count statistics via mining association rules (Section 4.1) and combining the most confident rules to predict the missing event with a simple weighted set cover algorithm (Section 4.2).
More formally, during the training phase, we would like to populate the set of interesting rules S = {S → {e}}, whose antecedents are sub-sets of the event space S ⊂ E, and consequents are single events e, e ∈ S. We denote as S e all the rules with the same consequent event e.
During the test phase, where we have an incomplete narrative chain {e 1 , e 2 , . . . , e L } and want to predict a missing event, we will use rules from S e to efficiently decompose P (e 1 , e 2 , . . . , e L |e) into P (S 1 |e) · P (S 2 |e) · . . . · P (S t |e) for each candidate event e. Naturally, this means selecting a set of rules whose antecedents {S 1 , S 2 , . . . , S t } (we call this set a candidate cover) are pairwise disjoint (S i ∩S j = ∅ ∀i, j ∈ [t]), and cover the event chain fully (S 1 ∪ S 2 ∪ . . . ∪ S t = {e 1 , e 2 , . . . , e L }).
To quantify the goodness of the decomposition, we define a score function for a candidate cover {S 1 , . . . , S t } and a candidate event e as follows: For each candidate event e, we select the best candidate coverŜ e according to the score function: In Section 4.1, we explain how the set of rules S is populated from the SI training corpus. In Section 4.2, we provide a randomized algorithm that solves problem 9 with a provably bounded error.

Mining interesting rules
As discussed in Section 2.1, in order to discover the set of interesting rules S, we need to mine frequent itemsets first. This can be achieved by any frequent itemset mining algorithm, such as Apriori (Agrawal et al., 1994), Eclat (Zaki, 2000), or FP-growth (Han et al., 2000).
Next, for the rule mining step we define an interestingness function int(S → E) over a rule S → E: where S ranges over all itemsets of size |S| and is disjoint with E. Note that int(S → E) provides a maximum likelihood estimate of P (S|E) for the probability space defined over sets of events, and sup(·) is a generalization of the previously defined C(·, ·) for event sets of size larger than two.
The denominator of (11) requires calculating the support over exponentially many itemsets. We can instead use the following simpler formula: where D is a transactional database of narrative event chains. Our intent is to use the above interestingness function to score rules from S that have a single event as a consequent, and thus Eq. 11 can be further simplified: Assuming that for each rule S → {e} the antecedent is bounded in size and small, we can precompute wsup k ({e}) for each e ∈ E and each k ∈ [|S|] in a single pass over the database. Note also that wsup 0 (I) = sup(I) and thus wsup k (·) is a generalization of support (1).
Given an interestingness function, we can now proceed to mine interesting rules over frequent event sets. The rule mining process is shown in Algorithm 1.
After a set of interesting rules S is populated, we can perform test-time inference on new narrative chains with Eqs. 9 and 10. To facilitate this, we frame the inference problem as the weighted set cover problem. The latter was known to be NPcomplete by Karp (1972), but there is a simple greedy algorithm by Chvatal (1979) that provides an approximate solution. To make it applicable to the search problem 9, we will run it (for each candidate event e) on the set S, mined by Algorithm 1, with the following weight function: The following lemma provides a lower bound on the score of the candidate cover obtained by Algorithm 2.
Algorithm 2 Greedy weighted set cover 1: Input: • A set of interesting rules S e , • A narrative chain e 1 , e 2 , . . . , e L .

Score estimation via weighted set cover
Lemma 2. Algorithm 2 finds a candidate cover {S 1 , . . . , S t } for a narrative chain {e 1 , . . . , e L } and a candidate event e, such that score(S 1 , . . . , S t ; e) ≥ OP T ln L+1 , where OP T is the score of the best candidate coverŜ e .
By exponentiating left and right-hand sides and noting that OP T = e −OP Tcover (by definition of the weight and score functions), we get: If we group the rules S → {e} by the consequent event and order by |S| w(S) within each group, then step 8 in Algorithm 2 becomes equivalent to iterating over ordered rules in S e . The overall running time to score the candidate event e is O(L + |S e |).
Additionally, O( e∈E |S e | log |S e |) preprocessing time is needed to group and order the rules in S.

Dataset
We perform experiments on the New York Times part of the Annotated Gigaword dataset by (Napoles et al., 2012). Chains of narrative events are constructed from the (automatically generated) in-document coreference chains: from each document in the dataset, we extract all coreference chains and retain the longest one, with length two or greater. We also filter top-10 occurring events which are mostly reporting verbs such as "say" and "think" and convey little meaning for SI task.
Training is done on the 1994-2006 portion (1.3M chains with 8.7M narrative events), development set is a subset of 2007-2008 portion (10K chains with 62K narrative events), and test set is a subset of 2009-2010 portion (5K chains with 31K narrative events).

Model setup
We implement and compare models described in Sections 3 and 4, along with a strong baseline Unigram model by Pichotta and Mooney (2014), which ranks each event according to its unigram probability in the training corpus.
For testing the Unordered PMI and Bigram models, we use implementations from the Nachos software package (Rudinger et al., 2015a). Both models are tuned to use skip-grams (as defined by Jans et al. (2012)) of size up to the chain length, which allows to reduce data sparsity and is consistent with the set of rules (of size two) generated by ARM.
ARM consists of 1) mining frequent itemsets and 2) obtaining interesting rules from those itemsets. For frequent itemsets mining, we use the FP-growth algorithm by Han et al. (2000) with a t sup = 100 threshold. For rule mining, we implement Algorithm 1. Since the rule mining step is much less computationally intensive than itemset mining, we can use a more permissive t int = 10 −5 threshold. We use the same thresholds across all models by applying the following back-off strategy in the Unordered PMI and Bigram models: where t ARM = max (t sup , C( * , e) · t int ).

Experimental Results
We perform two experiments, comparing existing count-based SI models with three variants of the proposed ARM model. The performance is measured using Recall@50 and Mean Reciprocal Rank.
In the first experiment, we establish that the count-based pruning, introduced by ARM support and interestingness thresholds (t sup and t int , respectively) for reducing the search space during rule mining, does contribute to better performance on the narrative cloze test. We also validate empirically that the ARM model with binary (of size two) rules is equivalent to the UOP model by Chambers and Jurafsky (2008). Finally, we compare variants of the ARM model, which vary in a way of incorporating a prior probability of the missing event. We conclude that the posterior ARM model, given by Eq. 7, achieves the best performance. The results of this experiment are outlined in Table 2. In the second experiment, we compare the bestscoring ARM model and other baseline models on 5,000 test chains. We achieve 5% relative improvement for Mean Reciprocal Rank (MRR) and 10% for Recall@50, which can be attributed to using higher-order count statistics and the selection of the prior for the missing event. The scalability of both rule mining and inference algorithms suggests that the performance may be further improved as the training corpus size grows and more reliable higher-order statistics become available. The results of this experiment are shown in Table 3.
Similar to Rudinger et al. (2015b), we also note that all models tend to improve their performance on longer chains, which may be explained by the availability of additional contextual information.

Conclusion
Our decision to approach count-based SI as ARM was motivated by a previously under-explored similarity of these well-established areas, which we outlined in this paper. Drawing similarities from the existing work on Classification using Association Rules (CAR) (Liu et al., 1998;Thabtah et al., 2005), we proposed a scoring function that uses ARM-based count statistics to reliably predict the missing event in the narrative cloze test. One downside of relying solely on count-based statistics is the low support of longer itemsets due to data sparsity. On the other hand, modern contextual encoders (Devlin et al., 2018) mitigate this via parameter sharing. Reliably mining rules whose support and interestingness are based on both counts and properties of dense embeddings can be a promising direction of future work.