Script Induction as Language Modeling

The narrative cloze is an evaluation metric commonly used for work on automatic script induction. While prior work in this area has focused on count-based meth-ods from distributional semantics, such as pointwise mutual information, we argue that the narrative cloze can be productively reframed as a language modeling task. By training a discriminative language model for this task, we attain improvements of up to 27 percent over prior methods on standard narrative cloze metrics.


Introduction
Although the concept of scripts in artificial intelligence dates back to the 1970s (Schank and Abelson, 1977), interest in this topic has renewed with recent efforts to automatically induce scripts from text on a large scale. One particularly influential work in this area, Chambers and Jurafsky (2008), treats the problem of script induction as one of learning narrative chains, which they accomplish using simple textual co-occurrence statistics. For the novel task of learning narrative chains, they introduce a new evaluation metric, the narrative cloze test, which involves predicting a missing event from a chain of events drawn from text. Several follow-up works (Chambers and Jurafsky, 2009;Jans et al., 2012;Pichotta and Mooney, 2014;Rudinger et al., 2015) employ and extend Chambers and Jurafsky (2008)'s methods for learning narrative chains, each using the narrative cloze to evaluate their work. 1 In this paper, we take the position that the narrative cloze test, which has been treated predom-1 A number of related works on script induction use alternative task formulations and evaluations. (Chambers, 2013;Cheung and Penn, 2013;Frermann et al., 2014;Manshadi et al., 2008;Modi and Titov, 2014;Regneri et al., 2010) inantly as a method for evaluating script knowledge, is more productively thought of simply as a language modeling task. 2 To support this claim, we demonstrate a marked improvement over previous methods on this task using a powerful discriminative language model -the Log-Bilinear model (LBL). Based on this finding, we believe one of the following conclusions must follow: either discriminative language models are a more effective technique for script induction than previous methods, or the narrative cloze test is not a suitable evaluation for this task. 3

Task Definition
Following the definitions of Chambers and Jurafsky (2008), a narrative chain is "a partially ordered set of narrative events that share a common actor," where a narrative event is "a tuple of an event (most simply a verb) and its participants, represented as typed dependencies." (De Marneffe et al., 2006) Formally, e := (v, d), where e is a narrative event, v is a verb lemma, and d is the syntactic dependency (nsubj or dobj) between v and the protagonist. As an example, consider the following narrative: John studied for the exam and aced it. His teacher congratulated him.
In the narrative cloze test, a sequence of narrative events (like the example provided here) is extracted automatically from a document, and one narrative event is removed; the task is to predict the missing event.
Data Each of the models discussed in the following section are trained and tested on chains of narrative events extracted from stories in the New York Times portion of the Gigaword corpus (Graff et al., 2003) with Concrete annotations (Ferraro et al., 2014). Training is on the entirety of the 1994-2006 portion (16,688,422 chains with 58,515,838 narrative events); development is a subset of the 2007-2008 portion (10,000 chains with 35,109 events); and test is a subset of the 2009-2010 portion (5,000 chains with 17,836 events). All extracted chains are of length two or greater.
Chain Extraction To extract chains of narrative events for training and testing, we rely on the (automatically-generated) coreference chains present in Concretely Annotated Gigaword. Each narrative event in an extracted chain is derived from a single mention in the corresponding coreference chain, i.e., it consists of the verb and syntactic dependency (nsubj or dobj) that governs the head of the mention, if such a dependency exists. Overlapping mentions within a coreference chain are collapsed to a single mention to avoid redundant extractions.

Models
In this section we present each of the models we train for the narrative cloze evaluation. In a single narrative cloze test, a sequence of narrative events, (e 1 , · · · , e L ), with an insertion point, k, for the missing event is provided. Given a fixed vocabulary of narrative events, V, a candidate sequence is generated for each vocabulary item by inserting that item into the sequence at index k. Each model generates a score for the candidate sequences, yielding a ranking over the vocabulary items. The rank assigned to the actual missing vocabulary item is the score the model receives on that cloze test. In this case, we set V to include all narrative events, e, that occur at least ten times in training, yielding a vocabulary size of 12,452. All out-of-vocabulary events are converted to (and scored as) the symbol UNK.

Count-based Methods
Unigram Baseline (UNI) A simple but strong baseline introduced by Pichotta and Mooney (2014) for this task is the unigram model: can-didates are ranked by their observed frequency in training, without regard to context.

Unordered PMI (UOP)
The original model for this task, proposed by Chambers and Jurafsky (2008), is based on the pointwise mutual information (PMI) between events.
Here, C(e 1 , e 2 ) is the number of times e 1 and e 2 occur in the same narrative event sequence, i.e., the number of times they "had a coreferring entity filling the values of [their] dependencies," and the ordering of e 1 and e 2 is not considered. In our implementation, individual counts are defined as follows: This model selects the best candidate event in a given cloze test according to the following score: We tune this model with an option to apply a modified version of discounting for PMI from Pantel and Ravichandran (2004).
Ordered PMI (OP) This model is a slight variation on Unordered PMI introduced by Jans et al. (2012). The only distinction is that C(e 1 , e 2 ) is treated as an asymmetric count, sensitive to the order in which e 1 and e 2 occur within a chain.
For each of the four narrative cloze scoring metrics we report on (average rank, mean reciprocal rank, recall at 10, and recall at 50), we tune the Unordered PMI, Ordered PMI, and Bigram Probability models over the following parameter space: where T is a pairwise count threshold.

A Discriminative Method
Log-Bilinear Language Model (LBL) The Log-Bilinear language model is a language model that was introduced by Mnih and Hinton (2007). Like other language models, the LBL produces a probability distribution over the next possible word given a sequence of N previously observed words. N is a hyper-parameter that determines the size of the context used for computing the probabilities. While many variants of the LBL have been proposed since its introduction, we use the simple variant described below.
Formally, we associate one context vector c e ∈ R d , one bias parameter b e ∈ R, and one target vector t e ∈ R d to each narrative event e ∈ V ∪ { UNK, BOS, EOS }. V is the vocabulary of events and BOS, EOS, and UNK are the beginning-of-sequence, end-of-sequence, and outof-vocabulary symbols, respectively. The probability of an event e that appears after a sequence s = [s 1 , s 2 , . . . , s N ] of context words is defined as: The operator performs element-wise multiplication of two vectors. The parameters that are optimized during training are m j ∀j ∈ [1, . . . , N ] and c e , t e ∀e ∈ V ∪ { UNK, BOS, EOS }. To calculate the log-probability of a sequence of narrative events E = (e 1 , . . . , e L ) we compute: Here f E is a function that returns the sequence of N words that precede the event e i in the se- quence E made by prepending N BOS tokens and appending a single EOS token to E. The LBL models are trained by minimizing the objective described in Equation 5 for all the sequences in the training corpus. We used the OxLM toolkit (Paul et al., 2014) which internally uses Noise-Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010) and processor parallelization for speeding up the training. For this task, we train LBL models with N = 2 (LBL2) and N = 4 (LBL4). In our experiments, increasing context size to N = 6 did not significantly improve (or degrade) performance. Table 1 shows the results of 17,836 narrative cloze tests (derived from 5,000 held-out test chains), with results bucketed by chain length. Performance is reported on four metrics: average rank, mean reciprocal rank, recall at 10, and recall at 50.

Experimental Results
For each of the four metrics, the best overall performance is achieved by one of the two LBL models (context size 2 or 4); the LBL models also achieve the best performance on every chain length. Not only are the gains achieved by the discriminative LBL consistent across metrics and chain length, they are large. For average rank, the LBL achieves a 27.0% relative improvement over the best non-discriminative model; for mean reciprocal rank, a 19.9% improvement; for recall at 10, a 22.8% improvement; and for recall at 50, a 17.6% improvement. (See Figure 1.) Furthermore, note that both PMI models and the Bigram model have been individually tuned for each metric, while the LBL models have not. (The two LBL models are tuned only for overall perplexity on the development set.) All models trend toward improved performance on longer chains. Because the unigram model also improves with chain length, it appears that longer chains contain more frequent events and are thus easier to predict. However, LBL performance is also likely improving on longer chains because of additional contextual information, as is evident from LBL4's slight relative gains over LBL2 on longer chains.

Conclusion
Pointwise mutual information and other related count-based techniques have been used widely to identify semantically similar words (Church and Hanks, 1990;Lin and Pantel, 2001;Tur-ney and Pantel, 2010), so it is natural that these techniques have also been applied to the task of script induction. Qualitatively, PMI often identifies intuitively compelling matches; among the top 15 events to share a high PMI with (eat, nsubj) under the Unordered PMI model, for example, we find events such as (overeat, nsubj), (taste, nsubj), (smell, nsubj), (cook, nsubj), and (serve, dobj). When evaluated by the narrative cloze test, however, these count-based methods are overshadowed by the performance of a general-purpose discriminative language model. Our decision to attempt this task with the Log-Bilinear model was motivated by the simple observation that the narrative cloze test is, in reality, a language modeling task. Does the LBL's success on this task mean that work in script induction should abandon traditional count-based methods for discriminative language modeling techniques? Or does it mean that an alternative evaluation metric is required to measure script knowledge? While we believe our results are sufficient to conclude that one of these alternatives is the case, we leave the task of determining which to future research.