Integrating Order Information and Event Relation for Script Event Prediction

There has been a recent line of work automatically learning scripts from unstructured texts, by modeling narrative event chains. While the dominant approach group events using event pair relations, LSTMs have been used to encode full chains of narrative events. The latter has the advantage of learning long-range temporal orders, yet the former is more adaptive to partial orders. We propose a neural model that leverages the advantages of both methods, by using LSTM hidden states as features for event pair modelling. A dynamic memory network is utilized to automatically induce weights on existing events for inferring a subsequent event. Standard evaluation shows that our method significantly outperforms both methods above, giving the best results reported so far.


Introduction
Frequently recurring sequences of events in prototypical scenarios, such as visiting a restaurant and driving to work, are a useful source of world knowledge. Two examples are shown in Figure 1, which are different variations of the "restaurant visiting" scenario, where events are partially ordered and can be flexible. Such knowledge is useful for natural language understanding because texts typically do not include event details when mentioning a scenario. For example, the reader is expected to infer that the narrator could have been  driving or cycling given the text "I got flat tire". Another typical use of event chain knowledge is to help infer what is likely to happen next given a previous event sequence in a scenario. We investigate the modeling of stereotypical event chains, which is remotely similar to language modeling, but with events being more sparse and flexibly ordered than words.
Our work follows a recent line of NLP research on script learning. Stereotypical knowledge about partially-ordered events, together with their participant roles such as "customer", "waiter", and "table", is conventionally referred to as scripts (Schank et al., 1977). NLP algorithms have been investigated for automatically inducing scripts from unstructured texts (Mooney and De-Jong, 1985;Chambers and Jurafsky, 2008). In particular, Chambers and Jurafsky (2008) made a first attempt to learn scripts from test inducing event chains by grouping events based on their narrative coherence, calculated based on Pairwise Mutual Information (PMI). Jans et al. (2012) showed that the method can be improved by calculating event relations using skip bi-gram probabilities, which explicitly model the temporal order of pairs event. Jans et al. (2012)'s model is adopted by a line of subsequent methods on inducing event chains from text (Orr et al., 2014;Pichotta and Mooney, 2014;Rudinger et al., 2015).
While the above methods are statistical, neural network models have recently been used for event sequence modeling. Granroth-Wilding and Clark (2016) used a Siamese Network instead of PMI to calculate the coherence between two events. Rudinger et al. (2015) extended the idea of Jans et al. (2012) by using a log-bilinear neural language model (Mnih and Hinton, 2007) to calculate event probabilities. By learning embeddings for reducing sparsity, the above models give much better results compared to the models of Chambers and Jurafsky (2008) and Jans et al. (2012). Similar in spirit, Modi (2016) predicted the probability of an event belonging to a certain event chain by modeling known events in the chain as a bag of vectors, showing that it outperforms discrete statistical methods. These neural methods are consistent with the earlier statistical models in leveraging event-pair relations. Pichotta and Mooney (2016) experimented with LSTM for script learning, using an existing sequence of events to predict the probability of a next event, which outperformed strong discrete baselines. One advantage of LSTMs is that they can encode unbounded time sequences without losing long-term historical information. LSTMs capture significantly more order information compared to the methods of Granroth-Wilding and Clark (2016), Rudinger et al. (2015), and Modi (2016), which model the temporal order of only pairs of events. On the other hand, a strong-order LSTM model can also suffer the disadvantage of over-fitting, given the flexible order of event chains in a script, as demonstrated by the cases of Figure 1. In this aspect, event-pair models are more adaptive for flexible orders. However, no direct comparisons have been reported between LSTM and various existing neural network methods that model event-pairs. We make such comparisons using the same benchmark, finding that the method of Pichotta and Mooney (2016) does not necessarily outperform event-pair models, such as Granroth-Wilding and Clark (2016). LSTM temporal ordering and event pair modeling have their respective strength.
To leverage the advantages of both methods, we propose to integrate chain temporal order information into event relation measuring. In particular, we calculate event pair relations by representing events in a chain using LSTM hidden states, which encode temporal information. The L-STM over-fitting issue is mitigated by using the temporal-order in a chain as a feature for event pair modeling, rather than the direct model output. In addition, observing that the importance of existing events can vary for inferring a subsequent event, we use a dynamic memory network model to automatically induce event weights for each event for inferring the next event. In contrast, previous methods give equal weights to existing events (Chambers and Jurafsky, 2008;Modi, 2016;Granroth-Wilding and Clark, 2016).
Results on a multi-choice narrative cloze benchmark show that our model significantly outperforms both Granroth-Wilding and Clark (2016) and Pichotta and Mooney (2016), improving the state-of-the-art accuracy from 49.57% to 55.12%. Our contributions can be summarized as follows: • We make a systematic comparison of LSTM and pair-based event sequence learning methods using the same benchmarks.
• We propose a novel dynamic memory network model, which combines the advantages of both LSTM temporal order learning and traditional event pair coherence learning.
• We obtain the best results in the standard multi-choice narrative cloze test.

Related Work
Scripts have been a traditional subject in AI research (Schank et al., 1977), where event sequences are manually encoded in knowledge bases, and used for end tasks such as inference. They are also connected with research in linguistics and psychology, and sometimes referred to as frames (Minsky, 1975;Fillmore, 1982) and schemata (Rumelhart, 1975). The same concept is also studied as templates in information extraction (Sundheim, 1991). Chambers and Jurafsky (2008) pioneered the recent line of work on script induction (Jans et al., 2012;Pichotta and Mooney, 2016;Granroth-Wilding and Clark, 2016), where the focus is on modeling narrative event chains, a crucial subtask for script modeling from raw text. Below we summarize such investigations.
With respect to event representation, Chambers and Jurafsky (2008) casted narrative events as triples of the form event, dependency , where the event is typically represented by a verb and the dependency represents typed dependency relations between the event and a protagonist, such as "subject" and "object". Chambers and Jurafsky (2008) organized narrative chains around a central actor, or protagonist, mining events that share a common protagonist from texts by using a syntactic parser and a coreference resolver. Balasubramanian et al. (2013) observed that the protagonist representation of event chains can suffer from weaknesses such as lack of coherence, and proposed to represent events as arg 1 , relation, arg 2 , where arg 1 and arg 2 represent the subject and object, respectively. Such representation is inspired by open information extraction (Mausam et al., 2012), and offers richer features for event pair modeling. Pichotta and Mooney (2014) adpoted a similar idea, using v(e s , e o , e p ) to represent an event, where v is a verb lemma, e s is the subject, e o is the object, and e p is an entity with prepositional relation to v. Their representation is used by subsequent work such as Modi (2016) and Granroth-Wilding and Clark (2016). We follow Pichotta and Mooney (2016) in our event representation form.
With respect to modeling, existing methods can be classified into two main categories, namely weak-order models, which calculate relations between pairs of events, and strong-order models, which consider the temporal order of events in a full sequence. Event-pair models have so far been the dominant method in the literature. Earlier work used discrete event representations and estimated event relations by statistical counting. As mentioned earlier, Chambers and Jurafsky (2008) used PMI to calculate event relations, and Jans et al. (2012) used skip bigram probabilites to the same end, which is order-sensitive. Most subsequent methods followed Jans et al. (2012) in using skip n-grams (Pichotta and Mooney, 2014;Rudinger et al., 2015).
Events being multi-argument structures, counting-based methods can suffer from sparsity issues. Recent work employed embeddings to address this disadvantage. Rudinger et al. (2015) learned event embeddings as a by-product of training a log-bilinear language model for events; Granroth-Wilding and Clark (2016) leveraged the skip-gram model of Mikolov et al. (2013) for training the embeddings of event and arguments by ordering them into a pseudo sentence. Modi (2016) utilized word embeddings of verbs and arguments directly, using a hidden layer to automatically consolidate word embedding into a single structured event embeddings. We follow Modi (2016) and use a hidden layer to learn event argument compositions given word embeddings, training the composition function as a part of the event chain learning process.
Mitigating the sparsity issue of event representations, neural methods can capture temporal orders between events beyond skip n-grams. Our model integrates the advantages of strong-order learning and event-pair learning by using LSTM hidden states as feature representation of existing events in the calculation of event pair relationships. In addition, we use a memory network model to weigh existing events, which gives better results compared to the equal weighting method of existing models.
With respect to evaluation, Chambers and Jurafsky (2008) proposed the Narrative Cloze Test, which asks for a missing event in a given event chain with a gap. The task has been adopted by various subsequent work for comparing results with Chambers and Jurafsky (2008) (Jans et al., 2012;Pichotta and Mooney, 2014;Rudinger et al., 2015). One issue of the narrative cloze test is that there can sometimes be multiple plausible answers, but only one gold-standard answer, which can make it overly expensive to manually evaluate system outputs. To address this issue, Modi (2016) proposed the Adversarial Narrative Cloze (ANC) task, which is to discriminate between pairs of real and corrupted event chains. Granroth-Wilding and Clark (2016) proposed the Multi-Choice Narrative Cloze (MCNC) task, which is to choose the most likely next event from a set of candidates given a chain of events. We choose MCNC for comparing different models.
Other related work includes learning temporal relations of events (Modi and Titov, 2014; Uz-X = Customer, Y = Waiter walk(X, restaurant), seat(X), order(X, food), serve(Y, food) eat(X, food), make(X, payment), c 1 : receive(X, response) c 2 : drive(X, mile) c 3 : seem(X) c 4 : discover(X, truth)  Abend et al., 2015), evaluated using different metrics. There has also been work using graph models to induce frames, which emphasize more on learning event structures and less on temporal orders (Chambers, 2013;Cheung et al., 2013). The above methods focus on one of the two subtasks we consider here. Frermann et al. (2014) used a Bayesian model to jointly cluster web collections of explicit event sequence and learn input event-pair temporal orders. However, their work is under a different input setting (Regneri et al., 2010), not learning event chains from texts. Mostafazadeh et al. (2016) proposed the story close task (SCT), which is to predict the ending given a unfinished story. Our narrative chain prediction task can be regarded as a sub task in the story close task, which can contribute as a major approach. On the other hand, information beyond event chains can be useful for the story close task.

Problem Definition
As shown in Figure 2, given a chain of narrative events e 1 , e 2 , ..., e n−1 , our work is to predict the likelihood of a next event candidate e n . Formally, an event e is a structure v(a 0 , a 1 , a 2 ), where v is a verb describing the event, a 0 and a 1 are its subject and direct object, respectively, and a 2 is a prepositional object. For example, given the sentence "John brought Marry to the restaurant", an event bring{John, Marry, to the restaurant} can be extracted.
We follow the standard script induction setting (Chambers and Jurafsky, 2008;Granroth-Wilding and Clark, 2016), extracting events from a text corpus using a syntactic parser and a named entity resolver. A neural network is used to model chains of extracted events for script learning.
In particular, we model the probability of a sub-sequent event given a chain of events. For evaluation, we solve the multi-choice narrative cloze task: given a chain of events and a set of candidate next events, the most likely candidate is chosen as the output.

Model
The overall structure of our model is shown in Figure 3, which has three main components. First, given an event v (a 0 , a 1 , a 2 ), a representation layer is used to compose the embeddings of v, a 0 , a 1 , and a 2 into a single event vector e. Second, a L-STM is used to map a sequence of existing events e 1 , e 2 , ..., e n−1 into a sequence of hidden vectors h 1 , h 2 , ..., h n−1 , which encode the temporal order. Given a next event candidate e c , the recurrent network takes one further step from h n−1 to derive its hidden vector h c , which encodes e c . Third, h c is paired with h 1 , h 2 , ..., h n−1 individually, and passed to a dynamic memory network to learn the relatedness score s. s is used to denote the connectedness between the candidate subsequent event and the context event chain.

Event Representation
We learn vector representations of standard events by composing pre-trained word embeddings of its verb and arguments. The skipgram model (Mikolov et al., 2013) is used to train word vectors. For arguments that consist of more than one word, we use the averaged word for the representation. OOV words are represented simply using zero vectors. For events with less than 3 arguments, such as "John fell", where v = fall, a 0 = John, a 1 = NULL, and a 2 = NULL, the NULL arguments are represented using all-zero vectors.
Denoting the embeddings of v, a 0 , a 1 , and a 2 as e(v), e(a 0 ), e(a 1 ), and e(a 2 ), respectively, the embedding of e is calculated using a tanh composition layer Here W v e , W 0 e , W 1 e , W 2 e , and b are model parameters, which are randomly initialized and tuned during the training of the main network.

Modeling Temporal Orders
Given the embeddings of the existing chain of events e 1 , e 2 , ..., e n−1 , we use a standard LST-M (Hochreiter and Schmidhuber, 1997) without  coupled input and forget gates or peephole connections to model the temporal order. We obtain a sequence of hidden state vectors h 1 , h 2 , ..., h n−1 by recurrently feeding e(e 1 ), e(e 2 ), ..., e(e n−1 ) as inputs to the LSTM, where h i = LSTM(e(e i ), h i−1 ). The initial state h s and all stand LSTM parameters are randomly initialized and tuned during training. Now for each candidate next event e c , we obtain its vector representation e(e c ) in the same way as for e 1 to e n−1 . e(e c ) is then appended to the existing event chain to obtain a temporal-ordersensitive feature vector h c , by advancing the recurrent encoding process for one step from h n−1 : h c = LSTM(e(e c ), h n−1 ). With multiple next event candidates e 1 c , e 2 c , ..., e m c (m ∈ [1, ∞]), m feature vectors are obtained, as shown in Figure 4, each being used as a basis for estimating the probability of the corresponding event candidate.

Modeling Pairwise Event Relations
After obtaining the hidden states for events, we model event pair relations using these hidden state vectors. A straightforward approach to model the relation between two events is using a Siamese network (Granroth-Wilding and Clark, 2016). The order-sensitive LSTM features for existing events h 1 , h 2 , ..., h n−1 and the candidate event h c are used as event representations. Given a pair of events h i (i ∈ [1..n − 1]) and h c , the relatedness score is calculated by where W si , W sc and b s are model parameters.
Given the relation score s i between h c and each existing event h i , the likelihood of e c given e 1 , e 2 , ..., e n−1 can be calculated as the average of s i : Weighting existing events. The drawback of above approach is that it considers the contribution of each event on the chain is same. However, given a chain of existing events, some are more informative for inferring a subsequent event than others. For example, given the events "wait in queue", "getting seated" and "order food", "order food" is more relevant for inferring "eat food" compared with the other two given events. Given information over the full event chain, this link can be more evident since the scenario is likely restaurant visiting.
We use an attentional neural network to calculate the relative importance of each existing event according to the subsequent event candidate, using h i (i ∈ [1..n−1]) and h c for event representations: where α i ∈ [0, 1] is the weight of h i , and i α t i = 1. W ei , W c , and b u are model parameters.
After obtaining the weight α i of each existing event h i , the relatedness of e c with the existing events can be calculated as: Multi-layer attention using Deep memory network. Memory network (Weston et al., 2014;Mikolov et al., 2014) has been used for exploring deep semantic information for semantic tasks. Such as question answering (Sukhbaatar et al., 2015;Kumar et al., 2016) and reading comprehension (Hermann et al., 2015;. Our task is analogous to such semantic tasks in Figure 5: Memory network at hop t. h i is the hidden variable of the existing event chain, v t is the semantic representation between context events and candidate event. a t is the weight of context events, and g is the gated recurrent network on Eq.10. the sense that deep semantic information can be necessary for making the most rational inference. Hence, we are motivated to use a deep memory network model to refine event weight and event relation calculation by recurrently modeling more abstract representations of the scenario. Different from the previous researches, we use the memory network to model the event chain, refining the attention mechanism used to explore the pair-wise relation between events. The memory model consists of multiple dynamic computational layers (hops). For the first layer (hop 1), the weights α for existing events e 1 , e 2 , ..., e n−1 can be calculated using the same attention mechanism as Eq.4 and Eq.5. Given the weights α, we build a consolidated representation of context event chain e 1 , e 2 , ..., e n−1 as a weighted sum of h 1 , h 2 , ..., h n−1 : The event candidate h c and the new representation of the existing chain h e can be further integrated to deduce a deeper representation of the full event chain hypothesis to the next layer (hop 2), denoted as v. v contains deeper semantic information compared with h c , which encode the temporal order of the event chain [h 1 , h 2 , ..., h n−1 , h c ] without differentiating the weights of each event. As a result, in the next hop, better event weights can potentially be deduced by using v instead of h c in the calculation of attention: In the same way, we stack multiple hops and repeat the steps multiple times, so that more abstract evidences can be extracted according to the chain of existing events. The above process can be performed recurrently, by taking h c as an initial scenario representation v 0 , and then repeatedly calculating h t e given h 1 , h 2 , ..., h n−1 and v t , and using h t e and v t to find a deeper scenario representation v t+1 . Following Chung et al. (2014) and Tran et al. (2016), a gated recurrent network is used to this end: At any step, if the value of |v t+1 −v t | is less than the threshold µ, we consider that the progress has reached convergence. Figure 5 shows an overview of the memory network at hop t.

Training
Given a set of event chains, each with a goldstandard subsequent event and a number of nonsubsequent events, our training objective is to minimize the cross-entropy loss between the gold subsequent event and the set of non-subsequent events. The loss function of event chain prediction is that: where s i is the relation score, y i is the label of the candidate (y i = 1 for positive sample, and y i = 0 for negative sample), Θ is the set of model parameters and λ is a parameter for L2 regularization. We apply online training, where model parameters are optimized by using AdaGrad (Duchi et al., 2011). We train word embedding using the Skipgram algorithm (Mikolov et al., 2013) 2 .

Datasets
Following Granroth-Wilding and Clark (2016), we extract events from the NYT portion of the Gigaword corpus (Graff et al., 2003). The C&C tools (Curran et al., 2007) are used for POS tagging and dependency parsing, and OpenNLP 3 for phrase structure parsing and coreference resolution. The training set consists of 1,500,000 event chains. We follow Granroth-Wilding and Clark (2016) and use 10,000 event chains as the test set, and 1,000 event chains for development. There are 5 choices of output event for event input chain, which are given by Granroth-Wilding and Clark (2016). This dataset is referred to as G&C16.
We also adapt the Chambers and Jurafsky (2008)'s dataset to the multiple choice setting, and use this dataset as the second benchmark. The dataset contains 69 documents, with 346 multiple choice event chain samples. We randomly sample 4 negative subsequent events for each event chain to make multiple-choice candidates. This dataset is referred to as C&J08. For both datasets, accuracy (Acc.) of the chosen subsequent event is used to measure the performance of our model.

Hyper-parameters
There are several important hyper-parameters in our models, and we tune their values using the development dataset. We set the regularization weight λ = 10 −8 and the initial learning rate to 0.01. The size of word vectors is set to 300, and the size of hidden vectors in LSTM to 128. In order to avoid over-fitting, dropout (Hinton et al., 2012) is used for word embedding with a ratio of 0.2. The neighbor similarity threshold η is set to 0.25. The threshold µ of the memory network sets to 0.1.

Development Experiments
We conduct a set of development experiments on the G&C16 development set to study the influence of event argument representations and network configurations of the proposed MemNet model.

Influence of Event Structure
Existing literature discussed various structures to denote events, such as v(a 0 , a 1 ) and v(a 0 , a 1 , a 2 ). We investigate the influence of integrating argument values of the subject a 0 , object a 1 and preposition a 2 , by doing ablation experiments on the development data. The results are shown in Table 1, where the system using all arguments gives a 54.36% accuracy. By removing a 2 , which exists in 17.6% of the events in our developmental data, the accuracy drops to 54.02%. In contrast, by removing a 0 and a 1 , which exist in 87.6% and 64.6% of the events in the development data, respectively, the accuracies drop to 53.43% and 53.57%, respectively, which demonstrates the relative importance of a 0 (i.e., the subject) and a 1 (i.e., the object) for event modelling. While most previous work (Chambers and Jurafsky, 2008;Balasubramanian et al., 2013;Pichotta and Mooney, 2014) modelled only a 0 and a 1 , recent work (Pichotta and Mooney, 2016;Granroth-Wilding and Clark, 2016) modelled a 2 also. By removing both a 1 and a 2 , the accuracy drops further to 53.32%. Interestingly, by removing the verb while keeping only the arguments, the accuracy drops to 42.63%. While this demonstrates the central value of the verb in denoting a event, it also suggests that the arguments themselves play a useful role in inferring the stereotypical scenario.

Influence of Network Configurations
We study the influence of various network configurations by performing ablation experiments, as shown in Table 2. MemNet is the full model of this paper; -LSTM denotes ablation of the LSTM layer, using e(e 1 ), e(e 2 ), ..., e(e n−1 ) instead of h 1 , h 2 , ..., h n−1 to represent events; -Hop denotes ablation of the dynamic network model, using only attention mechanism to calculate the weights of each existing event; -Attention denotes ablation of the attention mechanism, using the same weight on each existing event when inferring e c . The model "-Attention, -LSTM" is hence similar to the method of Granroth-Wilding and Clark (2016), although we used a different way of deriving event embeddings. The model "LSTM-only" shows a based by using LSTM hidden vector h n−1 to directly predict the next event, which is similar to the method of Pichotta and Mooney (2016).
Influence of Temporal Order. By comparing "MemNet" and "-LSTM", and comparing "- Attention" with "-Attention, -LSTM", one can find that temporal order information over the whole event chain does have significant influence on the results (p − value < 0.01 using t-test).
On the other hand, using LSTM to directly predict the subsequent event ("LSTM-only") does not give better accuracies compared to model event pairs ("-Attention, -LSTM"). This confirms our intuition that strong-oder modelling and event-pair modelling each have their own strength.

Influence of Attention.
Comparison between "-Attention" and "-Hop", and between "-Attention, -LSTM" and "-Hop, -LSTM" shows that giving different weights to different events does lead to improving results. Our analysis in Section 4.3 gives more intuitions to this observation. Finally, comparison between "-Hop" and "MemNet" and between "-Hop, -LSTM" and "-LSTM" shows that a multi-hop deep memory network can indeed enhance the model with single level attention by offering more effective semantic representation of the scenarios. Table 3 shows the final results on the C&C 16 and C&J08 datasets, respectively. We compare the results of our final model with the following baselines:

Final Results
• PMI is the co-occurrence based model of Chambers and Jurafsky (2008), who calculate event pair relations based on Pointwise Mutual Information (PMI), scoring each candidate event e c by the sum of PMI scores between the given events e 0 , e 1 , ..., e n−1 and the candidate.
• Bigram is the counting based model of Jans et al. (2012), calculating event pair relations based on skip bigram probabilities, trained using maximum likelihood estimation.  • Event-Comp is the neural event relation model proposed by Granroth-Wilding and Clark (2016). They learn event representations by calculating pair-wise event scores using a Siamese network.
• RNN is the method of Pichotta and Mooney (2016), who model event chains by directly using h c in Section 4.2 to predict the output, rather than taking them as features for event pair relation modeling.
• MemNet is the proposed deep memory network model.
Our reimplementation of PMI and Bigrams follows (Granroth-Wilding and Clark, 2016). It can be seen from the table that the statistical counting-based models PMI and Bigram significantly underperform the neural network models Event-Comp, RNN and MemNet, which is largely due to their sparsity and lack of semantic representation power. Under our event representation, Bigram does not outperform PMI significantly either, although considering the order of event pairs. This is likely due to sparsity of events when all arguments are considered.
Direct comparison between Event-Comp and RNN shows that the event-pair model gives comparable results to the strong-order LSTM model. Although Granroth-Wilding and Clark (2016) and Pichotta and Mooney (2016) both compared with statistical baselines, they did not make direct comparisons between their methods, which represent two different approaches to the task. Our results show that they each have their unique advantages, which confirm our intuition in the introduction. By considering both pairwise relations and chain temporal orders, our method significantly outperform both Event-Comp and RNN (p − value < 0.01 using t-test), giving the best reported results on both datasets.

Conclusion
We proposed a dynamic memory network to integrate chain order information into event relation measuring, calculating event pair relations by representing events in a chain using LSTM hidden states, which encode temporal orders, and using a dynamic memory model to automatically induce event weights for each event. Standard evaluation showed that our method significantly outperforms state-of-the-art event pair models and event chain models, giving the best results reported so far.