Multi-Relational Script Learning for Discourse Relations

Modeling script knowledge can be useful for a wide range of NLP tasks. Current statistical script learning approaches embed the events, such that their relationships are indicated by their similarity in the embedding. While intuitive, these approaches fall short of representing nuanced relations, needed for downstream tasks. In this paper, we suggest to view learning event embedding as a multi-relational problem, which allows us to capture different aspects of event pairs. We model a rich set of event relations, such as Cause and Contrast, derived from the Penn Discourse Tree Bank. We evaluate our model on three types of tasks, the popular Mutli-Choice Narrative Cloze and its variants, several multi-relational prediction tasks, and a related downstream task—implicit discourse sense classification.


Introduction
Representing world knowledge that can be used for commonsense reasoning is a long-standing AI goal. Scripts (Schank and Abelson, 1977) are structured knowledge representations capturing the relationships between prototypical event sequences and their participants in a given scenario. For example, given the event "John shot Jim with a gun", we can infer that "he got arrested by police" is more probable than "he fell asleep".
In recent years, the problem of extracting script knowledge from text has attracted significant attention. Early works (Chambers and Jurafsky, 2008) focused on symbolic event representations and used Pointwise Mutual Information (PMI) between events to capture their relationships. Recent works (Pichotta and Mooney, 2016a;Granroth-Wilding and Clark, 2016;Lee and Goldwasser, 2018;Li et al., 2018) represent events using dense vectors, based on event cooccurrence, and use vector similarity over their embeddings to measure their relationship.
Our main observation in this paper is that while models for learning script knowledge improved significantly over the last decade, these models can essentially represent only a single event relationship, co-occurrence. That is, events appearing in similar contexts tend to have similar representations. Although this idea works well for a lot of NLP tasks, it is too coarse for modeling commonsense, which should account for fine-grained relationships. To better understand this, consider the example described in Figure 1. Given the first event, corresponding to the sentence "Jenny went to her favorite restaurant.", called Step 1, any of the following events in Step 2 would be highly related, and thus similar, to the input event. That is, "It was raining outside" and "She was very hungry" are both possible NEXT events. Using event similarity alone is too coarse to support many relevant inferences. However, if the relation between the events is given, more clues can be applied to support reliable inferences. In Figure 1, given Step 2 is a Reason to Step 1, analogous to asking the question "Why did Jenny go there?", the event "She was very hungry" is clearly a more reasonable choice. Therefore, using event similarity alone is too coarse to support many relevant inferences, i.e., capturing the Reason for the event, should produce a different set of relevant events, compared to Temporal (next) events To help prioritize between showing diverse types of event relations and providing a framework for this discussion, we focus on a set of discourse relations, introduced by Penn Discourse Tree Bank (PDTB) (Prasad et al., 2007). Traditional script learning models would fall short of making the inferences here. For example, the last inference step in Figure 1 asks for an event that Contrasts with the previous step. Based on human commonsense, we can identify that the most probable scenario is "She ordered a meal but she liked the food better last time." Modeling the re- "Why did Jenny go there?" "What did Jenny do next?" "Jenny was disappointed because…" lation type helps us capture different expectations about subsequent events. We use the fact that these relations are often indicated by discourse markers (e.g., "but", capturing the contrasting relation) to extract supervision for learning these relations. Our goal in this paper is to support such inferences. We introduce a multi-relational event embedding approach, which generalizes the notion of event embedding, by allowing it to capture multiple fine-grained relationships. Our approach builds on recent translation-based embeddings (Bordes et al., 2013;Lin et al., 2015), originally introduced in the context of knowledge graph completion. We adapt these methods to the textual inputs, and suggest a compositional neural network used for capturing the event's internal linguistic structure, while using the translationbased embedding objective to capture different relationships between events. We include 11 relation types, capturing the progression of the narrative: COREF NEXT, the next event in the coreference chain; NEXT, the next event that occurs subsequently in text; and 9 discourse relations, collectively refer to as DISCOURSE NEXT. We evaluate our model in three settings. In the first, we evaluate it on a common benchmark, Multiple-Choice Narrative Cloze (MCNC) task (Granroth-Wilding and Clark, 2016), and its sequential variants proposed by Lee and Goldwasser (2018). We show that we can outperform previously published work by a large margin. In the second setting, we further examine our model's characteristics on three intrinsic tasks. In the last setting, we conduct a challenging downstream task-implicit discourse sense classifications, examplifying the model's applicability.

Related Work
Statistical Script Learning was popularized by Chambers and Jurafsky (2008), framing the problem as an unsupervised learning problem, using a PMI-based learning model to approximate a conditional probability of event occurrence. Recent approaches build on representation learning techniques, by learning event embeddings with neural networks. Granroth-Wilding and Clark (2016) utilized Skip-Gram (Mikolov et al., 2013) and an event compositional neural network to adjust event representations. Pichotta and Mooney (2016b; 2016a) applied a LSTM Recurrent Neural Network (RNN), coupled with Beam Search, to model event sequences and their representations. Weber et al. (2017) used three-dimensional tensor-based networks to construct the event representations. Lee and Goldwasser (2018) trained the event embedding with additional features in a hierarchical architecture. Li et al. (2018) constructed an event graph and utilized its network information to make script event predictions. In this paper we combine GRU (Chung et al., 2014), for encoding fine-grained argument information, with a compositional network to generate event representations. GRU was shown to be a competitive alternative to LSTM while requiring less parameters (Kiros et al., 2015;Hochreiter and Schmidhuber, 1997).
Modeling multi-relational data was originally explored for Knowledge Graph Completion, typically focusing on a family of translation-based embedding models which view relations as translations in the vector space. For example, TransE (Bordes et al., 2013), captures the relation between h, t, r (embedding of arg0, arg1, and relation), by minimizing the distance between h + r and t. TransH (Wang et al., 2014) and TransR (Lin et al., 2015) projects the entities into relationspecific spaces. Recent models address issues, such as maintaining structures (Xie et al., 2016;Yoon et al., 2016) and capturing richer interactions (Nickel et al., 2016). In this paper, we adapt TransE and TransR for narrative script learning, which is an innovative generalization of relation embedding for commonsense inference.
Several recent works looked at modeling specific relationships between events and extracting commonsense knowledge. Zhao et al. (2017) explored modeling cause-effect relations between events;  focused on If-Then relations and showed that their joint multi-task model outperforms the models trained in isolation, based on human evaluations. Peng and Roth (2016) utilized discourse markers to extract relations between semantic frames and modeled them with prevalent language models. Event2Mind (Rashkin et al., 2018) created a dataset capturing the relationship between an event description and its participants' intent and emotional reaction. This idea is related to our work, as the intent and reaction can correspond to Reason and Result discourse relations in our case. Our goal in this paper is to present a relational generalization over such relationships using a shared embedding space.
The Narrative Cloze (NC) task (Chambers and Jurafsky, 2008) was introduced to evaluate statistical script models by removing an event from a chain, and observing the ranking of the correct answer over the entire event vocabulary, given the rest of the chain. However when complex event structures were considered, e.g., multi-argument events (Pichotta and Mooney, 2014), the large vocabulary size introduced both computational issues and ambiguity into the evaluation. As a result, Granroth-Wilding and Clark (2016) proposed a multiple-choice variation, called MCNC. It simplifies the evaluation process and reduces its computational burden. A similar choice of the multiple-choice adaptation could also be found in recent works, such as Story Cloze (Mostafazadeh et al., 2017) and SWAG (Zellers et al., 2018). In this paper, we evaluate our models on MCNC, and two recent variants (Lee and Goldwasser, 2018) turning MCNC into a sequential inference task. We also introduce relation-specific evaluation capturing the ability of our model to account for nuanced relations beyond co-occurrence.

Model
We propose a learning framework, which accounts for the internal predicate-argument structure of events, tuning it to respect different relation types.
Overview Our framework has two preprocessing phases: Event Extraction and Relational Triplet Extraction. In Event Extraction, we aim to identify events from free-form text. The process builds on a dependency parser and coreference resolution. Once events are extracted, we address their relations, specifically three types: (1) events with coreferent entities, (2) events located near each other, and, more importantly, (3) events connected with discourse relations.
The output of the preprocessing phases is a set of relation triplets (e h , e t , r), where e h and e t are head and tail events, and r is their relation type. We then feed them to a neural network for learning event and relation embeddings. The network objective is an energy function f (e t , e h , r), which can be used to approximate the conditional probabilities p(e t |e h , r) or p(r|e t , e h ). This objective captures commonsense knowledge expressed in event relations and embeds it in a vector space, which can be utilized in downstream tasks. Two model variants are proposed in this paper. The first model, EventTransE, assumes that all the relations are in the same embedding space and jointly learns representations for events and relations. It works well in some cases, though it might not be expressive enough in others. The second model, EventTransR, addresses this issue by introducing relation-specific parameters, which project events into relation-specific spaces when measuring their relatedness.

Event Extraction
We construct a preprocessing pipeline to extract events and relations over a large text collection. Each event e consists of three components: predicate (pred(e)), subject (subj(e)), and object (obj(e)). Due to computational considerations we restrict the event representation to two arguments. We use a special empty argument representation, NONE, for events that have fewer arguments. To obtain the event representation from text, we first run a dependency parser and coreference resolution 1 to acquire the needed information.
Events are extracted by connecting entity mentions on the coreference chain with their corresponding predicate and additional argument, based on the dependency tree. E.g., given, "Jenny went into her favorite restaurant," we extract (go into, jenny, her favorite restaurant).
Unlike the previous works (Lee and Goldwasser, 2018;Granroth-Wilding and Clark, 2016;Pichotta and Mooney, 2016a), which only consider headwords of entity mentions, we use complete mention spans. In our running example, we consider the object as "her favorite restaurant", rather than just "restaurant". This allows the models to capture the nuanced information relevant for many commonsense inferences, such as the "favorite" here. Other preprocessing steps follow the previous works and are detailed in the appendix.

Relational Triplet Extraction
Relations are expressed as triplets (e h , e t , r), where r is the relation type, and e h and e t are events that have an internal structure of (pred(e), subj(e), obj(e)). 11 types of relations are considered in this paper for demonstrations: COREF NEXT, NEXT, and 9 discourse relations, which collectively refer to as DISCOURSE NEXT. COREF NEXT captures sequential relationships between events on the same coreference chain. The NEXT relation is defined between events pairs that co-occurr in a fixed-sized (w context ) context window. It aims to capture related events that do not share arguments. For example, in "The forest was on fire. Trees burned.", the two events do not share arguments, but they often co-occur, and thus are related. Previous works about script learning (Pichotta and Mooney, 2016a;Granroth-Wilding and Clark, 2016;Lee and Goldwasser, 2018;Li et al., 2018) use either COREF NEXT or NEXT independently, which failed to leverage the shared information.
For DISCOURSE NEXT, 9 discourse relations, taken from PDTB, are denoted in Table 1. These relations correspond to commonsense judgments. For example, we can do causal inference with the Reason and Result; or we can identify the juxtaposition between events by utilizing Contrast.
The discourse relations can be represented with a relation type and a pair of argument spans . For example, "Jenny went to a restaurant, because she was hungry" has a relation Reason and the spans are the two clauses (omitting the connective "because"). Since training event embedding requires significantly more data than annotated in the PDTB corpus, we approximate this by building a rule-based annotator. We first identify explicit discourse connectives, such as "because," and assume that the surrounding clauses are their argument spans. To determine the relation type, we map the connectives to their most probable type based on the PDTB data. To mitigate the noise, we only take connectives that are highly indicative of their type (85% of connective occurrences are of that type). Note that in our setup a given pair of events might have up to three relations annotated: a discourse relation, NEXT, and COREF NEXT. We create negative examples by corrupting the positive triplets, randomly replacing e h , e t , or r with an event or relation. For each positive triplet we sample one negative triplet. While our weakly supervised relation extraction is noisy, we demonstrate empirically its ability to capture these relations. Figure 2 shows the architecture of our models.

Compositional Event Representation
Each event e has a raw representation (pred(e), subj(e), obj(e)). The predicate pred(e) is given in an embedding lookup table, a matrix with size |P | × d a , where P denotes predicate vocabulary. subj(e) and obj(e) are encoded with two separate Bi-GRUs (Chung et al., 2014). We call them subject encoder and object encoder, as shown in the figure. The outputs of the encoders are d a -dimensional respectively. Each GRU is defined as follows: where x t is the input token at timestamp t; W (z) , U (z) , W (r) , U (r) , W, U are parameters to be trained; h t ∈ R dr 2 is the hidden memory at timestamp t; z t and r t are update and reset gates for controlling purposes. The final argument representation is the concatenation of GRU hidden represen- The encoded representations for each event component are then fed into a Event Composition network. The network is fully-connected and has one hidden layer, defined as follows:  where x e is the concatenation of the encoded predicate, subject, and object; The output e ∈ R dr is the event embedding.
For the relations, we embed them using another embedding lookup table. The table size is n rel × d r , where n rel is the number of relation types. In our case, n rel = 11.

Model: EventTransE
EventTransE is an event embedding model inspired by TransE (Bordes et al., 2013). The idea is to embed nodes and their relations in the same vector space so that the distance between nodes reflects their relations. This is called translating operations in the original paper. Based on this idea, we explore a new possibility of learning event embeddings that can make inferences conditioned on different relations. We connect the TransE objective to the previous compositional network outputs, which can be formulated as follows: where e h , e t , r ∈ R dr are the embeddings from the Event Composition network. Note that Equation 1 is a dissimilarity measure. Lower scores mean that the given two events are strongly related.

Model: EventTransR
A known issue of EvenTransE is its limited ability to deal with reflexive, 1-to-N, N-to-1, or Nto-N relations (Wang et al., 2014). Consider a simple example illustrating the problem: given Equation (1), it is possible to learn a zero relation vector r and two arbitrary but identical event representations e h and e t , which minimize the loss. EventTransR is proposed to address these issues by separating the event and relation spaces as TransR (Lin et al., 2015). It introduces relationspecific parameters to model the interactions between the spaces. EventTransR is defined as follows: where r ∈ R dr , e h , e t ∈ R de are the input embeddings, and M r ∈ R de×dr is the relation-specific parameters introduced.

Training Objective
The objective is the Margin-Based Ranking Loss: where T is the set of positive relational triplets; T * is the set of corrupted relational triplets; δ is the margin, and f ∈ {f transe , f transr }. At test time, we can leverage the dissimilarity measures to either predict the tail event given the head event and relation, or predict the relation given the head and tail events: f (e h , e * , r); r = arg min r * ∈R f (e h , e t , r * ).
E and R are the event and relation vocabulary.

Experiments
We divided our experimental evaluation into three parts. The first focuses on comparing our models with previous work on several common script learning evaluation tasks. The second evaluates our model's ability to capture different relation types between events. In the third, we apply our models to a related downstream task, implicit discourse sense classification, and achieve competitive results by combining our event embeddings with ELMo (Peters et al., 2018), a contextualized word embedding model. We provide additional qualitative analysis, showing inferences made by our model, in the appendix. For training, we use the New York Times (NYT) section of the English Gigaword (Parker et al., 2011). It contains 2M newswire articles and splits into train/dev/test sets, replicating the setup given by Granroth-Wilding and Clark (2016). 500M triplets are extracted from the training set. All the experimental results are averaged over 5 runs.
We leave the details about hyperparameter tuning in the appendix. The source code and pre-trained models are publicly available 2 .

Multiple Choice Narrative Cloze Tasks
We begin by evaluating our model on three event representation tasks: Multiple-Choice Narrative Cloze (MCNC), Multiple-Choice Narrative Sequence (MCNS), and Multiple-Choice Narrative Explanation (MCNE). MCNC, proposed by Granroth-Wilding and Clark (2016), measures script learning models' ability to predict a missing event, given its context, in a multiple-choice setting. This evaluation task is not perfect, as noise would be introduced by automatic extraction tools, but not so common as to invalidate the results, and thus this evaluation is widely accepted. Lee and Goldwasser (2018) generalized this singlestep task, and suggested two sequence inference versions-MCNS and MCNE. Figure 3 explains the three tasks. Given an event chain, MCNC chooses one step as a multiple-choice question and generates four negative choices for that step. MCNS turns it into a sequence prediction problem by creating multiple-choice questions for each step, except the start event. MCNE provides an additional clue, which is the end event. The inference model has to connect the start and end by explaining things happened in between.
Following the setup in (Granroth-Wilding and Clark, 2016), we evaluate on top of coreferenced event chains, where a protagonist participates each event. The minimum length of the event chains is 9, as short chains are likely to be caused by parsing errors. Our models naturally score the candidates with our training objective f ∈ {f trane , f transr } using COREF NEXT relation, while other baselines use cosine similarity.

Multiple-Choice Narrative Cloze
We compared two versions of our models, using the entire argument span, or just its headword, with several recently published results.
We compare our models with the following baselines on the MCNC: • Random uniformly selects a candidate.
• Word2Vec (Mikolov et al., 2013) refers to the pre-trained word embeddings from Word2Vec SkipGram. The summation of word embeddings of predicates and argument mentions are used to represent events. • EvSkipGram (Granroth-Wilding and Clark, 2016) uses SkipGram to learn representations from "sentences" formed by predicates and argument headwords. • EventComp (Granroth-Wilding and Clark, 2016) uses a neural network to learn a compositional function for EvSkipGram and outputs a coherence score for event pairs. • SGNN (Li et al., 2018) is a graph-based model specifically designed for MCNC. It considers each event chain as a sub-graph, and feed it into their GRU-based recurrent networks, which outputs relatedness scores for the candidates. • FEEL (Lee and Goldwasser, 2018) is an event embedding model that does multi-task learning for inter-event relations and intraevent features. • PairLSTM  is an event embedding model that considers event order information and uses a LSTM network's hidden states for event representations.
Since we need the complete argument spans for events, which is not available in (Granroth-Wilding and Clark, 2016)'s pre-processing proce-

Methods Accuracy
Random* 20.00 PPMI (Chambers and Jurafsky, 2008) 30.52 BiGram (Jans et al., 2012) 29.67 Word2Vec* (Mikolov et al., 2013) 37.39 EvSkipGram* (Granroth-Wilding et al., 2016) 46.28 EventComp (Granroth-Wilding et al., 2016) 49.57 FEEL*  51.62 SGNN (Li et al., 2018) 52.45 PairLSTM  55.12  dure, we re-implement the event extraction step by carefully following their procedure. We mark the results based on the newly sampled evaluation set with a star sign (*). We released the newly sampled evaluation set for future comparisons. Table  2 shows the results. Our models outperform the best baseline model for more than 7% absolute accuracy score. We attribute the improvement to three factors: (1) our models encode complete argument mentions rather than just headwords, EventTranseEheadword and EventTransR-headword, which are our models' variants that use only headwords for arguments, show that about 3% of the improvement is from this; (2) our models have shared event representations over multiple relations, which regularize the representations in diverse aspects, while other baselines do not make use of relations other than COREF NEXT. (3) our models' training objective directly measures relationspecific dissimilarity between events, while most others are based on simple cosine similarity.

Multiple-Choice Narrative Sequence
The MCNC looks at a single transition between events; however, it does not capture the flow of the entire narrative. Lee and Goldwasser (2018) proposed MCNS, which instead of sampling candidate options for one event, it samples options for all the events on the chain, except the first event which is used as the starting point for predictions (Figure 3b). Based on the dissimilarity scores calculated by our models, we can compute transition probabilities for each step. Then we can find the  most likely sequence using Viterbi inference algorithm (Viterbi, 1967). We follow the evaluation setting used in (Lee and Goldwasser, 2018) and compare three decision models: (1) Viterbi, which finds the most probable sequence of predictions; (2) Baseline-Inf, which greedily picks the best transition at each step based on the previous prediction; (3) Skyline-Inf, which breaks down a sequence of decisions into local decisions, each using the gold states of all the contextual events. Table 3 shows the results. Our models outperform FEEL (Lee and Goldwasser, 2018), who introduced the task. The same set of reasons given in the section MCNC explain the improvement. We also note that EventTransE is especially strong in making predictions for COREF NEXT.

Multiple-Choice Narrative Explanation
MCNE is another extension to MCNC. Essentially, in addition to the first event, the final event is also given (Figure 3c). Intuitively, the goal of this evaluation task is to capture explanations, consisting of event sequences, that connect the start and end points. The same inference algorithms as MCNS are adopted. The right three columns of Table 3 gives the result (Note that the Baseline-Inf and Skyline-Inf are shared with MCNS). The result shows a similar trend as MCNS, but with higher scores, due to the additional information brought by the last event. Note that when calculating the accuracy, we only consider the event blanks in the middle (ignoring the last prediction made in MCNS) for both MCNS and MCNE. This ensures a fair comparison.

Intrinsic Discourse Relations Evaluation
We suggest three intrinsic tasks, depicted in Figure  4, evaluating how multi-relational information is captured. Given a triplet (e1, e2, r): (1) predict the next event e2, (2) predict the relation r, and (c) (e1, e2, r)? Figure 4: Three intrinsic tasks for evaluating our models: (4a) predicts the next event given an event and a relation; (4b) predicts the relation given a pair of events; (4c) binary classification for triplets.  Table 4: Accuracy scores (%) of the next event prediction, given an event and a relation. ELMo is a contextualized word embedding model. -Random and -NEXT are our model variants that replace the given relation with a random and NEXT relation respectively.
Predict the Next Event Similar in spirit to the setup described in Fig. 1, we ask whether knowing the relation, connecting the head to the tail event, would change the expectation about the tail event.
Given a set of triplets that have discourse relations, for each triplet, we corrupt e t and sample four extra negative choices to form a multiplechoice question. We compare our model variants with a strong baseline model-ELMo (Peters et al., 2018). ELMo is a context-aware word embedding model that has shown strong performance in language understanding tasks. To get the contextualized word embeddings, we have to provide the context, usually the sentence where the target words appear. To retrieve the context, for each event e, we re-construct its "sentence" by concatenating its subj(e), pred(e), and obj(e). The averaged word embedding of the context is used to represent the event. ELMo predicts the next event based on cosine similarity, disregarding the relation. We also make two variants to show our models' awareness to relation types. One replaces the correct discourse relation with a random relation; the other replaces it with a NEXT relation. Table 4 shows the results. We can see that all our model variants outperform the ELMo baseline, as our models are aware of the relation between   events. Similarity-based models that can capture frequently co-occurred events fail to consider the nuanced relations. EventTransR performs the best as it has relation-specific parameters emphasizing the relational nuances. Interestingly, using -NEXT relation only is also very indicative for predicting the next event, which explains why previous works failed to address the nuanced relations. The results for -Random relations indicates that EventTranR is very sensitive to incorrect relations. This is due to the separation between the relation and event embedding spaces, useful for relationsensitive tasks. Also, that EventTranE-Random model works better than EventTranE-NEXT suggests that our models with discourse relations do capture their fine-grained differences. Note that even EventTranR with scrambled relations outperform ELMo with a large margin. We hypothesize that ELMo emphesizes similarity rather than nuanced discourse relations between sentences.
Predict the Relation We predict the correct relation out of the 9 discourse relations (Table 1), given two events. Table 5 shows the result. With additional relation-specific parameters introduced, EventTransR performs better than EventTransE. Note that the ability to rank the correct relation is also important as there might be more than one possible next events. According to the MRR and Recall@4, both models are competitive.
Triplet Classification This task is inspired by Triplet Classifications in Knowledge Graph Completion (Socher et al., 2013;Wang et al., 2014). It predicts whether a given triplet (e h , e t , r) is valid or not. We sample positive triplets from our dev and test splits and negative triplets by corrupting e t . We use the dev split to develop a set of relation-specific thresholds λ r . The score is calculated using f ∈ {f trane , f transr }. If the score is lower than λ r , the triplet is classified as positive; otherwise, it is negative. We sample 500 positive and negative triplets for each relation. The ELMo baseline is similar to previous experiments. We also develop a set of relation-specific thresholds based on ELMo's similarity scores to make predictions. Table 6 summarizes the results and shows that the similarity-based model, ELMo, cannot represent the nuanced relations information as good as our model. Interestingly, both our models excelled at predicting the Expansion relations (Instant. and Restat.). EventTransR get high scores on Temporal relations (Async.) which implies its applicability on tasks like event order inference (Ning et al., 2018). In general, for tasks requiring nuanced relations, EventTransR works better; if we only need to know the NEXT or COREF NEXT events, EventTransE is better. In addition, EventTransE has less trainable parameters, converging way faster.

Implicit Discourse Sense Classifications
The final evaluation task is a subtask in CoNLL 2016 Shared Task  on implicit discourse sense classification. We follow the same setting as the shared task, with 15 sense classes. More details can be found in .
Three baselines, the best and median system of each subtask, are provided. In addition, we also trained a strong baseline based on ELMo. We first create word embeddings for words in the argument spans using ELMo and put an attention layer on top of the words. The attention layer weights the words and create the argument representation. We feed the representations of the two arguments to a neural classifier, where two fully-connected hidden layers with dimensions 256 and 128 are applied. ReLU (Nair and Hinton, 2010) are used as activation functions and AdaGrad (Duchi et al., 2011) is used for optimizing the parameters. We combine EventTransE with the ELMo baseline by having another attention layer on top of the event embeddings and concatenating all the argument  representations in the network. Table 7 shows the results. The ELMo baseline is highly competitive, comparable to the winners of the task (ecnucs and ttr). Our combined model (ELMo+EventTransE) consistently contributes to performance, demonstrating the benefit of our model to downstream tasks.

Summary
We consider the problem of learning relationaware event embeddings for commonsense inference, which can account for different relations between events, beyond simple event similarity. We include several event relations, identifying, for example, the causes for them. We show that weak supervision, provided by a rule-based annotator is enough for training our models.
We evaluated and compared two models, Event-TransE and EventTransR, on several narrative cloze and relation-specific tasks, and showed the learned embedding can capture relation-specific information as well as improve performance for a downstream task.
This work lays the foundation for reasoning over narratives and explaining how sentences combine to form them. In the future we would like to expand this direction, and find ways to connect event and relation representation, learning and inference in a unified framework.

A Details of Event Preprocessing
This section notes the detailed preprocessing steps for extracting event predicate and arguments. It follows the previous work (Lee and Goldwasser, 2018), except the argument mention part.
• Unlike the previous works (Lee and Goldwasser, 2018;Granroth-Wilding and Clark, 2016;Pichotta and Mooney, 2016a), which only consider the headword of entity mentions, we use the entire mention span. This change gives the models a possibility to capture the nuanced information in the entity mentions, relevant for many commonsense inferences. For example, capturing the relationships between "a hungry man walked on a street" and "he grabbed some food" hinges on capturing the modifier "hungry." • Predicates are lemmatized and in lower-case.
• Predicates are not only verbs but also predicative adjectives. For instance, "Jenny was hungry. She ordered a big meal." The predicate "hungry" plays an important role here.
• Negations should be applied to predicates, e.g., "She didn't eat dinner," results in a new predicate: "not eat." • Particles and clausal complements (xcomp) are included in verb predicates, since verbs, such as "go" and "have" are not strong enough to give meaningful information. For instance, in "He went shopping last night," the predicate is "go shop," rather than "go." • Low-frequency predicates and words in the entity mentions are considered as Out-Of-Vocabulry (OOV) during training. As the vocabulary size is related to memory limitation and rare words are highly likely to introduce noise, only the most active n pred predicates and n argword argument words are considered.
• For the same reason given above, the maximum entity mention lengths, l subj and l obj , are set.

B Negative Sampling for Event Triplets
For each positive triplet (e h , e t , r), we extract one negative triplet by randomly replacing e h , e t , or r in equal chance. The events are sampled from event vocabulary, collected from the training set, and the relations are from the 11 types we support. We have experimented with different negative sampling strategies, such as corrupting the tail event only or sampling with different event distributions. None of them perform better.

C Hyperparameters
We have experimented with different sets of hyperparameters, and came up with the following setting: the number of active predicates and argument words, n pred and n argword , are both set to 25000; the maximum argument lengths, l subj and l obj , are set to 15; the event contextual window size w context for extracting NEXT relation is 5; the event composition hidden layer has the dimension d h = 1000 and Rectified Linear Unit (ReLU) (Nair and Hinton, 2010) is used as the activation function.; embedding dimensions d a = 500, d e = 500, and d r = 500; the margin δ is empirically set to 1; the optimizer is Adagrad (Duchi et al., 2011) with initial learning rate 0.01; the word embeddings for entity mention encoders are initialized as the word embeddings pre-trained in Skip-Thoughts (Kiros et al., 2015); all the experimental results are averaged over 5 runs.

D Qualitative Analysis
The experiments in the paper provide quantitative evaluations of our models. To give more comprehensive understanding, we also perform a qualitative analysis, which instantiates exact inferences our models make. In this analysis, our models make inferences in grounded scenarios, where we have clearer expectations about possible events and outcomes. To do so, we create two confined "worlds," where each world only have limited numbers of entities and predicates, and hence a limited number of candidate events. This limitation is enforced as it helps examine quality of the inferences. Table 8 shows the entities and predicates that are selected for the two worlds. The topic of the first world (a) is about a murderer and the topic of the second world (b) is about stock markets. Both are common topics in newswire articles, which we use for training the model. Note that since each event triplet has two entity components (subject and object) and one predicate, the number of candidate events is calculated as n pred · n 2 ent , where n pred is number of predicates and n ent is the number of entities. In these two worlds, we have 1100 and 1400 candidate events.
To conduct the inference, our model ranks the candidate events according to their relevance to a given starting scenario, which is a sequence of events. We use EventTransR to embed and rank all the events. For each candidate, we jointly consider its relevance to each start event. The dissimilarity scoring function s(.) is defined as follows: s(e c ) = es∈S f transr (e s , e c , r), ∀e c ∈ C, where S is all events in the starting scenario, C is the set of possible candidate events, and r is the embedding of the interested discourse relation. We rank all the candidates based on this function. Candidates with lower scores will be ranked higher. In addition, we consider four discourse relations-Contrast, Reason, Result, and Asynchronous-in this analysis, as they are particularly interesting for commonsense inference. Table 9 summarizes the analysis. In each case, we only list the top 2-3 events. In world (a), Event-TransR can precisely predict events in three out of four relations. In particular, we can contrast the fact that "John died" with "John survived," which has not been addressed in previous works. For Asynchronous, on which EventTransR fails, the signal for temporal relations is noisier as many possible outcomes are reasonable. In world (b), our model succeeds in all four relations. Also, our model is able to tell the difference between Result and Reason, as indicated by the prediction that "the stock has soared" leads to "CEO made money," and "Because shares increased, the stock soared." They show that we are able to control the inferences over different discourse perspectives, which is useful for tasks like story generations.
This analysis helps provide more intuitions about the knowledge learned by our models. Note that this is a challenging task even when grounded with a small set of candidate events, as was reported by previous works that looked at eventranking based evaluations (Pichotta and Mooney, 2016a;.