Implicit Argument Prediction with Event Knowledge

Implicit arguments are not syntactically connected to their predicates, and are therefore hard to extract. Previous work has used models with large numbers of features, evaluated on very small datasets. We propose to train models for implicit argument prediction on a simple cloze task, for which data can be generated automatically at scale. This allows us to use a neural model, which draws on narrative coherence and entity salience for predictions. We show that our model has superior performance on both synthetic and natural data.


Introduction
When parts of an event description in a text are missing, this event cannot be easily extracted, and it cannot easily be found as the answer to a question. This is the case with implicit arguments, as in this example from the reading comprehension dataset of Hermann et al. (2015): Text: More than 2,600 people have been infected by Ebola in Liberia, Guinea, Sierra Leone and Nigeria since the outbreak began in December, according to the World Health Organization. Nearly 1,500 have died.
Question: The X outbreak has killed nearly 1,500.
In this example, it is Ebola that broke out, and Ebola was also the cause of nearly 1,500 people dying, but the text does not state this explicitly. Ebola is an implicit argument of both outbreak and die, which is crucial to answering the question.
We are particularly interested in implicit arguments that, like Ebola in this case, do appear in the text, but not as syntactic arguments of their predicates. Event knowledge is key to determining implicit arguments. In our example, diseases are maybe the single most typical things to break out, and diseases also typically kill people.
The task of identifying implicit arguments was first addressed by Gerber and Chai (2010) and Ruppenhofer et al. (2010). However, the datasets for the task were very small, and to our knowledge there has been very little further development on the task since then.
In this paper, we address the data issue by training models for implicit argument prediction on a simple cloze task, similar to the narrative cloze task (Chambers and Jurafsky, 2008), for which data can be generated automatically at scale. This allows us to train a neural network to perform the task, building on two insights. First, event knowledge is crucial for implicit argument detection. Therefore we build on models for narrative event prediction (Granroth-Wilding and Clark, 2016;Pichotta and Mooney, 2016a), using them to judge how coherent the narrative would be when we fill in a particular entity as the missing (implicit) argument. Second, the omitted arguments tend to be salient, as Ebola is in the text from which the above example is taken. So in addition to narrative coherence, our model takes into account entity salience (Dunietz and Gillick, 2014).
In an evaluation on a large automatically generated dataset, our model clearly outperforms even strong baselines, and we find salience features to be important to the success of the model. We also evaluate against a variant of the Gerber and Chai (2012) model that does not rely on gold features, finding that our simple neural model outperforms their much more complex model.
Our paper thus makes two major contributions. 1) We propose an argument cloze task to generate synthetic training data at scale for implicit argument prediction. 2) We show that neural event models for narrative schema prediction can be used on implicit argument prediction, and that a straightforward combination of event knowledge and entity salience can do well on the task.

Related Work
While dependency parsing and semantic role labeling only deal with arguments that are available in the syntactic context of the predicate, implicit argument labeling seeks to find argument that are not syntactically connected to their predicates, like Ebola in our introductory example.
The most relevant work on implicit argument prediction came from Gerber and Chai (2010), who built an implicit arguments dataset by selecting 10 nominal predicates from NomBank (Meyers et al., 2004) and manually annotating implicit arguments for all occurrences of these predicates.
In an analysis of their data they found implicit arguments to be very frequent, as their annotation added 65% more arguments to NomBank. Gerber and Chai (2012) also trained a linear classifier for the task relying on many hand-crafted features, including gold features from FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005) and Nom-Bank. This classifier has, to the best of our knowledge, not been outperformed by follow-up work (Laparra and Rigau, 2013;Schenk and Chiarcos, 2016;Do et al., 2017). We evaluate on the Gerber and Chai dataset below. Ruppenhofer et al. (2010) also introduced an implicit argument dataset, but we do not evaluate on it as it is even smaller and much more complex than Gerber and Chai (2010). More recently, Modi et al. (2017) introduced the referent cloze task, in which they predicted a manually removed discourse referent from a human annotated narrative text. This task is closely related to our argument cloze task.
Since we intend to exploit event knowledge in predicting implicit arguments, we here refer to recent work on statistical script learning, started by Jurafsky (2008, 2009). They introduced the idea of using statistical information on coreference chains to induce prototypical sequences of narrative events and participants, which is related to the classical notion of a script (Schank and Abelson, 1977). They also proposed the narrative cloze evaluation, in which one event is removed at random from a sequence of narrative events, then the missing event is predicted given all context events. We use a similar trick to de-fine a cloze task for implicit argument prediction, discussed in Section 3.
Many follow-up papers on script learning have used neural networks. Rudinger et al. (2015) showed that sequences of events can be efficiently modeled by a log-bilinear language model. Pichotta and Mooney (2016a,b) used an LSTM to model a sequence of events. Granroth-Wilding and Clark (2016) built a network that produces an event representation by composing its components. To do the cloze task, they select the most probable event based on pairwise event coherence scores. For our task we want to do something similar: We want to predict how coherent a narrative would be with a particular entity candidate filling the implicit argument position. So we take the model of Granroth-Wilding and Clark (2016) as our starting point.
The Hermann et al. (2015) reading comprehension task, like our cloze task, requires systems to guess a removed entity. However in their case the entity is removed in a summary, not in the main text. In their case, the task typically amounts to finding a main text passage that paraphrases the sentence with the removed entity; this is not the case in our cloze task.

The Argument Cloze Task
We present the argument cloze task, which allows us to automatically generate large scale data for training (Section 6.1) and evaluation (Section 5.1).
In this task, we randomly remove an entity from an argument position of one event in the text. The entity in question needs to appear in at least one other place in the text. The task is then for the model to pick, from all entities appearing in the text, the one that has been removed. We first define what we mean by an event, then what we mean by an entity. Like Pichotta and Mooney (2016a); Granroth-Wilding and Clark (2016), we define an event e as consisting of a verbal predicate v, a subject s, a direct object o, and a prepositional object p (along with the preposition). Here we only allow one prepositional argument in the structure, to avoid variable length input in the event composition model. 2 By an entity, we mean a coreference chain with a length of at least two -that is, the entity needs to appear at least twice in the text.
For example, from a piece of raw text ( Figure  Manville Corp. said it will build a $ 24 million power plant to provide electricity to its Igaras pulp and paper mill in Brazil .
The company said the plant will ensure that it has adequate energy for the mill and will reduce the mill's energy costs .
(a) A piece of raw text from OntoNotes corpus.
x 0 = The company x 1 = mill x 2 = power plant e 0 : ( build-pred, x 0 -subj, x 2 -dobj, -) e 1 : ( provide-pred, -, electricity-dobj, x 1 -prep_to ) e 2 : ( ensure-pred, Extracted events (e0~e4) and entities (x0~x2), using gold annotations from OntoNotes. e 0, e 2, e 3, e 4 : same as above e 1 : ( provide-pred, -, electricity-dobj, ??-prep_to ) x 0 = The company x 1 = mill x 2 = power plant (c) Example of an argument cloze task for prep to of e1. 1a), we automatically extract a sequence of events from a dependency parse, and a list of entities from coreference chains. In Figure 1b, e 0~e4 are events, x 0~x2 are entities. The arguments electricity-dobj and energy-dobj are not in coreference chains and are thus not candidates for removal. An example of the argument cloze task is shown in Figure  1c. Here the prep to argument of e 1 has been removed. Coreference resolution is very noisy. Therefore we use gold coreference annotation for creating evaluation data, but automatically generated coreference chains for creating training data.

Modeling Narrative Coherence
We model implicit argument prediction as selecting the entity that, when filled in as the implicit argument, makes the overall most coherent narrative. Suppose we are trying to predict the direct object argument of some target event e t . Then we complete e t by putting an entity candidate into the direct object argument position, and check the coherence of the resulting event with the rest of the narrative. Say we have a sequence of events e 1 , e 2 , . . . , e n in a narrative, and a list of entity candidates x 1 , x 2 , . . . , x m . Then for any candidate x j , we first complete the target event to be where v t , s t , and p t are the predicate, subject, and prepositional object of e t respectively, and x j is filled as the direct object. (Event completion for omitted subjects and prepositional objects is analogous.) Then we compute the narrative coherence score S j of the candidate x j by 3 where e t (j) and e c are representations for the completed target event e t (j) and one context event e c , and coh is a function computing a coherence score between two events, both depending on the model being used. The candidate x j with the highest score S j is then selected as our prediction.

The Event Composition Model
To model coherence (coh) between a context event and a target event, we build an event composition model consisting of three parts, as shown in Figure 2: event components are representated through event-based word embeddings, which encode event knowledge in word representations; the argument composition network combines the components to produce event representations; and the pair composition network compute a coherence score for two event representations.
This basic architecture is as in the model of Granroth-Wilding and Clark (2016). However our model is designed for a different task, argument cloze rather than narrative cloze, and for our task entity-specific information is more important. We therefore create the training data in a different way, as described in Section 4.2.1. We now discuss the three parts of the model in more detail.  Figure 2: Diagram for event composition model. Input: a context event and a target event. Event-Based Word Embeddings: embeddings for components of both events that encodes event knowledge. Argument Composition Network: produces an event representation from its components. Pair Composition Network: computes a coherence score coh from two event representations. Extra Features: argument index and entity salience features as additional input to the pair composition network.

Event-Based Word Embeddings
arguments as input to compute event representations. To better encode event knowledge in word level, we train an SGNS (skip-gram with negative sampling) word2vec model (Mikolov et al., 2013) with event-specific information. For each extracted event sequence, we create a sentence with the predicates and arguments of all events in the sequence. An example of such a training sentence is given in Figure 3.
build-pred company-subj plant-dobj provide-pred electricity-dobj mill-prep_to ensure-pred plant-subj has-pred company-subj energy-dobj mill-prep_for reduce-pred plant-subj cost-dobj Pair Composition Network The pair composition network (light blue area in Figure 2) computes a coherence score coh between 0 and 1, given the vector representations of a context event and a target event. The coherence score should be high when the target event contains the correct argument, and low otherwise. So we construct the training objective function to distinguish the correct argument from wrong ones, as described in Equation 3.

Training for Argument Prediction
To train the model to pick the correct candidate, we automatically construct training samples as event triples consisting of a context event e c , a positive event e p , and a negative event e n . The context event and positive event are randomly sampled from an observed sequence of events, while the negative event is generated by replacing one argument of positive event by a random entity in the narrative, as shown in Figure 4. We want the coherence score between e c and e p to be close to 1, while the score for e c and e n should be close to 0. Therefore, we train the model to minimize cross-entropy as follows: − log(coh(e ci , e pi ))−log(1−coh(e ci , e ni )) where e ci , e pi , and e ni are the context, positive, and negative events of the ith training sample respectively.

Entity Salience
Implicit arguments tend to be salient entities in the document. So we extend our model by entity salience features, building on recent work by Dunietz and Gillick (2014), who introduced a simple model with several surface level features for entity salience detection. Among the features they used, we discard those that require external resources, and only use the remaining three features, as illustrated in Table 1. Dunietz and Gillick found mentions to be the most powerful indicator for entity salience among all features. We expect similar results in our experiments, however we include all three features in our event composition model for now, and conduct an ablation test afterwards.

Feature
Description 1st loc Index of the sentence where the first mention of the entity appears head count Number of times the head word of the entity appears mentions A vector containing the numbers of named, nominal, pronominal, and total mentions of the entity The entity salience features are directly passed into the pair composition network as additional input. We also add an extra feature for argument position index (encoding whether the missing argument is a subject, direct object, or prepositional object), as shown in the red area in Figure 2.

Argument Cloze Evaluation
Previous implicit argument datasets were very small. To overcome that limitation, we automatically create a large and comprehensive evaluation dataset, following the argument cloze task setting in Section 3.
Since the events and entities are extracted from dependency labels and coreference chains, we do not want to introduce systematic error into the evaluation from imperfect parsing and coreference algorithms. Therefore, we create the evaluation set from OntoNotes (Hovy et al., 2006), which contains human-labeled dependency and coreference annotation for a large corpus. So the extracted events and entities in the evaluation set are gold. Note that this is only for evaluation; in training we do not rely on any gold annotations (Section 6.1).
There are four English sub-corpora in OntoNotes Release 5.0 4 that are annotated with dependency labels and coreference chains. Three of them, which are mainly from broadcast news, share similar statistics in document length, so we combine them into a single dataset and name it ON-SHORT as it consists mostly of short documents. The fourth subcorpus is from the Wall Street Journal and has significantly longer documents. We call this subcorpus ON-LONG and evaluate on it separately. Some statistics are shown in Table 2

The Gerber and Chai (G&C) Dataset
The implicit argument dataset from Gerber and Chai (2010) (referred as G&C henceforth) consists of 966 human-annotated implicit argument instances on 10 nominal predicates. To evaluate our model on G&C, we convert the annotations to the input format of our model as follows: We map nominal predicates to their verbal form, and semantic role labels to syntactic argument types based on the NomBank frame definitions. One of the examples (after mapping semantic role labels) is as follows: [Participants] subj will be able to transfer [money]  For the nominal predicate investment, there are three arguments missing (subj, dobj, prep to). The model first needs to determine that each of those argument positions in fact has an implicit filler. Then, from a list of candidates (not shown here), it needs to select Participants as the implicit subj argument, money as the implicit dobj argument, and either other investment funds or a stock fund and a money-market fund as the implicit prep to.

Implementation Details
We train our neural model using synthetic data as described in Section 3. For creating the training data, we do not use gold parses or gold coreference chains. We use the 20160901 dump of English Wikipedia 5 , with 5,228,621 documents in total. For each document, we extract plain text and break it into paragraphs, while discarding all structured data like lists and tables 6 . We construct a sequence of events and entities from each paragraph, by running Stanford CoreNLP (Manning et al., 2014) to obtain dependency parses and coreference chains. We lemmatize all verbs and arguments. We incorporate negation and particles in verbs, and normalize passive constructions. We represent each argument by the corresponding entity's representative mention if it is linked to an entity, otherwise by its head lemma. We keep verbs and arguments with counts over 500, together with the 50 most frequent prepositions, leading to a vocabulary of 53,345 tokens; all other words are replaced with an out-of-vocabulary token. The most frequent verbs (with counts over 100,000) are down-sampled.
For training the event-based word embeddings, we create pseudo-sentences (Section 4.2) from all events of all sequences (approximately 87 million events) as training samples. We train an SGNS word2vec model with embedding size = 300, window size = 10, subsampling threshold = 10 −4 , and negative samples = 10, using the Gensim package (Řehůřek and Sojka, 2010).
For training the event composition model, we follow the procedure described in Section 4.2.1, and extract approximately 40 million event triples as training samples 7 . We use a two-layer feedforward neural network with layer sizes 600 and 300 for the argument composition network, and another two-layer network with layer sizes 400 and 200 for the pair composition network. We use cross-entropy loss with 2 regularization of 0.01. We train the model using stochastic gradient descent (SGD) with a learning rate of 0.01 and a batch size of 100 for 20 epochs.
To study how the size of the training set affects performance, we downsample the 40 million training samples to another set of 8 million training samples. We refer to the resulting models as EVENTCOMP-8M and EVENTCOMP-40M.

Evaluation on Argument Cloze
For the synthetic argument cloze task, we compare our model with 3 baselines.
RANDOM Randomly select one entity from the candidate list.
MOSTFREQ Always select the entity with highest number of mentions.
EVENTWORD2VEC Use the event-based word embeddings described in Section 4.2 for predicates and arguments. The representation of an event e is the sum of the embeddings of its components, i.e., where v, s, o, p are the embeddings of verb, subject, object, and prepositional object, respectively. The coherence score of two events in this baseline model is their cosine similarity. Like in our main model, the coherence score of the candidate is then the maximum pairwise coherence score, as described in Section 4.1. The evaluation results on the ON-SHORT dataset are shown in Table 3. The EVENT-WORD2VEC baseline is much stronger than the other two, achieving an accuracy of 38.40%. In fact, EVENTCOMP-8M by itself does not do better than EVENTWORD2VEC, but adding entity salience greatly boosts performance. Using more training data (EVENTCOMP-40M) helps by a substantial margin both with and without entity salience features.
To see which of the entity salience features are important, we conduct an ablation test with the EVENTCOMP-8M model on ON-SHORT. From the results in Table 4, we can see that in our task, as in Dunietz and Gillick (2014), the entity mentions features, i.e., the numbers of named, nominal, pronominal, and total mentions of the entity, are most helpful. In fact, the other two features even decrease performance slightly.    We take a closer look at several of the models in Figure 5. Figure 5a breaks down the results by the argument type of the removed argument. On subjects, the EVENTWORD2VEC baseline matches the performance of EVENTCOMP, but not on direct objects and prepositional objects. Subjects are semantically much less diverse than the other argument types, as they are very often animate. A similar pattern is apparent in Figure 5b, which has results by the part-of-speech tag of the head word of the removed entity. Note that an entity is a coreference chain, not a single mention; so when the head word is a pronoun, this is an entity which has only pronoun mentions. A pronoun entity provides little semantic content beyond, again, animacy. And again, EVENTWORD2VEC performs well on pronoun entities, but less so on entities described by a noun. It seems that EVENT-WORD2VEC can pick up on a coarse-grained pattern such as animate/inanimate, but not on more fine-grained distinctions needed to select the right noun, or to select a fitting direct object or prepositional object. This matches the fact that EVENT-WORD2VEC gets a less clear signal on the task, in two respects: It gets much less information than EVENTCOMP on the distinction between argument positions, 8 and it only looks at overall event similarity while EVENTCOMP is trained to detect narrative coherence. Entity salience contributes greatly across all argument types and parts of speech, but more strongly on subjects and pronouns. This is again because subjects, and pronouns, are semantically less distinct, so they can only be distinguished by relative salience. Figure 5c analyzes results by the frequency of the removed entity, that is, by its number of mentions. The MOSTFREQ baseline, unsurprisingly, only does well when the removed entity is a highly frequent one. The EVENTCOMP model is much better than MOSTFREQ at picking out the right entity when it is a rare one, as it can look at the semantic content of the entity as well as its frequency. Entity salience boosts the performance of EVENTCOMP in particular for frequent entities.
The ON-LONG dataset, as discussed in Section 5.1, consists of OntoNotes data with much longer documents than found in ON-SHORT. Evaluation results on ON-LONG are shown in Table 5. Although the overall numbers are lower than those for ON-SHORT, we are selecting from 36.95 candidates on average, more than 3 times more than for ON-SHORT. Considering that the accuracy of randomly selecting an entity is as low as 2.71%, the performance of our best performing model, with an accuracy of 27.87%, is quite good.

Evaluation on G&C
The G&C data differs from the Argument Cloze data in two respects. First, not every argument position that seems to be open needs to be filled: The model must additionally make a fill / no-fill decision. Whether a particular argument position is typically filled is highly predicate-specific. As the small G&C dataset does not provide enough data to train our neural model on this task, we instead train a simple logistic classifier, the fill / no-fill classifier, with a small subset of shallow lexical features used in Gerber and Chai (2012), to make the decision. These features describe the syntactic context of the predicate. We use only 14 features; the original Gerber and Chai model had more than 80 features, and our re-implementation, described below, has around 60. The second difference is that in G&C, an event may have multiple open argument positions. In that case, the task is not just to select a candidate entity, but also to determine which of the open argument positions it should fill. So the model must do multi implicit argument prediction. We can flexibly adapt our method for training data generation to this case. In particular, we create extra negative training events, in which an argument of the positive event has been moved to another argument position in the same event, as shown in Figure 6. We can then simply train our EVENTCOMP model on this extended training data. We refer to the extra training process as multi-arg training.
x 0 = The company x 1 = mill x 2 = power plant Context: ( build-pred, x 0 -subj, x 2 -dobj, -) Positive: ( reduce-pred, x 2 -subj, cost-dobj, -) Negative: ( reduce-pred, -, cost-dobj, x 2 -prep ) We compare our models to that of Gerber and Chai (2012  We present the evaluation results in Table 6. The original EVENTCOMP models do not perform well, which is as expected since the model is not designed to do the fill / no-fill decision and multi implicit argument prediction tasks as described above. With the fill / no-fill classifier, precision rises by around 13 points because this classifier prevents many false positives. With additional multi-arg training, F 1 score improves by another 22-23 points. At this point, our model achieves a performance comparable to the much more complex G&C reimplementation GCAUTO. Adding entity salience features further boosts both precision and recall, showing that implicit arguments do tend to be filled by salient entities, as we had hypothesized. Again, more training data substantially benefits the task. Our best performing model, at 49.6 F 1 , clearly outperforms GCAUTO, and is comparable with the original Gerber and Chai (2012) model trained with gold features. 10

Conclusion
In this paper we have addressed the task of implicit argument prediction. To support training at scale, we have introduced a simple cloze task for which data can be generated automatically. We have introduced a neural model, which frames implicit argument prediction as the task of selecting the textual entity that completes the event in a maximally narratively coherent way. The model prefers salient entities, where salience is mainly defined through the number of mentions. Evaluating on synthetic data from OntoNotes, we find that our model clearly outperforms even strong baselines, that salience is important throughout for performance, and that event knowledge is particularly useful for the (more verb-specific) object and prepositional object arguments. Evaluating on the naturally occurring data from Gerber and Chai, we find that in a comparison without gold features, our model clearly outperforms the previous state-of-the-art model, where again salience information is important.
The current paper takes a first step towards predicting implicit arguments based on narrative coherence. We currently use a relatively simple model for local narrative coherence; in the future we will turn to models that can test global coherence for an implicit argument candidate. We also plan to investigate how the extracted implicit arguments can be integrated into a downstream task that makes use of event information, in particular we would like to experiment with reading comprehension.