Joint Extraction of Events and Entities within a Document Context

Events and entities are closely related; entities are often actors or participants in events and events without entities are uncommon. The interpretation of events and entities is highly contextually dependent. Existing work in information extraction typically models events separately from entities, and performs inference at the sentence level, ignoring the rest of the document. In this paper, we propose a novel approach that models the dependencies among variables of events, entities, and their relations, and performs joint inference of these variables across a document. The goal is to enable access to document-level contextual information and facilitate context-aware predictions. We demonstrate that our approach substantially outperforms the state-of-the-art methods for event extraction as well as a strong baseline for entity extraction.


Introduction
Events are things that happen or occur; they involve entities (people, objects, etc.) who perform or are affected by the events and spatio-temporal aspects of the world. Understanding events and their descriptions in text is necessary for any generallyapplicable machine reading systems. It is also essential in facilitating practical applications such as news summarization, information retrieval, and knowledge base construction.
The interpretation of event descriptions is highly contextually dependent. To make correct predictions, a model needs to account for mentions of events and entities together with the discourse context. Consider, for example, the following excerpt from a news report: "On Thursday, there was a massive U.S. aerial bombardment in which more than 300 Tomahawk cruise missiles rained down on Baghdad. Earlier Saturday, Baghdad was again targeted. ..." The excerpt describes two U.S. attacks on Baghdad. The two event anchors (triggers) are boldfaced and the mentions of entities and spatio-temporal information are italicized. The first event anchor "aerial bombardment" along with its surrounding entity mentions -"U.S.", "Tomahawk cruise missiles", and "Baghdad", describe an attack from the U.S. on Baghdad with Tomahawk cruise missiles being the weapon. The second sentence on its own contains little event-related information, but together with the context of the previous sentence, it indicates another U.S. attack on Baghdad.
State-of-the-art event extraction systems have difficulties inferring such information due to two main reasons. First, they extract events and entities in separate stages: entities such as people, organization, and locations are first extracted by a named entity tagger, and then these extracted entities are used as inputs for extracting events and their arguments (Li et al., 2013). This often causes errors to propagate. In the above example, if the entity tagger mistakenly identifies "Baghdad" as a person, then the event extractor will fail to extract "Baghdad" as the place where the attack happened. In fact, previous work (Li et al., 2013) observes that using previously extracted entities in event extraction results in a substantial decrease in performance compared to using gold-standard entity information.
Second, most existing work extracts events independently from each individual sentence, ignoring the rest of the document (Li et al., 2013;Judea and Strube, 2015;Nguyen and Grishman, 2015). Very few attempts have been made to incorporate document context for event extraction. Ji and Grishman (2008) model the information flow in two stages: the first stage trains classifiers for event triggers and arguments within each sentence; the second stage applies heuristic rules to adjust the classifiers' outputs to satisfy document-wide (or document-cluster-wide) consistency. Liao and Grishman (2010) further improved the rule-based inference by training additional classifiers for event triggers and arguments using document-level information. Both approaches only propagate the highly confident predictions from the first stage to the second stage. To the best of our knowledge, there is no unified model that jointly extracts events from sentences across the whole document.
In this paper, we propose a novel approach that simultaneously extracts events and entities within a document context. 1 We first decompose the learning problem into three tractable subproblems: (1) learning the dependencies between a single event and all of its potential arguments, (2) learning the cooccurrence relations between events across the document, and (3) learning for entity extraction. Then we combine the learned models for these subproblems into a joint optimization framework that simultaneously extracts events, semantic roles, and entities in a document. In summary, our main contributions are: 1. We propose a structured model for learning within-event structures that can effectively capture the dependencies between an event and its arguments, and between the semantic roles and entity types for the arguments.
2. We introduce a joint inference framework that combines probabilistic models of within-event structures, event-event relations, and entity ex-1 The code for our system is available at https://github.com/bishanyang/ EventEntityExtractor. traction for joint extraction of the set of entities and events over the whole document.
3. We conduct extensive experiments on the Automatic Content Extraction (ACE) corpus, and show that our approach significantly outperforms the state-of-the-art methods for event extraction and a strong baseline for entity extraction.

Task Definition
We adopt the ACE definition for entities ((LDC), 2005a) and events ((LDC), 2005b): • Entity mention: An entity is an object or set of objects in the world. An entity mention is a reference to an entity in the form of a noun phrase or a pronoun.
• Event trigger: the word or phrase that clearly expresses its occurrence. Event triggers can be verbs, nouns, and occasionally adjectives like "dead" or "bankrupt".
• Event argument: event arguments are entities that fill specific roles in the event. They mainly include participants (i.e., the entities that are involved in the event) and general event attributes such as place and time, and some event-typespecific attributes that have certain values (e.g., JOB-TITLE, CRIME).
We are interested in extracting entity mentions, event triggers, and event arguments. We consider ACE entity types PER, ORG, GPE, LOC, FAC, VEH, WEA and ACE VALUE and TIME expressions 2 , and focus on 33 ACE event subtypes, each of which has its own set of semantic roles for the potential arguments. There are 35 such roles in total, but we collapse 8 of them that are time-related (e.g., TIME-HOLDS, TIME-AT-END) into one, because most of these roles have very few training examples. Figure 2 shows an example of ACE annotations for events and entities in a sentence. Note that not every entity mention in the sentence is involved in events and a single entity mention can be associated with multiple events. colors. Each event trigger has an event subtype marked above it and each entity mention has an entity type marked above it. Each event trigger evokes an event with semantic roles that are filled by entity mentions. The roles are marked on the links between event trigger and entity mentions. For example, "conviction" evokes a CONVICT event, and has the CRIME and DEFENDANT roles filled by "blasphemy" and "Christian man" respectively.

Approach
In this section, we describe our approach for joint extraction of events and entities within a document context. We first decompose the learning problem into three tractable subproblems: learning withinevent structures, learning event-event relations, and learning for entity extraction. We will describe the probabilistic models for learning these subproblems. Then we present a joint inference framework that integrates these learned models into a single model to jointly extract events and entities across a document.

Learning Within-event Structures
As described in Section 2, a mention of an event consists of an event trigger and a set of event arguments. Each event argument is also an entity mention with an entity type. In the following, we develop a probabilistic model to learn such dependency structure for each individual event mention.
Given a document x, we first generate a set of event trigger candidates T and a set of entity candidates N . 3 For each trigger candidate i ∈ T , we associate it with a discrete variable t i that takes values from the 33 ACE event types and a NONE class indicating other events or no events. Denote the set of entity candidates that are potential arguments for trigger candidate i as N i . 4 For each j ∈ N i , we associate it with a discrete variable r ij which models the event-argument relation between trigger candidate i and entity candidate j. It takes values from 28 semantic roles and a NONE class indicating invalid roles. Each argument candidate j is also associated with an entity type variable a j , which takes values from 9 entity types and a NONE class indicating invalid entity types. We define the joint distribution of variables t i , r i· = {r ij } j∈N i , and a · = {a j } j∈N i conditioned on the observations, which can be factorized according to the factor graph shown in Figure 2: where θ 1 , ..., θ 5 are vectors of parameters that need to be estimated, and f 1 , ..., f 5 are different forms of feature functions which we will describe later. Note that not all configurations of the variables are valid in our model. Based on the definitions in Section 2, each event type takes arguments with certain semantic roles. For example, the arguments of the event MARRY can only play the roles of PERSON, TIME, and PLACE. In addition, a NONE event type should not take any arguments. Similarly, each semantic role should be filled with entities with compatible types. For example, the PERSON role type can only be filled with an entity of type PER. However, a NONE role type can be filled with an entity of any type. To account for these compatibility constraints, we enforce the probabilities of all invalid configurations to be zero.
Features. f 1 , f 2 , and f 4 are unary feature functions that depend on trigger variable t i , argument variable r ij , and entity variable a j respectively. We construct a set of features for each feature function (see Table 1). Many of these features overlap with those used in previous work (Li et al., 2013;Li et al., 2014), except for the word embedding features for triggers and the features for entities which are derived from multiple entity resources. f 3 and f 5 are pairwise feature functions that depend on trigger-argument pair (t i , r ij ) and argument-entity pair (r ij , a j ) respectively. We consider simple indicator functions 1 t,r and 1 r,a as features (1 y (x) equals 1 when x = y and 0 otherwise).
Training. For model training, we find the optimal parameters θ using the maximum-likelihood estimates with an L2 regularization: We use L-BFGS to optimize the training objective. To calculate the gradient, we use the sum-product algorithm to compute the exact marginals for the unary cliques t i , r ij , a j and the pairwise cliques (t i , r ij ), (r ij , a j ). Typically the training complexity for graphical models with unary and pairwise cliques is quadratic in the size of the label set. However, the complexity of our model is much lower than that since we only need to compute the joint distributions over valid variable configurations. Denote the number of event subtypes as T , the number of event argument roles as N , the average number of argument roles for each event subtype as k 1 , the average number of entity types for each event argument as k 2 , and the average number of argument candidates for each trigger candidate as M . The complexity of computing the joint distribution is O(M ×(k 1 T +k 2 N )), and k 1 and k 2 are expected to be small in practice (k 1 = 6, k 2 = 3 in ACE).

Learning Event-Event Relations
So far we have described a model for learning structures for a single event. However, the inference of the event types for individual events may depend on other events that are mentioned in the document. For example, an ATTACK event is more likely to occur with INJURE and DIE events than with life events like MARRY and BORN. In order to capture this intuition, we develop a pairwise model of event-event relations in a document.
Our training data consists of all pairs of trigger candidates that co-occur in the same sentence or are connected by a coreferent subject/object if they are in different sentences. 5 We want to propagate information between these trigger pairs since they are more likely to be related.
Formally, given a trigger candidate pair (i, i ), we estimate the probabilities for their event types where φ is a vector of parameters and g is a feature function that depends on the trigger candidate pair and their context. We consider both trigger-specific features and relational features. For trigger-specific features, we use the same trigger features listed in Table 1. For relational features, we consider for each pair of trigger candidates: (1) whether they are connected by a conjunction dependency relation (based on dependency parsing); (2) whether they share a subject or an object (based on dependency parsing and coreference resolution); (3) whether they have the same head word lemma; (4) whether they share a semantic frame based on FrameNet. During training, we use L-BFGS to compute the maximumlikelihood estimates of φ.

Entity Extraction
For entity extraction, we trained a standard linearchain Conditional Random Field (CRF) (Lafferty et al., 2001)  5. semantic frames that associate with the head word and its p-o-s tag based on FrameNet (Li et al., 2014) 6. pre-trained vector for the head word (Mikolov et al., 2013) Syntactic resources: Stanford parser 7. dependency edges involving the head word, both lexicalized and unlexicalized 8. whether the head word is a pronoun Argument Lexical resources: WordNet 1. lemmas of the words in the entity mention 2. lemmas of the words in the trigger mention 3. words between the entity mention and the trigger mention Syntactic resources: Stanford parser 4. the relative position of the entity mention to the trigger mention (before, after, or contain) 5. whether the entity mention and the trigger mention are in the same clause 6. the shortest dependency paths between the entity mention and the trigger mention Entity Entity resources: Stanford NER NELL KB 1. Gender and animacy attributes of the entity mention 2. Stanford NER type for the entity mention 3. Semantic type for the entity mention based on the NELL knowledge base (Mitchell et al., 2015) 4. Predicted entity type and confidence score for the entity mention output by the entity extractor described in Section 3.3 (1) current words and part-of-speech tags; (2) context words in a window of size 2; (3) word type such as all-capitalized, is-capitalized, and all-digits; (4) Gazetteer-based entity type if the current word matches an entry in the gazetteers collected from Wikipedia (Ratinov and Roth, 2009). In addition, we consider pre-trained word embeddings (Mikolov et al., 2013) as dense features for each word in order to improve the generalizability of the model.

Joint Inference
Our end goal is to extract coherent event mentions and entity mentions across a document. To achieve this, we propose a joint inference approach that allows information flow among the three local models and finds globally-optimal assignments of all variables, including the trigger variables t, the argument role variables r, and the entity variables a. Specifically, we define the following objective: The first term is the sum of confidence scores for individual event mentions based on the parameter estimates from the within-event model. E(t i , r i· , a · ) can be further decomposed into three parts.
The second term is the sum of confidence scores for event relations based on the parameter estimates from the pairwise event model, where R(t i , t i ) = log p φ (t i , t i |i, i , x). The third term is the sum of confidence scores for entity mentions, where D(a j ) = log p ψ (a j |j, x) and p ψ (a j |j, x) is the marginal probability derived from the linear-chain CRF described in Section 3.3. The optimization is subjected to agreement constraints that enforce the overlapping variables among the three components to agree on their values. The joint inference problem can be formulated as an integer linear program (ILP). To solve it efficiently, we find solutions for the relaxation of the problem using a dual decomposition algorithm AD 3 (Martins et al., 2011). AD 3 has been shown to be orders of magnitude faster than a general purpose ILP solver in practice (Das et al., 2012). It is also particularly suitable for our problem since it involves decompositions that have many overlapping simple factors. We observed that AD 3 recovers the exact solutions for all the test documents in our experiments and the runtime for labeling each document is only three seconds in average in a 64-bit machine with two 2GHz CPUs and 8GB of RAM.

Experiments
We conduct experiments on the ACE2005 corpus. 6 It contains text documents from a variety of sources such as newswire reports, weblogs, and discussion forums. We use the same data split as in Li et al. (2013). Table 2 shows the data statistics.
We adopt the evaluation metrics for events as defined in Li et al. (2013). An event trigger is correctly identified if its offsets match those of a goldstandard trigger; and it is correctly classified if its event subtype (33 in total) also match the subtype of the gold-standard trigger. An event argument is correctly identified if its offsets and event subtype match those of any of the reference argument mentions in the document; and it is correctly classified if its semantic role (28 in total) is also correct. For entities, a predicted mention is correctly extracted if its head offsets and entity type (9 in total) match those of the reference entity mention.
Note that our approach requires entity mention candidates and event trigger candidates as input. Instead of enumerating all possible text spans, we generate high-quality entity mentions from the kbest predictions of our CRF entity extractor (in Section 3.3). 7 Similarly, we train a CRF for event trigger extraction using the same features except for the gazetteers, and generate trigger candidates based on the k-best predictions. We set k = 50 for entities and k = 10 for event triggers based on performance on the development set. They cover 92.3% of the gold-standard entity mentions and 96.3% of the gold-standard event triggers in the test set.

Results
Event Extraction. We compare the proposed models WITHINEVENT (in Section 3.1) and JOIN-TEVENTENTITY (in Section 3.4) with two strong baselines. One is JOINTBEAM (Li et al., 2013), a state-of-the-art event extractor that uses a structured perceptron with beam search for sentence-level joint extraction of event triggers and arguments. The other is STAGEDMAXENT, a typical two-stage approach that detects event triggers first and then event arguments. We use the same event trigger candidates and entity mention candidates as input to all the comparing models except for JOINTBEAM, because JOINTBEAM only extracts event mentions and assumes entity mentions are given. We consider a realistic experimental setting where no gold-standard annotations are available for entities during testing. To obtain results from JOINTBEAM, we ran the actual system 8 used in Li et al. (2013) using the entity mentions output by our CRF-based entity extractor. Table 3 shows the average 9 precision, recall, and F1 score for event triggers and event arguments. We can see that our WITHINEVENT model, which explicitly models the trigger-argument dependencies and argument-role-entity-type dependencies, outperforms the MaxEnt pipeline, especially in event argument extraction. This shows that modeling the trigger-argument dependencies is effective in reducing error propagation.
Comparing to the state-of-the-art event extractor JOINTBEAM, the improvements introduced by WITHINEVENT are substantial in both event triggers and event arguments. We believe there are two main reasons: (1) WITHINEVENT considers all possible joint trigger/argument label assignments, whereas parts and consider the k-best predictions for each part. 8 https://github.com/oferbr/ BIU-RPI-Event-Extraction-Project 9 We report the micro-average scores as in previous work (Li et al., 2013).  Table 3: Event extraction results on the ACE2005 test set. * indicates that the difference in F1 compared to the second best model (WITHINEVENT) is statistically significant (p < 0.05).

Model
Trigger Arg CROSS-DOC (Ji and Grishman, 2008) 67.3 42.6 CNN (Nguyen and Grishman, 2015) 67.6 -JOINTEVENTENTITY 68.7 48.4  JOINTBEAM considers only a subset of the possible assignments based on a heuristic beam search. More specifically, when predicting labels for token i, JointBeam considers only the K-best (K = 4 in their paper) partial trigger/argument label configurations for the previous i − 1 tokens. As the length of the sentence increases, a large amount of information will be thrown away.
(2) WITH-INEVENT models argument-role-entity-type dependencies, whereas JOINTBEAM assumes the entity types are given. This can cause error propagation.
JOINTEVENTENTITY provides the best performance among all the models on all evaluation categories. It boosts both precision and recall compared to WITHINEVENT. 10 This demonstrates the advantages of JOINTEVENTENTITY in allowing information propagation across event mentions and entity mentions and making more context-aware and semantically coherent predictions.
We also compare the results of JOINTEVENTEN-TITY with the best known results on the ACE event 10 All significance tests reported in this paper were computed using the paired bootstrap procedure (Berg-Kirkpatrick et al., 2012)   extraction task in Table 4. CROSS-DOC (Ji and Grishman, 2008) performs cross-document inference of events using document clustering information, and CNN (Nguyen and Grishman, 2015) is a convolutional neural network for extracting event triggers at the sentence level. We see that JOINTEVENTEN-TITY outperforms both models and achieves new state-of-the-art results for event trigger and argument extraction in an end-to-end evaluation setting. Entity Extraction. In addition to extracting event mentions, JOINTEVENTENTITY also extracts entity mentions. We compare its output with the output of a strong entity extraction baseline CRFENTITY (described in Section 3.3). Table 5 shows the (micro-)average precision, recall, and F1 score. We see that JOINTEVENTENTITY introduces a significant improvement in recall and F1. Table 6 further shows the F1 score for four major entity types PER, GPE, ORG, and TIME in ACE. The promising improvements indicate that joint modeling of events and entities allows for more accurate predictions about not only events but also entities. Table 7 divides the errors made by JOINTEVEN-TENTITY based on different subtasks and the classification error types in each task. For event triggers, the majority of the errors relates to missing triggers and only 3.7% involves misclassified event types (e.g., a DEMONSTRATION event is mistaken for a TRANSPORT event). Among the missing triggers, we examine the cases where the event types are correctly identified in a sentence but with in-  correct triggers and find that there are only 5% of such cases. For event arguments, the majority of the errors relates to missing arguments and only 4.1% is about misclassified argument roles. Among the missing event arguments, 10% of them has correctly identified entity types.

Error Analysis
In general, the errors for event extraction are commonly due to three reasons: (1) Lexical sparsity. For example, in the sentence "At least three members of a family ... were hacked to death ...", our model fails to detect that "hacked" triggers an AT-TACK event, because it has never seen "hacked" with this sense during training. Using WordNet and pretrained word vectors may alleviate the sparsity issue. It is also important to disambiguate word senses in context. (2) Shallow understanding of context, especially long-range context. For example, given the sentence "She is being held on 50,000 dollars bail on a charge of first-degree reckless homicide ...", the model detects that "homicide" triggers an event, but fails to detect that "She" refers to the AGENT who committed the homicide. This is mainly due to the complex long-distance dependency between the trigger and the argument. (3) Use of complex language such as metaphor, idioms, and sarcasm. Addressing these phenomena is in general difficult since it requires richer background knowledge and more sophisticated inference.
For entity extraction, we find that integrating event information into entity extraction successfully improves recall and F1. However, since the ACE dataset is restricted to a limited set of events, a large portion of the sentences does not contain any event triggers and event arguments that are of interest. For these sentences, there is little or no benefit of joint modeling. We also find that some entity misclassification errors can be avoided if entity coreference information is available. We plan to investigate coreference resolution as an additional component to our joint model in future work.

Related Work
Event extraction has been mainly studied using the ACE data (Doddington et al., 2004) and biomedical data for the BioNLP shared tasks (Kim et al., 2009). To reduce task complexity, early work employs a pipeline of classifiers that extracts event triggers first, and then determines their arguments (Ahn, 2006;Björne et al., 2009). Recently, Convolutional Neural Networks have been used to improve the pipeline classifiers (Nguyen and Grishman, 2015;. As pipeline approaches suffer from error propagation, researchers have proposed methods for joint extraction of event triggers and arguments, using either structured perceptron (Li et al., 2013), Markov Logic (Poon and Vanderwende, 2010), or dependency parsing algorithms (McClosky et al., 2011). However, existing joint models largely rely on heuristic search to aggressively shrink the search space. One exception is work in Riedel and McCallum (2011), which uses dual decomposition to solve joint inference with runtime guarantees. Our work is similar to Riedel and McCallum (2011). However, there are two main differences: first, our model extracts both event mentions and entity mentions; second, it performs joint inference across sentence boundaries. Although our approach is evaluated on ACE, it can be easily adapted to BioNLP data by using appropriate features for events triggers, argument roles, and entities. We consider this as future work.
There has been work on improving event extraction by exploiting document-level context. Berant et al. (2014) exploits event-event relations, e.g., causality, inhibition, which frequently occur in biological texts. For general texts most work focuses on exploiting temporal event relations (Chambers and Jurafsky, 2008;Do et al., 2012;McClosky and Manning, 2012). For the ACE domain, there is work on utilizing event type co-occurrence patterns to propagation event classification decisions (Ji and Grishman, 2008;Liao and Grishman, 2010). Our model is similar to their work. It models the co-occurrence relations between event types (e.g., a DIE event tends to co-occur with ATTACK events and TRANS-PORT events). It can be extended to handle other types of event relations (e.g., causal and temporal) by designing appropriate features. Chambers and Jurafsky (2009;2011) learn narrative schemas by linking event verbs that have coreferring syntactic arguments. Our model also adopts this intuition to relate event triggers across sentences. In addition, each event argument is grounded by its entity type (e.g., an entity mention of type PER can only fill roles that can be played by a person).

Conclusion
In this paper, we introduce a new approach for automatic extraction of events and entities across a document. We first decompose the learning problem into three tractable subproblems: learning within-event structures, learning event-event relations, and learning for entity extraction. We then integrate these learned models into a single model that performs joint inference of all event triggers, semantic roles for events, and entities across the whole document. Experimental results demonstrate that our approach outperforms the state-of-the-art event extractors by a large margin and substantially improves a strong entity extraction baseline. For future work, we plan to integrate entity and event coreference as additional components into the joint inference framework. We are also interested in investigating the integration of more sophisticated event-event relation models of causality and temporal ordering.