Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies

Fully understanding narratives often requires identifying events in the context of whole documents and modeling the event relations. However, document-level event extraction is a challenging task as it requires the extraction of event and entity coreference, and capturing arguments that span across different sentences. Existing works on event extraction usually confine on extracting events from single sentences, which fail to capture the relationships between the event mentions at the scale of a document, as well as the event arguments that appear in a different sentence than the event trigger. In this paper, we propose an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction. Experimental results show that our approach achieves comparable performance to CRF-based models on ACE05, while enjoys significantly higher computational efficiency.


Introduction
Narratives are account of a series of related events or experiences (Urdang, 1968). Extracting events in literature can help machines better understand the underlying narratives. A robust event extraction system is therefore crucial for fully understanding narratives.
Event extraction aims to identify events composed of a trigger of pre-defined types and the corresponding arguments from plain text (Grishman et al., 2005). To gain full information about the extracted events, entity coreference and event coreference are important, as demonstrated in Figure 1a. These two tasks require document-level modeling. The majority of the previous event extraction works focus on sentence level (Li and Ji, 2014;Lin et al., 2020). Some later works leverage document-level features, but still extract events at Event Coreference

Entity Coreference
Bob was shot on the street.
He reported to the police about the incident. Figure 1: (a) demonstrates why coreference resolution is essential for event extraction. In the second sentence, without entity coreference, an event extraction system cannot identify which real-world entity does He refer to. Similarly, incidence and shot will be incorrectly linked to two different real-world events without event coreference. (b) shows the importance of cross-event dependencies. The local trigger classifier falsely classifies death as type DIE. Instead, it is an EXECUTE event as a person's life is taken away by an authority. A structured prediction model that learns cross-event interactions can potentially infer the correct event type for death given the previous SENTENCE event is often carried out by authorities.
the scope of sentence (Yang and Mitchell, 2016;Zhao et al., 2018b;Wadden et al., 2019). More recently,  and  treat document-level event extraction as a templatefilling task. Li et al. (2020a) performs event mention extraction and the two coreference tasks independently using a pipeline approach. However, none of the previous works learn entity and event coreference jointly with event mention extraction. We hypothesize that joint learning event mention extraction, event coreference, and entity coreference can result in richer representations and better performance.
Moreover, learning cross-event dependencies is crucial for event extraction. Figure 1b shows a real example from the ACE05 dataset on how learning dependencies among event mentions can help correct errors made by local trigger classifiers. However, efficiency is a challenge when modeling such dependencies at the scale of document. While some works attempted to capture such dependencies with conditional random field or other structured prediction algorithms on hand-crafted features (Li et al., 2013;Lin et al., 2020), these approaches subject to scalablility issue and require certain level of human efforts. In this work, we study end-to-end learning methods of an efficient energy-based structured prediction algorithm, Deep Value Networks (DVN), for document-level event extraction.
The contribution of this work is two-fold. First, we propose a document-level event extraction model, DEED (Document-level Event Extraction with DVN). DEED utilizes DVN for capturing crossevent dependencies while simultaneously handling event mention extraction, event coreference, and entity coreference. Using gradient ascent to produce structured trigger prediction, DEED enjoys a significant advantage on efficienty for capturing inter-event dependencies. Second, to accommodate evaluation at the document level, we propose two evaluation metrics for document-level event extraction. Experimental results show that the proposed approach achieve comparable performance with much better training and inference efficiency than strong baselines on the ACE05 dataset.

Related Works
In this section, we summarize existing works on document-level information extraction and event extraction, and the application of structured prediction to event extraction tasks.

Document-level Information Extraction
Information extraction (IE) is mostly studied at the scope of sentence by early works. (Ju et al., 2018;Qin et al., 2018;Stanovsky et al., 2018). Recently, there has been increasing interest in extracting information at the document-level. Jia et al. (2019) proposed a multiscale mechanism that aggregates mention-level representations into entity-level representations for document-level N-ary relation extraction. Jain et al. (2020) presented a dataset for salient entity identification and document-level Nary relation extraction in scientific domain. Li et al. (2020b) utilized a sequence labeling model with feature extractors at different level for documentlevel relation extraction in biomedical domain. Hu et al. (2020) leveraged contextual information of multi-token entities for document-level named entity recognition. A few studies which tackled document-level event extraction will be reviewed in Section 2.
Document-level Event Extraction Similar to other IE tasks, most event extraction methods make predictions within sentences. Initial attempts on event extraction relied on hand-crafted features and a pipeline architecture (Ahn, 2006;Gupta and Ji, 2009;Li et al., 2013). Later studies gained significant improvement from neural approaches, especially large pre-trained language models (Wadden et al., 2019;Nguyen et al., 2016;Lin et al., 2020;Balali et al., 2020). Recently, event extraction at the document level gains more attention. Yang et al. (2018) proposed a twostage framework for Chinese financial event extraction: 1) sentence-level sequence tagging, and 2) document-level key event detection and heuristicbased argument completion. Zheng et al. (2019) transforms tabular event data into entity-based directed acyclic graphs to tackle the argument scattering challenge. Du and Cardie (2020) employed a mutli-granularity reader to aggregate representations from different levels of granularity. However, none of these approaches handle entity coreference and event coreference jointly. Our work focus on extracting events at the scope of document, while jointly resolving both event and entity coreference.

Structured Prediction on Event Extraction
Existing event extraction systems integrating structured prediction typically uses conditional random fields (CRFs) to capture dependencies between predicted events Wang et al., 2018). However, CRF is only applicable to modeling linear dependencies, and has scalablility issue as the computation cost at least grows quadratically in the size of label. Another line of solutions incorporated beam search with structured prediction algorithms. Li et al. (2013) leveraged structured perceptron to learn from hand-crafted global features. Lin et al. (2020) adopted hand-crafted global features with a global scoring function and uses beam search for inference. While these structured prediction methods can model beyond linear dependencies and alleviate the scalability issue, it requires pre-defined orders for running beam search. In contrast, our method addresses the above two issues by adopting an efficient stuctured prediction algorithm, Deep Value Networks, which runs linear in the size of label and does not require pre-defined order for decoding.
3 Document-level Event Extraction

Task Definition
The input to the document-level event extraction task is a document of tokens D = {d 0 , d 1 , ..., d m }, with spans S = {s 0 , s 1 , ...s n } generated by iterating k-grams in each sentence (Wadden et al., 2019). Our model aims to jointly solve event mention extraction, event coreference, and entity coreference.
Event Mention Extraction refers to the subtask of 1) identifying event triggers in D by predicting the event type for each token d i . 2) Then, given each trigger, corresponding arguments in S and argument roles are extracted. This task is similar to the sentence-level event extraction task addressed by previous studies (Wadden et al., 2019;Lin et al., 2020). The difference is that we require extracting full spans of all name, nominal, and pronoun arguments, while these works focus on extracting head spans of name arguments. Entity Coreference aims to find which entity mentions refer to the same entity. Our model predicts the most likely antecedent span s j for each span s i . Event Coreference is to recognize event mentions that are coreferent to each other. Similar to entity coreference, we predict the most likely antecedent trigger d j for each predicted trigger d i . Entity Extraction is performed as an auxiliary subtask for richer representations. Each entity mention corresponds to a span s i in S.

Task Evaluation
Evaluation metrics used by previous sentencelevel event extraction studies (Wadden et al., 2019;Zheng et al., 2019;Lin et al., 2020) are not suitable for our task as event coreference and entity coreference are not considered.  evaluates entity coreference using bipartite matching. However, it does not consider event coreference and less informative arguments (nominal and pronoun). As a solution, we propose two metrics: DOCTRIGGER and DOCARGUMENT, to properly evaluate event extraction at the document level. The purpose is to conduct evaluation on event coreference clusters and argument coreference clusters. DOCTRIGGER considers trigger span, event type, and event coreference. Triggers in the same event coreference chain are clustered together. The metric first aligns gold and predicted trigger clusters, and computes a matching score between each gold-predicted trigger cluster pair. A predicted trigger cluster gets full score if all the associated triggers are correctly identified. To enforce the constraint that one gold trigger cluster can only be mapped to at most one predicted trigger cluster, Kuhn-Munkres algorithm (Kuhn, 1955) is adopted. DOCARGUMENT considers argument span, argument role, and entity coreference. We define an argument cluster as an argument with its coreferent entity mentions. Similar to DOCTRIGGER, DOCARGUMENT uses Kuhn-Munkres algorithm to align gold and predicted argument clusters, and compute a matching score between each argument cluster pair. An event extraction system should get full credits in DOCARGUMENT as long as it identifies the most informative co-referent entity mentions and does not predict false positive coreferent entity mentions. 1 Details of the evaluation metric are included in Appendix C.

Proposed Approach
We develop a base model that makes independent predictions for each subtask under a multi-task IE framework. The proposed end-to-end framework, DEED, then incorporates DVN into the base model to efficiently capture cross-event dependencies.

Base Model
Our BASE model is built on a span-based IE framework, DYGIE++ (Wadden et al., 2019). DYGIE++ learns entity classification, entity corefernce, and event extraction jointly. The base model extends the entity coreference module of DYGIE++ to handle event coreference.
Encoding Ideally, we want to encode all tokens in a document D = {d 1 , d 2 , ..., d m } with embeddings that covers the context of the entire document. However, due to hardware limitation for long documents, each document is split into multisentences. Each multi-sentence corresponds to a chunk of consecutive sentences. We obtain rich contextualized embeddings for each multi-sentence of tokens e = {e 1 , e 2 , ..., e n } using BERT-BASE (Devlin et al., 2019).
Span Enumeration Conventional event extraction systems use BIO tag scheme to identify the starting and ending position of each trigger and entity. Nevertheless, this method fails to handle nested entities. As a solution, we enumerate all possible spans to generate event mention and entity mention candidates from uni-gram to k-gram. 2 Each span s i is represented by corresponding head token e h , tail token e t and the distance embeddings c h,t , denoted as Classification We use task-specific feed-forward networks (FFN) to compute the label probabilities. Trigger extraction is performed on each token y trig i = FFN trig (e i ), while entity extraction is done on each span y ent i = FFN ent (x i ). For argument extraction, event coreference, and entity coreference, we score each pair of candidate spans where t refers to a specific task. Cross-entropy loss is used to learn trigger extraction, argument extraction as follows , where y t * denotes the ground truth labels, N t denotes the number of instances, and t denotes different tasks.
For entity coreference and event coreference, BASE optimizes marginal log-likelihood for all correct coreferent spans given candidate spans.
2 k is empirically determined to be 12.
where COREF(i) denotes the gold set of spans coreferent with candidate span i, and t denotes different tasks. The total loss function for BASE is the weighted sum of all tasks: β t is the loss weight for task t.

Cross-event Dependencies
A main issue for document-level event extraction is the increased complexity for capturing event dependencies. Due to larger number of events at the scope of document, efficiency is a key challenge to modeling inter-event interactions. We incorporate DVN (Gygli et al., 2017) into BASE to solve this issue given its advantage in computation efficiency.
Deep Value Networks DVN is an energy-based structured prediction architecture v(x, y; θ) parameterized over θ that learns to evaluate the compatibility between a structured prediction y and an input x. The objective of v(x, y; θ) is to approximate an oracle value function v * (y, y * ), a function which measures the quality of the output y in comparison to the groundtruth y * , s.t.∀y ∈ Y, v(x, y; θ) ≈ v * (y, y * ). The final evaluation metrics are usually used as the oracle value function v * (y, y * ). For simplicity, we drop the parameter notion θ , and use v(x, y) to denote DVN instead.
The inference aims to findŷ = argmax y v(x, y) for every pair of input and output. A local optimum of v(x, y) can be efficiently found by performing gradient ascent that runs linear in the size of label. Given DVN's higher scalability compared with other structured prediction algorithms, we leverage DVN to capture cross-event dependencies.
Deep Value Networks Integration Local trigger classifier predicts the event type scores for each token independently. DVN takes in predictions from local trigger classifiers y trig and embeddings of all tokens e as inputs. Structured outputsŷ trig should correct errors made by the local trigger classifier due to uncaptured cross-event dependencies. y trig is obtained by performing h-iteration updates on local trigger predictions y trig using gradient ascent, 3 where y 1 = y trig , α denotes the inference learning rate, and P Y denotes a function that clamps inputs into the range (0, 1). The most likely event type for token i is determined by computing argmax(ŷ trig i ).
End-to-end DVN Learning We train DEED in an end-to-end fashion by directly feeding the local trigger predictions to both DVN and the oracle value function. The trigger classification F 1 metric adopted by previous works (Wadden et al., 2019;Lin et al., 2020) is used as the oracle value function v * (y trig , y trig * ). To accommodate continuous outputs, v * (y trig , y trig * ) needs to be relaxed. We relaxed the output label for each token from [0, 1] to (0, 1). Union and intersection set operations for computing the F 1 scores are replaced with elementwise minimum and maximum operations, respectively. The relaxed oracle value function is denoted as v * (y trig , y * trig ). The loss function for the trigger DVN is the following: −v * (y trig , y trig * ) log v(e, y trig ) − (1 − v * (y trig , y trig * )) log(1 − v(e, y trig )).
( 2) The total loss function for training DEED end-toend is the summation of BASE loss and DVN loss, Noise Injection However, in this training setup, DVN observes a large portion of high scoring examples at the later stage of training process when the local trigger classifier starts to overfit on the training examples. A naive solution is feeding random noise to train DVN in addition to the outputs of local trigger classifier. Yet, the distribution of these noise are largely distinct from the output of trigger classifier, and therefore easily distinguishable by DVN. Thus, we incorporate swap noise into the local trigger predictions, where s% of the local trigger outputs y trig are swapped, as depicted in Figure 2. 4 This way, noisy local trigger predictions have similar distributions to the original trigger predictions. We also hypothesize that higherconfident predictions are often easier to identify, and swapping higher-confident trigger predictions may not help DVN learn. We experimented swapping only the lower-confident trigger predictions.

Experimental Setup
Our models are evaluated on the ACE05 dataset, containing event, relation, entity, and coreference annotations. Experiments are conducted at the document level instead of sentence level as previous works (Wadden et al., 2019;Lin et al., 2020).

Baselines and Model Variations
We compare DEED with three baselines: (1) BASE, the base model described in Section 4.1; (2) BCRF extends BASE by adding a CRF layer on top of the trigger classifier; (3) OneIE + is a pipeline composed of the joint model presented in Lin et al. (2020) and coreference modules adapted from BASE. Lin et al. (2020) is the state-of-the-art sentence-level event extraction model that utilizes beam search and CRF with global features to model cross sub-task dependencies. For fair comparison, all models are re-trained using BERT-BASE (Devlin et al., 2019) as the encoder.
In addition to the original DEED model, we consider three variations of it, as discussed in Section 4.2. DEED w/RN incorporates random noise while learning DVN, whereas DEED w/SN integrates swap noise. DEED w/SNLC is an extension of DEED w/SN, where swap noise is only applied to lower-confident trigger predictions.

Overall Results
The overall results are summarized in Table 1   while DEED w/SNLC achieves the highest DOC-TRIGGER score and combined score.

Performance of Each Component
To understand the capabilities of each module, we show an evaluation breakdown on each component following previous works (Wadden et al., 2019;Lin et al., 2020) in Table 2. 5 Both BCRF and DEED obtain significant performance gain over BASE across all tasks. In terms of trigger-related tasks, Trig-I and Trig-C, DEED w/SNLC achieves the highest scores. Yet, BCRF performs the best on Evt-Co. This explains the close performance of DEED w/SNLC and BCRF on DOCTRIGGER, 5 These studies focus on extracting head span of name argument, while we extract full span of all types of arguments. as shown in Table 1. In terms of argument-related tasks, OneIE + achieves the best performance on Arg-I and Arg-C. This suggests that cross-subtask modeling can be important to improve argument extraction. Arg-I and Arg-C are much lower than the reported scores by previous studies (Wadden et al., 2019;Lin et al., 2020). This suggests the difficulty of extracting full span of pronoun and nominal arguments. Table 3 describes the computation time of different models. DEED only requires slightly more computation time in both training and inference time than BASE. By contrast, compared to BCRF, DEED is ∼3.5x faster in training time and ∼6x faster in inference time. This demonstrates the efficiency of our approach given the little increase in computation time and the significant performance gain comparable to BCRF detailed in Tables 1 and 2. We also added experiments with OneIE + as a reference, but the comparison focuses on end-to-end frameworks.

Value Function Approximation
To show that the performance gain of DEED is resulted from improved capabilities of DVN in judging the structure of predicted triggers, we investigate how close DVN approximates the oracle value function under different training settings. We use cross entropy loss as the distance function between the output of DVN and and output of the oracle value function on the test set. The lower the loss is, the closer between the output of DVN and the output of the oracle value function. Table 4 shows the approximation results. The SNLC variation (swap noise applying to lower-confident predicted triggers) yields the lowest loss comparing to the base model and other variations. Along with the results shown in Table 2, we show that lower DVN loss results in better trigger scores. This demonstrates that integrating noise into DVN training procedure is effective in learning better DVN and obtaining better overall performance.

Error Analysis
We manually compared gold and predicted labels of event mentions on the ACE05 test set and analyzed the mistakes made by our model. These errors are categorized as demonstrated in Figure 3.  In the sentence above, the trigger label for token resignation should be END-POSITION, according to the annotation guideline. Yet, it is not annotated as a trigger in gold annotation. In other cases, two sentences with similar structures contain inconsistent gold annotation, such as: Separately, former WorldCom CEO Bernard Ebbers failed on April 29 to make a first repayment of 25 million dollars ...

Former senior banker Callum McCarthy begins what is one of the most important jobs in London 's financial world in September
The two examples above share similar context. However, the former in the first sentence is not involved with any event, whereas the former in the second sentence is annotated as an END-POSITION typed trigger.
Conceptual Events Another common source of false positive errors is extracting "conceptual" events, which did not happen or may happen in the future. For instance, ... former WorldCom CEO Bernard Ebbers failed on April 29 to make a first repayment of 25 million dollars ... Our model predicts the word repayment as an TRANSFER-MONEY, which is true if it indeed happened, except it failed, as indicated in the beginning of the sentence. To handle this type of error, models need to be aware of the tense and whether there is a negative sentiment associated with the predicted events.
Weak Textual Evidence Our model commonly made false negative errors in cases where the textual information is vague.
But both men observed an uneasy truce over US concerns about Russian aid to the nuclear program of Iran ...
In the above sentence, DVN fails to identify the token aid as a trigger of type TRANSFER-MONEY. In fact, it is hard to determine whether the aid is monetary or military given the context of the whole document. In this case, models have to be aware of information from other sources, such as knowledge bases or other news articles.
Cross-event Dependencies Although our model is able to correct many mistakes made by BASE that requires modeling of cross-event dependencies, as  In the above example, DVN correctly predict suicide as a DIE typed trigger, but falsely predict shot as type ATTACK instead of type DIE. If our model could capture the interactions between suicide and shot, it would be able to process this situation. There is still room to improve in cross-event dependency modeling.

Conclusion
In this paper, we investigate document-level event extraction that requires joint modeling of event and entity coreference. We propose a documentlevel event extraction framework, DEED, which uses DVN to capture cross-event dependencies, and explore different end-to-end learning methods of DVN. Experimental results show that DEED achieves comparable performance to competitive baseline models, while DEED is much favorable in terms of computation efficiency. We also found that incorporating noise into end-to-end DVN training procedure can result in higher DVN quality and better overall performance.

Ethics
Biases have been studied in many information extraction tasks, such as relation extraction (Gaut et al., 2020), named entity recognition (Mehrabi et al., 2020), and coreference resolution (Zhao et al., 2018a). Nevertheless, not many works investigate biases in event extraction tasks, particularly ACE05.
We analyze the portion of male pronouns (he, him, and his) and female pronouns (she and her) in the ACE05 dataset. In total, there are 2780 male pronouns, while only 970 female pronouns appear in the corpus. We would expect the trained model to perform better when extracting events where male arguments are involved, and make more mistakes for event involving female arguments due to the significant imbalance between male and female entity annotation. After analyzing the performance of DEED w/ SNLC on the test set, we found that it scores 54.90 and 73.80 on Arg-C F 1 for male and female pronoun arguments, respectively. Surprisingly, our model is better at identifying female pronoun arguments than male pronoun arguments.
While our proposed framework may not subject to gender biases in ACE05, whether such issue can occur when our model is deployed for public use is unknown. Rigorous studies on out-of-domain corpus is needed to answer this question.

A Data Statistics
The statistics of ACE05 are shown in Table 6.We observe that the event coreference annotation is very sparse.

B Implementation Details
We adopted part of the pre-processing pipelines from Wadden et al. (2019)   if not SAMEROLE(g, p) or not