Document-Level Event Role Filler Extraction using Multi-Granularity Contextualized Encoding

Few works in the literature of event extraction have gone beyond individual sentences to make extraction decisions. This is problematic when the information needed to recognize an event argument is spread across multiple sentences. We argue that document-level event extraction is a difficult task since it requires a view of a larger context to determine which spans of text correspond to event role fillers. We first investigate how end-to-end neural sequence models (with pre-trained language model representations) perform on document-level role filler extraction, as well as how the length of context captured affects the models’ performance. To dynamically aggregate information captured by neural representations learned at different levels of granularity (e.g., the sentence- and paragraph-level), we propose a novel multi-granularity reader. We evaluate our models on the MUC-4 event extraction dataset, and show that our best system performs substantially better than prior work. We also report findings on the relationship between context length and neural model performance on the task.


Introduction
The goal of document-level event extraction 1 is to identify in an article events of a pre-specified type along with their event-specific role fillers, i.e., arguments. The complete document-level extraction problem generally requires role filler extraction, noun phrase coreference resolution and event tracking (i.e., determine which extracted role fillers belong to which event). In this work, we focus only on document-level role filler extraction. Figure 1 provides a representative example of this task. Given an article consisting of multiple paragraphs/sentences, and a fixed set of event types 1 The task is also referred to as template filling (MUC-4, 1992 [S1] ... by special urban troops, four terrorists have been arrested in soacha.
[S2] They are responsible for the car bomb attack on the Newspaper El Espectador, to a series of bogota dynamite attacks, to the freeing of a group of paid assassins.
[S3] The terrorists are also connected to the murder of Teofilo Forero Castro, … [S4] General Ramon is the commander of the 13 th infantry brigade.
[S5] He said that at least two of those arrested have fully confessed to having taken part in the accident of Luis Carlos Galan Sarmiento in soacha, Cundinamarca.
(e.g., terrorist events) and associated roles (e.g., PERPETRATOR INDIVIDUAL, VICTIM, WEAPON), we aim to identify those spans of text that denote the role fillers for each event described in the text. This generally requires both sentence-level understanding and accurate interpretation of the context beyond the sentence. Examples include identifying "Teofilo Forero Castro" (mentioned in S3) as a victim of the car bomb attack event (mentioned in S2), determining there's no role filler in S4 (both of which rely mainly on sentence-level understanding, and identifying "four terrorists" in S1 as a perpetrator individual (which requires coreference resolution across sentence boundaries). Generating the document-level extractions for events is essential in facilitating downstream applications such as information retrieval and article summarization (Yang and Mitchell, 2016), and for real-life applications such as trends analysis of world events (Sundheim, 1992).
Recent work in document-level event role filler extraction has employed a pipeline architecture with separate classifiers for each type of role and for relevant context detection (Patwardhan and Riloff, 2009;Huang and Riloff, 2011). However these methods: (1) suffer from error propagation across different pipeline stages; and (2) require heavy feature engineering (e.g., lexico-syntactic pattern features for candidate role filler extraction; lexical bridge and discourse bridge features for detecting event-relevant sentences at the document level). Moreover, the features are manually designed for a particular domain, which requires linguistic intuition and domain expertise .
Neural end-to-end models have been shown to excel at sentence-level information extraction tasks, such as named entity recognition (Lample et al., 2016;Chiu and Nichols, 2016) and ACE-type within-sentence event extraction (Chen et al., 2015;Nguyen et al., 2016;Wadden et al., 2019). However, to the best of our knowledge, no prior work has investigated the formulation of document-level event role filler extraction as an end-to-end neural sequence learning task. In contrast to extracting events and their role fillers from standalone sentences, document-level event extraction poses special challenges for neural sequence learning models. First, capturing long-term dependencies in long sequences remains a fundamental challenge for recurrent neural networks (Trinh et al., 2018). To model long sequences, most RNN-based approaches use backpropagation through time. But it's still difficult for the models to scale to very long sequences. We provide empirical evidence for this for event extraction in Section 4.3. Second, although pretrained bi-directional transformer models such as BERT (Devlin et al., 2019) better capture long-distance dependencies as compared to an RNN architecture, they still have a constraint on the maximum length of the sequence, which is below the length of many articles about events.
In the sections below, we study how to train and apply end-to-end neural models for event role filler extraction. We first formalize the problem as a sequence tagging task over the tokens in a set of contiguous sentences in the document. To address the aforementioned challenges for neural models applied to long sequences, (1) we investigate the effect of context length (i.e., maximum input segment length) on model performance, and find the most appropriate length; and (2) propose a multi-granularity reader that dynamically aggregates the information learned from the local con-text (e.g., sentence-level) and the broader context (e.g., paragraph-level). A quantitative evaluation and qualitative analysis of our approach on the MUC-4 dataset (MUC-4, 1992) both show that the multi-granularity reader achieves substantial improvements over the baseline models and prior work.
For replication purposes, our repository for the evaluation and preprocessing scripts will be available at https://github.com/xinyadu/doc_ event_role.

Related Work
Event extraction has been mainly studied under two paradigms: detecting the event trigger and extracting the arguments from an individual sentence (e.g., the ACE task (Doddington et al., 2004) 2 , vs. at the document level (e.g., the MUC-4 template-filling task (Sundheim, 1992)).
Sentence-level Event Extraction The ACE event extraction task requires extraction of the event trigger and its arguments from a sentence. For example, in the sentence " ... Iraqi soldiers were killed by U.S. artillery ...", the goal is to identify the "die" event triggered by killed and the corresponding arguments (PLACE, VICTIM, INSTRU-MENT, etc.). Many approaches have been proposed to improve performance on this specific task. Li et al. (2013Li et al. ( , 2015  The approaches generally focus on sentence-level context for extracting event triggers and arguments and rarely generalize to the document-event extraction setting (Figure 1).
Only a few models have gone beyond individual sentences to make decisions. Ji and Grishman (2008) enforce event role consistency across documents. Liao and Grishman (2010) explore event type co-occurrence patterns to propagate event classification decisions. Similarly, Yang and Mitchell (2016) propose jointly extracting events and entities within a document context. Also related to our work are Duan et al. (2017) and Zhao et al. (2018), which utilize document embeddings to aid event detection with recurrent neural networks. Although these approaches make decisions with crosssentence information, their extractions are still at the sentence level.
Document-level Event Extraction has been studied mainly under the classic MUC paradigm (MUC-4, 1992). The full task involves the construction of answer key templates, one template per event (some documents in the dataset describe more than one events). Typically three steps are involved -role filler extraction, role filler mention coreference resolution and event tracking). In this work we focus on role filler extraction.
From the modeling perspective, recent work explores both the local and additional context to make the role filler extraction decisions. GLACIER (Patwardhan and Riloff, 2009) jointly considers crosssentence and noun phrase evidence in a probabilistic framework to extract role fillers. TIER (Huang and Riloff, 2011) proposes to first determine the document genre with a classifier and then identify event-relevant sentences and role fillers in the document. Huang and Riloff (2012) propose a bottom-up approach that first aggressively identifies candidate role fillers (with lexico-syntactic pattern features), and then removes the candidates that are in spurious sentences (i.e., not event-related) via a cohesion classifier (with discourse features). Similar to Huang and Riloff (2012), we also incorporate both intra-sentence and cross-sentence features (paragraph-level features), but instead of using manually designed linguistic information, our models learn in an automatic way how to dynamically incorporate learned representations of the article. Also, in contrast to prior work that is pipeline-based, our approach tackles the task as an end-to-end sequence tagging problem.
There has also been work on unsupervised event schema induction (Chambers and Jurafsky, 2011;Chambers, 2013) and open-domain event extraction (Liu et al., 2019) from documents: the main idea is to group entities corresponding to the same role into an event template. Our models, on the other hand, are trained in supervised way and the event schemas are pre-defined.
Apart from event extraction, there has been increasing interest on cross-sentence relation extraction (Mintz et al., 2009;Peng et al., 2017;Jia et al., 2019). This work assumes that mentions are provided, and thus is more of a mention/entity-level classification problem. Our work instead focuses on role filler/span extraction using sequence tagging approaches; role filler type is determined during this process.
Capturing Long-term Dependencies for Neural Sequence Models For training neural sequence models such as RNNs, capturing long-term dependencies in sequences remains a fundamental challenge (Trinh et al., 2018). Most approaches use backpropagation through time (BPTT) but it is difficult to scale to very long sequences. Many variations of models have been proposed to mitigate the effect of long sequence length, such as Long Short Term Memory (LSTM) Networks (Hochreiter and Schmidhuber, 1997;Gers et al., 1999;Graves, 2013) and Gated Recurrent Unit Networks (Cho et al., 2014). Transformer based models (Vaswani et al., 2017;Devlin et al., 2019) have also shown improvements in modeling long text. In our work for document-level event role filler extraction, we also implement LSTM layers in the models as well as utilize the pre-trained representations provided by the bi-directional transformer model -BERT. From an application perspective, we investigate the suitable length of context to incorporate for the neural sequence tagging model in the document-level extraction setting. We also study how to mitigate problems associated with long sequences by dynamically incorporating both sentence-level and paragraph-level representations in the model ( Figure 3).

Methodology
In the following we describe (1) how we transform the document into paired token-tag sequences and formalize the task as a sequence tagging problem (Section 3.1); (2) the architectures of our base ksentence reader (Section 3.2) and multi-granularity reader (Section 3.3).

Constructing Paired Token-tag Sequences from Documents and Gold Role Fillers
We formalize document-level event role filler extraction as an end-to-end sequence tagging problem. The Figure 2 illustrates the general idea. Given a document and the text spans associated with the gold-standard (i.e., correct) fillers for each role, we adopt the BIO (Beginning, Inside, Outside) tagging scheme to transform the document into paired token/BIO-tag sequences..
We construct example sequences of variant context lengths for training and testing our end-to-… …  Figure 2: An overview of our framework for training the sequence reader for event role filler extraction. end k-sentence readers (i.e., the single-sentence, double-sentence, paragraph and chunk readers). By "chunk", we mean the chunk of contiguous sentences which is right within the sequence length constraint for BERT -512 in this case. Specifically, we use a sentence splitter 3 to divide the document into sentences s 1 , s 2 , ..., s n . To construct the training set, starting from each sentence i, we concatenate the k contiguous sentences (s i to s i+k−1 ) to form overlapping candidate sequences of length k -sequence 1 consists of {s 1 , ..., s k }, sequence 2 consists of {s 2 , ..., s k+1 }, etc. To make the training set balanced, we sample the same number of positive and negative sequences from the candidate sequences, where "positive" sequence contains at least one event role filler, and "negative" sequences contain no event role fillers. To construct the dev/test set, where the reader is applied, we simply group the contiguous k sentences together in order, producing n k sequences (i.e., sequence 1 consists of {s 1 , ..., s k }, sequence 2 consists of {s k+1 , ..., s 2k }, etc.) For the paragraph reader, we set k to average paragraph length for the training set, and to the real paragraph length for test set.
We denote the token in the sequence with x, the input for the k-sentence reader is i is the i-th token of the k-th sentence, and l k is the length of the k-th sentence.

k-sentence Reader
Since our general k-sentence reader does not recognize sentence boundaries, we simplify the notation for the input sequence as {x 1 , x 2 , ..., x m } here.
Embedding Layer In the embedding layer, we represent each token x i in the input sequence as the concatenation of its word embedding and contextual token representation: • Word Embedding: We use the 100dimensional GloVe pre-trained word embeddings (Pennington et al., 2014) trained from 6B Web crawl data. We keep the pre-trained word embeddings fixed. Given a token x i , we have its word embedding: • Pre-trained LM representation: Contextualized embeddings produced by pre-trained language models (Peters et al., 2018;Devlin et al., 2019) have been proved to be capable of modeling context beyond the sentence boundary and improve performance on a variety of tasks. Here we employ the contextualized representations produced by BERT-base for our k-sentence labeling model, as well as the multi-granularity reader to be introduced next. Specifically, we use the average of all the 12 layers' representations and freeze the weights (Peters et al., 2019) during training after empirical trials 4 . Given the sequence {x 1 , x 2 , ..., x m }, we have: We forward the concatenation of the two representations for each token to the upper layers: BiLSTM Layer To help the model better capture task-specific features between the sequence tokens. We use a multi-layer (3 layers) bi-directional LSTM encoder on top of the token representations, which we denote as BiLSTM: CRF Layer Drawing inspirations for sentencelevel sequence tagging models on tasks like NER (Lample et al., 2016). Modeling the labeling decisions jointly rather than independently improves the models performance (e.g., the tag "I-Weapon" should not follow "B-Victim"). We model labeling decisions jointly using a conditional random field (Lafferty et al., 2001). After passing {p 1 , p 2 , ..., p m } through a linear layer, we have P of size m× size of tag space, where P i,j is the score of the tag j of the i-th token in the sequence. For a tag sequence y = {y 1 , ..., y m }, we have the score for the sequencetag pair as: A is the transition matrix of scores such that A i,j represents the score of a transition from the tag i to tag j. A softmax function is applied over scores for all possible tag sequences, which yield a probability for the gold sequence y gold . The logprobability of the gold tag sequence is maximized during training. During decoding, the model predicts the output sequence that obtains the maximum score.

Multi-Granularity Reader
To explore the effect of aggregating contextualized token representations from different granularities 4 Using the representations of the last layer, or summing all the 12 layers' representations give consistently worse results.
Similar to the general k-sentence reader, we use the same embedding layer here to represent the tokens. But we apply the embedding layer to two granularities of the paragraph text (sentence-and paragraph-level). Although the word embeddings are the same for the embedding layers from different granularities, the contextualized representations are different for each token -when the token is encoded in the context of a sentence, or in the context of a paragraph.
Correspondingly, we build two BiLSTMs (BiLSTM sent. and BiLSTM para. ) on top of the sentence-level contextualized token representations {x Sentence-Level BiLSTM The BiLSTM sent. is applied sequentially to each sentence in the paragraph: Paragraph-Level BiLSTM Another BiLSTM layer (BiLSTM para. ) is applied to the entire paragraph (as compared to BiLSTM sent. , which is applied to each sentence), to capture the dependency between tokens in the paragraph:  i (the i-th token in the j-th sentence), to fuse the representations learned at the sentence-level (p (j) i ) and paragraph-level (p (j) i ), we propose two options -the first uses a sum operation, and the second uses a gated fusion operation: [S1] … four terrorists have been arrested in soacha.  Figure 3: Overview for our multi-granularity reader. The dark blue BiLSTM sent. produces sentence-level representations for each token, the yellow BiLSTM para. produces paragraph-level representations for each token.
• Simple Sum Fusion: • Gated Fusion: The gated fusion compute the gate vector g (j) i with its sentence-level token representationp (j) i and paragraph-level token representationp (j) i , to control how much information should be incorporated from the two representations.
Similarly to in the general k-sentence reader, we add the CRF layer (section 3.2) on top of the fused representations for each token in the paragraph {p

Experiments and Analysis
We evaluate our models' performance on the MUC-4 event extraction benchmark (MUC-4, 1992), and compare to prior work. We also report findings on the effect of context length on the end-to-end readers' performance on this document-level task.

MUC-4 Event Extraction Dataset
The MUC-4 dataset consists of 1,700 documents with associated answer key (role filler) templates. To make sure our results are comparable to the previously reported results on this dataset, we use the 1300 documents for training, 200 documents (TST1+TST2) as the development set and the 200 documents (TST3+TST4) as the test set.
Evaluation Metrics Following the prior work, we use head noun phrase match to compare the extractions against gold role fillers for evaluation 5 ; besides noun phrase matching, we also report exact match accuracy, to capture how well the models are capturing the role fillers' boundary 6 . Our results are reported as Precision (P), Recall (R) and F-measure (F-1) score for the macro average for all the event roles. In Table 2, we also present the scores for each event role (i.e., PERPETRATOR INDIVIDUALS, PERPETRATOR ORGANIZATIONS, PHYSICAL TARGETS, VICTIMS and WEAPONS) based on the head noun match metric. The detailed documentation and implementation for the evaluation script will be released.

Baseline Systems and Our Systems
We compare to the pipeline and manual feature engineering based systems: GLACIER (Patwardhan and Riloff, 2009) consists of a sentential event classifier and a set of plausible role filler recog-5 Duplicate role fillers (i.e., extractions for the same role that have the same head noun) are conflated before being scored; they are counted as one hit (if the system produces it) or one miss (if the system fails to produce any of the duplicate mentions). 6 Similarly, duplicate extractions with the same string are counted as one hit or miss.  nizers for each event role. The final extraction decisions are based on the product of normalized sentential and phrasal probabilities; TIER (Huang and Riloff, 2011) proposes a multi-stage approach. It processes a document in three stages: classifying narrative document, recognizing event sentence and noun phrase analysis. Cohesion Extract (Huang and Riloff, 2012) adopts a bottom-up approach, which first aggressively identifies candidate role fillers in the document and then refines the candidate set with cohesion sentence classifier. Cohesion Extract obtains substantially better precision and with similar level of recall as compared to GLACIER and TIER.
To investigate how the neural models capture the long dependency in the context of variant length (single-sentence, double-sentence, paragraph or longer), we initialize the k in k-sentence reader to different values to build the: Single-Sentence Reader (k = 1), which reads through the document sentence-by-sentence to extract the event role fillers; Double-Sentence Reader (k = 2), which reads the document with step of two sentences; Paragraph Reader (k = # sentences in the paragraph), which reads the document paragraph-byparagraph; Chunk Reader (k = maximum # of sentences that fit right in the length constraint for pretrained LM models), which reads the document with the longest step (the constraint of BERT model). Table 1&2 presents the results obtained with our Multi-Granularity Reader. Similar to the paragraph-level reader, it reads through document paragraph-by-paragraph, but learns the representations for both intra-sentence and inter-sentence context.

Results and Findings
We report the macro average results in Table 1. To understand in detail how the models extract the fillers for each event role, we also report the per event role results in Table 2. We summarize the results into important findings below: • The end-to-end neural readers can achieve nearly the same level or significantly better results than the pipeline systems. Although our models rely on no hand-designed features, the contextualized double-sentence reader and paragraph reader achieves nearly the same level of F-1 compared to Cohesion Extraction (CE), judging by the head noun matching metric. Our multi-granularity reader performs significantly better (∼60) than the prior stateof-the-art.
• Contextualized embeddings for the sequence consistently improve the neural readers' performance. The results show that the contextualized k-sentence readers all outperform their non-contextualized counterparts, especially when k > 1. The trends also exhibit in the per event role analysis (Table 2). To notice, we freeze the transformers' parameters during training (fine-tuning yields worse results).
• It's not the case that modeling the longer context will result in better neural sequence   tagging model on this document-level task.
When increasing the input context from a single sentence to two sentences, the reader has a better precision and lower recall, resulting in no better F-1; When increase the input context length further to the entire paragraph, the precision increases and recall remains the same level, resulting in higher F-1; When we keep increasing the length of input context, the reader becomes more conservative and F-1 drops significantly. All these indicate that focusing on the local (intra-sentence) and broader (paragraph-level) context are both important for the task. Similar results regarding the context length have also been found in document-level coreference resolution (Joshi et al., 2019).
• Our multi-granularity reader that dynamically incorporates sentence-level and paragraph-level contextual information performs significantly better, than the nonend-to-end systems and our base k-sentence readers on the macro average F-1 metric.
In terms of the per event role performance (

Further Analysis
We conduct an ablation study on how modules of our multi-granularity reader affect its performance on this document-level extraction task (Table 3). From the results, we find that: (1) when replacing the gated fusion operation with the simple sum of the sentence-and paragraph-level token representations, the precision and F-1 drop substantially, which proves the importance of dynamically incorporating context; (2) when removing the BERT's contextualized representations, the model becomes more conservative and yields substantially lower recall and F-1; (3) when replacing the CRF layer and make independent labeling decisions for each token, both the precision and recall drops substantially.
We also do an error analysis with examples and predictions from different models, to understand qualitatively the advantages and disadvantages of our models. In the first example below (green span: gold extraction, the role after is the span's event role), the multi-granularity (MG) reader and single-sentence reader correctly extracts the two target expressions, which the paragraph reader overlooks. Although only in the last sentence the attack and targets are mentioned, our MG reader successfully captures this with focusing on both the paragraph-level and intra-sentence context.
... the announcer says president virgilio barco will tonight disclose his government's peace proposal. ...... . Near the end, the announcer adds to the initial report on the el tomate attack with a 3-minute update that adds 2 injured, 21 houses Target destroyed, and 1 bus Target burned.
In the second example (red span: false positive per-pInd extraction by the single-sentence reader), although "members of the civil group" appears in a sentence about explosion, judging from paragraph-level context or reasoning about the expression itself should help confirm that it is not perpetrator individual. The MG and paragraph reader correctly handles this and also extracts "the bomb". There's substantial improvement space for our MG reader's predictions. There are many role fillers which the reader overlooks. In the example below, "La Tandona" being a perpetrator organization is implicitly expressed in the document and the phrase did not appear elsewhere in the corpus. But external knowledge (e.g., Wikipedia) could help confirm its event role. In the last example, there are no explicit expression such as "kill" or "kidnap" in the context for the target. Thus it requires deeper understanding of the entire narrative and reasoning about the surrounding context to understand that "Jorge Serrano Gonzalez" is involved in a terrorism event.
... said that the guerrillas are desperate and ... . The president expressed his satisfaction at the release of Santander department senator Jorge Serrano Gonzalez Target, whom he described as one of the most important people that colombian democracy has at this moment.

Conclusion and Future Work
We have demonstrated that document-level event role filler extraction could be successfully tackled with end-to-end neural sequence models. Investigations on how the input context length affects the neural sequence readers' performance show that context of very long length might be hard for the neural models to capture and results in lower performance. We propose a novel multi-granularity reader to dynamically incorporate paragraph-and sentence-level contextualized representations. Evaluations on the benchmark dataset and qualitative analysis prove that our model achieves substantial improvement over prior work. In the future work, it would be interesting to further explore how the model can be adapted to jointly extract role fillers, tackles coreferential mentions and constructing event templates.