Event Extraction by Answering (Almost) Natural Questions

The problem of event extraction requires detecting the event trigger and extracting its corresponding arguments. Existing work in event argument extraction typically relies heavily on entity recognition as a preprocessing/concurrent step, causing the well-known problem of error propagation. To avoid this issue, we introduce a new paradigm for event extraction by formulating it as a question answering (QA) task, which extracts the event arguments in an end-to-end manner. Empirical results demonstrate that our framework outperforms prior methods substantially; in addition, it is capable of extracting event arguments for roles not seen at training time (zero-shot learning setting).


Introduction
Event extraction is a long-studied and challenging task in Information Extraction (IE) (Sundheim, 1992;Riloff et al., 1993;Riloff, 1996). The goal is to extract structured information -"what is happening" and the persons/objects that are involved -from unstructured text. Understanding the structure of events in text is of great importance for downstream applications such as news summarization and information retrieval (Yang and Mitchell, 2016). The task is illustrated via an example in Figure 1 from the ACE 2005 corpus (Doddington et al., 2004). It depicts an ownership transfer event (the event type), which is triggered in the sentence by the word "sale" (the event trigger), and accompanied by its extracted arguments -text spans denoting entities that fill a set of (semantic) roles associated with the event type (e.g., BUYER, SELLER and ARTIFACT for ownership transfer events).
Prior and recent successful approaches to event extraction have benefited from dense features extracted by neural models (Chen et al., 2015; Input: As part of the 11-billion-dollar sale of USA Interactive's film and television operations to the French company and its parent company in December 2001, USA Interactive received 2.5 billion dollars in preferred shares in Vivendi Universal Entertainment.

Event type
Transaction-Transfer-Ownership Trigger "sale" Arguments Buyer "French company", "parent company" Seller "USA Interactive" Artifact "operations" Place -Beneficiary -Extracted Event: Figure 1: Extracting event trigger and its corresponding arguments. Nguyen et al., 2016;Liu et al., 2018) and contextualized representations from pretrained language models (Zhang et al., 2019b;Wadden et al., 2019). However, they (1) rely heavily on entity information for argument extraction, in particular, generally requires a multi-step approach for event argument extraction -firstly identifying entities and their types with trained models (Wadden et al., 2019) or a parser (Sha et al., 2018), then argument roles (or no role) are assigned to each entity. Although joint models (Yang and Mitchell, 2016;Nguyen and Nguyen, 2019;Zhang et al., 2019a) have been propose to mitigate this issue. Error propagation still happens during this process -using extracted/predicted entities in event extraction results in a significant drop in performance of argument extraction, as compared to using gold entity information (Li et al., 2013;Yang and Mitchell, 2016); (2) Do not consider the semantic similarity across different argument roles. For example, in the ACE 2005 corpus (Doddington et al., 2004), CONFLICT.ATTACK event and JUSTICE.EXECUTE come with the argument role TARGET and PERSON, respectively. In both events, the argument roles refer to some human being (who) is affected by an action. Not considering the similarity between them can hurt the performance, especially for argument roles with few/no examples at training time (e.g., similar to the zero-shot setting in (Levy et al.,  Buyer "French company", "parent company", "USA Interactive" Seller "USA Interactive" Artifact "operations" Place "USA" Beneficiary -Buyer "French company", "parent company", "USA Interactive" Seller "USA Interactive" Artifact "operations" Place "USA" Beneficiary -

2017)).
In this paper, we propose a new paradigm for the event extraction task -formulating it as a question answering (QA)/machine reading comprehension (MRC) task. The general framework is illustrated in Figure 2. We design fixed question templates for trigger detection and varied question templates for extracting each argument role. The input sentence is instantiated with the templates before being fed into the models to obtain the extractions. Details will be explained in Section 2.
Our paradigm brings many advantages for tackling the problem: (1) Our approach requires no entity annotation (gold or predicted entity information). To be more specific, it is end-to-end for event argument extraction, there is no pre-step needed for entity recognition; (2) The question answering paradigm helps the model learn to extract event arguments with help from transferring across different but semantically similar argument roles, we show empirically that the performance on both trigger and argument extraction outperform prior methods (Section 3.2). We also prove that our framework is able to extract event arguments of unseen roles (zero-shot setting); (3) Under our paradigm, advantages from models in question answering/machine reading comprehension literature (e.g., MatchLSTM (Wang and Jiang, 2016), BiDAF (Seo et al., 2016), etc.) can be explored. Our main contributions and findings can be summarized as following: • We propose a question answering framework ( Figure 2) for detecting event triggers and extracting its corresponding arguments. To our best knowledge, this is a first attempt to cast the event extraction problem as a QA task.
• We conduct extensive experiments to evaluate our framework on the Automatic Content Extraction (ACE) event extraction task. We propose several questioning strategies and investigate their effect on our model's performance. We find that using the annotation guideline based questioning strategy (i.e., questions encode more naturalness and semantics) with trigger information yields the best result, especially in the setting with unseen argument roles. Our best model outperforms the prior models on the ACE event extraction task.
Our code and question templates for the work will be open sourced at https://github.com/ xinyadu/eeqa for reproduction purpose.

Methodology
In this section, we first provide an overview for the framework (Figure 2), then go deeper into details of the components in the framework: the questioning strategies, models training and inference.

Framework Overview
Given an input sentence, we do the instantiation with trigger question template to get the input sequence for the QA model for trigger detection (green box BERT_QA_Trigger). After obtaining the extracted trigger and its type (i.e., event type) for the input sentence, instantiation is done for each argument role of the predicted event type. Then, the instantiated input sequences are passed into another QA model for argument extraction (orange box BERT_QA_Arg). Finally, a dynamic threshold is applied to the candidate arguments extracted, and only the top arguments are kept.  The input sequences for the two QA models share a similar format: where [CLS] is the special classification token, and [SEP] is the special token to denote separation. We provide details on how to obtain the question with various strategies in Section 2.2. Details on the QA models and the inference process can be found in Section 2.3.

Questioning Strategies
For event trigger detection, we design simple fixed templates ("what is the trigger", "trigger", "action", "verb"). Basically, we use the fixed literal phrase as the question. For example, if we choose "verb" template, the input sequence after instantiation is: Next we introduce our question templates for argument extraction, we design three templates with argument role name, basic argument based question and annotation guideline based question, respectively.
• Template 1 (argument role name) For this template, we use the argument role name (e.g., artifact, agent, place) as the question.
• Template 2 (argument based question) Instead of directly use the argument role name ([argument]) as question, we first determine the argument role's type (person, place or other). Then based on the basic type information, we determine the "wh" word ([wh_word]) for question -who for person, where for place and what for other. In summary, the question is: [wh_word] is the [argument] ?
In this way, more semantic information is added in the template 2 question as compared to template 1 question.
• Template 3 (annotation guideline based question) To incorporate even more naturalness and semantic information into the question, we utilize the description for each argument role in ACE annotation guidelines for events (Linguistic Data Consortium, 2005) to design the (almost) natural question.
Finally, to encode the trigger information, we add "in [trigger]" at the end of the question (where [trigger] is instantiated with the real trigger token from the trigger detection phase). For example, the template 2 question incorporating trigger information would be: To help better understand all the strategies above, table 1 presents an example for argument roles of event type MOVEMENT.TRANSPORT. We see in the table that the annotation guideline based question is more natural and encodes more semantics about a given argument role. For example, for "artifact", the question "what is being transported" (from description for the role in annotation guideline) is more natural than the simple question"what is the artifact".

Question Answering Models
We use BERT ) as the base model for getting the contextualized representations from the input sequences for both BERT_QA_Trigger and BERT_QA_Arg, the parameters are updated during the training process. After the instantiation with question templates the sequences are of format [CLS] question [SEP] sentence [SEP]. Then we get the contextualized representations of each token for trigger detection and argument extraction with BERT T r and BERT Arg , respectively. For the input sequence (e 1 , e 2 , ..., e N ) prepared for trigger detection, we have: e 1 , e 2 , ..., e N = BERT T r (e 1 , e 2 , ..., e N ) As for the input sequence (a 1 , a 2 , ..., a M ) prepared for argument span extraction, we have: The output layer differs: BERT_QA_Trigger predicts the type for each token in sentence, while BERT_QA_Arg predicts the start and end offset for the argument span.
For trigger prediction, we introduce a new parameter matrix W tr ∈ R H×T , where H is the hidden size of the transformer and T is the number of event types plus one (for non-trigger token). The softmax normalization is applied across the T types: For argument span prediction, we introduce two new parameter matrices W s ∈ R H×1 and W e ∈ R H×1 , the softmax normalization is applied across the input tokens a 1 , a 2 , ..., a M to get the probability of each token being selected as the start/end of the argument span: To train the models (BERT_QA_Trigger and BERT_QA_Arg), we minimize the negative loglikelihood loss for both models. In particular, the loss for the argument extraction model is the sum of two parts: the start token loss and end end token loss. For the training example with no argument span, we minimize the start and end probability of the first token ([CLS]).
At test time, for trigger detection, to obtain the type for each token e 1 , e 2 , ..., e N , we simply apply argmax to P tr .
Inference with Dynamic Threshold for Argument Spans During test time, predicting the argument spans is more complex -since for each argument role, there might be several or no spans to be extracted. After the output layer, we have the probability of each token a i ∈ (a 1 , a 2 , ..., a M ) being the start (P s (i)) and end (P e (i)) of the argument span.
We run Algorithm 1 to get all the valid candidate argument spans from the sentence for each argument role. Basically, we   3. Calculate the relative no answer score (na_score) for the candidate span and add the candidate to list (line 7-9).
Then in Algorithm 2, we obtain the threshold that helps achieve best evaluation results on the dev set (line 1-9). Finally, we apply the best threshold (best_thresh) to all the candidate argument spans in the test set and keep only the top arguments with na_score larger than the threshold (line 10-13). With the dynamic threshold for determining the number of arguments to be extracted for each role, we avoid adding a (hard) hyperparameter for this purpose.

Dataset and Evaluation Metric
We conduct experiments on the ACE 2005 corpus (Doddington et al., 2004), it contains documents crawled between year 2003 and 2005 from a variety of areas such as newswire (nw), weblogs (wl), broadcast conversations (bc) and broadcast news (bn). The part that we use for evaluation is fully annotated with 5,272 event triggers and 9,612 arguments. We use the same data split and preprocessing step as in the prior works (Zhang et al., 2019b;Wadden et al., 2019).
As for evaluation, we adopt the same criteria defined in Li et al. (2013): An event trigger is cor-rectly identified (ID) if its offsets match those of a gold-standard trigger; and it is correctly classified if its event type (33 in total) also match the type of the gold-standard trigger. An event argument is correctly identified (ID) if its offsets and event type match those of any of the reference argument mentions in the document; and it is correctly classified if its semantic role (22 in total) is also correct. Though our framework does not involve the trigger/argument identification step and tackles the identification + classification in an end-to-end way. We still report the trigger/argument identification's results to compare to prior work. It could be seen as a more lenient eval metric, as compared to the final trigger detection and argument extraction metric (ID + Classification), which requires both the offsets and the type to be correct. All the aforementioned elements are evaluated using precision (denoted as P), recall (denoted as R) and F1 scores (denoted as F1).

Results
Evaluation on ACE Event Extraction We compare our framework's performance to a number of prior competitive models: dbRNN (Sha et al., 2018) is an LSTM-based framework that leverages the dependency graph information to extract event triggers and argument roles. Joint3EE (Nguyen and Nguyen, 2019) is a multi-task model that per-forms entity recognition, trigger detection and argument role assignment by shared Bi-GRU hidden representations. GAIL (Zhang et al., 2019b) is an ELMo-based model that utilizes generative adversarial network to help the model focus on harderto-detect events. DYGIE++ (Wadden et al., 2019) is a BERT-based framework that models text spans and captures within-sentence and cross-sentence context.
In Table 2, we present the comparison of models' performance on trigger detection. We also implement a BERT fine-tuning baseline and it reaches nearly same performance as its counterpart in the DYGIE++. We observe that our BERT_QA_Trigger model with best trigger questioning strategy reaches comparable (better) performance with the baseline models. Table 3 shows the comparison between our model and baseline systems on argument extraction. Notice that the performance of argument extraction is directly affected by trigger detection. Because argument extraction correctness requires the trigger to which the argument refers to be correctly identified and classified. We observe, (1) Our BERT_QA_Arg model with best argument question strategy (annotation guideline based questions) outperforms prior works significantly, although it uses no entity recognition resources; (2) Drop of F1 performance from argument identification (correct offset) to argument ID + classification (both correct offset and argument role) is only around 1%, while the gap is around 3% for prior models which rely on entity recognition and a multi-step process for argument extraction. This once again demonstrates the benefit of our new formulation for the task as question answering.
To gain a better understanding of how the dynamic threshold is affecting our framework's performance. We do an ablation study on this (Table 3) and find that the threshold increases the precision and the general F1 substantially. The last row in the  (similar to the zero-shot relation extraction setting in Levy et al. (2017)), we conduct another experiment, where we keep 80% of the argument roles (16 roles) seen at training time, and 20% (6 roles) only seen at test time. Specifically, the unseen roles are "Vehicle, Artifact, Target, Victim, Recipient, Buyer". Table 4 presents the results. Random NE is our random baseline that selects a named entity in the sentence, it comes with a reasonable performance of near 25%. Prior models such as GAIL is not capable of handling the unseen roles. Using our QA-based framework, as we leverage more semantic information and naturalness into the question (from question template 1 to 2, to 3), both the precision and recall increases substantially.

Influence of Questioning Templates
To investigate how the questioning strategies affect the performance of event extraction. We do experiments on trigger and argument extractions with different strategies, respectively.  In Table 5, we try different questions for trigger detection. By leaving empty, we mean instantiating the question with empty string. There's no substantial gap between different alternatives. By using "verb" as the question, our BERT_QA_Trigger model achieves best performance (measured by F1  Table 6: Influence of questioning strategy on argument extraction. score).
The comparison between different questioning strategies for argument extraction is even more interesting. In Table 6, we present the results in two settings: event argument extraction with predicted triggers (the same setting as in Table 3), and with gold triggers. In summary, we finds that: • Adding "in [trigger]" afterwards the question consistently improve the performance. It serves as an indicator for what/where the trigger is in the input sentence. Without adding the "in [trigger]", for each template (1, 2 & 3), the F1 of models' predictions drop around 3 percent when given predicted triggers, and more when given gold triggers. • Our template 3 questioning strategy which is most natural achieves the best performance. As we mentioned earlier, template 3 questions are based on descriptions for argument roles in the annotation guideline, thus encoding more semantic information about the role name. And this corresponds to the accuracy of models' predictions -template 3 outperforms template 1&2 in both with "in [trigger]" and without "in [trigger]" setting. What's more, we observe that template 2 (adding a wh-word to form the questions) achieves better performance than the template 1 (directly using argument role name).

Error Analysis
We further conduct error analysis and provide a number of representative examples.  data, our framework extracts more argument spans only around 14% of the cases. Most of the time (54.37%), our framework extracts less argument spans, this corresponds to the results in Table 3, where the precision of our models are higher. In around 30% of the cases, our framework extracts same number of argument spans as in the gold data, half of them match exactly the gold arguments. After examining the examples, we find the reasons for errors can be mainly divided into three categories: (1) Lack of knowledge for obtaining exact boundary for argument span. For example, in "Negotiations between Washington and Pyongyang on their nuclear dispute have been set for April 23 in Beijing ...", for the ENTITY role, two argument spans should be extracted ("Washington" and "Pyongyang"). While our framework predicts the entire "Washington and Pyongyang" as the argument span. Although there's an overlap between the prediction and gold-data, the model gets no credit for it. (2) Lack of reasoning with documentlevel context. In sentence "MCI must now seize additional assets owned by Ebbers, to secure the loan." There is a TRANSFER-MONEY event triggered by loan, with MCI being the GIVER and Ebbers being the RECIPIENT. In the previous paragraph, it's mentioned that "Ebbers failed to make repayment of certain amount of money on the loan from MCI." Without this context, it is hard to determine that Ebbers should be the recipient of the loan. (3) Data and lexical sparsity. In the following two examples, our model fails to detect the triggers of type END-POSITION. "Minister Tony Blair said ousting Saddam Hussein now was key to solving similar crises." "There's no indication if Erdogan would purge officials who opposed letting in the troops." It's partially due to they were not seen during training as trigger words. "ousting" a rare word and is not in the tokenizers' vocabulary. Purely inferring from the sentence context is hard for the purpose.

Related Work
Event Extraction Most event extraction research has focused on the 2005 Automatic Content Extraction (ACE) sentence-level event task (Walker et al., 2006). In recent years, continuous representations from convolutional neural network (Nguyen and Grishman, 2015;Chen et al., 2015) and recurrent neural network (Nguyen et al., 2016) have been proved to help substantially for the pipeline classifiers. To mitigate the effect of error propagation, joint models have been proposed for event extraction, Yang and Mitchell (2016) consider structural dependencies between events and entities. It requires heavy feature engineering to capture discriminative information. Nguyen and Nguyen (2019) propose a multitask model that performs entity recognition, trigger detection and argument role prediction by sharing Bi-GRU hidden representations. Zhang et al. (2019a) utilizes a neural transition-based extraction framework (Zhang and Clark, 2011), which requires specially designed transition actions, which still requires recognizing entities during decoding, though the entity recognition and argument role prediction is done in a joint way. These methods generally performs trigger detection → entity recognition → argument role assignment during decoding. Different from the works above, our framework completely bypasses the entity recognition stage (thus no annotation resources needed), and directly tackles event argument extraction. Also related to our work includes Wadden et al. (2019), they model the entity/argument spans (with start and end offset) instead of labeling with BIO scheme. Different from our work, their learned span representations are later used to predict the entity/argument type. While our QA model directly extract the spans for certain argument role type. Contextualized representations produced by pre-trained language models (Peters et al., 2018; have been proved to be helpful for event extraction (Zhang et al., 2019b;Wadden et al., 2019) and question answering (Rajpurkar et al., 2016). The attention mechanism helps capture relationships between tokens in question and input sequence. We use BERT in our framework for capturing semantic relationship between question and input sentence.

Machine Reading Comprehension (MRC)
The span-based MRC tasks involve extracting a span from a paragraph (Rajpurkar et al., 2016) or multiple paragraphs (Joshi et al., 2017;Kwiatkowski et al., 2019). Recently, there have been explorations on formulating NLP tasks as question answering. McCann et al. (2018) propose natural language decathlon challenge (decaNLP) which consists of ten tasks (e.g., machine translation, summarization, question answering, etc.) They cast all tasks as question answering over a context and propose a general model for this. In the information extraction literature, Levy et al. (2017) propose the zero-shot relation extraction task and reduce the task to answering crowd-sourced reading comprehension questions. Li et al. (2019b) casts entity-relation extraction as a multi-turn question answering task. Their questions lack diversity and naturalness. For example for the PART-WHOLE relation, the template questions is "find Y that belongs to X", where X is instantiated with the pre-given entity. The follow-up work from Li et al. (2019a) propose better query strategies incorporating synonyms and examples for named entity recognition. Different from the works above, we focus on the more complex event extraction task, which involves both trigger detection and argument extraction. Our questions for extracting event arguments are more natural (based on annotation guidelines) and leverage trigger information.

Conclusion
In this paper, we introduce a new paradigm for event extraction based on question answering. We investigate how the questioning strategies affect the performance of our framework on both trigger detection and argument extraction, and find that more natural questions lead to better performance. Our framework outperforms the prior works on the ACE 2005 benchmark, and is capable of extracting event arguments of unseen roles at training time. For future work, it would be interesting to try incorpo-rating broader context (e.g., paragraph/documentlevel context (Ji and Grishman, 2008;Huang and Riloff, 2011)) in our methods to improve the accuracy of the predictions.