Event Extraction as Machine Reading Comprehension

Event extraction (EE) is a crucial information extraction task that aims to extract event information in texts. Previous methods for EE typically model it as a classiﬁcation task, which are data-hungry and suffer from the data scarcity problem. In this paper, we pro-pose a new learning paradigm of EE, by explicitly casting it as a machine reading comprehension problem (MRC). Our approach includes an unsupervised question generation process, which can transfer event schema into a set of natural questions, followed by a BERT-based question-answering process to retrieve answers as EE results. This learning paradigm enables us to strengthen the reasoning process of EE, by introducing sophisticated models in MRC, and relieve the data scarcity problem, by introducing the large-scale datasets in MRC. The empirical results show that: i) our approach attains state-of-the-art performance by considerable margins over previous methods. ii) Our model is excelled in the data-scarce scenario, for example, obtaining 49.8% in F1 for event argument extraction with only 1% data, compared with 2.2% of the previous method. iii) Our model also ﬁts with zero-shot scenarios, achieving 37 . 0% and 16% in F1 on two datasets without using any EE training data.


Introduction
Event extraction (EE), a crucial information extraction (IE) task, aims to extract event information in texts. For example, in a sentence S1 (shown in Figure 1 (a)), an EE system should recognize an Attack event 1 , expressed by an event trigger stabbed with four event arguments -Sunday (Role=Time), a protester (Role=Attacker), an officer (Role=Target), and a paper cutter (Role=Instrument). EE is shown to benefit a wide range of applications including knowledge 1 According to the ACE event ontology. base augmentation (Ji and Grishman, 2011), document summarization, question answering (Berant et al., 2014), and others.
In the current study, EE is mostly formulated as a classification problem, aiming to locate and categorize each event trigger/argument (Ahn, 2006;Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016). Despite many advances, classification based methods are data-hungry, which require a great deal of training data to ensure good performance Li et al., 2013;Liu et al., 2018a). Moreover, such methods generally cannot deal with new event types never encountered during training time (Huang et al., 2018).
In this particular study, we introduce a new learning paradigm for EE, shedding lights on tackling the above problems simultaneously. Our major motivation is that, essentially EE may be viewed as a machine reading comprehension (MRC) problem (Hermann et al., 2015;Chen et al., 2016) involving text understanding and matching, aiming to find event-specific information in texts. For example, in S1, the extraction of role-filler of Instrument is semantically equivalent to the following questionanswering process (as shown in Figure 1 (b)): Q1: What Instrument did the protester use to stab the officer? A1: a paper cutter. 2 This implies new ways to tackle EE, which come with two major advantages: First, by framing EE as MRC, we can leverage the recent advances in MRC (e.g., BERT (Devlin et al., 2019)) to boost EE task, which may greatly strengthen the reasoning process in the model. Second, we may directly leverage the abundant MRC datasets to boost EE, which may relieve the data scarcity problem (This is referred to as cross-domain data augmentation). The second advantage also opens a door for zero-shot EE: for unseen event types, we can list questions defining their schema and use an MRC model to retrieve answers as EE results, instead of obtaining training data for them in advance.
To bridge MRC and EE, the key challenge lies in generating relevant questions describing an event scheme (e.g., generating Q1 for Instrument). Note we cannot adopt supervised question generation methods (Duan et al., 2017;Yuan et al., 2017;Elsahar et al., 2018), owing to the lack of aligned question-event pairs. Previous works connecting MRC and other tasks usually adopt humandesigned templates (Levy et al., 2017;FitzGerald et al., 2018;Li et al., 2019b,a;Gao et al., 2019;Wu et al., 2019). For example, in QA-SRL (FitzGerald et al., 2018), the question for a predicate publish is always "Who published something?", regardless of the contexts. Such questions may not expressive enough to instruct an MRC model to find answers.
We overcome the above challenge by proposing an unsupervised question generation process, which can generate questions that are both relevant and context-dependent. Specifically, in our approach, we assume that each question can be decomposed as two parts, reflecting query topic and context-related information respectively. For example, Q1 can be decomposed as "What instrument" and "did the protester use to stab the officer?". To generate the query topic expression, we design a template-based generation method, combining role categorization and interrogative words realization. To generate the more challenging contextdependent expression, we formulate it as an unsupervised translation task (Lample et al., 2018b) (or style transfer (Prabhumoye et al., 2018), which transforms a descriptive statement into a questionstyle expression, based on in-domain de-noising auto-encoding (Vincent et al., 2008) and crossdomain back-translation (Sennrich et al., 2016). 2 Figure 1 (b) gives another example.
Note the training process only needs large volume of descriptive statements and unaligned questionstyle statements. Finally, after the questions are generated, we build a BERT based MRC model (Devlin et al., 2019) to answer each of question and synthesize all of the answers as the result of EE.
To evaluate our approach, we have conducted extensive experiments on the benchmark EE datasets, and the experimental results have justified the effectiveness of our approach. Specifically, 1) in the standard evolution, our method attains state-ofthe-art performance and outperforms previous EE methods by a margin ( § 4.2). 2) In the data-low scenario, our approach demonstrates promising results, for example, achieving 49.8% in F1 using 1% of training data, compared with only 2.2% in F1 of the previous EE method ( § 4.3). 3) Our approach also fits with zero-shot scenarios, achieving 37.0% and 16.6% in F1 on two datasets without using any EE training data ( § 4.4).
To sum up, we make the following contributions: • We investigate a new formulation of EE, by framing it as an MRC problem explicitly. We show this new formulation can boost EE by leveraging both model and data in the area of MRC. Our work may encourage more works studying transfer learning from MRC to boost information extraction.
• We propose an unsupervised question generation method to bridge MRC and EE. Compared with previous works using templates to generate questions, our method can generate questions that are both topic-relevant and context-dependent, which can better instruct an MRC model for question-answering.
• We report on state-of-the-art performance on the benchmark EE dataset. Our method also demonstrate promising results in addressing data-low and zero-shot scenarios.

Related Work
Event Extraction. EE is a crucial IE task that aims to extract event information in texts, which has attracted extensive attention among researchers. Traditional EE methods employ manual-designed features, such as the syntactic feature (Ahn, 2006), document-level feature (Ji and Grishman, 2008), entity-level feature (Hong et al., 2011) and other features (Liao and Grishman, 2010;Li et al., 2013)  for the task. Modern EE methods employ neural models, such as Convolutional Neural Networks (Chen et al., 2015), Recurrent Neural Networks (Nguyen et al., 2016;Sha et al., 2018), Graph Convolutional Neural Networks (Liu et al., 2018b(Liu et al., , 2019b, and other advanced architectures (Yang and Mitchell, 2016;Liu et al., 2018aLiu et al., , 2019aNguyen and Nguyen, 2019;Zhang et al., 2019). Despite many advances, as mentioned in Introduction, most previous approaches formulate EE as a classification problem, which usually suffer from the data scarcity problem, and they generally cannot deal with new event types never seen at the training time.
MRC for Other Tasks. Our work also relates to works connecting MRC and other tasks, such as relation extraction (Levy et al., 2017;Li et al., 2019b), semantic role labeling (FitzGerald et al., 2018), named entity recognition (Li et al., 2019a), and others (Wu et al., 2019;Gao et al., 2019). Particularly, Du and Cardie (2020) adopt a similar idea to frames EE as MRC. But different from our work, most of the above methods (Levy et al., 2017;Li et al., 2019b;FitzGerald et al., 2018;Du and Cardie, 2020) adopt human-designed, context-independent questions, which may not provide enough contextual evidence for question-answering. Some works indeed do not adopt question-style queries (Li et al., 2019a;Gao et al., 2019). For example, Li et al. (2019a) use "Find organizations in the text" as a query command to find ORGANIZATION entity. The discrepancy between such non-natural "queries" and natural questions in MRC datasets may hinder effective transfer learning from MRC to the task. By contrast, our work aims to generate both relevant and context-related questions via an unsupervised question generation method.

The Approach
Our approach, denoted by RCEE (Reading Comprehension for Event Extraction), is visualized in Figure 2. Specifically, given a sentence S1, RCEE first identifies an event trigger "stabbed" and its event type Attack, on receiving a special query "[Event]". Secondly, RCEE generates a question for each semantic role corresponding to the event schema of Attack. Thirdly, RCEE builds an MRC model to answer each question as event argument extraction. Finally, RCEE synthesizes all of the answers as the final result of EE. The technical details of RCEE are presented in the following. In the illustration, we denote a sentence as c = {c 1 , · · · , c n }, and we structure the illustration as event trigger extraction, unsupervised question generation, event argument extraction, and the training procedure of RCEE.

Event Trigger Extraction
To extract event triggers, we use "[EVENT]" as a special query command, indicating finding all event triggers in texts 3 . The reason is that event triggers are usually verbs, and it is hard to design questions for them. Also note here this special query command enables event trigger and argument extraction share a same encoding model. Next, we adopt classification-based (instead of span-based method) for trigger extraction, considering that most triggers (over 95% in ACE) are single words, and span-based answer generation may be too heavy. Specifically, we first jointly encode "[EVENT]" with the sentence c to compute an encoded representation (we refer to § 3.3 for details). Then for each word c i in c, we take its encoded representation as the input of a logistic regression model, and compute a vector o c i containing probabilities of different event types. Finally, the probability of the lth event type for c i is

Unsupervised Question Generation
After trigger extraction, RCEE generates a set of questions according to the predicted event type.
Here we assume each question can be composited as: 1) query topic, which reflects the relevance of a question, and 2) question-style event statement, which encodes the context-related information.
Question Topic Generation. We devise template-based methods for query topic generation. Note to make a question natural enough, we should consider different interrogative words for different semantic roles. For example, the query topic for the semantic role Time might be "When [...]", but for Attacker might be "Who [...]". With the above motivation, we first group semantic roles into different categories, and then design different templates for each category. Table 1 shows our categorization (i.e., time-related, place-related, person-related and general roles) and templates for the ACE 2005 event ontology. According to the table, the generated query topic for Victim is "Who is the Victim".
Question Contextualization. Question contextualization aims to generate the remaining questionstyle event statement. Here formulate it as an unsupervised translation task (Lample et al., 2018a,b), with a goal to maps descriptive statement (such as the sentence) to a question-style statement, with no parallel resources. It can also be viewed as style transfer (Prabhumoye et al., 2018). To achieve the goal, we first build large corpora of descriptive statements (denoted as S) and unaligned natural questions (denoted as Q) 4 , and we restrict each instance in S a window of words centered at a verb, and each instance of Q a question removing interrogative words such as When/Where/Who/What. Second, following Lample et al. (2018b), we build two MT models: P S→Q (q s |s), which maps a descriptive statement s ∈ S as a question-style statement q s , and P Q→S (s q |q), which conducts the translation reversely. Each MT model includes an encoder and a decoder in the source and target domains respectively. For example, P S→Q (q s |s) has an encoder E S in S, and a decoder D Q in Q. Third, We train P S→Q (q s |s) and P Q→S (s q |q) jointly via in-domain auto-encoding, de-noising auto-encoding (Vincent et al., 2008), and crossdomain online-back translation (Sennrich et al., 2016), as shown in Figure 3. Finally, at the inference time, a window of words centered at the predicted trigger (denoted by s x ) is considered as input of P S→Q (q s |s), and we compute the questionstyle statement q sx via: q sx is concatenated with the pre-generated query topic to generate the final question.

Event Argument Extraction
RCEE then performs event argument extraction as question answering, by using a BERT based MRC model. Let a question be q = {q 1 , · · · , q m } Learning Input Representations. We first encode q and c jointly to learn the input representations, by constructing an sequence "[CLS] q [SEP] c" as input of BERT. To further enhance the representation, we devise a new embedding, word sharing embedding, as the input of BERT, with a motivation that shared words of q and c are more likely to convey event information. Specifically, the word sharing embedding of a word w i (in q or c) is: (2) where p sh and p no are two embedding vectors getting updated during training. After encoding, we take the last hidden layer of BERT, H q c ∈ R N ×d 2 , as the final representation of q and c, where N = m + n + 2 5 , and d 2 designates BERT's hidden dimension.
Adaptive Argument Generation. Different from triggers, event arguments generation is tackled by span-based algorithms (Hermann et al., 2015), as they are usually entities and contain multiple words. While we note over 14% of semantic roles have zero or multiple arguments, we revise the existing algorithm to tackle the issue (shown in Algorithm 1). Specifically, given the joint representation H q c of q and c, we fist compute two probability vectors containing the start and end positions of the answer over every position in c: where W start and W end ∈ R 2d 4 ×1 are model parameters. Then, we regard the special token "[SEP]" as "no-answer" indicator, and we only use start/end positions whose probabilities are higher than that of "[SEP]" to construct candidate answers. We adopt several heuristics regarding i) relative position of start/end index, length constraint, and likelihood threshold δ to filter out illegal answers. The new algorithm can generate both zero or more than one answers for a question. Additionally, when entity information is known (this setting is adopted in many approaches (Chen et al., 2015;Nguyen et al., 2016)), we further adopt golden entity refinement, return answer list 16: end procedure which enforces answers have the same boundaries as ground-truth entities.

Training
To train RCEE, we adopt a pre-training followed by fine-tuning strategy, which can jointly train a model using datasets of MRC and EE.
Pre-training Stage. In the pre-training stage, we train RCEE on MRC datasets, with a loss: where c, q, a denotes an MRC example consisting of context c, query q, and answer a; P(a|c, q) indicates the likelihood of the ground-truth answer a given c and q, which is defined as: P(a|c, q) = log p(g a s |c, q)+log p(g a e |c, q) (6) where g a s and g a e are respectively the ground-truth start/end positions.
Fine-Tuning Stage. In the fine-tuning stage, we train RCEE on EE datasets with a loss: L ev (θ)=− e log p(g e |w e )+ r∈A(ge) P(a r |c e , q r ) (7) where e ranges over each event instance; w e indicates the trigger of e; g e indicates the event type of e; Arg(e) designates the role set of g e ; r ranges over each rule. We adopt Adam (Kingma and Ba, 2014) to update parameters of RCEE.   (Li et al., 2013;Chen et al., 2015;Yang and Mitchell, 2016), and we also adopt precision (P), recall (R), and F1-score (F1) as evaluation metrics to ensure comparability. Significance tests are conducted using methods proposed by Yeh (2000) with a significance level of p = 0.05.
Implementation Details. We adopt BERT-Large, which has 24 layers, 1024 hidden units, and 16 attention heads, as our MRC model. Other hyper-parameters are tuned on the validating set via a grid search. Specifically, the dimension of word sharing embedding is set as 100 (from 10, 50, 100, 200, to 500). The answer prediction threshold δ is set as 0.3 (from [0.1, 0.2, .., 0.9]). The batch size is set as 10 (from 2, 5, 10, 15). The dropout rate is set as 0.5. We adopt SQuAD 2.0 (Rajpurkar et al., 2018) for cross-domain data argumentation (Our MRC model achieves 83.9% in F1). Implementations of unsupervised question generation are in supplement materials. Our code will be released at https://github.com/jianliu-ml/EEasMRC.
Baseline Models. We compare our model with: 1) JointBeam (Li et al., 2013), a state-of-the-art feature-based method for EE; 2) DMCNN (Chen et al., 2015) and 3) (Sha et al., 2018) and 5) JMEE (Liu et al., 2018b) two models exploring syntax information via RNNs and Graph Convolotional Neural Networks (GCNs) for EE. Joint EE models are also considered, including: 6) Joint3EE (Nguyen and Nguyen, 2019), which uses a unified architecture to predict entities and events; 7) JointTrans (Zhang et al., 2019), which adopts a left-to-right transaction-based method for EE. To further investigate whether the improvement are introduced by BERT representation, we also consider: 8) BERTEE, which adopts BERT representations but uses classification strategy for EE. Our model is denoted as RCEE and RCEE ER ("ER" denotes with golden entity refinement). We use DA to indicate cross-domain data augmentation.

Standard Evaluation
In the standard evaluation, we consider two settings with 1) known entities, which is considered by many previous methods, and 2) unknown entities, which is a more realistic setting.
Results with Known Entities. Table 2 gives the results of trigger (Trigger Ex.) and argument extraction (Argument Ex.) with known entities. We also report on results of argument extraction with oracle triggers (Argument Ex.(O)), to exclude the potential error propagation from trigger extraction results. From the results, 1) RCEE ER attains state-of-the-art performance, outperforming all baselines by considerable margins (+0.6% in trigger extraction; (+3.6% (5.4%)) in argument ex- JointBeam (2013)   traction). 2) Especially, RCEE ER outperforms BERTEE (which also use BERT representations) with over 5% in argument extraction, which indicates that the improvements are mainly from problem reformulation, rather than introducing BERT representations.
3) The high recall of RCEE ER indicates that it can predict more examples than baselines, which may imply that RCEE ER can tackle difficult cases that fail baseline models.
Results with Unknown Entities. Table 3 gives results with unknown entities. In this setting, classification-based methods need to identify entities first, thus we implement a BERT-base one for them 7 . Joint EE methods are also compared, which do not require entity information. We use RCEE for comparison, which excludes entity refinement. From the results, RCEE still demonstrates the best performance -it beats both classification based methods (over 9.3% in F1) and joint models (over 6.0%). By checking ∆F 1, we note RCEE relies relatively less on golden entities (-4.3% in F1 without them), but classification-based methods depend heavily on them, suffering from a drop of over 8% in F1 with the predicted entities.

Results in Data-Scarce Scenarios
Figure 4 compares models and BERTEE in datascarce scenarios, and Table 4 gives results in the extremely data-low scenario (≤ 20% training data) 8 . From the results, our model demonstrates superior performance, for example, obtaining 49.8% in F1 with only 1% of EE training data, in comparison 7 One tagger reaches 85.4%/85.9%/85.6% in P/R/F1, matching the state-of-the-art (Yang and Mitchell, 2016). 8 To simplicity discussion, we assume golden triggers in the following experiments.   to 2.2% in F1 of BERTEE. We note the improvement comes from two aspects: 1) Data augmentation (DA). For example, DA improves +47.6% and +33.4% for RCEE ER in experiments with 1% and %5 data according to Table 4. 2) Answer generation algorithm. Note RCEE ER without DA still consistently outperforms BERTEE in data-low scenarios. This implies the answer generation algorithm is data-efficient than classification method. The reason might be that, the answer generation algorithm in our approach is position-based, which might be robust for unseen words. While the classification method in previous EE methods are largely word-based, which requires more labeled data. Table 5 shows the results regarding zero-shot EE, where EE data is completely banned for training (Only using DA for model pre-training). To increase the persuasiveness of results, we adopt another dataset, FrameNet (Baker, 2014)   5 Further Discussion

Impact of Question Generation
We compare different question generation strategies: 1) QRole, which uses a role's name as query; 2) QCommand, which uses "Find the #Role" as query (Li et al., 2019a), and 3) QTemplate, which uses a template "What is the #ROLE in the #event trigger event?" as query (FitzGerald et al., 2018). From the results, QRole, QCommand, and QTemplate achieve 60.1%, 64.9%, and 68.%5 in F1 in argument extraction; compared with 70.1% of our approach. We note the inferiority of those methods may lay in their poor expression ability. For example, in a sentence "The pair flew to Singapore last year after ...", QNAME uses "Time" as query; QCommand uses "Find the Time" as query; QTemplate uses "What is the Time in the flew event?" as query. While our approach directly generates a nearly perfect question "[When] do the pair fly to Singapore?" We provide more examples in supplement materials. Figure 5 shows the performance of RCEE on different semantic roles, regarding four randomly selected roles with 1) plenty data, e.g.

Error Analysis
We conduct error analysis in this section. One typical error is related to long-range dependency, accounting for 23.4% (here "long-range" denotes the distance between a trigger and an argument is ≥ 10). Table 6 (a) shows a case, where the argument Evian, France is about 20 words away from the trigger leave, making it difficult to identify the argument.
2) The second error relates to roles whose meaning are general, e.g., Entity, Agent -it is usually difficult to generate meaningful questions for these roles, causing 32.7% errors among all cases.
3) The third error relates to co-reference, which accounts for 17.2%. Considering the example in Table 6 (b), where die evokes a Die event with "Laleh" and "Ladan" fulfilling a semantic role Victim. Our model predicts "them" (two words ahead of die) as answer -though "them" is a reference of "Laleh and Ladan", it considered as an error according to current evaluations. This also raises the question of whether we should consider co-reference when we evaluate EE systems.

Conclusion and Future Work
In this paper, we take a fresh look at EE by casting it as an MRC problem. Our method includes an unsupervised question generation process which can generate both relevant and context-related questions, whose effectiveness is verified by empirical results. In the future, we would adapt our method to other IE tasks to study its application scope.

A Implementation Details of Unsupervised Question Generation
Following Lample et al. (2018b), we use FastBPE to split each example into sub-word units, with a vocabulary size of 60k. We implement both encoders and decoders as 4-layer transformers, where one layer is domain-specific for both the encoder and decoder and the rest are shared. Moreover, we use the standard hyper-parameter settings recommended by (Lample et al., 2018b). The input word embeddings are initialized as FastText vectors trained on the concatenation of the S and Q.
Negotiations between Washington and Pyongyang on Role = Time/Place (When/Where) did the negotiations between Washington and Pyongyang begin ?
founder Stelios Haji-Ioannou , who set up easyJet in 1995 and built Role = Time/Place (When/Where) did founder Stelios Ioannescu set up his company ?
divorce in September after their marriage broke down . Role = Time/Place (When/Where) did the divorce occur after their marriage ?
The total purchase cost is estimated at 300 Role = Price (What is the price) of the total cost of building a nuclear power plant ?
His wife will go on trial next week on charges of Role = Defendant (Who is the defendant) on trial next week?
Security Council for its 1990 invasion of Kuwait should be removed Role = Attacker (Who is the attacker) for its 1990 Gulf War ?
in U.S. troops for a war against Iraq even though it Role = Attacker (Who is the attacker) for a war against Iraq ?
Kuvaldin of a research center funded by former Soviet president Mikhail Role = Organization (What is the organization) of Kubidran University funded by ? Table 7: Examples of generated questions. In each cell, the first line is the original sentence (event triggers are in italic); the second line is the semantic role; the third line is the generated question. () denotes the query topic generated by templates, and the remaining part is the query-style expression generated by our model.
During training, we reduce the coefficient of autoencoding loss from 1.0 to 0.5 by 100K steps and to 0 by 300K steps. We cease training when the BLEU scores between back-translated and input questions stop improving, usually around 800K steps. For inference, we use a beam size of 5 and a language model to evaluate all the candidates to yield the best one.

B Generated Questions
Some generated questions are given in Table 7.
Note these examples are directly taken from our model's output without any manual edition (We do not even add a question mark at the end of each question).