Reading the Manual: Event Extraction as Definition Comprehension

We propose a novel approach to event extraction that supplies models with \emph{bleached statements}: machine-readable natural language sentences that are based on annotation guidelines and that describe generic occurrences of events. We introduce a model that incrementally replaces the bleached arguments in a statement with responses obtained by querying text with the statement itself. Experimental results demonstrate that our model is able to extract events under closed ontologies and can generalize to unseen event types simply by reading new bleached statements.


Introduction
Natural language processing seeks to help humans understand large amounts of text, as demonstrated by recent research in information extraction.However, there is a disconnect between how humans and machines carry out a task such as information extraction, as shown in Figure 1: humans read annotation manuals consisting of guidelines and illustrative examples then label data, whereas machines label data (by making predictions) based on previously seen examples.Traditionally, machines consume only examples as their data without considering the annotation guidelines specifying how annotation decisions were made.In essence, humans perform the task based on a set of prescribed rules, whereas machines perform the task based on learned patterns describing the data.
In this work, we explore the feasibility of supplying a model access to information derived from annotation manuals.We focus on the task of event extraction and convert annotation guidelines describing event types into natural language bleached statement.
As an example, a bleached statement for the ACE 2005 (Walker et al. 2006) LIFE:BE-BORN event type is: some person PERSON was born in some location PLACE at some time TIME The bleached statement describes a general occurrence of an event of a given type.The event's arguments are initialized with bleached placeholders (e.g.some person, some location) to be replaced with extracted spans from the text, When the number of training examples for an event type is small or even zero (so-called "few-" and "zero-shot learn-ing", respectively), models that use bleached statements still have access to information about how those examples were annotated.We hypothesize that machines that make use of information derivable from the guidelines can achieve good performance after seeing a small set of examples, as humans do.
We additionally propose a model that incrementally populates bleached statements by querying partially filled statements against text.This strategy is similar to the tasks of machine reading comprehension (MRC) and question answering (QA), in which an answer span is predicted in response to a question about a document.
Experimental results demonstrate that zero-and few-shot event extraction are feasible with this approach.Additionally, our model achieves state-of-the-art performance on trigger identification and trigger classification on the ACE 2005 dataset.
The contributions of this work are: • A novel approach to event extraction that takes into account annotation guidelines through bleached statements; • A multiple-span selection model that demonstrates the feasibility of the approach for event extraction as well as for zero-and few-shot settings.

Background
Event extraction is traditionally viewed as three subtasks -(1) event trigger detection, where triggers of events (words that most clearly express the occurrences of events) are detected; (2) entity mention detection, where all potential arguments (entity mentions) to events are detected; and (3) argument role prediction, where relations between detected entity mentions and trigger words are recognized with respect to each event type's defined set of roles.
Because pipelined approaches suffer from error propagation in which the error from earlier subtasks (e.g.entity mention detection) is inherited by later subtasks, joint modeling of the 3 subtasks has been attempted.Yang and Mitchell (2016) attempts to jointly model the three components with hand-crafted features, but still need to detect entity mentions and event triggers separately.Nguyen and Nguyen (2019) jointly models the three tasks using neural networks with shared underlying representations.The models proposed in these two works are the baselines used in this paper.Huang et al. (2018) approach zero-shot event extraction by stipulating a graph structure for each event type and finding the event type graph structure whose learned representation most closely matches the learned representation of the parsed AMR (Banarescu et al. 2013) structure of a text.In contrast, our approach forgoes explicit graph-structured semantic representations such as AMR.
Researchers have introduced large question answering (QA) / machine reading comprehension (MRC) datasets in a cloze style (Hermann et al. 2015;Onishi et al. 2016), where a query sentence contains a placeholder and the model is expected to fill the blank.Our work can be viewed as an extension to such work, where multiple placeholders are extracted.Li et al. (2019) casts relation extraction as multi-turn question answering with natural language questions, where in each turn one argument of the relation is found through QA.The method requires writing a question for each entity type and each relation type.
In (Levy et al. 2017), sets of crowdsourced paraphrastic questions are written for each relation type in the ontology.In contrast, for each event type we use a single declarative bleached statement derived from the annotation guidelines.Soares et al. (2019) proposes a model for relation extraction by filling in two blanks given a contextual relation statement.
These three methods focus on binary relation extraction, and do not readily generalize to n-ary events or relations.Our approach naturally supports variable arity events and relations.

Problem Formulation
A bleached statement consists of: the statement tokens , where r k is the predefined role of that argument (e.g.AGENT, PATIENT); and an index set An example bleached statement in the ACE 2005 dataset for the event type LIFE:DIE (also used in our Figure 2  This statement is accompanied by the following placeholder dictionary, in which each role is mapped to an index set that highlights the placeholder in the bleached statement 3 : The event extraction task as defined in the ACE 2005 dataset also requires finding an event trigger-a span in the 2 Our bleached statements are inspired in part by linguistic resource creation efforts by White and Rawlins (2018). 3The model itself does not see the role names.They are used only for human readability and evaluation.
s 9 / 6 d 9 u 5 A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K 9 / 8 i c e S Q J 9 X 1 Y o Z t 6 A A 0 c T a I 5 8 s 9 / 6 d 9 u 5 A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " K 9 / 8 i c e S Q J 9 X 1 Y o Z t 6 A A 0 c T a I 5 8 e / 8 d l 5 p L r j F K P p d q w c 7 d + 7 e 2 7 3 f e P D w 0 e M n e / t P z 6 w u D I M 2 0 0 K 2 p I Y y 9 H 1 r H N x 8 x j I q w I B w q 2 7 B U / A 1 K l j 1 d 5 b + 1 z 4 7 v 3 P G r 2 K 3 V H C N 4 1 m 2 j Z U a q s b b 1 S Z t C 2 s k B v w / 0 1 J S l b 1 K E D P o 0 0 J g i W N E 5 y c X 3 5 7 T u n F 2 2 I q j V v z x u H k S L W a 4 S 5 6 T F + Q l i c l b c k I + k F P S J q z 2 q / a n H t R 3 g q / B t + B 7 8 G N + t V 5 b x D w j K y f 4 + R e T N m 7 i < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P Q j e / 8 d l 5 p L r j F K P p d q w c 7 d + 7 e 2 7 3 f e P D w 0 e M n e / t P z 6 w u D I M 2 0 0 K 2 p I Y y 9 H 1 r H N x 8 x j I q w I B w q 2 7 B U / A 1 K l j 1 d 5 b + 1 z 4 7 v 3 P G r 2 K 3 V H C N 4 1 m 2 j Z U a q s b b 1 S Z t C 2 s k B v w / 0 1 J S l b 1 K E D P o 0 0 J g i W N E 5 y c X 3 5 7 T u n F 2 2 I q j V v z x u H k S L W a 4 S 5 6 T F + Q l i c l b c k I + k F P S J q z 2 q / a n H t R 3 g q / B t + B 7 8 G N + t V 5 b x D w j K y f 4 + R e T N m 7 i < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " P Q j e / 8 d l 5 p L r j F K P p d q w c 7 d + 7 e 2 7 3 f e P D w 0 e M n e / t P z 6 w u D I M 2 0 0 K e / 8 d l 5 p L r j F K P p d q w c 7 d + 7 e 2 7 3 f e P D w 0 e M n e / t P z 6 w u e 7 9 9 H 6 3 e 7 9 9 H 6 3 e 7 9 9 H 6 3 a 0 a J / u t K G x F n w 6 a R + F s h h v e c 2 / H e + F F 3 q F 3 5 H 3 w j r 2 2 x 2 t / 6 z v 1 V / U 9 / 7 v / 0 / / l / 5 4 e r d d m P s + 8 p e X / + Q f V 1 3 + 1 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 s M a 0 a J / u t K G x F n w 6 a R + F s h h v e c 2 / H e + F F 3 q F 3 5 H 3 w j r 2 2 x 2 t / 6 z v 1 V / U 9 / 7 v / 0 / / l / 5 4 e r d d m P s + 8 p e X / + Q f V 1 3 + 1 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 s M a 0 a J / u t K G x F n w 6 a R + F s h h v e c 2 / H e + F F 3 q F 3 5 H 3 w j r 2 2 x 2 t / 6 z v 1 V / U 9 / 7 v / 0 / / l / 5 4 e r d d m P s + 8 p e X / + Q f V 1 3 + 1 < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 s M a 0 a J / u t K G x F n w 6 a R + F s h h v e c 2 / H e + F F 3 q F 3 5 H 3 w j r 2 2 x 2 t / 6 z v 1 V / U 9 / 7 v / 0 / / l / 5 4 e r d d m P s + 8 p e X / + Q f V 1 3 + 1 < / l a t e x i t > W 9 2 m H t q P 6 j / q v + u / 5 n e r S 2 M Y t 5 E S y s + t 9 / 9 p O C t g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W 9 2 m H t q P 6 j / q v + u / 5 n e r S 2 M Y t 5 E S y s + t 9 / 9 p O C t g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W 9 2 m H t q P 6 j / q v + u / 5 n e r S 2 M Y t 5 E S y s + t 9 / 9 p O C t g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W 9 2 m H t q P 6 j / q v + u / 5 n e r S 2 M Y t 5 E S y s + t 9 / 9 p O C t g = = < / l a t e x i t >  text that most clearly expresses the event's occurrence.In this example, the trigger is the word "killed".For a consistent implementation, we consider the trigger to be a special argument of the event, with role name TRIGGER.
Formally, the task is: given a bleached statement S, its placeholder dictionary R, and text tokens T, return a dictionary R that contains the event trigger and the the extracted arguments.Such a result is shown in the bottom right of Figure 2. Note that the INSTRUMENT role is not filled in the example because the model does not find a span to fill it.

Approach
Given a bleached statement with multiple placeholders, we do not fill the placeholders in parallel-instead, we fill them incrementally in an enforced order. 4In each step, the model attempts to fill a single focused placeholder, which is replaced by the extracted span(s) thereby creating a refined statement (see Figure 2).In this work we fill the placeholders in the statement from left to right and leave other orders as future work.
Formally, in each round, our model returns multiple arguments (see "Multiple Argument Selector" section below) for each placeholder: where S is the (partially refined) statement, I is the index set that covers the focused placeholder (which corresponds to a role), and T is the text to extract from.The returned argument set A contains a number of text spans (potentially zero) in T that replace the placeholder in S picked out by I.
If A is the empty set, then the model did not find an appropriate text span to replace the placeholder.If the answer set A is not empty, we replace the placeholder with the extracted span.Note that in some cases, there can be more than one argument that fits a role.Consider the following bleached statement (for the ACE 2005 event LIFE:MARRY), focused on the first placeholder "some people": some people We run this iterative process until all roles of an event are visited.An advantage of this method is that during the incremental refinement process, the statement always remains a natural language sentence.The incremental process for extracting event arguments is formalized in Algorithm 1, given the initial bleached statement S, the role dictionary R, and the text T to extract from.
Annotation manuals of interest usually define multiple event types.For each event type τ described in the manual, we require a bleached statement S and a role dictionary R.These together form our ontology O = {(τ k , S k , R k )}.To perform full event extraction (Algorithm 2), we first run a trigger detection model for all event types (see "Trigger Identification" section below) specified in the ontology.For those event types whose trigger is found, we proceed with argument extraction (Algorithm 1).
Algorithm 1 Argument extraction given bleached statement Input: statement S, placeholder dictionary R, text T Output: extracted argument structure E function EXTRACTARGUMENTS(S, R, T) i ← 1 i-th round S (1) ← S the initial statement E ← extracted event for (r, I) ∈ R do A ← GETARGS(S (i) , I, T) if A then S (i+1) ← replace the I tokens in S (i) with A refine the statement E ← E ∪ (r : A) event argument extracted else S (i+1) ← S (i)  skip to the next role

Model
Architecture for MRC In light of recent advancements in NLP from large-scale pre-training, we use BERT (Devlin et al. 2019) as our sequence encoder.We first review the answer selector architecture for machine reading comprehension (MRC) used in BERT, then extend it for our approach.
Under the formulation of MRC, each training data point is of the form (S, T) where S is a natural language question with tokens S = (s 1 , • • • , s n ) and T is the text to extract answers from, with tokens T = (t 1 , • • • , t m ).The model returns a span in T or predicts that the question is not answerable, in which case an empty span is returned.
To perform MRC, Devlin et al. (2019) proposed the following architecture.First the question S and the text T are concatenated with special delimiters and passed through the BERT contextualizer: where CLS is a special sentinel token whose embedding encompasses the whole string, and SEP is a sentence sepa-rator.We denote the output encoding of each question token s i (1 ≤ i ≤ n) as s i ∈ R d , and the encoding of each text token t j (1 ≤ j ≤ m) as t j ∈ R d .
Additionally, two vectors, b left and b right , for the left and right boundaries of the answer span are learned.The probability of each token t j (1 ≤ j ≤ m) being the left or right boundary of the answer span is computed as The two vectors b left and b right act as attention query vectors to the text, resulting in a soft pointer over the text tokens.
If the text has no answer span for a given question, both the left and right boundaries of the answer span should point to the CLS sentinel token instead of any text token t ji.e.P left ( CLS ) and P right ( CLS ) should be the maximum among all probabilities of the left and right boundaries.
At decoding time, the model selects the left boundary l and right boundary r such that l ≤ r and the overall logit score b left • t l + b right • t r is the highest.
Multiple Argument Selector Our scenario is fundamentally different from MRC in two ways: (1) Our query is not formulated as a natural language question; instead, it is a cloze-style problem with a natural language statement and a highlighted blank to fill; (2) For some cases, there can be more than one answer for a given blank.Previous MRC models support extracting only at most one answer to a question.
To accommodate these requirements, we propose a new architecture for this scenario that describes the GETARGS function in Algorithm 1.Given a bleached statement S = (s 1 , • • • , s n ) with a highlighted placeholder span with indices  2013), where answer spans are tagged using a linear-chain conditional random field (CRF) (Lafferty, McCallum, and Pereira 2001).By considering answer span selection as tagging, our model is able to select multiple spans for a query.
We enforce the constraint that all extracted spans come from the same sentence in the text, but in general this constraint need not be enforced.Additionally, our model operates on single-sentence contexts, so information available in other sentences is not considered.
We use the BIO tagging scheme (Ramshaw and Marcus 1995), where each token in the text is tagged with B (beginning), I (inside), or O (outside).
In a linear-chain CRF, the probability of an output tag sequence y 1 , • • • , y j (for each j, y j ∈ {B, I, O}) given the text T where we define the potential function ψ(y j−1 , y j , j) as the output of a neural function described below.Our model is trained to maximize P.
We first compute an attentive representation for a placeholder with respect to each text token t j , using the attention mechanism proposed by Luong, Pham, and Manning (2015), since the placeholder is of variable length but we desire a fixed-size vector representation: Then the attentive placeholder representation sj , together with its corresponding text token representation t j , are joined using various matching methods proposed in Mou et al. ( 2016): 5 x j = sj ; t j ; |s j − t j | ; sj t j (7) yielding the joined feature vector x j ∈ R 4d .Finally the joined feature vector x j is passed through a multi-layer feed-forward neural network to get the final potential function for each token and each predicted tag type y j ∈ {B, I, O}: In our experiments, we pass x j through 4 layers, with output dimensions 2d, d, d, and 1, respectively, and tanh as the nonlinearity function between layers.
Trigger Identification Triggers of events can be thought as a special argument, which usually is the main verb (or a nominalized verb) that expresses the occurrence of an event.
We reuse the argument selection model for trigger identification: the highlighted token set for the trigger is all tokens in the statement that are not part of any standard argument: For example, the highlighted token set for the trigger of the statement in Figure 2 consists of the tokens underlined below: someone killed someone else with something in some place at some time

Training Data Generation
We generate data examples in the form of (S, I, T) triples to train the argument extractor, where S is a bleached statement, I is the index set of the focused placeholder, and T is the text.Algorithm 1 generates a sequence of bleached statements, where each successive statement is a refinement of its predecessor.During training, instead of replacing placeholders with their predicted arguments A ← GETARGS(S, I, T), we replace them with the gold argument(s) from the event extraction dataset.
Negative Sampling For trigger identification, we augment each example with negative samples from the set of event types not found in the example's text.For each event, α% of the non-occurring event types are taken as negative samples.We tune α ∈ {10, 20, 30, 40, 50}.

Recasting MRC Data for Pre-training
SQuAD (Rajpurkar et al. 2016) is a reading comprehension dataset consisting of questions on a set of Wikipedia articles, where the answer to each question is a span of text extracted from the corresponding reading passage.Its version 2.0 (Rajpurkar, Jia, and Liang 2018) contains additional data that poses unanswerable questions to reading comprehension systems.To do well, a system should learn to abstain from answering when no answer is supported by the text.
We employ recast versions of the training and development splits of SQuAD 2.0 as pre-training data for our event extraction system.We cast each SQuAD natural language question to a format similar to our bleached statements, where the wh-question phrases of the questions are tagged as the placeholders to be filled.For example, given the following SQuAD question, What form of oxygen is composed of three oxygen atoms? the extracted wh-phrase is "What form of oxygen", which is chosen as the single placeholder in this statement, yielding: What form of oxygen ANSWER is composed of three oxygen atoms?which is answered as ozone ANSWER is composed of three oxygen atoms?
This methodology is linguistically motivated, as both questions (as in SQuAD) and bleached statements (this work) reduce to logical forms with the same predicate.The denotations can be written (using a generic operator Q) as Qx.form of oxygen (x) ∧ composed of three oxygen atoms (x) where Q is λ for the question and is ∃ for the bleached statement.Hence the wh-phrase is semantically similar to an existentially quantified phrase (e.g.some form of oxygen, where some introduces existential quantification), despite their pragmatic difference in illocutionary force (inquiring vs. stating).Additionally, wh-phrases presuppose the existence of their answer referent; to use a wh-phrase when no referent exists would be infelicitous.Hence whquestion phrases serve the same function as the existentially quantified placeholder phrases in our bleached statements, and so the recast SQuAD questions are appropriate data for pre-training.We extract wh-phrases through syntactic analysis of the questions.We define the wh-phrase of a question to be the maximum span in its constituency parse that bears any of the following tags in the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993) parsing annotation guideline: • WHADJP (wh-adjectival phrase): e.g.how hot; • WHADVP (wh-adverbial phrase): e.g.why; • WHNP (wh-noun phrase): e.g. which book; • WHPP (wh-prepositional phrase): e.g. by whose authority; • WDT (wh-determiner): e.g.which; • WP (wh-pronoun): e.g.who; • WP$ (possessive wh-pronoun): e.g.whose; • WRB (wh-adverb): e.g.where.
We employ the neural span-based constituency parser (Stern, Andreas, and Klein 2017) in the Al-lenNLP (Gardner et al. 2018) toolkit to parse the SQuAD questions for extracting the wh-phrases.

Experiments and Discussions Event Extraction on ACE 2005
We evaluate our approach on the ACE 2005 dataset and use the same data splits as previous work, in which 40 newswire documents are used as the test set, another 30 documents of different genres are selected as the development set, and the remaining 529 documents constitute the training set (Li, Ji, and Huang 2013;Yang and Mitchell 2016;Nguyen and Nguyen 2019).Following previous work, we use four evaluation metrics: • Trigger Identification: a trigger is correctly identified if its span offsets exactly match a reference trigger.
• Trigger Classification: a trigger is correctly classified if its span offsets and event subtype exactly match a reference trigger.
• Argument Identification: an argument is correctly identified if its span offsets and corresponding event subtype exactly match a reference argument.
• Argument Classification: an argument is correctly classified if its span offsets, corresponding event subtype, and argument role exactly match a reference argument.
The overall performance is evaluated using precision (P), recall (R), and F-measure (F 1 ) for each metric.
We use BERT for sequence encoding. 6For pre-training on the recast SQuAD 2.0 dataset, we follow the previously mentioned pre-processing strategy.We pre-train on the training set of SQuAD 2.0 and perform early stopping using the development partition.Examples that do not have exactly 1 wh-phrase are discarded. 7The maximum sequence length is 512 word pieces, the maximum query length is 128 word pieces, the learning rate is 3 × 10 −5 with an Adam optimizer, the maximum gradient norm for gradient clipping is set to 1.0, and the number of training epochs is 3.
After pre-training on SQuAD 2.0, we fine-tune the model on ACE 2005.While keeping other hyperparameters unchanged, we set the learning rate to 1 × 10 −5 and the number of training epochs to 8.During fine-tuning, we employ negative sampling and set the negative sampling rate to 30%.
In addition to fine-tuning on the full training set of ACE 2005, we consider a single-genre "partial" training setting in which the model is trained only on the 58 documents that appear in the newswire portion of the full training set.

Experimental Results & Discussion
We train our model using full and partial training data and compare with two joint event extraction model baselines.The JOINTFEATURE model (Yang and Mitchell 2016) is a feature-based model that exploits document-level information; the JOINT3EE model (Nguyen and Nguyen 2019) is a neural model that achieves state-of-the-art performance on ACE 2005.These two models represent the state-of-the-art performance for feature-based and neural models, respectively.
Table 1 reports the performance of the systems on the four evaluation metrics.Training on the full training set improves F 1 performance over training on the partial training set, giving the largest improvement on trigger identification and classification.Additionally, pre-training on the recast SQuAD 2.0 dataset provides large F 1 improvements (4.8%-10.2%absolute F 1 increase) on all four evaluation metrics.Our model also tends to have higher recall than precision, especially on trigger identification and classification, and suffers from low precision compared to prior work.Our model achieves state-of-the-art performance on trigger identification and trigger classification.
Because our model does not explicitly incorporate entity mention detection, we hypothesize that our MRC-inspired approach predicts answer spans that are semantically correct but do not exactly match the gold answers, hurting performance on argument-related subtasks.We compare predicted arguments with gold references and find the following sources of errors: • Relative clauses: Our model predicts Mosul whereas the gold answer is Mosul, where U.S. troops killed 17 people in clashes earlier in the week.
• Counts: The gold annotation is 300 billion yen but our model predicts 300 billion.
• Durations: The gold annotation is lasted two hours but our model predicts two hours.

Few-shot Learning on FrameNet
In order to evaluate our approach under a lower-resource setting than the partial training setting, we consider fewand zero-shot learning on annotated documents from FrameNet.on between 1 and 5 documents. 9We pick the top-10 most frequent frames that represent events and write bleached statements based on their frame definitions.
Following the same pre-training setup, we then train the model using the same hyperparameters as the model finetuned on ACE 2005.

Zero-shot Learning on FrameNet
We additionally investigate the model's ability to generalize to unseen event types using the same dataset as the few-shot setting.We employ a leave-one-out strategy to the frames in Table 2, training on 9 frames and testing on the other 1. 9 Because we do not perform a hyperparameter sweep in this setting, we do not use a development set.

Experimental Results & Discussion
The results in Table 2 reveal a large variation in performance on the frames.The best performance is achieved on the ATTACK frame, but frames such as STATEMENT achieve poor performance.We report the macro-averaged F 1 over all frames to reveal overall performance instead of micro-average, since we care how the approach generalizes to different frames.Overall, the macro-averaged F 1 shows that the model can feasibly extract information about events of unseen types, but performance varies greatly across frames.
Possible reasons why the STATEMENT frame has low performance include: (1) the event type being too general, (2) the bleached statement being poorly constructed, (3) the span for the "message" role being long and difficult to tag exactly correctly using the BIO scheme.

Conclusion & Future Work
We present an approach to event extraction that uses bleached statements to give a model access to information contained in annotation manuals.Our model incrementally refines the statements with values extracted from text.We also demonstrate the feasibility of making predictions on event types seen rarely or not at all.Future work can apply our approach to n-ary relation extraction.

Figure 1 :
Figure 1: Comparison of data sources for human annotators, traditional information extraction systems, and our proposed approach.Human annotators use annotation guidelines and limited illustrative examples, traditional systems use large amounts of labeled examples, and our proposed system uses bleached statements (derived from annotation guidelines) and large amounts of labeled examples.
t e x i t s h a 1 _ b a s e 6 4 = " K 9 t e x i t s h a 1 _ b a s e 6 4 = " r 5 v / 4 A y 5 Z b R I = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " A + O C W D z + C w v n a z 2 b X 1 I C o i P S b 6 0 2 m w s f J 4 5 a 8 v 9 + w + g B I C x < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 3 s t e x i t s h a 1 _ b a s e 6 4 = " n i R w 8 l b a z L E x y l f D b x 3 O

Figure 2 :
Figure 2: An example of our approach on a sentence from the ACE 2005 dataset for the LIFE:DIE event type.The bleached statement is incrementally populated with values from the text (in the order denoted by the superscripts), and not all event arguments are supported by the text.The red circles denote the model.The grayed out text in the paragraph is given for context to the reader, but our model operates on single-sentence contexts.

3
We expect multiple arguments for the same role PERSON in this event.If our model returns A = {"Kim", "Pat"}, i.e. a set containing multiple extracted arguments, we replace the placeholder with all arguments, concatenated with the "and" token, and shift the focus to the next placeholder, creating the refined statement: returns nothing, i.e.A = , we simply skip the placeholder and move the focus to the next placeholder.For example, if the model finds no argument for the PLACE role, the refined statement of the next iteration would be Kim and Pat PERSON 1 married in some location PLACE 2 at some time TIME 3 instead of two attention query vectors b left and b right to get the left and right boundary for the answer span, we consider the problem of answer span selection as a tagging problem, first proposed in Yao et al. (
The results in Figure 3 show that F 1 performance on all evaluation metrics increases as more documents are added to the training set.The marginal utility of adding training documents almost monotonically decreases as the training set increases, so that performance from training on 3 documents roughly matches performance from training on 5 documents.

Table 2 :
Macro-averaged F 119.6 19.6 13.4 12.46Zero-shot results on FrameNet frames.Trigger classification performance is equivalent to trigger identification performance because we evaluate on only one frame in each zero-shot learning experiment.