Resource-Enhanced Neural Model for Event Argument Extraction

Event argument extraction (EAE) aims to identify the arguments of an event and classify the roles that those arguments play. Despite great efforts made in prior work, there remain many challenges: (1) Data scarcity. (2) Capturing the long-range dependency, specifically, the connection between an event trigger and a distant event argument. (3) Integrating event trigger information into candidate argument representation. For (1), we explore using unlabeled data. For (2), we use Transformer that uses dependency parses to guide the attention mechanism. For (3), we propose a trigger-aware sequence encoder with several types of trigger-dependent sequence representations. We also support argument extraction either from text annotated with gold entities or from plain text. Experiments on the English ACE 2005 benchmark show that our approach achieves a new state-of-the-art.


Introduction
Event argument extraction (EAE) aims to identify the entities that serve as arguments of an event and to classify the specific roles they play. As in Fig. 1, "two soldiers" and "yesterday" are arguments, where the event triggers are "attacked" (with event type being ATTACK 1 ) and "injured" (event type INJURY). For the trigger "attacked", "two soldiers" plays the argument role Target while "yesterday" plays the argument role Attack Time. For the event trigger "injured", "two soldiers" and "yesterday" play the role Victim and INJURY Time, respectively. There has been significant work on event extraction (EE) (Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013), but the EAE task remains a challenge and has become the bottleneck for improving the overall performance of EE (Wang et al., 2019a). 2 Supervised data for EAE is expensive and hence scarce. One possible solution is to use other available resources like unlabeled data. For that, (1) We use BERT (Devlin et al., 2018) as our model encoder which leverages a much larger unannotated corpus where semantic information is captured. Unlike Yang et al. (2019) who added a final/prediction layer to BERT for argument extraction, we use BERT as token embedder and build a sequence of EAE task-specific components (Sec. 2). (2) We use (unlabeled) in-domain data to adapt the BERT model parameters in a subsequent pretraining step as in (Gururangan et al., 2020). This makes the encoder domain-aware. (3) We perform self-training to construct auto-labeled data (silver data).
A crucial aspect for EAE is to integrate event trigger information into the learned representations. This is important because arguments are dependent on triggers, i.e., the same argument span plays completely different roles toward different triggers. An example is shown in Fig. 1, where "two soldiers" plays the role Target for the event ATTACK and the role Victim for INJURY. Different from 2 EAE has similarities with semantic role labeling. Event triggers are comparable to predicates in SRL and the roles in most SRL datasets have a standard convention of interpreting who did what to whom. EAE has a custom taxonomy of roles by domain. We also use inspiration from the SRL body of work (Strubell et al., 2018;Wang et al., 2019b;He et al., 2017;Marcheggiani and Titov, 2017). existing work that relies on regular sequence encoders, we design a novel trigger-aware encoder which simultaneously learns four different types of trigger-informed sequence representations.
Capturing the long-range dependency is another important factor, e.g., the connection between an event trigger and a distant argument. Syntactic information could be useful in this case, as it could help bridge the gap from a word to another distant but highly related word (Sha et al., 2018;Liu et al., 2018;Strubell et al., 2018). We modify a Transformer (Devlin et al., 2018) by explicitly incorporating syntax via an attention layer driven by the dependency parse of the sequence.
We design our role-specific argument decoder to seamlessly accommodate both settings (with and without the availability of entities). We also tackle the role overlap problem (Yang et al., 2019) using a set of classifiers or taggers in our decoder.
Our model achieves the new state-of-the-art on ACE2005 Events data (Grishman et al., 2005).

Task Setup
An event trigger g is a span x ab indicating an event of type y g , where y g belongs to a fixed set of pre-defined trigger types. Given a sequencetrigger pair (X , g) as input, EAE has two goals: (1) Identify all argument spans from X and (2) Classify the role r for each argument. In some settings, a set of entities is given (each entity is a span in X) and such entities are used as a candidate pool for arguments. For example, "two soldiers" and "yesterday" are candidate entities in Fig. 1. Trigger-Aware Sequence Encoder: This encoder is designed to distinguish candidate arguments conditioned on different triggers. Note a span may encode different argument information for two triggers, for example, in Fig. 1, "Two soldiers" plays the role of Target for the ATTACK event and Victim for the INJURY event. In order to model this, our encoder uses BERT to embed input tokens, where the BERT embedding of token x t is denoted as b t . A segment (0/1) embedding seg t for each token x t indicating whether x t belongs to the trigger or not (Logeswaran et al., 2019, inter-alia) is used, which is added up with token embedding and position embedding as input to BERT (Fig.2). The encoder then concatenates the following learned representations 3 for each token: (1) A trigger representation h g by max pooling over BERT embeddings of the tokens in trigger g; (2) A trigger type embedding p yg for y g ; (3) A trigger indicator (0/1) embedding l t , indicating whether x t belongs to the trigger or not. 4 (4) A token embedding b t . This results in a trigger-aware representation c t for each token where c t = Concat(b t ; p yg ; l t ; h g ) and C for the whole sequence with T tokens.

Modeling Argument Extraction
Syntax-Attending Transformer: Dependency parsing has been used as a feature to improve EE (Sha et al., 2018;Liu et al., 2018). Inspired by Strubell et al. (2018), we utilize dependency parses 5 by modifying an attention head for each layer in a Transformer. Note that this Transformer is different from the BERT component, as this Transformer aims to capture long-range dependency on top of the trigger-aware representations learned from our sequence encoder. The output C from our encoder now will be the input of this Transformer, which will go through L layers of the modified syntax-attending Transformers. Each of these is assumed to have N self-attention heads. For each layer l, we modify one of these N heads to be a dependency-based attention head (call d-head) with output H l : where Q l , K l and V l are query, key, and value representations, and W * are learning parameters. U l = {u l 1 ..u l T } is the layer-l output of our Transformer and U 0 = C. Eq. 2 uses the scaled-dot product attention (Vaswani et al., 2017). The difference of the d-head compared to other heads is that its keys K and values V are constructed differently. For each token x i , valid keys and values are restricted to all tokens x j such that x i and x j have an edge between them 6 in the dependency parse of the sequence X . This makes every a l t ∈ A l a weighted attention sum over the neighbor 7 values v l j of the token x i in the dependency parse. We then concatenate this a l t and the token's own representation u l t projected linearly. Finally, this is projected back to the same dimensions as the outputs of the other N − 1 attention heads. By concatenating their outputs, our model captures both syntax-informed and globalattending information. The final output from our Transformer component is U L = u L 1 , u L 2 ..u L T . Role-Specific Argument Decoder: We consider two settings: (1) with and (2) without entities. When entities are provided, they are used to form candidates for arguments; when they are not provided, our model infers arguments from plain text.
For (1), we assume that all arguments are entities but the vice versa is not true. So, we treat all entity spans, within a fixed sentence window around the trigger g, as candidate arguments. An entity representation is formed by pooling u L t for all tokens x t in the entity span. Note that, since the encoder is trigger-aware, this representation is already conditioned on (X , g) for role classification. Commonly used datasets like ACE2005 have a 10% role overlap problem (Yang et al., 2019). Concretely, consider a sentence like "The suicide bomber died in the blast he set off". Here, "sui-cide bomber" plays two distinct roles Attacker and Victim for the same trigger "blast" that denotes an ATTACK event. Hence, we perform role classification for every role independently (as a multi-label classification problem), using a set of classifiers, where each classifier handles one particular role, i.e., role-specific (such as the VICTIM, TARGET or ORIGIN as orange shown in Fig. 2). We thus call this decoder role-specific argument decoder.
More specifically, we use one binary classifier per role permissible for current trigger type on this entity representation. The outcome of the classifier for role r determines whether this entity plays the role r for the current trigger or not.
For (2), in the absence of entities we have no candidate spans for arguments. Using final layer output of syntax-attending Transformer, we predict a sequence of BIO tags with one sequence tagger per role. 8 So in this setting the role-specific argument decoder comprises a set of sequence taggers.

Training Regimes for Data Scarcity
Domain-adaptive pretraining: An additional phase of in-domain pretraining has been shown to be effective for downstream tasks (Gururangan et al., 2020). Based on this, we perform a second phase of domain-adaptive pretraining with both BERT losses before fine-tuning the BERT encoder.

Self-training:
For self-training (Chapelle et al., 2009;Scudder, 1965, inter-alia), we first train our model based on gold data. Next, we use that model to tag unlabeled data and get a much larger but noisy silver dataset (Sec. 3). We then train a new version of our model on the silver dataset; the resulting model is later fine-tuned on the gold data.
Auxiliary tasks: Although trigger detection is not the focus of this work, we model it as an auxiliary task to help EAE. We share the BERT encoder (Sec. 2.2) for both tasks. The trigger detection task uses the standard sequence tagging decoder for BERT (Devlin et al., 2018). The intuition here is to improve (1) the representation of the shared BERT component and (2) trigger representation, by performing trigger detection.

Experiments
Data and Tools: We use the ACE-2005 English Event data (Grishman et al., 2005). 9 For selftraining and domain-adaptive pretraining, we randomly sample 50k documents (626k sentences) from Gigaword 10 to construct silver data. We use Stanford CoreNLP software 11 for tokenization, sentence segmentation and dependency parsing.
Training Setup: We use 50 dimensions for trigger indicator and trigger type embedding. We use 2 (L = 2) layers for the syntax-attending Transformer with 2 (N = 2) attention heads, dropout of 0.1. When entities are available, we only consider entities in the same sentence as the trigger as candidates for argument extraction. During training, We use Adam (Kingma and Ba, 2014) as optimizer and batch size of 32 for both main task EAE and the auxiliary task of trigger detection; we alternate between batches of main task and auxiliary task with probabilities of 0.9 and 0.1, respectively. We early stop training if performance on the development set does not improve after 20 epochs. All model parameters are fine-tuned during training. For BERT pretraining, we use the same setting as in (Devlin et al., 2018) but with an initial learning rate of 1e-5. We stop pretraining after 10k steps. In order to obtain reliably predicted triggers as input for EAE, we trained a five-model ensemble trigger detection system following Wadden et al. (2019). 12 9 Standard splits (Li et al., 2013): 529 documents (14,385 sentences) are used for training, 30 documents (813 sentences) for development, and 40 documents (632 sentences) for test. 10 https://catalog.ldc.upenn.edu/LDC2011T07 11 https://stanfordnlp.github.io/CoreNLP/ 12 Since trigger detection is not our main task and improving it is not the focus of this work, its results are not for comparison and thus excluded from the main result tables. As a  Results and Analyses: Table 1 shows the results in two experimental settings: with and without entities. In the setting with entities, we use gold entities as in prior work. We have the following observations: (1) Our model achieves the best results ever reported in both experimental settings on RC (overall F1 scores).
(2) Our model does not achieve the highest scores on AI. It seems however that our model is able to bridge the gap given that in order to achieve good results in RC you need AI, so it couples these two mutually affected sub-tasks closer to each other.
(3) Self-training leads to gains of 1 F1 point both with and without entities. (4) Domain-adaptive pretraining shows small improvements on both AI and RC (less that self-training). There are two possible reasons. First, Gigaword is news while ACE is not only news; we only adapted part of the domains. Second, even though we used a small learning rate during pretraining, 50k unlabeled documents is a small amount for pretraining.

Related Work
Event Argument Exaction (EAE) is an important task in Event Extraction (EE). Early studies designed lexical, contextual or syntactical features to tackle the EE problem (Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013). Later on, neural networks (Yubo et al., 2015;Sha et al., 2016;Nguyen et al., 2016;Sha et al., 2018;Liu et al., 2018;Yang et al., 2019;Wang et al., 2019a) demonstrated their effectiveness in representation learning without manual feature engineering. Our proposed model belongs to the latter category.
Here we present and discuss the most related studies to our work. Yang et al. (2019) used a pre-trained model with a state-machine based span boundary detector. They used heuristics to resolve final span boundaries. Wang et al. (2019a) also used a pre-trained model together with a handcrafted conceptual hierarchy. Our approach does not need the design of such heuristics or conceptual hierarchy. In terms of modeling, their approaches used regular BERT as their encoders, where the argument representations are not explicitly conditioned on triggers. In contrast, our encoder is enhanced by providing more trigger-oriented information and BERT is only used as one part of it, which results in a trigger-aware sequence encoder. This allows us to better model interactions between arguments and triggers. Liu et al. (2018) added a GCN layer to integrate the syntactic information into a neural model. Different from their solution, we encode the syntax jointly with attention mechanism, simplifying the learning, making it more efficient, and achieving better results. Finally, no prior work has deeply studied the data scarcity is-sue in EAE, while we exploit several techniques to tackle it in this work.

Conclusion
We present a new model which provides the best results in the EAE task. The model can generate trigger-aware argument representations, incorporate syntactic information (via dependency parses), and handle the role overlapping problem with rolespecific argument decoder. We also experiment with some methods to address the data scarcity issue. Experimental results show the effectiveness of our proposed approaches.