Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation

Event extraction is of practical utility in natural language processing. In the real world, it is a common phenomenon that multiple events existing in the same sentence, where extracting them are more difficult than extracting a single event. Previous works on modeling the associations between events by sequential modeling methods suffer a lot from the low efficiency in capturing very long-range dependencies. In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. The experiment results demonstrate that our proposed framework achieves competitive results compared with state-of-the-art methods.


Introduction
Extracting events from natural language text is an essential yet challenging task for natural language understanding. When given a document, event extraction systems need to recognize event triggers with their specific types and their corresponding arguments with the roles. Technically speaking, as defined by the ACE 2005 dataset 1 , a benchmark for event extraction (Grishman et al., 2005), the event extraction task can be divided into two subtasks, i.e., event detection (identifying and classifying event triggers) and argument extraction (identifying arguments of event triggers and labeling their roles).
In event extraction, it is a common phenomenon that multiple events exist in the same sentence. Extracting the correct multiple events from those * * Corresponding author. 1 https://catalog.ldc.upenn.edu/ ldc2006t06 sentences is much more difficult than in the oneevent-one-sentence cases because those various types of events are often associated with each other. For example, in the sentence "He left the company, and planned to go home directly.", the trigger word left may trigger a Transport (a person left a place) event or an End-Position (a person retired from a company) event. However, if we take the following event triggered by go into consideration, we are more confident to judge it as a Transport event rather than an End-Position event. This phenomenon is quite common in our real world, as Injure and Die events are more likely to co-occur with Attack events than others, whereas Marry and Born events are less likely to co-occur with Attack events. As we investigated in ACE 2005 dataset, there are around 26.2% (1042/3978) sentences belong to this category.
Significant efforts have been dedicated to solving this problem. Most of them exploiting various features (Liu et al., 2016b;Yang and Mitchell, 2016;Li et al., 2013;Keith et al., 2017;Liu et al., 2016a;Li et al., 2015), introducing memory vectors and matrices , introducing more transition arcs (Sha et al., 2018), keeping more contextual information (Chen et al., 2015) into sentence-level sequential modeling methods like RNNs and CRFs. Some also seek features in document-level methods (Liao and Grishman, 2010;Ji and Grishman, 2008). However, sentencelevel sequential modeling methods suffer a lot from the low efficiency in capturing very longrange dependencies while the feature-based methods require extensive human engineering, which also largely affects model performance. Besides, these methods do not adequately model the associations between events.
An intuitive way to alleviate this phenomenon is to introduce shortcut arcs represented by linguistic resources like dependency parsing trees to  Figure 1: An example of dependency parsing result produced by Stanford CoreNLP. There are two events in the sentence: a Die event triggered by the word killed with four arguments in red and an Attack event triggered by the word barrage with three arguments in blue. The red dotted arc is the shortcut path consisting of three directed arcs from trigger killed to another trigger barrage.
drain the information flow from a point to its target through fewer transitions. Comparing to sequential order, modeling with these arcs often successfully reduce the needed hops from one event trigger to another in the same sentences. In Figure  1, for example, there are two events: a Die event triggered by the word killed with four arguments in red and an Attack event triggered by the word barrage with three arguments in blue. We need six hops from killed to barrage according to sequential order, but only three hops according to the arcs in dependency parsing tree (along the nmodarc from killed to witnesses, along the acl-arc from witnesses to called, and along the xcomp-arc from called to barrage). These three arcs consist of a shortcut path 2 , draining the dependency syntactic information flow from killed to barrage with fewer hops 3 . In this paper, we propose a novel Jointly Multiple Events Extraction (JMEE) framework by introducing syntactic shortcut arcs to enhance information flow and attention-based graphic convolution networks to model the graph information. To implement modeling with the shortcut arcs, we adopt the graph convolutional networks (GCNs) (Kipf and Welling, 2016;Marcheggiani and Titov, 2017;Nguyen and Grishman, 2018) to learn syntactic contextual representations of each node by the representative vectors of its immediate neighbors in the graph. And then we utilize the syntactic contextual representations to extract triggers and arguments jointly by a self-attention mechanism to aggregate information especially keeping the associations between multiple events.
We extensively evaluate the proposed JMEE framework with the widely-used ACE 2005 dataset to demonstrate its benefits in the experiments especially in capturing the associations between events. To summary, our contribution in this work is as follows: • We propose a novel joint event extraction framework JMEE based on syntactic structures which enhance information flow and alleviate the phenomenon where multiple events are in the same sentence.
• We propose a self-attention mechanism to aggregate information especially keeping the associations between multiple events and prove it is useful in event extraction.
• We achieve the state-of-the-art performance on the widely used datasets for event extraction using the proposed model with GCNs and self-attention mechanism.

Approach
Generally, event extraction can be cast as a multiclass classification problem deciding whether each word in the sentence forms a part of event trigger candidate and whether each entity in the sentence plays a particular role in the event triggered by the candidate triggers. There are two main approaches to event extraction: (i) the joint approach that extracts event triggers and arguments simultaneously as a structured prediction problem, and (ii) the pipelined approach that first performs trigger prediction and then identifies arguments in separate stages. We follow the joint approach that can effectively avoid the propagated errors in the pipeline. Additionally, we extract events in sentencelevel mainly for three reasons. Firstly, in our in- • The entity type label embedding vector of w i : Similarly to the POS-tagging label embedding vector of w i , we annotate the entity mentions in a sentence using BIO annotation schema and transform the entity type labels to real-valued vectors by looking up the embedding table. It should be noticed that we use the whole entity extent in ACE 2005 dataset which contains overlapping entity mentions and we sum all the possible entity type label embedding vectors for each token.
The transformation from the token w i to the vector x i essentially converts the input sentence W into a sequence of real-valued vectors X = (x 1 , x 2 , ..., x n ), which will be feed into later modules to learn more effective representations for event extraction.

Syntactic Graph Convolution Network
Considering an undirected graph G = (V, E) as the syntactic parsing tree for sentence W , where V = v 1 , v 2 , ..., v n (|V| = n) and E are sets of nodes and edges, respectively. In V, each v i is the node representing token w i in W . Each edge (v i , v j ) ∈ E is a directed syntactic arc from token w i to token w j , with the type label K(w i , w j ). Additionally, to allow information to flow against the direction, we also add reversed edge (v j , v i ) with the type label K ′ (w i , w j ). Following Kipf and Welling (2016), we also add all the self-loops, i.e., (v i , v i ) for any v i ∈ V. For example, in the dependency parsing tree shown in Figure 1, there are four arcs in the subgraph with only two nodes "killed" and "witnesses": the dependency arc with the type label K("killed", "witnesses") = nmod, the revresed dependency arc with the additional type label K("witnesses", "killed") = nmod ′ , and the two self-loops of "killed" and "witnesses" with type label K("killed", "killed") = K("witnesses", "witnesses") = loop.
Therefore, in the k-th layer of syntactic graph convolution network module, we can calculate the graph convolution vector h where K(u, v) indicates the type label of the edge (u, v); W u,v) are the weight matrix and the bias for the certain type label K(u, v), respectively; N (v) is the set of neighbors of v including v (because of the self-loop); f is the activation function. Moreover, we use the output of the word representation module x i to initialize the node representation h 0 v i of the first layer of GCNs. After applying the above two changes, the number of predefined directed arc type label (let us say, N ) will be doubled (to 2N + 1). It means we will have 2N + 1 sets of parameter pairs W (k) k and b (k) k for a single layer of GCN. In this work, we use Stanford Parser (Klein and Manning, 2003) to generate the arcs in dependency parsing trees for sentences as the shortcut arcs. The current representa-tion contains approximately 50 different grammatical relations, which is too high for the parameter number of a single layer of GCN and not compatible with the existing training data scale. To reduce the parameter numbers, following Marcheggiani and Titov (2017), we modify the definition of type label K(w i , w j ) to: (2) where the new K(w i , w j ) only have three type labels.
As not all types of edges are equally informative for the downstream task, moreover, there are also noises in the generated syntactic parsing structures; we apply gates on the edges to weight their individual importances. Inspired by Dauphin et al. u,v for each edge (u, v) indicating the importance for event extraction by: where σ is the logistic sigmoid function, V (k) K(u,v) and d

(k)
K(u,v) are the weight matrix and the bias of the gate. With this additional gating mechanism, the final syntactic GCN computation is formulated as (4) As stacking k layers of GCNs can model information in k hops, and sometimes the length of shortcut path between two triggers is less than k, to avoid information over-propagating, we adapt highway units (Srivastava et al., 2015), which allow unimpeded information flowing across stacking GCN layers. Typically, highway layers conduct nonlinear transformation as: where σ is the sigmoid function; ⊙ is the elementwise product operation; g is a nonlinear activation function; t is called transform gate and (1 − t) is called carry gate. Therefore, the input of the k-th GCN layers should be h (k) instead of h (k) .
The GCNs are designed to capture the dependencies between shortcut arcs, while the layer number of GCNs limits the ability to capture local graph information. However, in this cases, we find that leveraging local sequential context will help to expand the information flow without increasing the layer number of GCNs, which means LSTMs and GCNs maybe complementary. Therefore, instead of feeding the word representation X = (x 1 , x 2 , ..., x n ) into the first GCN layer, we follow Marcheggiani and Titov (2017), apply Bidirectional LSTM (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) to encode the the word representation X as: and the input of t-th token to GCNs is is the concatenation operation. The Bi-LSTM adaptively accumulates and abstracts the context for each token in the sentence.

Self-Attention Trigger Classification
When taking each token as the current word, we get the representation D from all tokens calculated by GCNs. Traditional event extraction systems often use max-pooling or its amelioration to aggregate information to each position. However, the max-pooling aggregation mechanisms tend to produce similar results after GCN modules in our framework. For example, if we get the aggregated vector Ag i at each position i by this max-pooling mechanism Ag i = max pooling n j=1 (H j ) with the GCNs output {H j |j = 1, ..., n} in which n is the sentence length, and the vector Ag i is all the same at each position. Besides, predicting a trigger label for a token should take other possible trigger candidates into consideration. To capture the associations between triggers in a sentence, we design a self-attention mechanism to aggregate information especially keeping the associations between multiple events.
Given the current token w i , the self-attention score vector and the context vector at position i are calculated as: score = norm(exp(W 2 f (W 1 D +b 1 )+b 2 )) (9) where norm means the normalization operation. Then we feed the context vector C i into a fullyconnected network to predict the trigger label in BIO annotation schema as: where f is a non-linear activation and y t i is the final output of the i-th trigger label.

Argument Classification
When we have extracted an entire trigger candidate, which is meeting an O label after an I-Type label or a B-Type label, we use the aggregated context vector C to perform argument classification on the entity list in the sentence. For each entity-trigger pair, as both the entity and the trigger candidate are likely to be a subsequence of tokens, we aggregate the context vectors of subsequences to trigger candidate vector T i and entity vector E j by average pooling along the sequence length dimension. Then we concatenate them together and feed into a fully-connected network to predict the argument role as: where y a ij is the final output of which role the jth entity plays in the event triggered by the i-th trigger candidate. When training our framework, if the trigger candidate that we focus on is not a correct trigger, we set all the golden argument labels concerning the trigger candidate to OTHER (not any roles). With this setting, the labels of the trigger candidate will be further adjusted to reach a reasonable probability distribution.

Biased Loss Function
In order to train the networks, we minimize the joint negative log-likelihood loss function. Due to the data sparsity in the ACE 2005 dataset, we adapt our joint negative log-likelihood loss func-tion by adding a bias item as: where N is the number of sentences in training corpus; n p , t p and e p are the number of tokens, extracted trigger candidates and entities of the p-th sentence; I(y t i ) is an indicating function, if y t i is not O, it outputs a fixed positive floating number α bigger than one, otherwise one; β is also a floating number as a hyper-parameter like α.

Dataset, Resources and Evaluation Metric
We evaluate our JMEE framework on the ACE 2005 dataset. The ACE 2005 dataset annotate 33 event subtypes and 36 role classes, along with the NONE class and BIO annotation schema, we will classify each token into 67 categories in event detection and 37 categories in argument extraction. To comply with previous work, we use the same data split as the previous work (Ji and Grishman, 2008;Liao and Grishman, 2010;Li et al., 2013;Chen et al., 2015;Liu et al., 2016b;Yang and Mitchell, 2016;Sha et al., 2018). This data split includes 40 newswire articles (881 sentences) for the test set, 30 other documents (1087 sentences) for the development set and 529 remaining documents (21,090 sentences) for the training set.
We deploy the Stanford CoreNLP toolkit 5 to preprocess the data, including tokenizing, sentence splitting, pos-tagging and generating dependency parsing trees.
Also, we follow the criteria of the previous work (Ji and Grishman, 2008;Liao and Grishman, 2010;Li et al., 2013;Chen et al., 2015;Liu et al., 2016b;Yang and Mitchell, 2016;Sha et al., 2018)  We also set dropout rate to 0.5 and L2-norm to 1e-8. The batch size in our experiments is 32, and we utilize a maximum length n = 50 of sentences in the experiments by padding shorter sentences and cutting off longer ones. These hyperparameters are either randomly searched or chosen by experiences when tuning in the development set. We use ReLU (Glorot et al., 2011) as our nonlinear activate function. We apply the stochastic gradient descent algorithm with mini-batches and the AdaDelta update rule (Zeiler, 2012). The gradients are computed using back-propagation. During training, besides the weight matrices, we also fine-tune all the embedding tables.

Overall Performance
We compare our performance with the following state-of-the-art methods: Table 1 shows the overall performance comparing to the above state-of-the-art methods with golden-standard entities. From the table, we can see that our JMEE framework achieves the best F 1 scores for both trigger classification and argumentrelated subtasks among all the compared methods. There is a significant gain with the trigger classification and argument role labeling performances, which is 2% higher over the best-reported models. These results demonstrate the effectivenesses of our method to incorporate with the graph convolution and syntactic shortcut arcs.

Effect on Extracting Multiple Events
To evaluate the effect of our framework for alleviating the multiple events phenomenon, we divide the test data into two parts (1/1 and 1/N) following Nguyen et al. (2016); Chen et al. (2015) and perform evaluations separately. 1/1 means that one sentence only has one trigger or one argument plays a role in one sentence; otherwise, 1/N is used. Table 2 illustrates the performance (F 1 scores) of JRNN (Nguyen et al., 2016), DMCNN (Chen et al., 2015), the two baseline model Embed-ding+T and CNN in Chen et al. (2015) and our framework in trigger classification subtask and argument role labeling subatsk. Embedding+T uses word embedding vectors and the traditional sentence-level features in Li et al. (2013), while ing the multiple-event phenomenon. In our framework, we introduce syntactic shortcut arcs to enhance information flow and adapt the graph convolution network to capture the enhanced representation. Then a self-attention aggregation mechanism is applied to aggregate the associations between events. Besides, we jointly extract event triggers and arguments by optimizing a biased loss function due to the imbalances in the dataset. The experiment results demonstrate the effectiveness of our proposed framework. In the future, we plan to exploit the information of one argument which plays different roles in various events to do better in event extraction task.