Joint Event Extraction via Recurrent Neural Networks

Event extraction is a particularly challenging problem in information extraction. The state-of-the-art models for this problem have either applied convolutional neural networks in a pipelined framework (Chen et al., 2015) or followed the joint architecture via structured prediction with rich local and global features (Li et al., 2013). The former is able to learn hidden feature representations automatically from data based on the continuous and generalized representations of words. The latter, on the other hand, is capable of mitigating the error propagation problem of the pipelined approach and exploiting the inter-dependencies between event triggers and argument roles via discrete structures. In this work, we propose to do event extraction in a joint framework with bidirectional recurrent neural networks, thereby beneﬁting from the advantages of the two models as well as addressing issues inherent in the existing approaches. We systematically investigate different memory features for the joint model and demonstrate that the proposed model achieves the state-of-the-art performance on the ACE 2005 dataset.


Introduction
We address the problem of event extraction (EE): identifying event triggers of specified types and their arguments in text. Triggers are often single verbs or normalizations that evoke some events of interest while arguments are the entities participating into such events. This is an important and challenging task of information extraction in natural language processing (NLP), as the same event might be present in various expressions, and an expression might expresses different events in different contexts.
There are two main approaches to EE: (i) the joint approach that predicts event triggers and arguments for sentences simultaneously as a structured prediction problem, and (ii) the pipelined approach that first performs trigger prediction and then identifies arguments in separate stages.
The most successful joint system for EE (Li et al., 2013) is based on the structured perceptron algorithm with a large set of local and global features 1 . These features are designed to capture the discrete structures that are intuitively helpful for EE using the NLP toolkits (e.g., part of speech tags, dependency and constituent tags). The advantages of such a joint system are twofold: (i) mitigating the error propagation from the upstream component (trigger identification) to the downstream classifier (argument identification), and (ii) benefiting from the the inter-dependencies among event triggers and argument roles via global features. For example, consider the following sentence (taken from ) in the ACE 2005 dataset: In Baghdad, a cameraman died when an American tank fired on the Palestine hotel.
In this sentence, died and fired are the event triggers for the events of types Die and Attack, respectively. In the pipelined approach, it is often simple for the argument classifiers to realize that camera-man is the Target argument of the Die event due to the proximity between cameraman and died in the sentence. However, as cameraman is far away from fired, the argument classifiers in the pipelined approach might fail to recognize cameraman as the Target argument for the event Attack with their local features. The joint approach can overcome this issue by relying on the global features to encode the fact that a Victim argument for the Die event is often the Target argument for the Attack event in the same sentence.
Despite the advantages presented above, the joint system by  suffers from the lack of generalization over the unseen words/features and the inability to extract the underlying structures for EE (due to its discrete representation from the handcrafted feature set) (Nguyen and Grishman, 2015b;Chen et al., 2015).
The most successful pipelined system for EE to date (Chen et al., 2015) addresses these drawbacks of the joint system by  via dynamic multi-pooling convolutional neural networks (DMCNN). In this system, words are represented by the continuous representations (Bengio et al., 2003;Turian et al., 2010;Mikolov et al., 2013a) and features are automatically learnt from data by the DM-CNN, thereby alleviating the unseen word/feature problem and extracting more effective features for the given dataset. However, as the system by Chen et al. (2015) is pipelined, it still suffers from the inherent limitations of error propagation and failure to exploit the inter-dependencies between event triggers and argument roles . Finally, we notice that the discrete features, shown to be helpful in the previous studies for EE , are not considered in Chen et al. (2015).
Guided by these characteristics of the EE systems by  and Chen et al. (2015), in this work, we propose to solve the EE problem with the joint approach via recurrent neural networks (RNNs) (Hochreiter and Schmidhuber, 1997; augmented with the discrete features, thus inheriting all the benefits from both systems as well as overcoming their inherent issues. To the best of our knowledge, this is the first work to employ neural networks to do joint EE.
Our model involves two RNNs that run over the sentences in both forward and reverse directions to learn a richer representation for the sentences. This representation is then utilized to predict event triggers and argument roles jointly. In order to capture the inter-dependencies between triggers and argument roles, we introduce memory vectors/matrices to store the prediction information during the course of labeling the sentences.
We systematically explore various memory vector/matrices as well as different methods to learn word representations for the joint model. The experimental results show that our system achieves the state-of-the-art performance on the widely used ACE 2005 dataset.

Event Extraction Task
We focus on the EE task of the Automatic Context Extraction (ACE) evaluation 2 . ACE defines an event as something that happens or leads to some change of state. We employ the following terminology: • Event mention: a phrase or sentence in which an event occurs, including one trigger and an arbitrary number of arguments. • Event trigger: the main word that most clearly expresses an event occurrence. • Event argument: an entity mention, temporal expression or value (e.g. Job-Title) that servers as a participant or attribute with a specific role in an event mention. ACE annotates 8 types and 33 subtypes (e.g., Attack, Die, Start-Position) for event mentions that also correspond to the types and subtypes of the event triggers. Each event subtype has its own set of roles to be filled by the event arguments. For instance, the roles for the Die event include Place, Victim and Time. The total number of roles for all the event subtypes is 36.
Given an English text document, an event extraction system needs to recognize event triggers with specific subtypes and their corresponding arguments with the roles for each sentence. Following the previous work Chen et al., 2015), we assume that the argument candidates (i.e, the entity mentions, temporal expressions and values) are provided (by the ACE annotation) to the event extraction systems.

Model
We formalize the EE task as follow. Let W = w 1 w 2 . . . w n be a sentence where n is the sentence length and w i is the i-th token. Also, let E = e 1 , e 2 , . . . , e k be the entity mentions 3 in this sentence (k is the number of the entity mentions and can be zero). Each entity mention comes with the offsets of the head and the entity type. We further assume that i 1 , i 2 , . . . , i k be the indexes of the last words of the mention heads for e 1 , e 2 , . . . , e k , respectively.
In EE, for every token w i in the sentence, we need to predict the event subtype (if any) for it. If w i is a trigger word for some event of interest, we then need to predict the roles (if any) that each entity mention e j plays in such event.
The joint model for event extraction in this work consists of two phases: (i) the encoding phase that applies recurrent neural networks to induce a more abstract representation of the sentence, and (ii) the prediction phase that uses the new representation to perform event trigger and argument role identification simultaneously for W . Figure 1 shows an overview of the model.

Sentence Encoding
In the encoding phase, we first transform each token w i into a real-valued vector x i using the concatenation of the following three vectors: 1. The word embedding vector of w i : This is obtained by looking up a pre-trained word embedding table D (Collobert and Weston, 2008;Turian et al., 2010;Mikolov et al., 2013a).
2. The real-valued embedding vector for the entity type of w i : This vector is motivated from the prior work (Nguyen and Grishman, 2015b) and generated by looking up the entity type embedding table (initialized randomly) for the entity type of w i . Note that we also employ the BIO annotation schema to assign entity type labels to each token in the sentences using the heads of the entity mentions as do Nguyen and Grishman (2015b).
3. The binary vector whose dimensions correspond to the possible relations between words in the dependency trees. The value at each dimension of this vector is set to 1 only if there exists one edge of the corresponding relation connected to w i in the dependency tree of W . This vector represents the dependency features that are shown to be helpful in the previous research .
Note that we do not use the relative position features, unlike the prior work on neural networks for EE (Nguyen and Grishman, 2015b;Chen et al., 2015). The reason is we predict the whole sentences for triggers and argument roles jointly, thus having no fixed positions for anchoring in the sentences.
The transformation from the token w i to the vector x i essentially converts the input sentence W into a sequence of real-valued vectors X = (x 1 , x 2 , . . . , x n ), to be used by recurrent neural networks to learn a more effective representation.
An important characteristics of the recurrent mechanism is that it adaptively accumulates the context information from position 1 to i into the hidden vector α i , making α i a rich representation. However, α i is not sufficient for the event trigger and argument predictions at position i as such predictions might need to rely on the context information in the future (i.e, from position i to n). In order to address this issue, we run a second RNN in the reverse direction from X n to X 1 to generate the second hidden vector sequence: ←−− RNN(x n , x n−1 , . . . , x 1 ) = (α n , α n−1 , . . . , α 1 ) in which α i summarizes the context information from position n to i. Eventually, we obtain the new representation (h 1 , h 2 , . . . , h n ) for X by concatenating the hidden vectors in (α 1 , α 2 , . . . , α n ) and (α n , α n−1 , . . . , α 1 ): Note that h i essentially encapsulates the context information over the whole sentence (from 1 to n) with a greater focus on position i.
Regarding the non-linear function, the simplest  Figure 1: The joint EE model for the input sentence "a man died when a tank fired in Baghdad" with local context window d = 1. We only demonstrate the memory matrices G arg/trg i in this figure. Green corresponds to the trigger candidate "died" at the current step while violet and red are for the entity mentions "man" and "Baghdad" respectively. form of Φ in the literature considers it as a one-layer feed-forward neural network. Unfortunately, this function is prone to the "vanishing gradient" problem (Bengio et al., 1994), making it challenging to train RNNs properly. This problem can be alleviated by long-short term memory units (LSTM) (Hochreiter and Schmidhuber, 1997;Gers, 2001). In this work, we use a variant of LSTM; called the Gated Recurrent Units (GRU) from . GRU has been shown to achieve comparable performance (Chung et al., 2014;Józefowicz et al., 2015).

Prediction
In order to jointly predict triggers and argument roles for W , we maintain a binary memory vector G trg i for triggers, and binary memory matrices G arg i and G arg/trg i for arguments (at each time i). These vector/matrices are set to zeros initially (i = 0) and updated during the prediction process for W .
Given the bidirectional representation h 1 , h 2 , . . . , h n in the encoding phase and the initialized memory vector/matrices, the joint prediction procedure loops over n tokens in the sentence (from 1 to n). At each time step i, we perform the following three stages in order: (i) trigger prediction for w i . The output of this process would be the predicted trigger subtype t i for w i , the predicted argument roles a i1 , a i2 , . . . , a ik and the memory vector/matrices G trg i , G arg i and G arg/trg i for the current step. Note that t i should be the event subtype if w i is a trigger word for some event of interest, or "Other" in the other cases. a ij , in constrast, should be the argument role of the entity mention e j with respect to w i if w i is a trigger word and e j is an argument of the corresponding event, otherwise a ij is set to "Other" (j = 1 to k).

Trigger Prediction
In the trigger prediction stage for the current token w i , we first compute the feature representation vector R trg i for w i using the concatenation of the following three vectors: • h i : the hidden vector to encapsulate the global context of the input sentence.
• L trg i : the local context vector for w i . L is then fed into a feed-forward neural network F trg with a softmax layer in the end to compute the probability distribution P trg i;t over the possible trigger subtypes: P where t is a trigger subtype. Finally, we compute the predicted type t i for w i by: t i = argmax t (P trg i;t ).

Argument Role Prediction
In the argument role prediction stage, we first check if the predicted trigger subtype t i in the previous stage is "Other" or not. If yes, we can simply set a ij to "Other" for all j = 1 to k and go to the next stage immediately. Otherwise, we loop over the entity mentions e 1 , e 2 , . . . , e k . For each entity mention e j with the head index of i j , we predict the argument role a ij with respect to the trigger word w i using the following procedure.
First, we generate the feature representation vector R arg ij for e j and w i by concatenating the following vectors: • h i and h i j : the hidden vectors to capture the global context of the input sentence for w i and e j , respectively. • L arg ij : the local context vector for w i and e j . L arg ij is the concatenation of the vectors of the words in the context windows of size d for w i and w i j : • B ij : the hidden vector for the binary feature vector V ij . V ij is based on the local argument features between the tokens i and i j from (Li et al., 2013). B ij is then computed by feeding V ij into a feed-forward neural network F binary for further abstraction: B ij = F binary (V ij ). where a is an argument role. Eventually, the predicted argument role for w i and e j is a ij = argmax a (P arg ij;a ). Note that the binary vector V ij enriches the feature representation R arg ij for argument labeling with the discrete structures discovered in the prior work on feature analysis for EE . These features include the shortest dependency paths, the entity types, subtypes, etc.

The Memory Vector/Matrices
An important characteristics of EE is the existence of the dependencies between trigger labels and argument roles within the same sentences . In this work, we encode these dependencies into the memory vectors/matrices G

Training
Denote the given trigger subtypes and argument roles for W in the training time as T = t * 1 , t * 2 , . . . , t * n and A = (a * ij ) j=1,k i=1,n . We train the network by minimizing the joint negative log-likelihood function C for triggers and argument roles: where I is the indicator function. We apply the stochastic gradient descent algorithm with mini-batches and the AdaDelta update rule (Zeiler, 2012). The gradients are computed using back-propagation. During training, besides the weight matrices, we also optimize the word and entity type embedding tables to achieve the optimal states. Finally, we rescale the weights whose Frobenius norms exceed a hyperparameter (Kim, 2014;Nguyen and Grishman, 2015a).

Word Representation
Following the prior work (Nguyen and Grishman, 2015b;Chen et al., 2015), we pre-train word embeddings from a large corpus and employ them to initialize the word embedding table. One of the models to train word embeddings have been proposed in Mikolov et al. (2013a;2013b) that introduce two log-linear models, i.e the continuous bag-of-words model (CBOW) and the continuous skipgram model (SKIP-GRAM). The CBOW model attempts to predict the current word based on the average of the context word vectors while the SKIP-GRAM model aims to predict the surrounding words in a sentence given the current word. In this work, besides the CBOW and SKIP-GRAM models, we examine a concatenation-based variant of CBOW (C-CBOW) to train word embeddings and compare the three models to understand their effectiveness for EE. The objective of C-CBOW is to predict the target word using the concatenation of the vectors of the words surrounding it.

Resources, Parameters and Dataset
For all the experiments below, in the encoding phase, we use 50 dimensions for the entity type embeddings, 300 dimensions for the word embeddings and 300 units in the hidden layers for the RNNs.
Regarding the prediction phase, we employ the context window of 2 for the local features, and the feed-forward neural networks with one hidden layer for F trg , F arg and F binary (the size of the hidden layers are 600, 600 and 300 respectively).
Finally, for training, we use the mini-batch size = 50 and the parameter for the Frobenius norms = 3.
These parameter values are either inherited from the prior research (Nguyen and Grishman, 2015b;Chen et al., 2015) or selected according to the validation data.
We pre-train the word embeddings from the English Gigaword corpus utilizing the word2vec toolkit 4 (modified to add the C-CBOW model). Following Baroni et al. (2014), we employ the context window of 5, the subsampling of the frequent words set to 1e-05 and 10 negative samples.
We evaluate the model with the ACE 2005 corpus. For the purpose of comparison, we use the same data split as the previous work Nguyen and Grishman, 2015b;Chen et al., 2015). This data split includes 40 newswire articles (672 sentences) for the test set, 30 other documents (836 sentences) for the development set and 529 remaining documents (14,849 sentences) for the training set. Also, we follow the criteria of the previous work Li et al., 2013;Chen et al., 2015) to judge the correctness of the predicted event mentions.

Memory Vector/Matrices
This section evaluates the effectiveness of the memory vector and matrices presented in Section 3.2.3. In particular, we test the joint model on different cases where the memory vector for triggers G trg and the memory matrices for arguments G arg/trg and G arg are included or excluded from the model. As there are 4 different ways to combine G arg/trg and G arg for argument labeling and two options to to employ G trg or not for trigger labeling, we have 8 systems for comparison in total. Table 1 reports the identification and classification performance (F1 scores) for triggers and argument roles on the development set. Note that we are using the word embeddings trained with the C-CBOW technique in this section.

System
No G arg/trg G arg G arg/trg +G arg We observe that the memory vector G trg is not helpful for the joint model as it worsens both trigger and argument role performance (considering the same choice of the memory matrices G arg/trg and G arg (i.e, the same row in the table) and except in the row with G arg/trg + G arg ).
The clearest trend is that G arg/trg is very effective in improving the performance of argument labeling. This is true in both the inclusion and exclusion of G trg . G arg and its combination with G arg/trg , on the other hand, have negative effect on this task. Finally, G arg/trg and G arg do not contribute much to the trigger labeling performance in general (except in the case where G t , G arg/trg and G arg are all applied).
These observations suggest that the dependencies among trigger subtypes and among argument roles are not strong enough to be helpful for the joint model in this dataset. This is in contrast to the de-pendencies between argument roles and trigger subtypes that improve the joint model significantly.
The best system corresponds to the application of the memory matrix G arg/trg and will be used in all the experiments below.

Word Embedding Evaluation
We investigate different techniques to obtain the pretrained word embeddings for initialization in the joint model of EE. Table 2 presents the performance (for both triggers and argument roles) on the development set when the CBOW, SKIP-GRAM and C-CBOW techniques are utilized to obtain word embeddings from the same corpus. We also report the performance of the joint model when it is initialized with the Word2Vec word embeddings from Mikolov et al. (2013a;2013b)   The first observation from the table is that RAN-DOM is not good enough to initialize the word embeddings for joint EE and we need to borrow some pre-trained word embeddings for this purpose. Second, SKIP-GRAM, WORD2VEC and CBOW have comparable performance on trigger labeling while the argument labeling performance of SKIP-GRAM and WORD2VEC is much better than that of CBOW for the joint EE model. Third and most importantly, among the compared word embeddings, it is clear that C-CBOW significantly outperforms all the others. We believe that the better performance of C-CBOW stems from its concatenation of the multiple context word vectors, thus providing more information to learn better word embeddings than SKIP-GRAM and WORD2VEC. In addition, the concate-  nation mechanism essentially helps to assign different weights to different context words, thereby being more flexible than CBOW that applies a single weight for all the context words. From now on, for consistency, C-CBOW would be utilized in all the following experiments.

Comparison to the State of the art
The state-of-the-art systems for EE on the ACE 2005 dataset have been the pipelined system with dynamic multi-pooling convolutional neural networks by Chen et al. (2015) (DMCNN) and the joint system with structured prediction and various discrete local and global features by   (Li's structure).
Note that the pipelined system in Chen et al. (2015) is also the best-reported system based on neural networks for EE. Table 3 compares these state-of-the-art systems with the joint RNN-based model in this work (denoted by JRNN). For completeness, we also report the performance of the following representative systems: 1) Li's baseline: This is the pipelined system with local features by .
2) Liao's cross event: is the system by  with the document-level information.
3) Hong's cross-entity : This system exploits the cross-entity inference, and is also the best-reported pipelined system with discrete features in the literature.
From the table, we see that JRNN achieves the best F1 scores (for both trigger and argument labeling) among all of the compared models. This is significant with the argument role labeling per-formance (an improvement of 1.9% over the bestreported model DMCNN in Chen et al. (2015)), and demonstrates the benefit of the joint model with RNNs and memory features in this work. In addition, as JRNN significantly outperforms the joint model with discrete features in  (an improvement of 1.8% and 2.7% for trigger and argument role labeling respectively), we can confirm the effectiveness of RNNs to learn effective feature representations for EE.

Sentences with Multiple Events
In order to further prove the effectiveness of JRNN, especially for those sentences with multiple events, we divide the test data into two parts according to the number of events in the sentences (i.e, single event and multiple events) and evaluate the performance separately, following Chen et al. (2015). Table 4 shows the performance (F1 scores) of JRNN, DMCNN and two other baseline systems, named Embeddings+T and CNN in Chen et al. (2015). Embeddings+T uses word embeddings and the traditional sentence-level features in  while CNN is similar to DMCNN, except that it applies the standard pooling mechanism instead of the dynamic multi-pooling method (Chen et al., 2015).
The most important observation from the table is that JRNN significantly outperforms all the other methods with large margins when the input sentences contain more than one events (i.e, the row labeled with 1/N in the table). In particular, JRNN is 13.9% better than DMCNN on trigger labeling while the corresponding improvement for argument role labeling is 6.5%, thereby further suggesting the benefit of JRNN with the memory features. Regard-

Stage
Model  ing the performance on the single event sentences, JRNN is still the best system on trigger labeling although it is worse than DMCNN on argument role labeling. This can be partly explained by the fact that DMCNN includes the position embedding features for arguments and the memory matrix G arg/trg in JRNN is not functioning in this single event case.

Related Work 7 Conclusion
We present a joint model to do EE based on bidirectional RNN to overcome the limitation of the previ-ous models for this task. We introduce the memory matrix that can effectively capture the dependencies between argument roles and trigger subtypes. We demonstrate that the concatenation-based variant of the CBOW word embeddings is very helpful for the joint model. The proposed joint model is empirically shown to be effective on the sentences with multiple events as well as yields the state-of-the-art performance on the ACE 2005 dataset. In the future, we plan to apply this joint model on the event argument extraction task of the KBP evaluation as well as extend it to other joint tasks such as mention detection together with relation extraction etc.