Document Embedding Enhanced Event Detection with Hierarchical and Supervised Attention

Document-level information is very important for event detection even at sentence level. In this paper, we propose a novel Document Embedding Enhanced Bi-RNN model, called DEEB-RNN, to detect events in sentences. This model first learns event detection oriented embeddings of documents through a hierarchical and supervised attention based RNN, which pays word-level attention to event triggers and sentence-level attention to those sentences containing events. It then uses the learned document embedding to enhance another bidirectional RNN model to identify event triggers and their types in sentences. Through experiments on the ACE-2005 dataset, we demonstrate the effectiveness and merits of the proposed DEEB-RNN model via comparison with state-of-the-art methods.


Introduction
Event Detection (ED) is an important subtask of event extraction. It extracts event triggers from individual sentences and further identifies the type of the corresponding events. For instance, according to the ACE-2005 annotation guideline, in the sentence "Jane and John are married", an ED system should be able to identify the word "married" as a trigger of the event "Marry". However, it may be difficult to identify events from isolated sentences, because the same event trigger might represent different event types in different contexts.
Existing ED methods can mainly be categorized into two classes, namely, feature-based methods (e.g., (McClosky et al., 2011;Hong et al., 2011;Li et al., 2014)) and representation-based methods (e.g., (Nguyen and Grishman, 2015;Chen et al., 2015;Liu et al., 2016a;). The former mainly rely on a set of hand-designed features, while the latter employ distributed representation to capture meaningful semantic information. In general, most of these existing methods mainly exploit sentence-level contextual information. However, document-level information is also important for ED, because the sentences in the same document, although they may contain different types of events, are often correlated with respect to the theme of the document. For example, there are the following sentences in ACE-2005: ... I knew it was time to leave. Isn't that a great argument for term limits? ... If we only examine the first sentence, it is hard to determine whether the trigger "leave" indicates a "Transport" event meaning that he wants to leave the current place, or an "End-Position" event indicating that he will stop working for his current organization. However, if we can capture the contextual information of this sentence, it is more confident for us to label "leave" as the trigger of an "End-Position" event. Upon such observation, there have been some feature-based studies (Ji and Grishman, 2008;Liao and Grishman, 2010;Huang and Riloff, 2012) that construct rules to capture document-level information for improving sentence-level ED. However, they suffer from two major limitations. First, the features used therein often need to be manually designed and may involve error propagation due to natural language processing; Second, they discover inter-event information at document level by constructing inference rules, which is time-consuming and is hard to make the rule set as complete as possible. Besides, a representation-based study has been presented in (Duan et al., 2017), which employs the PV-DM model to train document embeddings and further uses it in a RNN-based event classifier. However, as being limited by the unsupervised training process, the document-level representation cannot specifically capture event-related information.
In this paper, we propose a novel Document Embedding Enhanced Bi-RNN model, called DEEB-RNN, for ED at sentence level. This model first learns ED oriented embeddings of documents through a hierarchical and supervised attention based bidirectional RNN, which pays word-level attention to event triggers and sentence-level attention to those sentences containing events. It then uses the learned document embeddings to facilitate another bidirectional RNN model to identify event triggers and their types in individual sentences. This learning process is guided by a general loss function where the loss corresponding to attention at both word and sentence levels and that of event type identification are integrated. It should be mentioned that although the attention mechanism has recently been applied effectively in various tasks, including machine translation , question answering (Hao et al., 2017), document summarization (Tan et al., 2017), etc., this is the first study, to the best of our knowledge, which adopts a hierarchical and supervised attention mechanism to learn ED oriented embeddings of documents.
We evaluate the developed DEEB-RNN model on the benchmark dataset, ACE-2005, and systematically investigate the impacts of different supervised attention strategies on its performance. Experimental results show that the DEEB-RNN model outperforms both feature-based and representation-based state-of-the-art methods in terms of recall and F1-measure.

The Proposed Model
We formalize ED as a multi-class classification problem. Given a sentence, we treat every word in it as a trigger candidate, and classify each candidate to a certain event type. In the ACE-2005 dataset, there are 8 event types, further being divided into 33 subtypes, and a "Not Applicable (NA)" type. Without loss of generality, in this paper we regard the 33 subtypes as 33 event types. Figure 1 presents the schematic diagram of the proposed DEEB-RNN model, which contains two main modules:

The ED Oriented Document Embedding
Learning (EDODEL) module, which learns the distributed representations of documents from both word and sentence levels via the well-designed hierarchical and supervised attention mechanism.
2. The Document-level Enhanced Event Detector (DEED) module, which tags each trigger candidate with an event type based on the learned embedding of documents.

The EDODEL Module
To learn the ED oriented embedding of a document, we apply the hierarchical and supervised attention network presented in Figure 1, which consists of a word-level Bi-GRU (Schuster and Paliwal, 2002) encoder with attention on event triggers and a sentence-level Bi-GRU encoder with attention on sentences with events. Given a document with L sentences, DEEB-RNN learns its embedding for detecting events in all sentences. Word-level embeddings Given a sentence s i (i = 1, 2, ..., L) consisting of words {w it |t = 1, 2, ..., T }. For each word w it , we first concatenate its embedding w it and its entity type embedding 1 e it (Nguyen and Grishman, 2015) as the input g it of a Bi-GRU and thus obtain the bidirectional hidden state h it : We then feed h it to a perceptron with no bias to get u it = tanh(W w h it ) as a hidden representation of h it and also obtain an attention weight α it = u T it c w , which should be normalized through a softmax function. Here, similar to that in (Yang et al., 2016), c w is a vector representing the wordlevel context of w it , which is initialized at random. Finally, the embedding of the sentence s i can be obtained by summing up h it with their weights: To pay more attention to trigger words than other words, we construct the gold word-level attention signals α * i for the sentence s i , as illustrated in Figure 2a. We can then take the square error as the general loss of the attention at word level to supervise the learning process: 1 The words in the ACE-2005 dataset are annotated with their entity types (annotated as "NA" if they are not an entity).
Sentence-level embeddings Given the sentence embeddings {s i |i = 1, 2, ..., L}, we first get the hidden state q i via a Bi-GRU: Then we feed q i to a perceptron with no bias to get the hidden representation t i = tanh(W s q i ) and also obtain an attention weight β i = t T i c s to be normalized via softmax. Similarly, c s represents the sentence-level context of s i to be randomly initialized. We eventually obtain the document embedding d as: We also think that the sentences containing event should obtain more attention than other ones. Therefore, similar to the case at word level, we construct the gold sentence-level attention signals β * for the document d, as illustrated in Figure 2b, and further take the square error as the general loss of the attention at sentence level to supervise the learning process:

The DEED Module
We employ another Bi-GRU encoder and a softmax output layer to model the ED task, which can handle event triggers with multiple words. Specifically, given a sentence s j (j = 1, 2, ..., L) in document d, for each of its word w jt (t = 1, 2, ..., T ), we concatenate its word embedding w jt and entity type embedding e jt with the corresponding document embedding d as the input r jt of the Bi-GRU and thus obtain the hidden state f jt : Finally, we get the probability vector o jt with K dimensions through a softmax layer for w jt , where the k-th element, o jt , of o jt indicates the probability of classifying w jt to the k-th event type. The loss function, J(y, o), can thus be defined in terms of the cross-entropy error of the real event type y jt and the predicted probability o (k) jt as follows: where I(·) is the indicator function.

Joint Training of the DEEB-RNN model
In the DEEB-RNN model, the above two modules are jointly trained. For this purpose, we define the joint loss function in the training process upon the losses specified for different modules as follows: where θ denotes, as a whole, the parameters used in DEEB-RNN, ϕ is the training document set, and λ and µ are hyper-parameters for striking a balance among J(y, o), E w (α * , α) and E s (β * , β).

Datasets and Settings
We validate the proposed model through comparison with state-of-the-art methods on the ACE-2005 dataset. In the experiments, the validation set has 30 documents from different genres, the test set has 40 documents and the training set contains the remaining 529 documents. All the data preprocessing and evaluation criteria follow those in (Ghaeini et al., 2016).
Hyper-parameters are tuned on the validation set. We set the dimension of the hidden layers corresponding to GRU w , GRU s , and GRU e to 300, 200, and 300, respectively, the output size of W w and W s to 600 and 400, respectively, the dimension of entity type embeddings to 50, the batch size to 25, the dropout rate to 0.5. In addition, we utilize the pre-trained word embeddings with 300 dimensions from (Mikolov et al., 2013) for initialization. For entity types, their embeddings are randomly initialized. We train the model using Stochastic Gradient Descent (SGD) over shuffled mini-batches and using dropout (Krizhevsky et al., 2012) for regularization.

Baseline Models
In order to validate the proposed DEEB-RNN model through experimental comparison, we choose the following typical models as the baselines.
Sentence-level is a feature-based model proposed in (Hong et al., 2011), which regards entitytype consistency as a key feature to predict event mentions.
Joint Local is a feature-based model developed in (Li et al., 2013), which incorporates such features that explicitly capture the dependency among multiple triggers and arguments. JRNN is a representation-based model proposed in , which exploits the inter-dependency between event triggers and argument roles via discrete structures.
Skip-CNN is a representation-based model presented in , which proposes a novel convolution to exploit nonconsecutive k-grams for event detection.
ANN-S2 is a representation-based model developed in , which explicitly exploits argument information for event detection via supervised attention mechanisms.
Cross-event is a feature-based model proposed in (Liao and Grishman, 2010), which learns relations among event types from training corpus and futher helps predict the occurrence of events.
PSL is a feature-based model developed in (Liu et al., 2016b), which encods global information such as event-event association in the form of logic using the probabilistic soft logic model.
DLRNN is a representation-based model proposed in (Duan et al., 2017), which automatically extracts cross-sentence clues to improve sentencelevel event detection.

Impacts of Different Attention Strategies
In this section, we conduct experiments on the ACE-2005 dataset to demonstrate the effectiveness of different attention strategies.
Bi-GRU is the basic ED model, which does not employ document-level embeddings.
DEEB-RNN uses the document embeddings and computes attentions without supervision, in which hyper-parameters λ and µ are set to 0.
DEEB-RNN1/2/3 means they uses the gold attention signals as supervision information. Specifically, DEEB-RNN1 uses only the gold word-level attention signal (λ = 1 and µ = 0), DEEB-RNN2 uses only the gold sentence-level attention signal (λ = 0 and µ = 1), whilst DEEB-RNN3 employs the gold attention signals at both word and sen-  tence levels (λ = 1 and µ = 1). Table 1 compares these methods, where we can observe that the methods with document embeddings (i.e., the last four) significantly outperform the pure Bi-GRU method, which suggests that document-level information is very beneficial for ED. An interesting phenomenon is that, as compared to DEEB-RNN, DEEB-RNN2 changes the precision-recall balance. This is because of the following reasons. On one hand, as compared to DEEB-RNN, DEEB-RNN2 uses the gold sentence-level attention signal, indicating that it pays special attention to the sentences containing events with event triggers. In this way, the Bi-RNN model for learning document embeddings will filter out the sentences containing events but without explicit event triggers. That means the events detected by DEEB-RNN2 are basically the ones with explicit event triggers. Therefore, as compared to DEEB-RNN, the precision of DEEB-RNN2 is improved; On the other hand, the above strategy may result in less learning of words, which are event triggers but do not appear in the training dataset. Therefore, those sentences with such event triggers cannot be detected. The recall of DEEB-RNN2 is thus lowered, as compared to DEEB-RNN. Moreover, DEEB-RNN3 shows the best performance, indicating that the gold attention signals at both word and sentence levels are useful for ED. Table 2 presents the overall performance of all methods on ACE-2005. We can see that different versions of DEEB-RNN consistently out-perform the existing state-of-the-art methods in terms of both recall and F1-measure, while their precision is comparable to that of others. The better performance of DEEB-RNN can be explained by the following reasons: (1) Compared with feature-based methods, including Sentencelevel, Joint Local, and representation-based methods, including JRNN, Skip-CNN and ANN-S2, our method exploits document-level information (i.e., the ED oriented document embeddings) from both word and sentence levels in a document by the supervised attention mechanism, which enhance the ability of identifying trigger words;

Performance Comparison
(2) Compared with feature-based methods using document-level information, such as Cross-event, PSL, our method can automatically capture event types in documents via a end-to-end Bi-RNN based model without manually designed rules; (3) Compared with representation-based methods using document-level information, such as DLRNN, our method can learn event detection oriented embeddings of documents through the hierarchical and supervised attention based Bi-RNN network.

Conclusions and Future Work
In this study, we proposed a hierarchical and supervised attention based and document embedding enhanced Bi-RNN method, called DEEB-RNN, for event detection. We explored different strategies to construct gold word-and sentence-level attentions to focus on event information. Experiments on the ACE-2005 dataset demonstrate that DEEB-RNN achieves better performance as compared to the state-of-the-art methods in terms of both recall and F1-measure. In this paper, we can strike a balance between sentence and document embeddings by adjusting their dimensions. In the future, we may improve the DEEB-RNN model to automatically determine the weights of sentence and document embeddings.