Collective Event Detection via a Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms

Traditional approaches to the task of ACE event detection primarily regard multiple events in one sentence as independent ones and recognize them separately by using sentence-level information. However, events in one sentence are usually interdependent and sentence-level information is often insufficient to resolve ambiguities for some types of events. This paper proposes a novel framework dubbed as Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms (HBTNGMA) to solve the two problems simultaneously. Firstly, we propose a hierachical and bias tagging networks to detect multiple events in one sentence collectively. Then, we devise a gated multi-level attention to automatically extract and dynamically fuse the sentence-level and document-level information. The experimental results on the widely used ACE 2005 dataset show that our approach significantly outperforms other state-of-the-art methods.


Introduction
Event detection (ED) is a crucial subtask of event extraction, which aims to identify event triggers and classify them into specific types from texts. According to the task defined in Automatic Context Extraction 1 (ACE), given the following sentence S1, a robust ED system should be able to recognize two events: a Die event triggered by died and an Attack event triggered by fired. S1: In Baghdad, a cameraman died when an American tank fired on the Palestine Hotel.
To this end, most methods (Ahn, 2006;Hong et al., 2011;Chen et al., 2015;Liu et al., 2017) model ED as a multiclassification task and predict every word in the 1 http://projects.ldc.upenn.edu/ace/ sentence separately to determine whether it triggers a specific type of event by using sentencelevel information. However, they face two problems: (1) Neglecting event interdependency by separately predicting each event; (2) Sentencelevel information is usually insufficient to resolve ambiguities for some types of events. In the following, we will use examples to illustrate these two problems specifically. S2: The project leader was fired for the bankruptcy of the subsidiary company.
Event interdependency: In S1, fired triggers an Attack event, while it triggers an End-Position event in S2. Because of the ambiguity, a traditional approach may mislabel fired in S1 as a trigger of End-Position event. However, if we know died triggers a Die event in S1, which is easier to disambiguate, we tend to predict that fired triggers an Attack event. The reason is that the events mentioned in the same sentence tend to be semantically coherent and a Die event usually co-occurs with an Attack event. The similar phenomenon can be found in S2. We conduct a statistical analysis on ACE 2005 dataset, and find that nearly 30% sentences contain multiple events which is a proportion we can not ignore. To give an intuitive illustration, the top 5 event types that cooccur with Attack event in the same sentence are shown in Figure 1. We call such clues as event interdependency. Some works Yang and Mitchell, 2016;Liu et al., 2016b) rely on a set of elaborately designed features and complicated natural language processing (NLP) tools to capture event interdependency. However, these methods lack generalization, take a large amount of human effort and are prone to error propagation problem. Though Nguyen et al. (2016) use a Recurrent Neural Networks (RNN) based classification model to capture the event interdependency between current event candidate and the former (left) predicted events, they miss the event interdependency between current event candidate and the later (right) predicted events, and the later events can not change the type of current event. The reason is that they classify the words of the sentence from left to right one by one and only use the former events to predict the later event types. We claim that both of the former and later predicted events are important to predict the event type of current trigger candidate. For example in S1, the former predicted Die event can help us to predict that fired triggers an Attack event, while in S2 the later predicted bankruptcy event can help us to predict that fired triggers an End-Position event. Thus, how to use a neural-based model to capture all event interdependencies (the interdependencies between the current event candidate and its former/later predicted events) in the whole sentence is a challenging problem.
Sentence-level and document-level information: Besides event interdependency, knowing that American tank is a weapon can also give us additional evidence to predict that fired triggers an Attack event in S1. Similarly in S2, knowing that project leader is a job title can also help us to predict that fired triggers an End-Position event. We call such clues as sentence-level information. However, sometimes it is difficult even for people to classify event types from an isolated sentence. We must resort to document-level information. For example, considering the following sentence with an ambiguous word left: S3: He left the company.
It is hard to tell left triggers a Transport event which means that he left the place, or an End-Position event which means that he resigned from the company. However, if we read the whole document, a clue like "He planned to go shopping before he went home, because he got off work early today." would give us more confidence to believe that left triggers a Transport event, while a clue like "They held a party for his retirement." would indicate the aforementioned event is an End-Position event. We call such clues as document-level information. Moreover, the confidence of sentence-level and document-level information should be taken into consideration when using them together to construct a broader range of contextual information. For example in S3, document-level information will give us more evidence, while in S1 sentence-level information is enough to disambiguate the types of events. There have been some feature-based studies (Ji and Grishman, 2008;Liao and Grishman, 2010;Huang and Riloff, 2012) that construct rules to capture document-level information for improving sentence-level ED. However, they suffer from two problems: (1) The features they used often need to be manually designed and may involve error propagation from existing NLP tools; (2) Sentence-level and document-level information are integrated by a large number of fixed rules, which is complicated to construct and it will be far from complete. Thus, how to use a neural-based model to automatically extract sentence-level and document-level information and dynamically integrate them is another challenging problem.
In this paper, we propose a Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms (HBTNGMA) to address the two problems stated above simultaneously. To capture event interdependency and collectively detect multiple events in one sentence, we propose a hierarchical and bias tagging networks for event detection. In which, we exploit a hierarchical RNN-based tagging layer to capture all event interdependencies in the whole sentence and devise a bias objective function to reinforce the influence of trigger tags on the model 2 . To use a broader range of contextual information of the event candidate, we propose a gated multi-level attention, which can automatically extract sentencelevel and document-level information and integrate them dynamically. In summary, the contributions of this paper are as follows: • We propose a novel framework for event detection, which can automatically extract and dynamically integrate sentence-level and document-level information and collectively detect multiple events in one sentence.
• To capture event interdependency, we exploit a hierarchical and bias tagging networks to detect multiple events in one sentence collectively. To automatically extract and dynamically integrate contextual information, we devise a gated multi-level attention Mechanisms. To our knowledge, this is the first work to jointly use event interdependency, sentence-level information and document-level information via a neural tagging schema for event detection task.
• We conduct extensive experiments on a widely used ACE 2005 dataset, and the experimental results show that our approach significantly outperforms other state-of-theart methods 3 .

Task Description
Event detection (ED) is a crucial subtask of event extraction (EE). In this paper, we focus on ED task defined in ACE evaluation, where an event is defined as a specific occurrence involving one or more participants. Firstly, we introduce some ACE terminology to facilitate the understanding of this task: Event trigger: the main word or phrase that most clearly expresses the occurrence of an event. Event arguments: the mentions that are involved in an event (viz., participants). Event mention: a phrase or sentence within which an event is described, including a trigger and arguments. Given an English text document, an ED system should identify event triggers and categorize their event types for each sentence. For instance, in the sentence "He died in the hospital", an ED system is expected to detect a Die event along with the trigger word "died". The ACE 2005 evaluation defines 8 event types and 33 subtypes, such as Attack or Die. Following previous works Chen et al., 2015;Liu et al., 2017;, we categorize triggers into these 33 subtypes.

Methodology
In this paper, we formulate event detection as a sequence labelling task. As shown in Figure 2, we label all words in one sentence collectively via a Hierarchical and Bias Tagging Networks with Gated Multi-level Attention Mechanisms (HBT-NGMA). We assign a tag for each word to indicate whether it triggers a specific type of event. We adopt the "BIO" tags schema, where tag "O" represent the "other" tag which means that the corresponding word does not trigger any event, tags "B-EventType" and "I-EventType" represent the "Begin-EventType" and "Inside-EventType" tag respectively. "EventType" means that the word triggers a specific type of event. "B" and "I" represent the position of the word in a trigger to solve the problem that a trigger word contains multiple words such as "take over", "go off" and so on. Thus, the total number of tags is N t = 2 ⇤ |N eventT ype | + 1, where |N eventT ype | is the size of the predefined event types and |N eventT ype | = 33 in this paper as stated above. Figure 2 describes the architecture of HBT-NGMA, which primarily involves the following four components: (i) embedding layer, which transforms each word into a continuous vector; (ii) BiLSTM layer, which uses a Bidirectional Long Short Term Memory (BiLSTM) to encode the semantics of each word considering the forward and backward information; (iii) gated multi-level attention, in which we propose a sentence-level and document-level attention to automatically extract sentence-level and document-level information respectively, and we devise a fusion gate to dynamically integrate them as context information; and (iv) Hierarchical tagging layer, in which we propose two Tagging LSTM (TLSTM1 and TL-STM2) and a tagging attention to automatically capture the event interdependency and tag all the words of the sequence collectively.

Embedding Layer
This paper uses the learned word embeddings as the source of basic features. Specifically, we use the Skip-gram model (Mikolov et al., 2013) to learn word embeddings on the NYT corpus.
Given a document d = {s 1 , s 2 , ..., s i , ..., s Ns }, where N s is the number of sentences in the document. The i-th sentence s i can be represented as token sequence s i = {w 1 , w 2 , ..., w t , ..., w Nw }, where N w is length of the sentence and w t is the t-th token of the sentence. Assume that the word embedding for token w t is e t and we use it as the input of the following layer.

BiLSTM Layer
In sequence labelling problems, the BiLSTM has been proven effective to capture the semantic information of each word (Lample et al., 2016). In this paper, we use the LSTM unit as described in (Zaremba and Sutskever, 2014). For each word w t , the forward LSTM encodes w t by considering the contextual information from word w 1 to w t , which is marked as ! h t . Similarly, the backward LSTM encodes w t based on the contextual information from w Nw to w t , which is marked as h t . Finally, we concatenate ! h t and h t to represent the information of the word w t , denoted as , and we concatenate ! h Nw and h 1 to represent the encoding information of the whole sentence s i , denoted as h s

Gated Multi-level Attention
Gated multi-level attention primarily involves the following three components: (i) sentence-level attention layer, which automatically captures important sentence-level information by considering the current word; (ii) document-level attention layer, which automatically captures important document-level information by considering the current sentence; and (iii) fusion gate layer, which use a fusion gate to dynamically integrate sentence-level and document-level information.

Sentence-level Attention Layer
Sentence-level attention layer aims to capture the important clues in sentence level. For each candidate word w t in the sentence, its sentence-level semantic information sh t is calculated as follows: where ↵ k s is the weight of each word representation h k . In this paper, we define ↵ k s as following: where z k s is the relatedness between the t-th word representation h t and the k-th word representation h k , modeled by bilinear attention as: where W sa is the weight matrix and b sa is the bias term. Following above sentence-level attention mechanism, we can get the sentence-level information for each word w t by considering the semantic information of the word w t .

Document-level Attention Layer
Similar to sentence-level attention, documentlevel attention captures the vital clues in the document level. The document-level semantic information dh i for i-th sentence calculated as follows: where ↵ k d is the weight of each sentence representation h s k , z k d is the relatedness between i-th sentence representation h s i and the k-th sentence representation h s k , W da is the weight matrix and b da is the bias term. Compared with sentence-level information, all words in the i-th sentence have the same document-level information dh i .

Fusion Gate Layer
We devise a fusion gate to dynamically integrate sentence-level information sh t and documentlevel information dh i for the t-th word w t in the i-th sentence s i , and calculate the contextual information representation cr t as follows: where G t is a fusion gate aims to model the confidence of clues provided by sentence-level information sh t and document-level information dh i , which is calculated as follows: where W g is the weight matrix and b g is the bias term, is a sigmoid function and denotes element-wise multiplication. Finally, the contextual information cr t of word w t and its word embedding e t are concatenated into a single vector xr t = [e t , cr t ] as the feature representation of w t .

Hierarchical Tagging Layer
In hierarchical tagging layer, we propose two Tagging LSTMs (TLSTM1 and TLSTM2) and a tagging attention to automatically capture the event interdependency and tag the sequence collectively.

The First Tagging Layer: TLSTM1
When detecting the tag of word w t in TLSTM1, the inputs are: the feature representation xr t obtained from embedding layer and gated multi-level attention layer, former predicted tag vector T 1 t 1 , and former hidden vector h 1 t 1 in TLSTM1. The detail operations are defined as follows: Where i t is an input gate, u t is an input modulation gate, f t is a forget gate, o t is an output gate, c t is a memory cell and T 1 t is a predicted tagging vector.

The Second Tagging Layer: TLSTM2
Though the TLSTM1 can capture the interdependency between current event candidate and the former predicted event tags, it can not capture the interdependency between the current event candidate and the later predicted event tags. Thus, we devise a second tagging layer (TLSTM2) upon the LSTM1 to capture the interdependency between the current event candidate and both of former and later predicted event tags from TLSTM1. When detecting the tag of word w t in TLSTM2, the inputs are: the feature representation xr t , former predicted tag vector T 2 t 1 in TLSTM2, the preliminary predicted information T a t calculated from TLSTM1, and former hidden vector h 2 t 1 in TL-STM2. The detail operations are defined as follows: The unit structure of TLSTM2 is similar to the unit of TLSTM1. The parts need to pay attention are follows: (1) the initial hidden input h 2 0 of the TLSTM2 is the last hidden vector h 1 Nw of the TL-STM1.
(2) the preliminary predicted information T a t is calculated from TLSTM1 by using a tagging attention as follows.

Tagging Attention
Tagging attention aims to automatically encode the preliminary predicted information T a t for the word w t and the details are as follows: where ↵ k T is the weight of each preliminary predicted tag T 1 k , z k T is the relatedness between t-th preliminary predicted tag T 1 t and the k-th preliminary predicted tag T 1 k , W ta is the weight matrix and b ta is the bias term.
The final normalized tag probability for word w t is based on the predicted tag vector T 2 t from TLSTM2 and computed as follows: where p(O i t |s j , ✓) is the probability that assigning the i-th tag to word w t in sentence s j when parameters is ✓, and N t is the total number of tags.

Training with Bias Objective Function
In one sentence, the number of "O" tags is much more than the number of trigger tags. Thus, we devise a bias objective function J(✓) to reinforce the influence of trigger tags on the model, which is defined as follows: where N ts is the number of training sentences, N w is the length of sentence s j , p(O yt t |s j , ✓) is the normalized probabilities of tags defined in Formula 10 and y t is the golden tag of word w t in sentence s j , ↵ is the bias weight and the larger ↵ will bring the greater influence of trigger tags on the model. Besides, I(O) is a switching function to distinguish the loss of tag "O" and trigger tags, which is defined as follows: To compute the network parameter ✓, we maximize the log likelihood J (✓) through stochastic gradient descent over shuffled mini-batches with the Adadelta (Zeiler, 2012) update rule.

Dataset and Evaluation Metrics
We conduct experiments on the widely used ACE 2005 dataset. For comparison, as the same as previous works (Liao and Grishman, 2010;Chen et al., 2015;Nguyen et al., 2016;Liu et al., 2017), we used the same test set with 40 documents and the same development set with 30 documents and the rest 529 documents are used for training. Finally, we use Precision (P ), Recall (R) and F measure (F 1 ) as the evaluation metrics as the same as previous work.

Hyper-parameter Setting
Hyper-parameters are tuned on the development dataset by grid search. We train the word embedding using Skip-gram algorithm 4 on the NYT corpus 5 . We set the dimension of word embeddings as 100, the dimension of tag vector as 20, all the size of LSTM in BiLSTM layer, TLSTM1 and TL-STM2 layer as 100, the bias parameter ↵ in Formula 11 as 5, the batch size as 20, the learning rate as 0.001, the dropout rate as 0.5.

Our Method vs. State-of-the-art Methods
We select the following state-of-the-art methods for comparison, which can be classified as two types: separate and collective methods: Separate methods: 1) Li's MaxEnt: the method that detects events in one sentence separately by using human-designed features . 2) Liao's CrossEvent : the method that uses cross event information (Liao and Grishman, 2010). 3) Hong's CrossEntity: the method that uses cross entity information . 4) Chen's DMCNN: the dynamic multipooling convolutional neural networks method . 5) Chen's DMCNN+: the DMCNN method argumented with automatically labeled data . 6) Liu's FrameNet : the method that leverages FrameNet as extended training data to improve ED . 7) Liu's ANN-Aug: the method that use the annotated argument information via a supervised attention to improve ED .
Collective methods: 1) Li's Structure: the method that collectively detects events by using human-designed features . 2) Yang's JointEE: the method that detects events and entities in one sentence jointly based on human-designed features (Yang and Mitchell, 2016). 3) Nguyen's JRNN: the method that exploits a RNN model to collectively detects events by only using sentence-level information . 4) Liu's PSL : the method that uses a probabilistic soft logic to detect events by using human-designed features .
Experimental results are shown in Table 1. From the table, we have the following observations: (1) Among all the methods, our HBT-NGMA achieves the best performance. It can improve the best collective method's F 1 by 4 https://code.google.com/p/word2vec/ 5 https://catalog.ldc.upenn.edu/LDC2008T19 3.9% (Liu's PSL) and improve the best separate method's F 1 by 1.4% (Liu's ANN-Aug) although Liu's ANN-Aug uses FrameNet as external resources. We also perform a t-test (p 6 0.05), which indicates that our method significantly outperforms all of the compared methods.
(2) Comparing our HBTNGMA to separate methods, it achieves a better performance. It proves that collectively predicting multiple events in one sentence is effective. (3) Our HBTNGMA outperforms feature-based collective methods (Li's Structure, Yang's JointEE and Liu's PSL), it proves that our automatically learned features can efficiently capture semantic information from plain texts. (4) Compared with Nguyen's JRNN, HBTNGMA gains a 4.0% improvement on F 1 value. The reason is that Nguyen's JRNN only uses sentence-level information while our model exploits multi-level information, and our model can capture the interdependencies between the current event candidate and its former/later predicted events simultaneously.

Effect of The Hierarchical and Bias
Tagging Networks In this subsection, we prove the effectiveness of hierarchical and bias tagging networks for collective ED. We select following methods as baselines: 1) LSTM+Softmax: a simplified version of our HBTNGMA, which directly use a softmax layer to separately detect events after we get the feature representation xr t of each word w t . 2) LSTM+CRF: the method is similar to our HBTNGMA, which uses a CRF layer to tag words instead of our Hierarchical TLSTM (HTL-STM) tagging layer. 3) LSTM+TLSTM: the method is similar to our HBTNGMA, which only use a TLSTM1 and takes all tags have same influence in training loss (i.e. ↵ in is set as 1) . 4) LSTM+HTLSTM: the method is similar to our HBTNGMA, which use a HTLSTM (TL-STM1+TLSTM2) and do not use bias objective function. And LSTM+HTLSTM+Bias is our proposed HBTNGMA. Moreover, we divide the testing data into two parts according the event number in a sentence (single event and multiple events) and perform evaluations separately.   Surprisingly, the LSTM+HTLSTM+Bias yields a 14.9% improvement on the sentence contains multiple events over the LSTM+Softmax. It proves neural tagging schema is effective for ED task especially for the sentences contain multiple events. 2) The LSTM+TLSTM achieve better performances than LSTM+CRF. And the LSTM+HTLSTM achieve better performances than LSTM+TLSTM. The results prove the effectiveness of the TLSTM layer and HTLSTM layer. 3) Compared with LSTM+HTLSTM, the LSTM+HTLSTM+Bias gains a 0.9% improvement on all sentence. It demonstrates the effectiveness of our proposed bias objective function.

Effect of The Gated Multi-level Attention
This subsection studies the effectiveness of our gated multi-level attention. We adopt same architecture of our HBTNGMA as shown in Figure 2 with different level clues as baselines: 1) Word Only is the method only uses word embedding e t to identify events. 2) Word+SA uses sentencelevel attention to capture important sentence-level information as additional clues. 3) Word+DA uses document-level attention to capture important document-level information as additional clues. 4) Word+Average MA uses both of sentence-level and document-level attention to capture multilevel information and integrate them with a average gate (all the dimension of the fusion gate are set as 0.5 ), which is a special case of our proposed HBTNGMA. And Word+Gated MA is our proposed HBTNGMA model.  Results are shown in Table 3. From the results, we have the following observations: 1) Compared with word only, Word+SA achieves a better performance. We can make the same observation when comparing Word+DA with word only. It proves that both sentence-level and document-level information are helpful for ED task. 2) Compared with Word+DA, Word+SA achieves a better performance. It proves that in most of cases sentence-level information provides more clues than document-level information. 3) Word+Gated MA gains a 0.9% improvement than Word+Average MA. It demonstrates that the effectiveness of our fusion gate to dynamically integrate clues from multiple levels.

Case Study
Interesting Cases: Our neural tagging schema not only can model the interdependency between multiple events in one sentence as proved in Subsection 4.3, but also the "BIO" tagging schema can solve the multiple words trigger inherently. We conduct a statistical analysis on the experimental results, and find that nearly 50% cases with multiple word trigger was solved by our model. Example is shown in Figure 3.
Early Saturday，more units were waiting in Kuwait to smash through any Iraqi resistance. Figure 3: The example of case solved by our model.

B-Attack I-Attack
Attention Visualization: As limited of space, we take one sentence with high sentence-level gated weight (example 1) and one sentence with high document-level gated weight (example 2) as examples for attention visualization. As shown in Figure 4, in example 1, sentence-level information plays more important role in disambiguating fired, and the words (tank, died and Baghdad) give … this is when you were in the Senate --"less and less information was new, fewer and fewer arguments were fresh,and the repetitiveness of the old arguments became tiresome." "I was becoming almost as cynical as my constituents. I knew it was time to leave." Isn' t that a great argument for term limits? us ample evidence to predict that fired triggers an Attack event. While document-level information plays a more important role in example 2. The surrounding sentence "this is ... tiresome." gives us more confidence to predict that leave triggers an End-Position event.

Related Works
Event detection is an increasingly hot and challenging research topic in NLP. Generally, existing approaches could roughly be divided into two groups: separate and collective methods.
Separate methods: These methods regard multiple events in one sentence as independent ones and recognize them separately. These methods include feature-based methods which exploit a diverse set of strategies to convert classification clues into feature vectors (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Huang and Riloff, 2012), and neural-based methods which use neural networks to automatically capture clues from plain texts Nguyen and Grishman, 2015;Feng et al., 2016;Chen et al., 2017;Duan et al., 2017;Liu et al., 2017). Though effective these methods, they neglect event interdependency by separately predicting each event.
Collective methods: These methods try to model the event interdependency and detect multiple events in one sentence collectively. However, nearly all of these methods are feature-based methods (McClosky et al., 2011;Yang and Mitchell, 2016;Liu et al., 2016b), which rely on elaborately designed features and suffer error propagation from existing NLP tools. Nguyen et al. (2016) exploits a neural-based method to detect multiple events collectively. However, they only use the sentence-level information and ne-glect document-level clues, and can only capture the interdependencies between the current event candidate and its former predicted events. Moreover, there method can not handle the multiple words trigger problem.

Conclusion
This paper proposes a novel framework for event detection, which can automatically extract and dynamically integrate sentence-level and documentlevel information and collectively detect multiple events in one sentence. A hierarchical and bias tagging networks is proposed to detect multiple events in one sentence collectively. A gated multi-level attention is devised to automatically extract and dynamically integrate contextual information. The experimental results on the widely used dataset prove the effectiveness of the proposed method.