Exploiting Contextual Information via Dynamic Memory Network for Event Detection

The task of event detection involves identifying and categorizing event triggers. Contextual information has been shown effective on the task. However, existing methods which utilize contextual information only process the context once. We argue that the context can be better exploited by processing the context multiple times, allowing the model to perform complex reasoning and to generate better context representation, thus improving the overall performance. Meanwhile, dynamic memory network (DMN) has demonstrated promising capability in capturing contextual information and has been applied successfully to various tasks. In light of the multi-hop mechanism of the DMN to model the context, we propose the trigger detection dynamic memory network (TD-DMN) to tackle the event detection problem. We performed a five-fold cross-validation on the ACE-2005 dataset and experimental results show that the multi-hop mechanism does improve the performance and the proposed model achieves best F1 score compared to the state-of-the-art methods.


Introduction
According to ACE (Automatic Content Extraction) event extraction program, an event is identified by a word or a phrase called event trigger which most represents that event. For example, in the sentence "No major explosion we are aware of", an event trigger detection model is able to identify the word "explosion" as the event trigger word and further categorize it as an Attack event. The ACE-2005 dataset also includes annotations for event arguments, which are a set of words or phrases that describe the event. However, in this work, we do not tackle the event argument classification and focus on event trigger detection.
The difficulty of the event trigger detection task lies in the complicated interaction between the event trigger candidate and its context. For instance, given a sentence at the end of a passage: they are going over there to do a mission they believe in and as we said, 250 left yesterday.
It's hard to directly classify the trigger word "left" as an "End-Position" event or a "Transport" event because we are not certain about what the number "250" and the pronoun "they" are referring to. But if we see the sentence: we are seeing these soldiers head out. which is several sentences away from the former one, we now know the "250" and "they" refer to "the soldiers", and from the clue "these soldiers head out" we are more confident to classify the trigger word "left" as the "Transport" event.
From the above, we can see that the event trigger detection task involves complex reasoning across the given context. Exisiting methods (Liu et al., 2017;Chen et al., 2015;Li et al., 2013;Nguyen et al., 2016;Venugopal et al., 2014) mainly exploited sentence-level features and (Liao and Grishman, 2010;Zhao et al., 2018) proposed document-level models to utilize the context. The methods mentioned above either not directly utilize the context or only process the context once while classifying an event trigger candidate. We argue that processing the context multiple times with later steps re-evaluating the context with information acquired from the previous steps improves the model performance. Such a mechanism allows the model to perform complicated reasoning across the context. As in the example, we are more confident to classify "left" as a "Tranport" event if we know "250" and "they" refer to "soldiers" in previous steps.
We utilize the dynamic memory network (DMN) (Xiong et al., 2016;Kumar et al., 2016) to capture the contextual information of the given trigger word. It contains four modules: the input module for encoding reference text where the an- swer clues reside, the memory module for storing knowledge acquired from previous steps, the question module for encoding the questions, and the answer module for generating answers given the output from memory and question modules. DMN is proposed for the question answering task, however, the event trigger detection problem does not have an explicit question. The original DMN handles such case by initializing the question vector produced by the question module with a zero or a bias vector, while we argue that each sentence in the document could be deemed as a question. We propose the trigger detection dynamic memory network (TD-DMN) to incorporate this intuition, the question module of TD-DMN treats each sentence in the document as implicitly asking a question "What are the event types for the words in the sentence given the document context". The high-level architecture of the TD-DMN model is illustrated in Figure 1.
We compared our results with two models: DM-CNN (Chen et al., 2015) and DEEB-RNN (Zhao et al., 2018) through 5-fold cross-validation on the ACE-2005 dataset. Our model achieves best F 1 score and experimental results further show that processing the context multiple times and adding implicit questions do improve the model performance. The code of our model is available online. 1

The Proposed Approach
We model the event trigger detection task as a multi-class classification problem following existing work. In the rest of this section, we describe four different modules of the TD-DMN separately along with how data is propagated through these modules. For simplicity, we discuss a single document case. The detailed architecture of our model is illustrated in Figure 2. 1 https://github.com/AveryLiu/TD-DMN

Input Module
The input module further contains two layers: the sentence encoder layer and the fusion layer. The sentence encoder layer encodes each sentence into a vector independently, while the fusion layer gives these encoded vectors a chance to exchange information between sentences.
Sentence encoder layer Given document d with l sentences (s 1 , . . . , s l ), let s i denotes the i-th sentence in d with n words (w i1 , . . . , w in ). For the j-th word w ij in s i , we concatenate its word embedding w ij with its entity type embedding 2 e ij to form the vector W ij as the input to the sentence encoder Bi-GRU (Cho et al., 2014) of size H s . We obtain the hidden state h ij by merging the forward and backward hidden states from the Bi-GRU: where + denotes element-wise addition. We feed h ij into a two-layer perceptron to generate the unnormalized attention scalar u ij : where W s 1 and W s 2 are weight parameters of the perceptron and we omitted bias terms. u ij is then normalized to obtain scalar attention weight α ij : The sentence representation s i is obtained by: Fusion layer The fusion layer processes the encoded sentences and outputs fact vectors which contain exchanged information among sentences. Let s i denotes the i-th sentence representation obtained from the sentence encoder layer. We generate fact vector f i by merging the forward and backward states from fusion GRU: Let H f denotes the hidden size of the fusion GRU, we concatenate fact vectors f 1 to f l to obtain the matrix F of size l by H f , where the i-th row in F stores the i-th fact vector f i . . . , f l }. The question module encodes sentence s i into the question vector q * . The memory module initializes m 0 with q * and iteratively processes for t times, at each time k it produces a memory vector m k using fact matrix F , question vector q * and previous memory state m k−1 . The answer module outputs the predicted trigger type for each word in s i using the concatenated tensor of the hidden states of the question module and the last memory state m t .

Question Module
The question module treats each sentence s in d as implicitly asking a question: What are the event types for each word in the sentence s given the document d as context? For simplicity, we only discuss the single sentence case. Iteratively processing from s 1 to s l will give us all encoded questions in document d.
Let W ij be the vector representation of j-th word in s i , the question GRU generates hidden state q ij by: The question vector q * is obtained by averaging all hidden states of the question GRU: Let H q denotes the hidden size of the question GRU, q * is a vector of size H q , the intuition here is to obtain a vector that represents the question sentence. q * will be used for the memory module.

Memory Module
The memory module has three components: the attention gate, the attentional GRU (Xiong et al., 2016) and the memory update gate. The attention gate determines how much the memory module should attend to each fact given the facts F , the question q * , and the acquired knowledge stored in the memory vector m t−1 from the previous step.
The three inputs are transformed by: where ; is concatenation. * , − and |.| are elementwise product, subtraction and absolute value respectively. F is a matrix of size (m, H f ), while q * and m t−1 are vectors of size (1, H q ) and (1, H m ), where H m is the output size of the memory update gate. To allow element-wise operation, H f , H q and H m are set to a same value H. Meanwhile, q * and m t−1 are broadcasted to the size of (m, H).
In equation 8, the first two terms measure the similarity and difference between facts and the question. The last two terms have the same functionality for facts and the last memory state. Let β of size l denotes the generated attention vector. The i-th element in β is the attention weight for fact f i . β is obtained by transforming u using a two-layer perceptron: β = softmax(tanh(u · W m 1 ) · W m 2 ) (9) where W m 1 and W m 2 are parameters of the perceptron and we omitted bias terms.
The attentional GRU takes facts F , fact attention β as input and produces context vector c of size H c . At each time step t, the attentional GRU picks the f t as input and uses β t as its update gate weight. For space limitation, we refer reader to (Xiong et al., 2016) for the detailed computation.
The memory update gate outputs the updated memory m t using question q * , previous memory state m t−1 and context c: where W u is the parameter of the linear layer. The memory module could be iterated several times with a new β generated for each time. This allows the model to attend to different parts of the facts in different iterations, which enables the model to perform complicated reasoning across sentences. The memory module produces m t as the output at the last iteration.

Answer Module
Answer module predicts the event type for each word in a sentence. For each question GRU hidden state q ij , the answer module concatenates it with the memory vector m t as the input to the answer GRU with hidden size H a . The answer GRU outputs a ij by merging its forward and backward hidden states. The fully connected dense layer then transforms a ij to the size of the number of event labels O and the softmax layer is applied to output the probability vector p ij . The k-th element in p ij is the probability for the word w ij being the k-th event type. Let y ij be the true event type label for word w ij . Assuming all sentences are padded to the same length n, the cross-entropy loss for the single document d is applied as: where I(·) is the indicator function.

Dataset and Experimental Setup
Dateset Different from prior work, we performed a 5-fold cross-validation on the ACE 2005 dataset. We partitioned 599 files into 5 parts. The file names of each fold can be found online 3 . We chose a different fold each time as the testing set and used the remaining four folds as the training set.

Baselines
We compared our model with two other models: DMCNN (Chen et al., 2015) and DEEB-RNN (Zhao et al., 2018). DMCNN is a sentencelevel event detection model which enhances traditional convolutional networks with dynamic multiple pooling mechanism customized for the task. The DEEB-RNN is a state-of-the-art documentlevel event detection model which firstly generate a document embedding and then use it to aid the event detection task.

Evaluation
We report precision, recall and F 1 score of each fold along with the averaged F 1 score of all folds. We evaluated all the candidate trigger words in each testing set. A candidate trigger word is correctly classified if its event subtype and offsets match its human annotated label.
Implementation Details separate sentences. We down-sampled negative samples to ease the unbalanced classes problem.
The setting of the hyperparameters is the same for different hops of the TD-DMN model. We set H, H s , H c , and H a to 300, the entity type embedding size to 50, W s 1 to 300 by 600, W s 2 to 600 by 1, W m 1 to 1200 by 600, W m 2 to 600 by 1, W u to 900 by 300, the batch size 4 to 10. We set the downsampling ratio to 9.5 and we used Adam optimizer (Kingma and Ba, 2014) with weight decay set to 1e −5 . We set the dropout (Srivastava et al., 2014) rate before the answer GRU to 0.4 and we set all other dropout rates to 0.2. We used the pre-trained word embedding from (Le and Mikolov, 2014).

Results on the ACE 2005 Corpus
The performance of each model is listed in table 1. The first observation is that models using document context drastically outperform the model that only focuses on the sentence level feature, which indicates document context is helpful in event detection task. The second observation is that increasing number of hops improves model performance, this further implies that processing the context for multiple times does better exploit the context. The model may have exhausted the context and started to overfit, causing the performance drop at the fourth hop.
The performance of reference models is much lower than that reported in their original papers. Possible reasons are that we partitioned the dataset randomly, while the testing set of the original partition mainly contains similar types of documents and we performed a five-fold cross-validation.

The Impact of the Question Module
To reveal the effects of the question module, we ran the model in two different settings. In the first setting, we initialized the memory vector m 0 and question vector q * with a zero vector, while in the second setting, we ran the model untouched. The results are listed in the table 2. The two models perform comparably under the 1-hop setting, this implies that the model is unable to distinguish the initialization values of the question vector well in the 1-hop setting. For higher number of hops, the untouched model outperforms the modified one. This indicates that with a higher number of memory iterations, the question vector q * helps the model to better exploit the context information. 4 In each batch, there are 10 documents.  indicates results with empty questions.
We still observe the increase and drop pattern of the F 1 for the untouched model. However, such a pattern is not obvious with empty questions. This implies that we are unable to have a steady gain without the question module in this specific task.

Future Work
In this work, we explored the TD-DMN architecture to exploit document context. Extending the model to include wider contexts across several similar documents may also be of interest. The detected event trigger information can be incorporated into question module when extending the TD-DMN to the argument classification problem. Other tasks with document context but without explicit questions may also benefit from this work.

Conclusion
In this paper, we proposed the TD-DMN model which utilizes the multi-hop mechanism of the dynamic memory network to better capture the contextual information for the event trigger detection task. We cast the event trigger detection as a question answering problem. We carried fivefold cross-validation experiments on the ACE-2005 dataset and results show that such multi-hop mechanism does improve the model performance and we achieved the best F 1 score compared to the state-of-the-art models.