Event Detection with Neural Networks: A Rigorous Empirical Evaluation

Detecting events and classifying them into predefined types is an important step in knowledge extraction from natural language texts. While the neural network models have generally led the state-of-the-art, the differences in performance between different architectures have not been rigorously studied. In this paper we present a novel GRU-based model that combines syntactic information along with temporal structure through an attention mechanism. We show that it is competitive with other neural network architectures through empirical evaluations under different random initializations and training-validation-test splits of ACE2005 dataset.


Introduction
Events are the lingua franca of news stories and narratives and describe important changes of state in the world. Identifying events and classifying them into different types is a challenging aspect of understanding text. This paper focuses on the task of event detection, which includes identifying the "trigger" words that indicate events and classifying the events into refined types. Event detection is the necessary first step in inferring more semantic information about the events including extracting the arguments of events and recognizing temporal and causal relationships between different events.
Neural network models have been the most successful methods for event detection. However, most current models ignore the syntactic relationships in the text. One of the main contributions of our work is a new DAG-GRU architecture (Chung et al., 2014) that captures the context and syntactic information through a bidirectional reading of the text with dependency parse relationships. This generalizes the GRU model to operate on a graph by novel use of an attention mechanism.
1 Also associated with Oregon State University.
Following the long history of prior work on event detection, ACE2005 is used for the precise definition of the task and the data for the purposes of evaluation. One of the challenges of the task is the size and sparsity of this dataset. It consists of 599 documents, which are broken into a training, development, testing split of 529, 30, and 40 respectively. This split has become a de-facto evaluation standard since (Li et al., 2013). Furthermore, the test set is small and consists only of newswire documents, when there are multiple domains within ACE2005. These two factors lead to a significant difference between the training and testing event type distribution. Though some work had been done comparing method across domains (Nguyen and Grishman, 2015), variations in the training/test split including all the domains has not been studied. We evaluate the sensitivity of model accuracy to changes in training and test set splits through a randomized study.
Given the limited amount of training data in comparison to other datasets used by neural network models, and the narrow margin between many high performance methods, the effect of the initialization of these methods needs to be considered. In this paper, we conduct an empirical study of the sensitivity of the system performance to the model initialization.
Results show that our DAG-GRU method is competitive with other state-of-the-art methods. However, the performance of all methods is more sensitive to the random model initialization than expected. Importantly, the ranking of different methods based on the performance on the standard training-validation-test split is sometimes different from the ranking based on the average over multiple splits, suggesting that the community should move away from single split evaluations.

Related Work
Event detection and extraction are well-studied tasks with a long history of research.
Nguyen and Grishman (2015) used CNNs to represent windows around candidate triggers. Each word is represented by a concatenation of its word and entity type embeddings with the distance to candidate trigger. Global max-pooling summarizes the CNN filter and the result is passed to a linear classifier.
Nguyen and Grishman (2016) followed up with a skip-gram based CNN model which allows the filter to skip non-salient or otherwise unnecessary words in the middle of word sequences. Feng et al. (2016) combined a CNN, similar to (Nguyen and Grishman, 2015), with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to create a hybrid network. The output of both networks was concatenated together and fed to a linear model for final predictions. Recently, Nguyen and Grishman (2018) used graph-CNN (GCCN) where the convolutional filters are applied to syntactically dependent words in addition to consecutive words. The addition of the entity information into the network structure produced the state-of-the-art CNN model. Another neural network model that includes syntactic dependency relationships is DAG-based LSTM (Qian et al., 2018). It combines the syntactic hidden vectors by weighted average and adds them through a dependency gate to the output gate of the LSTM model. To the best of our knowledge, none of the neural models combine syntactic information with attention, which motivates our research.

DAG GRU Model
Event detection is often framed as a multi-class classification problem (Chen et al., 2015;Ghaeini et al., 2016). The task is to predict the event label for each word in the test documents, and NIL if the word is not an event. A sentence is a sequence of words x 1 . . . x n , where each word is represented by a k-length vector. The standard GRU model works as follows: The GRU model produces a hidden vector h t for each word x t by combining its representation with the previous hidden vector. Thus h t summarizes both the word and its prior context.
Our proposed DAG-GRU model incorporates syntactic information through dependency parse relationships and is similar in spirit to (Nguyen and Grishman, 2018) and (Qian et al., 2018). However, unlike those methods, DAG-GRU uses attention to combine syntactic and temporal information. Rather than using an additional gate as in (Qian et al., 2018), DAG-GRU creates a single combined representation over previous hidden vectors and then applies the standard GRU model. Each relationship is represented as an edge, (t, t , e), between words at index t and t with an edge type e. The standard GRU edges are included as (t, t − 1, temporal).
Each dependency relationship may be between any two words, which could produce a graph with cycles. However, back-propagation through time (Mozer, 1995) requires a directed acyclic graph (DAG), Hence the sentence graph, consisting of temporal and dependency edges E, is split into two DAGs: a "forward" DAG G f that consists of only of edges (t, t , e) where t < t and a corresponding "backward" DAG G b where t > t. The dependency relation between t and t also includes the parent-child orientation, e.g., nsubj-parent or nsubj-child for a nsubj (subject) relation.
An attention mechanism is used to combine the multiple hidden vectors. The matrix D t is formed at each word x t by collecting and transforming all the previous hidden vectors coming into node t, one per each edge type e. α gives the attention, a distribution weighting importance over the edges.
At least three members of a family in Indias northeastern state of Tripura were hacked to death by a tribal mob nsubj auxpass tanh softmax x dot tanh dot Figure 1: The hidden state of "hacked" is a combination of previous output vectors. In this case, three vectors are aggregated with DAG-GRU's attention model. h t , is included in the input for the attention model since it is accessible through the "subj" dependency edge. h t is included twice because it is connected through a narrative edge and a dependency edge with type "auxpass." The input matrix is non-linearly transformed by U a and tanh. Next, w a determines the importance of each vector in D t . Finally, the attention a t is produced by tanh followed by softmax then applied to D t . The subject "members" would be distant under a standard RNN model, however the DAG-GRU model can focus on this important connection via dependency edges and attention.
Finally, the combined hidden vector h a is created by summing D t weighted by attention.
However, having a set of parameters U e for each edge type e is over-specific for small datasets. Instead a shared set of parameters U a is used in conjunction with an edge embedding v e .
The edge type embedding v e is concatenated with the hidden vector h t and then transformed by the shared weights U a . This limits the number of parameters while flexibly weighting the different edge types. The new combined hidden vector h a is used instead of h t−1 in the GRU equations.
The model is run forward and backward with the output concatenated, h c,t = h f,t h b,t , for a representation that includes the entire sentence's context and dependency relations. After applying dropout (Srivastava et al., 2014) with 0.5 rate to h c,t , a linear model with softmax is used to make predictions for each word at index t.

Experiments
We use the ACE2005 dataset for evaluation. Each word in each document is marked with one of the thirty-three event types or Nil for non-triggers. Several high-performance models were reproduced for comparison. Each is a good faith reproduction of the original with some adjustments to level the playing field.
For word embeddings, Elmo was used to generate a fixed representation for every word in ACE2005 (Peters et al., 2018. The three vectors produced per word were concatenated together for a single representation. We did not use entity type embeddings for any method. The models were trained to minimize the cross entropy loss with Adam (Kingma and Ba, 2014) with L2 regularization set at 0.0001. The learning rate was halved every five epochs starting from 0.0005 for a maximum of 30 epochs or until convergence as determined by F1 score on the development set.
The same training method and word embeddings were used across all the methods. Based on preliminary experiments, these settings resulted in better performance than those originally specified. However, notably both GRU  and DAG-LSTM (Qian et al., 2018) were not used as joint models. Further, the GRU implementation did not use a memory network, instead we used the final vectors from the forward and backward pass concatenated to each timestep's output for additional context. For CNNs (Nguyen and Grishman, 2015) the number of filters was reduced to 50 per filter size. The CNN+LSTM (Feng et al.,

Effects of Random Initialization
Given that ACE2005 is small as far as neural network models are concerned, the effect of the random initialization of these models needs to be studied. Although some methods include tests of significance, the type of statistical test is often not reported. Simple statistical significance tests, such as the t-test, are not compatible with a single F1 score, instead the average of F1 scores should be tested (Wang et al., 2015).
We reproduced and evaluated five different systems with different initializations to empirically assess the effect of initialization. The experiments were done on the standard ACE2005 split and the aggregated results over 20 random seeds were given in Table 1. The random initializations of the models had a significant impact on their performance. The variation was large enough that the observed range of the F1 scores overlapped across almost all the models. However the differences in average performances of different methods, except for CNN and DAG-LSTM, were significant at p < 0.05 according to the t-test, not controlling for multiple hypotheses.
Both the GRU (Nguyen et al., 2016) and CNN (Nguyen and Grishman, 2015) models perform well with their best scores being close to the reported values. The CNN+LSTM model's results were significantly lower than the published values, though this method has the highest variation. It is possible that there is some unknown factor such as the preprocessing of the data that significantly impacted the results or that the value is an outlier. Likewise, the DAG-LSTM model underperformed. However, the published results were based on a joint event and argument extraction model and probably benefited from the additional entity and argument information.
DAG-GRU A consistently and significantly outperforms the other methods in this comparison. The best observed F1 score, 71.1%, for DAG-GRU is close to the published state-of-the-art scores of DAG-LSTM and GCNN at 71.9% and 71.4% respectively. With additional entity information, GCNN achieves a score of 73.1%. Also, the attention mechanism used in DAG-GRU A shows a significant improvement over the averaging method of DAG-GRU B. This indicates that some syntactic links are more useful than others and that the weighting attention applies is necessary to utilize that syntactic information.
Another source of variation was the distributional differences between the development and testing sets. Further, the testing set only include newswire articles whereas the training and dev. sets contain informal writing such as web log (WL) documents. The two sets have different proportions of event types and each model saw at least a 2% drop in performance between dev. and test on average. At worst, the DAG-LSTM model's drop was 5.26%. This is a problem for model selection, since the dev. score is used to choose the best model, hyperparameters, or random initialization. The distributional differences mean that methods which outperform others on the dev. set do not necessarily perform as well on the test set. For example, DAG-GRU A performs worse that DAG-GRU B on the dev. set, however it achieves a higher mean score on the testing set. One method of model selection over random initializations is to train the model k times and pick the best one based on the dev. score. Repeating this model selection procedure many times for each model is prohibitively expensive, so the experiment was approximated by bootstrapping the twenty samples per model (Efron, 1992). For each model, 5 dev. & test score pairs were sampled with replacement from the twenty available pairs. The initialization with the best dev. score was selected and the corresponding test score was taken. This model selection process of picking the best of 5 random samples was repeated 1000 times and the results are shown in Table 2. This process did not substantially increase the average performance beyond the results in Table 1, although it did reduce the variance, except for the CNN model. It ap-   pears that using the dev. score for model selection is only marginally helpful.

Randomized Splits
In order to explore the effect of the training/testing split popularized by (Li et al., 2013), a randomized cross validation experiment was conducted. From the set of 599 documents in ACE2005, 10 random splits were created maintaining the same 529, 30, 40 document counts per split, training, development, testing, respectively. This method was used to evaluate the effect of the standard split, since it maintains the same data proportions while only varying the split. The results of the experiment are found in Table 3. The effect of the split is substantial. Almost all models' performance dropped except for DAG-LSTM, however the variance increased across all models. In the worst case, the standard deviation increased threefold from 0.86% to 2.60% for the GRU model. In fact, the increased variation of the splits means that the confidence intervals for all the models overlap. This aligns with cross domain analysis, some domains such as WL are known to be much more difficult than the newswire domain which comprises all of the test data under the standard split (Nguyen and Grishman, 2015). Further, the effect of the difference in splits also negates the benefits of the attention mechanism of DAG-  GRU A. This is likely due to the test partitions' inclusion of WL and other kinds of informal writing. The syntactic links are much more likely to be noisy for informal writing, reducing the syntactic information's usefulness and reliability. All these sources of variation are greater than most advances in event detection, so quantifying and reporting this variation is essential when assessing model performance. Further, understanding this variation is important for reproducibility and is necessary for making any valid claims about a model's relative effectiveness.

Conclusions
We introduced and evaluated a DAG-GRU model along with four previous models in two different settings, the standard ACE2005 split with multiple random initializations and the same dataset with multiple random splits. These experiments demonstrate that our model, which utilizes syntactic information through an attention mechanism, is competitive with the state-of-the-art. Further, they show that there are several significant sources of variation which had not been previously studied and quantified. Studying and mitigating this variation could be of significant value by itself. At a minimum, it suggests that the community should move away from evaluations based on single random initializations and single training-test splits.