Event Detection with Multi-Order Graph Convolution and Aggregated Attention

Syntactic relations are broadly used in many NLP tasks. For event detection, syntactic relation representations based on dependency tree can better capture the interrelations between candidate trigger words and related entities than sentence representations. But, existing studies only use first-order syntactic relations (i.e., the arcs) in dependency trees to identify trigger words. For this reason, this paper proposes a new method for event detection, which uses a dependency tree based graph convolution network with aggregative attention to explicitly model and aggregate multi-order syntactic representations in sentences. Experimental comparison with state-of-the-art baselines shows the superiority of the proposed method.


Introduction
As an important information extraction task, event detection aims to find event mentions with specific event types from given texts. Each event mention is identified by a word or phrase called event trigger, which serve as the main word(s) to the corresponding event. For example, in the sentence presented in Figure 1, an event detection system should be able to recognize the trigger word "fired" corresponding to the event type "Attack".
Among existing methods for event detection, sequence based ones (e.g., Chen et al. (2015); Nguyen et al. (2016)) only use the given sentences, which suffer from the low efficiency problem in capturing long-range dependency; On the contrary, dependency tree based methods utilize the syntactic relations (i.e., arcs) in the dependency tree of a given sentence to more effectively capture the interrelation between each candidate trigger word and its related entities or other triggers. For example, Nguyen and Grishman (2018) and  treat each dependency tree as a graph and adopt Graph Convolution Network (GCN) (Kipf and Welling, 2016) to extract event triggers.
The syntactic relations between trigger words and related entities may be of first order, embodied as the direct arcs in the dependency tree. They may also be of high-order (i.e., the paths more than one hop in the dependency tree). Particularly, according to our statistics on the benchmark ACE-2005 dataset, about 51% (4977/9793) of event related entities need more than one hop to get to the corresponding trigger words and related entities in their dependency trees. For example, in Figure 1, it needs at least 4 hops (i.e., "fired"-"evidence"-"blood"-"soldiers") to figure out that the word "fired" means "shot" instead of "dismissed". However, the above dependency tree based methods explicitly use only first-order syntactic relations, although they may also implicitly capture high-order syntactic relations by stacking more GCN layers. However, as the number of GCN layers increases, the representations of neighboring words in the dependency tree will get more and more similar, since they all are calculated via those of their neighbors in the dependency tree. This is the so-called over-smoothing problem (Zhou et al., 2018), which damages the diversity of the representations of neighboring words.
To overcome this problem, this paper proposes a Multi-Order Graph Attention Network based method for Event Detection, called MOGANED. It utilizes both first-order syntactic graph and highorder syntactic graphs to explicitly model multiorder representations of candidate trigger words. In order to calculate multi-order representations for each word, we apply Graph Attention Network (GAT) proposed in (Veličković et al., 2017) to weight the importance of its neighboring words in the syntactic graphs of different orders. We then employ an attention aggregation mechanism to merge its multi-order representations. Finally, through experimental comparison with state-ofthe-art baselines we show the superiority of the proposed method in terms of both precision and F1-measure. It should be mentioned that, to the best of our knowledge, this is the first study applying GAT to event detection.

The Proposed Model
Following the existing studies, we regard event detection as a multi-class classification problem. Let W = w 1 , w 2 , ..., w n be a sentence of length n, where w i is its i-th token. Since event triggers may contain multiple words, we adopt the "BIO" schema to make annotation. The number of labels is thus be 2L + 1, where L is the number of predefined event types.
The proposed model contains three modules: (i) word encoding, which encodes input sentence to a sequence of vectors; (ii) multi-order graph attention network that performs graph attention convolution over multi-order syntactic graphs; (iii) attention aggregation, which aggregates multi-order representations of each word with different attention weights to predict its label.

Word Encoding
The word encoding module first transforms each input token w i into a comprehensive embedding vector x i by concatenating its word embedding word i , entity type embedding et i , POStagging embedding pos i and position embedding ps i (Nguyen et al., 2016;Nguyen and Grishman, 2018), where word i is obtained by looking up a pre-trained word embedding table on a large corpus and others are randomly initialized.
The input sentence W will then be transformed to a sequence of vectors X = x 1 , x 2 , ..., x n . Since each word can only be updated by its neighbors in the dependency tree through dependency arcs, following the previous methods , we employ a Bidirectional Long-Short Term Memory network (BiLSTM) (Hochreiter and Schmidhuber, 1997) to encode X with its context as P = p 1 , p 2 , ..., p n , which will be used as the input of the multi-order GAT. Here, where || is the concatenation operation.

Multi-Order Graph Attention Network
Each dependency tree can be represented as a first-order syntactic graph in terms of an adjacency matrix. Let A be the adjacency matrix of the first-order syntactic graph, which is generated from the dependency tree of the sentence, W .
A contains three sub matrixes A along , A rev and A loop (Marcheggiani and Titov, 2017) of the same dimensions n × n. A along (i, j) = 1, if there is a dependency arc from w i to w j in the dependency tree, otherwise 0. As the reverse graph of A along , The multi-order GAT module simply learns a list of representations over multi-order syntactic graphs by a few parallel GAT layers (Veličković et al., 2017), which weights the importance of neighbors of each word in each syntactic graph during convolution.
The representations h k i of the k-th-order syntactic graph A k are calculated from the representations of the subgraphs of A k : where f (·) is the graph attention convolution function and ⊕ is element-wise addition, and where σ is exponential linear unit (ELU) (Clevert et al., 2015); W a,k and a,k are the weight matrix and bias item for a k ; u ij is the normalized weight of the neighbor w j when updating w i ,

Trigger Label
Multi Order Graph Attention Network Word Encoding where e ij = γ(W comb [W att p i ||W att p j ]). Here, N i is the neighbor set of w i in the subgraph, γ is LeakyReLU (with negative input slope α = 0.2) (Maas et al., 2013); W comb and W att are weight matrices. After graph attention convolution, each candidate trigger word w i gets a list of multi-order representations h k i , k ∈ [1, K], where K is the highest order used in this module.

Attention Aggregation
To aggregate the list of multi-order representations h k i of each word w i , we employ a weighted attention aggregation mechanism from Hierarchical Attention Networks (Yang et al., 2016): where v k i is the normalized weight of the k-th order graph representation of the word w i , which is calculated by where s j i = tanh(W awa h j i + awa ). Here, W awa and awa are the weight matrix and the bias term, respectively; ctx is a randomly initialized context vector, which captures the importance of graph representations of each order.
Finally, we use the aggregated representation h i to predict the trigger label of word w i as follows: where y q i denotes the probability assigning label q to word w i and O i = w o h i + o . Here, w o and o are the weight matrix and the bias item, respectively.

Bias Loss Function
Since the number of "O" labels in the data is much larger than that of event labels, we use a bias loss function (Chen et al., 2018) to enhance the influence of event labels during training: where N s is the number of sentences; N i,w is the number of words in s i ; I(O) equals 1, if the label of the word is "O"; otherwise 0; λ is the weight parameter larger than 1.

Experiment Settings
Dataset and Resources We compare our MO-GANED model with baseline methods on ACE-2005 with the same data split, where 40 newswire documents are used as the test set, 30 newswire documents as the validation set and 529 remained documents as the training set.
For data preprocessing, we use the Stanford CoreNLP toolkit for sentence splitting, tokenizing, POS-tagging and dependency parsing. We adopt word embeddings trained over the New York Times corpus with the Skip-gram algorithm (Mikolov et al., 2013;Chen et al., 2018 below, we use 100 as the dimension of word embeddings and 50 as that of the rest embeddings. We set the hidden units of the BiLSTM network to 100. We set the highest order K to 3 and the dimension of graph representation to 150. We set batch size to 30 and utilize a fixed maximum sentence length n = 50 by padding short sentences and cutting longer ones. During training, we use the AdaDelta update rule (Zeiler, 2012) with a learning rate of 0.001. We set the dropout rate to 0.3 and the L2-norm to 1e − 5. We set the bias loss parameter λ to 5.

Overall Performance
We compare our model with the following stateof-the-art baselines: 1) Cross Event (Liao and Grishman, 2010), which uses document level information for event extraction; 2) DMCNN (Chen et al., 2015), which builds a dynamic multipooling CNN model; 3) JRNN (Nguyen et al., 2016), which uses a bidirectional RNN and human designed features; 4) DEEB-RNN , which uses hierarchical supervised attention with document level information for event detection; 5) dbRNN (Sha et al., 2018), which adds dependency arcs over a Bi-LSTM network to improve event extraction; 6) GCN-ED (Nguyen and Grishman, 2018), which uses an argument pooling mechanism for event detection based on GCN; 7) JMEE , which uses GCN with highway network and self-attention. Table 1 presents the performance comparison between different methods. We can see that MOGANED achieves 1.6% and 1.7% improvement on precision and F 1 -measure, respectively, compared with the best baselines. MOGANED reaches a lower recall than two sequence based  methods, JRNN and DEEB-RNN. This may be caused by the propagated error from the dependency parsing tool. There are long sentences where trigger words and their corresponding entities are parsed into two independent dependency trees without connection. Therefore, the trigger words and their related entities are unable to interact in multi-order graph attention network, which leads to a lower recall. However, MOGANED still achieves the best performance in terms of precision, recall, F 1 -measure among all dependency based methods, which suggests the effective of multi-order representations.

Ablation Study
This study aims to validate the impacts of multiorder representation, graph attention network and attention aggregation. For this purpose, we design three architectures based on MOGANED: 1) MOGANED-First: it only uses first-order syntactic graph (i.e., K=1); 2) MOGANED-GCN: it uses traditional GCN instead of GAT; 3) MOGANED-Mean: it adopts mean pooling as the attention aggregation mechanism of multi-order representations for each word. The experimental result is shown in Table 2. All of these three modified models get lower performance than MOGANED. MOGANED-First achieves the worst performance, which suggests that high-order syntactic relations play an important role in event detection. MOGANED-GCN drops more on precision than recall, which illustrates that the attention learned from GAT helps MOGANED predict trigger words more precisely. The performance drop of MOGANED-Mean is the smallest among the three modified models. Although the average of multi-order representations achieves competitive performance for event detection, the proposed attention aggregation module still distinguishes the importance of syntactic representations of different order, which achieves 1.7% improvement on F 1 -measure.

Conclusion
In this paper, we proposed the MOGANED model for modeling multi-order representations via GAT and employed an attention aggregation mechanism to better capture dependency contextual information for event detection. The experimental results well demonstrate its effectiveness and superiority.