Edge-Enhanced Graph Convolution Networks for Event Detection with Syntactic Relation

Event detection (ED), a key subtask of information extraction, aims to recognize instances of specific event types in text. Previous studies on the task have verified the effectiveness of integrating syntactic dependency into graph convolutional networks. However, these methods usually ignore dependency label information, which conveys rich and useful linguistic knowledge for ED. In this paper, we propose a novel architecture named Edge-Enhanced Graph Convolution Networks (EE-GCN), which simultaneously exploits syntactic structure and typed dependency label information to perform ED. Specifically, an edge-aware node update module is designed to generate expressive word representations by aggregating syntactically-connected words through specific dependency types. Furthermore, to fully explore clues hidden from dependency edges, a node-aware edge update module is introduced, which refines the relation representations with contextual information.These two modules are complementary to each other and work in a mutual promotion way. We conduct experiments on the widely used ACE2005 dataset and the results show significant improvement over competitive baseline methods.


Introduction
Event Detection (ED) is an important information extraction task that seeks to recognize events of specific types from given text. Specifically, each event in a sentence is marked by a word or phrase called "event trigger". The task of ED is to detect event triggers and classify them into specific types of interest. Taking Figure 1 as an example, ED is supposed to recognize the event trigger "visited'' and classify it to the event type Meet. Dependency trees convey rich structural information that is proven useful for ED (Nguyen and Grishman, 2018;Liu et al., 2018b;. Recent works on ED focus on building Graph Convolutional Networks (GCNs) over the dependency tree of a sentence to exploit syntactic dependencies (Nguyen and Grishman, 2018;Liu et al., 2018b;. Compared to sequencebased models, GCN-based models are able to capture non-local syntactic relations that are obscure from the surface form alone , and usually achieve better performance. Nevertheless, existing GCN-based ED methods do not consider dependency labels, which may serve as significant indicators to reveal whether a word is a trigger or not. As shown in Figure 1, the dependency "nsubj" (nominal subject) and "dobj"(direct object) show that "Putin" and "Bush" are the subject and object of "visited" respectively, and the words connected to "visited" with "nmod"(noun compound modifier) dependency express when and where the event happened. Apparently, such dependency labels constitute an effective evidence to predict the event type of "visited" as Meet. In addition, our statistical results on the benchmark ACE2005 dataset show that "nsubj", "dobj" and "nmod" take up 32.2% of triggerrelated dependency labels (2.5% for each relation on average among all 40 dependency relations), which means that simultaneously modeling syntactic structure and dependency labels can be crucial to make full use of the dependency trees to further improve the performance of ED.
Besides, we also observe that the same dependency label under different context may convey different signals for ED. Again, taking Figure 1 as an example: the dependency "nmod" connected with "ranch" indicates where the event happens but another dependency "nmod" connected with "November" points out when the event happens. Such an observation demonstrates that assigning a single context-independent representation for each dependency label is not enough to express the complex relations between words. This is to say, the representations of dependency relations should be context-dependent and dynamic, calculated and updated according to a sentential context using a network structure.
To model the above ideas, in this paper, we propose a novel neural architecture named Edge-Enhanced Graph Convolutional Networks (EE-GCN), which explicitly takes advantage of the typed dependency labels with dynamic representations. In particular, EE-GCN transforms a sentence to a graph by treating words and dependency labels as nodes and typed edges, respectively. Accordingly, an adjacency tensor is constructed to represent the graph, where syntactic structure and typed dependency labels are both captured. To encode the heterogeneous information from the adjacency tensor, EE-GCN simultaneously performs two kinds of propagation learning. For each layer, an edgeaware node update module is firstly performed for aggregating information from neighbors of each node through specific edges. Then a node-aware edge update module is used to dynamically refine the edge representation with its connected node representations, making the edge representation more informative. These two modules work in a mutual promotion way by updating each other iteratively.
Our contributions are summarized as follows: • We propose the novel EE-GCN that simultaneously integrate syntactic structure and typed dependency labels to improve neural event detection, and learns to update the relation representations in a context-dependent manner. To the best of our knowledge, there is no similar work in ED.
• Experiments conducted on the ACE2005 2 benchmark show that EE-GCN achieves SO-TA performance. Further analysis confirms

Related Works
In earlier ED studies, researchers focused on leveraging various kinds of linguistic features and manually designed feature for the task. However, all the feature-based methods depend on the quality of designed features from a pre-processing step.
Most recent works have focused on leveraging neural networks in this task (Chen et al., 2015;Nguyen and Grishman, 2015;Nguyen et al., 2016;Ghaeini et al., 2016;Feng et al., 2016). The existing approaches can be categorized into two classes: The first class is to improve ED through special learning techniques including adversarial training (Hong et al., 2018), knowledge distillation (Liu et al., 2019; and model pretraining (Yang et al., 2019). The second class is to improve ED by introducing extra resource, such as argument information ), document information (Duan et al., 2017Chen et al., 2018), multi-lingual information (Liu et al., 2018a(Liu et al., , 2019, knowledge base  and syntactic information (Sha et al., 2018).
Syntactic information plays an important role in ED. Sha et al. (2018) exploited a dependencybridge recurrent neural network to integrate the dependency tree into model. Orr et al. (2018) proposed a directed-acyclic-graph GRU model to introduce syntactic structure into sequence structure. With the rise of GCN (Kipf and Welling, 2017), researchers proposed to transform the syntactic dependency tree into a graph and employ GCN to conduct ED through information propagation over the graph (Nguyen and Grishman, 2018;Liu et al., 2018b;. Although these works use syntax structures, few of them take dependency label information into consideration, which we, here, demonstrate its importance. How to effectively leverage the typed dependency information still remains a challenge in this task.

Problem Statement
In this section, we formally describe the event detection problem. Following previous works (Chen et al., 2015;Nguyen et al., 2016;Chen et al., 2018;, we formulate event detection as a sequence labeling task. Each word is assigned a label that contributes to event annotation. Tag "O" represents the "Other" tag, which means that the corresponding word is irrelevant of the target events. In addition to "O", the other tags consist of two parts: the word position in the trigger and the event type. We use the "BI" (Begin, Inside) signs to represent the position information of a word in the event trigger. The event type information is obtained from a pre-defined set of events. Thus, the total number of tags is 2 × N EventT ype + 1, where N EventT ype is the number of predefined event types. Figure 2 gives an illustration of EE-GCN based event detection architecture, which is mainly composed of three components: the Input Layer, the Edge-Enhanced GCN layer and the Classification Layer. Next, we detail all components sequentially from bottom to top.

Input Layer
Let S = {w 1 , w 2 , ..., w n } denote an n-word sentence, we first transform each word to a real-valued vector x i by concatenating the following vectors: • Word embedding w i : it captures the meaningful semantic regularity of word. Following previous works (Chen et al., 2018;, we use the word embedding pre-trained by Skip-gram on the NYT Corpus.
• Entity type embedding e i : entities in the sentence are annotated with BIO schema and we map each entity type label to a real-valued embedding by looking up an embedding table.
Thus, the input embedding of w i can be defined as x i = [w i ; e i ] ∈ R dw+de , where d w and d e denote the dimension of word embedding and entity type embedding respectively. Then, a BiLSTM layer is adopted to capture the contextual information for each word. For simplicity, we denote the contextualized word representations as S = [h 1 , · · · , h n ], where S ∈ R n×d are used as initial node features in EE-GCN.

Edge-Enhanced Graph Convolutional Networks
In this subsection, we start by introducing the baseline GCN model, and then present the proposed EE-GCN, which can make full use of dependency label features for better representation learning.

Vanilla Graph Convolutional Network
GCN (Kipf and Welling, 2017), which is capable of encoding graphs, is an extension of convolutional neural network. For an L-layer GCN where l ∈ [1, · · · , L], if we denote H l−1 the input state and H l the output state of the l-th layer, the graph convolutional operation can be formulated as: where A ∈ R n×n is an adjacency matrix expressing connectivity between nodes, W is a learnable convolutional filter and σ denotes a nonlinear activation function, e.g., ReLU. Previous GCN-based ED methods (Nguyen and Grishman, 2018; Liu et al., 2018b; transform dependency tree to a graph according to syntactic connectivity, with each word in the sentence regarded as a node. The graph is represented by an n × n adjacency matrix A through enumerating the graph, where A ij = 1 if there is a syntactic dependency edge between node i and node j, otherwise A ij = 0. Obviously, such approaches use a binary adjacent matrix as structural information, and omit typed dependency label features, which can be potentially useful for ED as discussed in the introduction. It is supposed to be mentioned that why these methods ignore typed dependency labels. An intuitive way for vanilla GCN to exploit these labels is to encode different types of dependency relation with different convolutional filters, which is similar to RGCN (Kipf and Welling, 2017). However, RGCN suffers from over-parameterization, where the number of parameters grows rapidly with the number of relations. Given that there exists approximately 40 types of dependency relations and the size of ED dataset is just moderate, models with large amount of parameters are likely to overfit, for which previous works for ED ignore typed dependency labels.

Edge-Enhanced GCN
Edge-Enhanced GCN (EE-GCN) is an extension of the vanilla GCN mentioned above, which incorporates typed dependency label information into the feature aggregation process to obtain better representations. Specifically, EE-GCN constructs an adjacency tensor E ∈ R n×n×p to describe the graph structure instead of the binary adjacency matrix used in the vanilla GCN, where E i,j,: ∈ R p is the p-dimensional relation representation between node i and node j, and p can also be understood as the number of channels in the adjacency tensor. Formally, E is initialized according to the dependency tree, if a dependency edge exists between w i and w j and the dependency label is r, then E i,j,: is initialized to the embedding of r obtained from a trainable embedding lookup table, otherwise we initialize E i,j,: with a p-dimensional all-zero vector. Following previous works (Marcheggiani and Titov, 2017;, we initialize E based on an undirectional graph, which means that E i,j,: and E j,i,: are initialized as the same embedding. For the ROOT node in the dependency tree, we add a self loop to itself with a special relation "ROOT". In order to fully leverage the adjacency tensor and effectively mine latent relation information be-yond the dependency labels, two modules are implemented at each layer l of EE-GCN to update the node representations (H) and edge representations (E) mutually through information aggregation: (2) Edge-Aware Node Update Module With words in sentence interpreted as nodes in graph, edge-aware node update (EANU) module updates the representation for each node by aggregating the information from its neighbors through the adjacency tensor. Mathematically, this operation can be defined as follows: ( 3) Specifically, the aggregation is conducted channel by channel in the adjacency tensor as follows: where E l−1 ∈ R n×n×p is the adjacency tensor from initialization or last EE-GCN layer, E l−1 :,:,i ∈ R n×n denotes the i th channel slice of E l−1 , H 0 is the output of BiLSTM, W ∈ R d×d is a learnable filter, d is the dimension of node representation, and σ is the ReLU activation function. A meanpooling operation is applied to compress features since it covers information from all channels.

Node-Aware Edge Update Module
In the original adjacency tensor, the relation representation between words is initialized to the dependency label embedding. However, as mentioned in the introduction, the same dependency label under different context may convey different signals for ED, thus assigning a single context-independent representation for each dependency label is not enough to express the complex relations between words. To address this issue, we propose a novel node-aware edge update (NAEU) module to dynamically calculate and update edge representations according to the node context. Formally, the NAEU operation is defined as: where ⊕ means the concatenation operator, h l i and h l j denote the representations of node i and node j in the l th layer after EANU operation, respectively, E l−1 i,j,: ∈ R p is the relation representation between node i and node j, W u ∈ R (2×d+p)×p is a learnable transformation matrix. This operation refines the adjacency tensor in a context-dependent manner, so that the latent relation information expressed in the node representations can be effectively mined and injected to the adjacency tensor. And the adjacency tensor is no longer constrained to just convey the dependency label information, obtaining more representation power. The updated adjacency tensor is fed into the next EE-GCN layer to perform another round of edge-aware node update, and such mutual update process can be be stacked over L layers.

Classification Layer
After aggregating word (node) representations from each layer of EE-GCN, we finally feed the representation of each word into a fully-connected network, which is followed by a softmax function to compute distribution p(t|h) over all event types: where W t maps the word representation h to the feature score for each event type and b t is a bias term. After softmax, event label with the largest probability is chosen as the classification result.

Bias Loss Function
Following popular choices (Chen et al., 2018;, we adopt a bias loss function to strengthen the influence of event type labels during training, since the number of "O" tags is much lager than that of event type tags. The bias loss function is formulated as follows: where N s is the number of sentences, n i is the number of words in the i th sentence; I(O) is a switching function to distinguish the loss of tag "O" and event type tags. It is defined as follows: where α is the bias weight. The larger the α is, the greater the influence of event type tags on the model.

Dataset and Evaluation Metrics
We conduct experiments on the ACE2005 dataset, which is the standard supervised dataset for event detection. The Stanford CoreNLP toolkit 3 is used for dependency parsing. ACE2005 contains 599 documents annotated with 33 event types. We use the same data split as previous works (Chen et al., 2015;Nguyen et al., 2016;Chen et al., 2018; for train, dev and test set, and describe the details in the supplementary material (Data.zip). We evaluate the models using the official scorer in terms of the Precision (P), Recall (R) and F 1 -score 4 .

Hyper-parameter Setting
The hyper-parameters are manually tuned on the dev set. We adopt word embeddings pre-trained on the NYT corpus with the Skip-gram algorithm and the dimension is 100. The entity type and dependency label embeddings are randomly initialized. We randomly initialize the entity type and dependency label embeddings with 25-and 50-dimension vectors. The hidden state size of BiLSTM and EE-GCN are set to 100 and 150, respectively. Parameter optimization is performed using SGD with learning rate 0.1 and batch size 30. We use L2 regularization with a parameter of 1e-5 to avoid overfitting. Dropout is applied to word embeddings and hidden states with a rate of 0.6. The bias parameter α is set to 5. The max length of sentence is set to be 50 by padding shorter sentences and cutting longer ones. The number of EE-GCN layers is 2, which is the best-performing depth in pilot studies. We ran all the experiments using Pytorch 1.1.0 on Nvidia Tesla P100 GPU, with Intel Xeon E5-2620 CPU.

Baselines
In order to comprehensively evaluate our proposed EE-GCN model, we compare it with a range of baselines and state-of-the-art models, which can be categorized into three classes: feature-based, sequence-based and GCN-based. Feature-based models use human designed features to perform event detection. 1) MaxEnt is proposed by Li et al. (2013)  GCN-based models build a Graph Convolutional Network over the dependency tree of a sentence to exploit syntactical information. 1) GCN-ED (Nguyen and Grishman, 2018) is the first attempt to explore how to effectively use GCN in event detection;2) JMEE (Liu et al., 2018b) enhances GCN with self-attention and highway network to improve the performance of GCN for event detection; 3) RGCN (Schlichtkrull et al., 2018), which models relational data with relation-specific adjacency matrix and convolutional filter, is originally proposed for knowledge graph completion. We adapt it to the task of event detection by using the same classification layer and bias loss with our model; 4) MOGANED  improves GCN with aggregated attention to combine multi-order word representation from different GC-N layers, which is the state-of-the-art method on the ACE2005 dataset.

Overall Performance
We report our experimental results on the ACE2005 dataset in Table 1. It is shown that our model, EE-GCN, outperforms all the baselines and achieves state-of-the-art F 1 -score. We attribute the performance gain to two aspects: 1) The introduction of typed dependency label. EE-GCN outperforms all existing GCN-based models which only utilize syntactic structure and ignore the specific typed dependency labels, this demonstrates that the type of dependency label is capable of providing key information for event detection. 2) The design of context-dependent relation representation. Compared with the baseline RGCN which also exploit- (Li et al., 2013) 74.5 59.1 65.9 CrossEntity (Hong et al., 2011) 72.9 64.3 68.3 DMCNN (Chen et al., 2015) 75.6 63.6 69.1 JRNN (Nguyen et al., 2016) 66.0 73.0 69.3 ANN-AugAtt  78.0 66.3 71.7 dbRNN † (Sha et al., 2018) 74.1 69.8 71.9 HBTNGMA (Chen et al., 2018) 77.9 69.1 73.3 GCN-ED † 77.9 68.8 73.1 JMEE † (Liu et al., 2018b) 76.3 71.3 73.7 RGCN † ‡ (Schlichtkrull et al., 2018)  s both syntactic structure and dependency labels, EE-GCN still improves by an absolute margin of 4.2%. We consider that it is because RGCN distinguishes different dependency labels with different convolution filters, thus the same dependency label maintains the same representation regardless of the different context. As a result, the potential relation information expressed beyond dependency labels is not fully exploited. By contrast, our EE-GCN model learns a context-dependent relation representation during information aggregation process with the help of the node-aware edge update module, and thus better captures the information under relations between words. We also observe that EE-GCN gains its improvements mainly on Recall, and we hypothesize that this is because EE-GCN introduces dependency label, which help to capture more fine-grained trigger-related features, thus more triggers would be detected. Meanwhile, MOGANED surpasses EE-GCN on Precision, which could be explained as the original paper analyzed that since MOGANED exploited GAT(GCN with attention) as basic encoder, the attention mechanism helps to predict event triggers more precisely.
Additionally, we notice that EE-GCN performs remarkably better than all sequence-based neural models that do not use dependency structure, which clearly demonstrates that the reasonable use of syntactic dependency information can indeed improve the performance of event detection. When comparing EE-GCN with dbRNN which adds weighted syntactic dependency arcs to BiLSTM, our model gains improvement on both P and R. This phenomenon illustrates that GCN is capable of modeling dependency structure more effectively and the multi-dimensional embedding of dependency label in EE-GCN learn more information than just a weight in dbRNN.

Ablation Study
To demonstrate the effectiveness of each component, we conduct an ablation study on the ACE2005 dev set as Table 2 shows 5 1) -Typed Dependency Label (TDL): to study whether the typed dependency labels contribute to the performance improvement, we initialize each E i,j,: in the adjacency tensor E as the same vector if there is a syntactic dependency edge between node i and node j, thus the typed dependency label information is removed. As a result, the F 1 -score drops by 0.5% absolutely, which demonstrates that typed dependency label information plays an important role in EE-GCN. 2) -Node-Aware Edge Update Module (NAEU): removing node-aware edge update module hurts the result by 0.99% F 1 -score, which verifies that the context-dependent relation representations provide more evident information for event detection than the context-independent ones. 3) -TDL & NAEU: we remove edge-aware node update module and node-aware edge update module simultaneously, then the model is degenerated to the vanilla GC-N. We observe that the performance reduces by 1.69%, which again confirms the effectiveness of our model. 4) -Multi-dimensional Edge representation (MDER): when we set the dimension of relation representation to 1, this is to compress the adjacency tensor E ∈ R n×n×p to be E ∈ R n×n×1 , the F 1 -score drops by 0.77 % absolutely, which indicates that the multi-dimensional representation is more powerful to capture information than just a scalar parameter or weight. 5) -BiLSTM: BiLSTM is removed before EE-GCN and the performance drops terribly. This illustrates that BiLSTM capture important sequential information which GCN misses. Therefore, GCN and BiLSTM are complementary to each other for event detection. 5 Note that the F1 score of model on the ACE2005 dev set is significantly lower than that on the test set. We guess the performance difference comes from the domain gap that the ACE2005 dev set and test set are collected from different domains (Nguyen and Grishman, 2015).

Model
Dev F1

Effect of Edge Representation Dimension
As shown in the ablation study, reducing the dimension of edge representation to 1 hurts the performance of EE-GCN deeply. One may wonder what is the appropriate dimension for EE-GCN. Therefore, we study the performance of the models with different dimensions of edge representation in this part. We vary the value of dimension from 1 to 80 with interval of 20 and check the corresponding F 1 -score of EE-GCN on the dev and test set of ACE2005. The results on the ACE2005 dev and test set are illustrated in Figure 3 and Figure 4 respectively. We could see that the F 1 -score peaks when the dimension is 50 and then falls. This again justifies the effectiveness of introducing multidimensional edge representation. Besides, the problem of overfitting takes effect when the dimension rises beyond a threshold, explaining the curve falls after the 50-dimensional representation in Figure 3.

Effectiveness of Dependency Label
To further confirm the effectiveness of dependency label, we add another experiment by adding dependency label to EEGCN-TDL individually. Based on F1=75.51% on test set with removed TDL, the maximum improvements are F1=77.09%, 77.22% and 76.69% when we respectively add dependency label of nmod, nsubj and dobj. This shows that these three labels are the mainly contributional labels, which is in consistent with our statistical in Introduction.

Performance of Different Event Types
We reviewed F1-score of each type of events using EE-GCN and GCN respectively, and observe that End-ORG(F1=0.0) and Start-ORG(F1=41.67%) are the hardest event types to detect for GCN. These two types of events gets significant improvement when using EE-GCN(F1=75.00% for END-ORG and F1=71.43% for Start-ORG), this demonstrates that the introducing dependency labels does help to improve ED. Besides, we notice that EE-GCN poorly performs on event types of ACQUIT, EXTRA-DITE and NOMINATE, which may be attributed to the very small amount of annotated instances of these types(only 6,7,12 respectively).

Impact of EE-GCN layers
As EE-GCN can be stacked over L layers, we investigate the effect of the layer number L on the final performance. Different number of layers ranging from 1 to 10 are considered. As shown in Figure 5, it can be noted that the performance increases with increasing EE-GCN layers. However, we find out EE-GCN encounters a performance degradation after a number of layers and the model obtains the best performance when L = 2, so is the performance on test set in Figure 6. For this observation, two aspects are considered: First, EE-GCN can only utilize first-order syntactic relations over dependency tree when L = 1, which is not enough to bring important context words that are multi-hops away on the dependency tree from the event trigger into the trigger representation. Second, EE-GCN operating on shallow dependency trees tends to over-smooth node representations, making node representations indistinguishable, thus hurting the model performance .

Efficiency Advantage
Since EE-GCN and RGCN both exploit syntactic structure and typed dependency labels simultaneously, we compare the efficiency of these two architectures from two aspects: parameter numbers and running speed. For the sake of fairness, we run them on the same GPU server with the same batch size. According to our statistics, the amount of parameters of EE-GCN and RGCN event detection architecture are 2.39M and 4.12M respectively. Besides, EE-GCN performs 9.46 times faster than RGCN at inference time. With the performance shown in Table 1, we can conclude that EE-GCN not only achieves better performance, but also outperforms RGCN in efficiency. This is mainly because EE-GCN exploits typed dependency labels by mapping them to relation embedding, while RGCN encodes different types of dependency labels with different convolutional filters. Mathematically, given a graph with r types of relations, the number of relation-related parameters in EE-GCN is only p × r while that in RGCN is r × h × h, where p is the dimension of relation embedding and h is the hidden state size of GCN. Considering that p and h are usually set in the same order, the number of parameters in RGCN increases more rapidly than EE-GCN because h×h is significantly greater than p. We could also read from  in a much slower way, which demonstrates that EE-GCN incorporated typed dependency relation without hurting efficiency badly.

Case Study
In this section, we present a visualization of the behavior of EE-GCN on two instances chosen from the ACE2005 test set, with the aim to validate our motivation provided in the introduction section. We wish to examine whether EE-GCN indeed focuses on modeling the relationship between event-related words through a per instance inspection, which is shown in Figure 7. Following (Sabour et al., 2017), we use the l 2 norm of relation representation in the adjacency tensor of the last EE-GCN layer (L = 2) to represent the relevance score of the corresponding word pair. In the first case, each word has a high relevance score with "visited" (the third column), because it is the event trigger. This trigger has the strongest connections with "Putin", "ranch", "November" and "Bush" (the third row), which means that these four words are the top contributors for the detection of "visited" in EE-GCN. Similarly, in the second case, EE-GCN is able to precisely connect the event trigger "arrested" with its subject "Police" and object "people". In general, the visualization result accords with the human behavior and shows the power of EE-GCN in cap-turing event-related relations between words.

Conclusion and Future Works
In this paper, we propose a novel model named Edge-Enhanced Graph Convolutional Networks (EE-GCN) for event detection. EE-GCN introduces the typed dependency label information into the graph modeling process, and learns to update the relation representations in a context-dependent manner. Experiments show that our model achieves the start-of-the-art results on the ACE2005 dataset. In the future, we would like to apply EE-GCN to other information extraction tasks, such as relation extraction and aspect extraction.