Graph Transformer Networks with Syntactic and Semantic Structures for Event Argument Extraction

The goal of Event Argument Extraction (EAE) is to find the role of each entity mention for a given event trigger word. It has been shown in the previous works that the syntactic structures of the sentences are helpful for the deep learning models for EAE. However, a major problem in such prior works is that they fail to exploit the semantic structures of the sentences to induce effective representations for EAE. Consequently, in this work, we propose a novel model for EAE that exploits both syntactic and semantic structures of the sentences with the Graph Transformer Networks (GTNs) to learn more effective sentence structures for EAE. In addition, we introduce a novel inductive bias based on information bottleneck to improve generalization of the EAE models. Extensive experiments are performed to demonstrate the benefits of the proposed model, leading to state-of-the-art performance for EAE on standard datasets.


Introduction
Event Extraction (EE) is an important task of Information Extraction that aims to recognize events and their arguments in text. In the literature, EE is often divided into two sub-tasks: (1) Event Detection (ED) to detect the event trigger words, and (2) Event Argument Extraction (EAE) to identity the event arguments and their roles for the given event triggers. In recent years, ED has been studied extensively with deep learning while EAE is relatively less explored (Wang et al., 2019b). As EAE is necessary to accomplish EE and helpful for many downstream applications (Yang et al., 2003;Cheng and Erk, 2018), further studies are required to improve the performance of EAE. This work focuses on EAE to meet this requirement for EE.
The current state-of-the-art methods for EAE have involved deep learning models that compute an abstract representation vector for each word in the input sentences based on the information from the other context words. The representation vectors for the words are then aggregated to perform EAE (Chen et al., 2015;Nguyen et al., 2016). Our main motivation in this work is to exploit different structures in the input sentences to improve the representation vectors for the words in the deep learning models for EAE. In this work, a sentence structure (or view) refers to an importance score matrix whose cells quantify the contribution of a context word for the representation vector computation of the current word for EAE. In particular, we consider two types of sentence structures in this work, i.e., syntactic and semantic structures. As such, the importance score for a pair of words in the syntactic structures is determined by the syntactic connections of the words in the dependency parsing trees while the contextual semantics of the words in the input sentences are exploited to compute the importance scores in the semantic structures. Consider the following sentence as an example: Iraqi Press constantly report interviews with Hussain Molem, the Hanif Bashir's son-in-law, while US officials confirmed all Bashir's family members were killed last week.
In this sentence, an EAE system should be able to realize the entity mention "Hussain Molem" as the Victim of the Attack event triggered by "killed". As "Hussain Molem" and "killed" are far away from each other in the sentence as well as its dependency tree, the EAE models might find it challenging to make the correct prediction in this case. In order for the models to be successful in this case, our intuition is that the models should first rely on the direct connections between "killed" and "all Bashir's family members" in the dependency tree to capture the role of "all Bashir's family members" in the represen-tation vectors for "killed". Afterward, the models can rely on the close semantic similarity between "all Bashir's family members" and "the Hanif Bashir's son-in-law" to further connect "the Hanif Bashir's son-in-law" to "killed" so the role information of "the Hanif Bashir's son-in-law" can be recorded in the representation vector for "killed". Finally, the direct apposition relation between "the Hanif Bashir's son-in-law" and "Hussain Molem" can be exploited to connect "Hussain Molem" with "killed" to obtain the necessary representations to perform argument prediction for "Hussain Molem". On the one hand, this example suggests that both syntactic and semantic structures are necessary for the EAE models. On the other hand, the example also hints that it is not enough to apply the syntactic and semantic structures separately. In fact, these structures should be explicitly combined to complement each other on identifying important context words to obtain effective representations for EAE. To our knowledge, this is the first work to explore syntactic and semantic structures for EAE.
How should we combine the syntactic and semantic structures to aid the learning of effective representations for EAE? In this work, we propose to employ Graph Transformer Networks (GTN) (Yun et al., 2019) to perform the syntax-semantic merging for EAE. GTNs facilitate the combination of multiple input structures via two steps. The first step obtains the weighted sums of the input structures, serving as the intermediate structures that are able to capture the information from different input perspectives (i.e., structures). In the second step, the intermediate structures are multiplied to generate the final structures whose goal is to leverage the multi-hop paths/connections between a pair of nodes/words (i.e., involving the other words) to compute the importance score for the final structures. As the multi-hop paths with heterogeneous types of connections along the way (i.e., syntax or semantic) has been illustrated to be helpful in our running example (i.e., between 'Hussain Molem" and 'killed"), we expect that GTNs can help to combine the syntactic and semantic structures to produce effective representations for EAE.
Finally, in order to further boost the performance for EAE, we propose a novel inductive bias for the proposed GTN model, aiming to improve the generalization of GTNs using the Information Bottleneck idea (Tishby et al., 2000;Belghazi et al., 2018). In particular, the use of the rich combined structures from syntax and semantics might augment GTNs with high capacity to encode the detailed information in the input sentences. Coupled with the generally small training datasets for EAE, the GTN models could learn to preserve all the context information in the input sentences, including the irrelevant information for EAE. This likely leads to the overfitting of GTNs on the training data. In order to overcome this issue, we propose to treat the GTN model in this work as an information bottleneck in which the produced representations of GTNs are trained to not only achieve good prediction performance for EAE but also minimize the mutual information with the input sentences (Belghazi et al., 2018). To this end, we introduce the mutual information between the generated representations of GTNs and the input sentences as an additional term in the overall loss function to improve the generalization of GTNs for EAE. Our extensive experiments on two benchmark datasets for EAE show that the proposed model can achieve the state-ofthe-art performance for EAE.
Among the two subtasks of EE, while ED has been studied extensively by the recent deep learning work (Nguyen and Grishman, 2015;Chen et al., 2015;Nguyen et al., 2016g;Chen et al., 2017;Liu et al., , 2018aZhao et al., 2018;Wang et al., 2019a;Lai et al., 2020c), EAE has been relatively less explored. The closest work to ours is (Wang et al., 2019b) that focuses on EAE and exploits the concept hierarchy of event argument roles to perform the task. Our work differs from (Wang et al., 2019b) in that we employ the syntactic and semantic structures of the sentences to better learn the representations for EAE. We also note some new directions on EE based on zero-shot learning (Huang et al., 2018), few-shot learning (Lai et al., 2020a,b) and multimodal learning .

Model
EAE can be formulated as a multi-class classification problem in which the input involves a sentence W = w 1 , w 2 , . . . , w N (w i is the i-th word/token in the sentence of length N ), and an argument candidate and event trigger at indexes a and t in the sentence (i.e., the words w a and w e ) respectively. The goal in this problem is to predict the role that the argument candidate w a plays in the event triggered by w e . Note that the set of the possible roles also include a special type None to indicate that the argument candidate is not an actual argument of the event (i.e., no roles). In order to achieve a fair comparison, following the prior work on EAE (Wang et al., 2019b), we take as inputs the event triggers that are detected by an independent model (i.e., the model in (Wang et al., 2019a)) and separate from our proposed model. We also consider each entity mention in the sentence as a candidate argument for the role prediction task. Our model for EAE in this work involves four major components: (i) sentence encoding, (ii) structure generation, (iii) structure combination, and (iv) model regularization as described in the following.

Sentence Encoding
To represent the sentence, we encode each word w i with a real-valued vector x i that is the concatenation of the two following vectors: (i) the embeddings of the relative distances of the word w i to the argument candidate (i.e., i − a) and event trigger (i.e., i − e) (these embeddings are initialized randomly and updated during training), and (ii) the BERT embedding of the word w i . In particular, to achieve a fair comparison with (Wang et al., 2019b), we run the BERT base cased model (Devlin et al., 2019) over W and use the hidden vector for the first wordpiece of w i in the last layer of BERT as the embedding vector (of 768 dimensions) for w i .
The word encoding step then produces a sequence of vectors X = x 1 , . . . , x N to represent the input sentence W . In order to better combine the BERT embeddings and the relative dis-tance embeddings, we further feed X into a Bidirectional Long-short Term Memory network (BiL-STM), resulting in the hidden vector sequence H = h 1 , . . . , h N as the representation vectors for the next steps.

Structure Generation
As presented in the introduction, the motivation for our EAE model is to employ the sentence structures to guide the computation of effective representation vectors for EAE with deep learning. These sentence structures would involve the score matrices of size N × N in which the score at the (i, j) cell is expected to capture the importance of the contextual information from w j with respect to the representation vector computation of w i in the deep learning models for EAE (called the importance score for the pair (w i , w j )). In this work, we consider two types of sentence structures for EAE, i.e., the syntactic structures and the semantic structures.
Syntactic Structures: It has been shown in the prior work that the dependency relations in the dependency trees can help to connect a word to its important context words to obtain effective representation vectors for EAE (Sha et al., 2018). To this end, we use the adjacency matrix A d of the dependency tree for W as one of the syntactic structures for EAE in this work. Note that A d here is a binary matrix whose cell (i, j) is only set to 1 if w i and w j are linked in the dependency tree for W .
One problem with the A d structure is that it is agnostic to the argument candidate w a and event trigger w e for our EAE task. As the argument candidate and event trigger are the most important words in EAE, we argue that the sentence structures should be customized for those words to produce more effective representation vectors in W for EAE. In order to obtain the task-specific syntactic structures for EAE, our intuition is that the closer words to the argument candidate w a and event trigger w e in the dependency tree would be more informative to reveal the contextual semantics of w a and w e than the farther ones in W (Nguyen and Grishman, 2018). These syntactic neighboring words of w a and w e should thus be assigned with higher importance scores in the sentence structures, serving as the mechanism to tailor the syntactic structures for the argument candidate and event trigger for EAE in this work. Consequently, besides the general structure A d , we pro-pose to generate two additional customized syntactic structures for EAE based on the lengths of the paths between w a , w e and the other words in the dependency tree of W for EAE (i.e., one for the argument candidate and one for the event trigger). In particular, for the argument candidate w a , we first compute the length d a i of the path from w a to every other word w i in W . The length d a i is then converted to an embedding vectord a i by looking up a length embedding table D (initialized randomly and updated during training): where [] is the vector concatenation, ⊙ is the element-wise multiplication, and F F is a twolayer feed-forward network to convert a vector to a scalar. We expect that learning the syntactic structures in this way would introduce the flexibility to infer effective importance scores for EAE.
The same procedure can then be applied to generate the trigger-specific syntactic structure A e = {s e i,j } i,j=1..N for the event trigger w e for the EAE model in this work. Finally, the general structure A d and the task-specific syntactic structures A a and A e would be used as the syntactic structures for our structure combination component in the next step.
Semantic Structure The semantic structure aims to learn the importance score for a pair of words (w i , w j ) by exploiting the contextual semantics of w i and w j in the sentence. As mentioned in the introduction, the semantic structure is expected to provide complementary information to the syntactic structures, that once combined, can lead to effective representation vectors for EAE. In particular, we employ the Bi-LSTM hidden vectors H = h 1 , . . . , h N in the sentence encoding section to capture the contextual semantics of the words for the semantic structure in this work. The semantic importance scores s s i,j for the where f is some function to produce a score for h i and h j . Motivated by the self-attention scores in (Vaswani et al., 2017), we use the following key and query-based function for the f function for s s i,j : where U k and U q are trainable weight matrices, and the biases are omitted in this work for brevity. Similar to the general syntactic structure A d , a problem for this function is that the semantic scores s s i,j are not aware of the argument candidate and the event trigger words, the two important words for EAE. To this end, we propose to involve the contextual semantics of the argument candidate w a and event trigger w e (i.e., h a and h t ) in the computation of the semantic structure score s s i,j for EAE using: The rationale in this formula is to use the hidden vectors for the argument candidate w a and event trigger w e to generate the task-specific control vectors c k and c q . These control vectors are then employed to filter the information in the key and query vectors (i.e., k i and q i ) so only the relevant information about w a and w e is preserved in k i and q i via the element-wise products ⊙. The resulting key and query vectors (i.e., k ′ i and q ′ i ) would then be utilized to compute the task-specific importance score s s i,j for the semantic structure A s in this work.

Structure Combination
The four initial structures in A = [A d , A a , A e , A s ] can be interpreted as four different types of relations between the pairs of words in W (i.e., using the syntactic and semantic information) (called the word relation types). The cell (i, j) in each initial structure is deemed to capture the degree of connection between w i and w j based on their direct interaction/edge (i.e., the one-hop path (w i , w j )) and the corresponding relation type for this structure. Given this interpretation, this component seeks to combine the four initial structures in A to obtain richer sentence structures for EAE. On the one hand, we expect the importance scores between a pair of words (w i , w j ) in the combined structures to be able to condition on the possible interactions between w i , w j and the other words in the sentence (i.e., the multi-hop paths between w i and w j that involve the other words). On the other hand, the multi-hop paths between w i and w j for the importance scores should also be able to accommodate the direct edges/connections between the words of different relation types (i.e., the heterogeneous edge types). Note that both the multi-hop paths and the heterogeneous edge types along the paths (i.e., for syntax and semantics) have been demonstrated to be helpful for EAE in the introduction. Consequently, in this work, we propose to apply Graph Transformer Networks (GTNs) (Yun et al., 2019) to simultaneously achieve these two goals for EAE.
In particular, following (Yun et al., 2019), we first add the identity matrix I (of size N × N ) into the set of structures in A to enable GTNs to learn the multi-hop paths at different lengths, i.e., Given the initial structures in A and inspired by the transformers in (Vaswani et al., 2017), the GTN model is organized into C channels for which the i-the channel involves M intermediate The weighted sums enable the intermediate structures to reason with any of the four initial word relation types depending on the context, thus offering the structure flexibility for the model. Afterward, in order to capture the multi-hop paths for the importance scores in the i-th channel, the intermediate structures are multiplied to obtain a single sentence structure Q i for this channel: Q i = Q i 1 ×Q i 2 ×. . .×Q i M (called the final structures). It has been shown in (Yun et al., 2019) that Q i is able to model any multi-hop paths between the words with lengths up to M . Such multi-hop paths can also host heterogeneous word relation types in the edges (due to the flexibility of the intermediate structures Q i j ), thus introducing rich sentence structures for our EAE problem.
In the next step, the final structures Q 1 , Q 2 , . . . , Q C of GTN are treated as different adjacency matrices for the fully connected graph between the words in W . These matrices would then be consumed by a Graph Convolutional Network (GCN) model (Kipf and Welling, 2017) to produce more abstract representation vectors for the words in our EAE problem. In particular, the GCN model in this work consists of several layers (i.e., G layers in our case) to compute the representation vectors at different abstract levels for the words in W . For the k-th final structure Q k , the representation vector for the word w i in the t-th GCN layer would be computed via: where U t is the weight matrix for the t-th GCN layer and the input vectorsĥ k,0 i for the GCN model are obtained from the Bi-LSTM hidden vectors (i.e.,ĥ k,0 Given the outputs from the GCN model, the hidden vectors in the last GCN layer (i.e., the Gth layer) of the word w i for all the final structures (i.e.,h 1,G i ,h 2,G i , . . . ,h C,G i ) are then concatenated to form the final representation vector h ′ i for w i in the proposed GTN model: ]. Finally, in order to predict the argument role for w a and w e in W , we assemble a representation vector R based on the hidden vectors for w a and w e from the GCN model via: . This vector is then sent to a two-layer feed-forward network with softmax in the end to produce a probability distribution P (.|W, a, t) over the possible argument roles. We would then optimize the negative log-likelihood L pred to train the model in this work: L pred = −P (y|W, a, t) where y is the golden argument role for the input example.

Model Regularization
As presented in the introduction, the high representation learning capacity of the GTN model could lead to memorizing the information that is only specific to the training data (i.e., overfitting). In order to improve the generalization, we propose to regularize the representation vectors obtained by GTN so only the effective information for EAE is preserved in the GTN representations for argument prediction and the nuisance information of the training data (i.e., the irrelevant one for EAE) can be avoided. To this end, we propose to treat the GTN model as an Information Bottleneck (IB) (Tishby et al., 2000) in which the GTN-produced representation vectors H ′ = h ′ 1 , h ′ 2 , . . . , h ′ N would be trained to both (1) retain the effective information to perform argument prediction for EAE (i.e., the high prediction capacity) and (2) maintain a small Mutual Information (MI) 1 with the representation vectors of the words from the earlier layers of the model (i.e., the minimality of the representations) (Belghazi et al., 2018). In this work, we follow the common practice to accomplish the high prediction capacity by using the GTN representation vectors to predict the argument roles and minimizing the induced negative log-likelihood in the training phase. However, for the minimality of the representations, we propose to achieve this by explicitly minimizing the MI between the GTN-produced vectors H ′ = h ′ 1 , h ′ 2 , . . . , h ′ N and the BiLSTM hidden vectors H = h 1 , h 2 , . . . , h N from sentence encoding. By encouraging a small MI between H and H ′ , we expect that only the relevant information for EAE in H is passed through the GTN bottleneck to be retained in H ′ for better generalization.
As H and H ′ are sequences of vectors, we first transform them into the summarized vectors h and h ′ (respectively) to facilitate the MI estimation via the max-pooling function: . Afterward, we seek to compute the MI between h and h ′ and include it in the overall loss function for minimization. However, as h and h ′ are high-dimensional vectors, their MI estimation is prohibitively expensive in this case. To this end, we propose to apply the mutual information neural estimation (MINE) method in (Belghazi et al., 2018) to approximate the MI with its lower bound. In particular, motivated by (Hjelm et al., 2019), we propose to further approximate the lower bound of the MI between h and h ′ via the adversarial approach using the loss function of a variable discriminator. As the MI between h and h ′ is defined as the KL divergence between the joint and marginal distributions of these two variables, the discriminator aims to differentiate the vectors that are sampled from the joint distribution and those from the product of the marginal distributions for h and h ′ . In our case, we sample from the joint distribution for h and h ′ by simply concatenating the two vectors (i.e., [h ′ , h]) and treat it as the positive example. To obtain the sample from the product of the marginal distribution, we concatenate h ′ withĥ that is the aggregated vector (via max-pooling) of the BiLSTM hidden vectors for another sentence (obtained from the same batch with the current sentence during training) (i.e., [h ′ ,ĥ] as the negother variable is revealed. ative example). These positive and negative examples are then fed into a two-layer feed-forward network D (i.e., the discriminator) to produce a scalar score, serving as the probability to perform a binary classification for the variables. Afterward, the logistic loss L disc of D is proposed as an estimation of the MI between h and h ′ and added into the overall loss function for minimization: Finally, the overall loss function L to train the model in this work would be: L = L pred + α disc L disc where α disc is a trade-off parameter.

Experiments
Datasets & Parameters: We evaluate the models on two benchmark datasets, i.e., ACE 2005 and TAC KBP 2016 (Ellis et al., 2016). ACE 2005 is a widely used EE dataset, involving 599 documents, 33 event subtypes and 35 argument roles. We use the same data split with the prior work (Chen et al., 2015;Wang et al., 2019b) for a fair comparison (i.e., 40 documents for the test data, 30 other documents for the development set, and the remaining 529 documents for the training data). For TAC KBP 2016, as no training data is provided, following (Wang et al., 2019b), we use the ACE 2005 training data to train the models and then evaluate them on the TAC KBP 2016 test data. To evaluate the models' performance, for a fair comparison with the previous work (Chen et al., 2015;Wang et al., 2019b), we consider an argument classification as correct if its predicted event subtype, offsets and argument role match the golden data.
We fine-tune the hyper-parameters for our model on the ACE 2005 development set, leading to the following values: 30 dimensions for the relative distance and length embeddings (i.e., D), 200 hidden units for the feed-forward network, BiLSTM and GCN layers, 2 layers for the BiL-STM and GCN modules (G = 2), C = 3 channels for GTN with M = 3 intermediate structures in each channel, and 0.1 for the parameter α disc . Finally, besides BERT, we also evaluate the proposed model when BERT is replaced by the word2vec embeddings (of 300 dimensions) (Mikolov et al., 2013) to make it comparable with some prior works. Note that as in (Wang et al., 2019b), the proposed model with BERT takes as inputs the predicted event triggers from the BERTbased ED model in (Wang et al., 2019a) while the proposed model with word2vec utilizes the pre-dicted event triggers from the word2vec-based ED model in (Chen et al., 2015) for compatibility.
Comparison with the State of the Art: To evaluate the effectiveness of the proposed model (called SemSynGTN), we first compare it with the baselines on the ACE 2005 dataset. Following (Wang et al., 2019b), we use the following baselines in our experiments: (i) the feature-based models (i.e., Li's Joint (Li et al., 2013) and RBPB (Sha et al., 2016)), (ii) the deep sequence-based models that run over the sequential order of the words in the sentences (i.e., DMCNN (Chen et al., 2015), JRNN (Nguyen et al., 2016), PLMEE (Yang et al., 2019), and DMBERT (i.e., DM-CNN with BERT) (Wang et al., 2019b)), (iii) the deep structure-based models that employ dependency trees for BiLSTM or GCNs (i.e., dbRNN (Sha et al., 2018) and JMEE (Liu et al., 2018b)), (iv) the models with Generative Adversarial Imitation Learning (GAIL (Zhang et al., 2019a)), and (v) the deep learning model that exploits the hierarchical concept correlation among argument roles (i.e., HMEAE (Wang et al., 2019b) (Sha et al., 2016) 54.1 53.5 53.8 DMCNN (Chen et al., 2015) 62.2 46.9 53.5 JRNN (Nguyen et al., 2016) 54.2 56.7 55.4 dbRNN (Sha et al., 2018) 66   Table 1 presents the performance of the models on the ACE 2005 test set. Note that we distinguish between the models that employ BERT for the pre-trained word embeddings and those that do not for a clear comparison in the table. The most important observation is that the proposed model SemSynGTN significantly outperforms all the baseline models (with p < 0.01) no matter if BERT is used as the pre-traind word embeddings or not. SemSynGTN achieves the state-of-the-art performance on ACE 2005 when BERT is applied in the model, thus demonstrating the benefits of the proposed model with the syntactic and semantic structure combination for EAE in this work.
In order to further demonstrate the effectiveness of the proposed model, following the previous work (Wang et al., 2019b), we evaluate the models on the TAC KBP 2016 dataset. In particular, we compare SemSynGTN with the top four systems in the TAC KBP 2016 evaluation (Dubbin et al., 2016;Hsi et al., 2016;Ferguson et al., 2016), the DMCNN model in (Chen et al., 2015), and the DMBERT and HMEAE models in (Wang et al., 2019b). Note that HMEAE is also the state-of-theart model for EAE on this dataset. The results are shown in Table 2 that corroborates our findings from Table 1. Specifically, SemSynGTN significantly outperforms all the baseline models with large margins (p < 0.01) (whether BERT is used or not), thus confirming the advantages of the Sem-SynGTN model in this work.
Model P R F1 DISCERN-R (Dubbin et al., 2016) 7.9 7.4 7.7 Washington4 (Ferguson et al., 2016) 32.1 5.0 8.7 CMU CS Event 1 (Hsi et al., 2016) 31.2 4.9 8.4 Washington1 (Ferguson et al., 2016) 26.5 6.8 10.8 DMCNN (Chen et al., 2015) 17  Ablation Study: This part analyzes the effectiveness of the components in the proposed model for EAE by removing each of them from the overall model and evaluating the performance of the remaining models on the ACE 2005 development dataset. In particular, the first major component in SemSynGTN involves structure customization that seeks to tailor the initial syntactic and semantic structures for the argument candidate and event trigger. We evaluate two ablated models for this component: (i) eliminating the task-specific syntactic customization from SemSynGTN that amounts to excluding the customized syntactic structures A a and A e from the initial structure set A (called SemSynGTN -SynCustom), and (ii) removing the task-specific semantic customization from SemSynGTN that leads to the use of the simple key-query version (i.e., Equation 1) to compute the importance scores in the semantic struc-ture A s (i.e., instead of using Equation 2 as in the full model) (called SemSynGTN -SemCustom).
The second major component is structure combination that aims to generate the mixed structures from the initial structures in A via GTN. We consider two ablated models for this component: Finally, the third component corresponds to the regularization loss based on information bottleneck L disc in Section 3.4. The removal of L disc from the overall loss L leads to the ablated model SemSynGTN -IB. As this component relies on the MI between the hidden vectors returned by the BiLSTM and GTN models, we evaluate another variant for SemSynGTN in this case where the regularization loss is eliminated, but the hidden vectors from the BiLSTM model h 1 , h 2 , . . . , h N are included in the final representation vector R for argument role prediction (i.e.,  The table clearly shows that all the components are necessary for SemSynGTN to achieve the highest performance. In particular, the GTN model, the syntactic and semantic structure customization, and the structure multiplication are all important as eliminating any of them would hurt the performance significantly. These evidences highlight the importance of combining the customized sentence structures for EAE in this work. In addition, "Sem-SynGTN -Bottleneck" and "SemSynGTN -IB + LSTM in R" are also significantly worse than Sem-SynGTN, suggesting the effectiveness of information bottleneck to regularize the model for better generalization for GTNs in EAE. Structure Analysis: The proposed model generates four initial sentence structures in A (i.e., A d , A a , A e , and A s ) to capture the general and task-specific structures for EAE based on the syntactic and semantic information. In order to evaluate their contribution for SemSynGTN, Table 4 presents the performance of the remaining models when each of these structures is eliminated from the model (i.e., from A). It is clear from the table that the model performance is significantly worse when we remove any of the initial structures in A, thus demonstrating the benefits of such structures for EAE.  Information Bottleneck Analysis: In order to prevent overfitting for the GTN model in this work, we propose to cast GTN as an information bottleneck that seeks to minimize the mutual information between the GTN-produced vectors H ′ = h ′ 1 , h ′ 2 , . . . , h ′ N and the representation vectors of the words from the earlier layers of the model (i.e., prior to GTN for the minimality of the representations). In particular, in the implementation of this idea, we propose to achieve the minimality of the representations by minimizing the mutual information between the vectors in H ′ and the BiL-STM hidden vectors H = h 1 , h 2 , . . . , h N in the sentence encoding. However, there are other prior layers of GTN whose hidden vectors can also be used for this MI minimization, including (1) the BERT-generated vectors for the words in the input sentence (i.e., E = e 1 , . . . , e N where e i is the hidden vector of the first wordpiece of w i in the last layer of the BERT model), and (2) the input vectors X = x 1 , . . . , x N for the BiLSTM layer in the sentence encoding where x i is the concate-nation of e i and the relative distance embeddings for w i toward w a and w e . In this analysis, we aim to evaluate the performance of SemSynGTN when the BiLSTM vectors H are replaced by the vectors in E and X in the computation of L disc for the MI minimization.  The first observation from the table is that Sem-SynGTN with the vectors in X for MI performs better than those with the BERT-generated vectors E. We attribute this to the fact that in addition to the BERT-generated vectors in E, the vectors in X also include the relative distance embeddings of the words (i.e., for the argument candidate and trigger word). In principle, this makes X more compatible with H ′ than E as both X and H ′ have access to the relative distances of the words to capture the positions of the argument candidate and trigger word in the sentences. Such compatible nature of information sources enables more meaningful comparison between X and H ′ for the MI minimization, providing more effective training signals to improve the representation vectors for EAE. More importantly, we see that the proposed MI minimization mechanism between H ′ and H helps SemSynGTN to achieve significantly better performance than those with the other variants (i.e., with X or E for the MI). This clearly helps to justify our proposal of employing the BiL-STM hidden vectors H to compute L disc in this work. In fact, the advantage of H over X for the MI minimization demonstrates the benefits of the BiLSTM layer to better combine the BERTgenerated vectors e i and the relative distance embeddings for w i in x i to generate effective hidden vectors for the MI-based comparisons with H ′ for EAE.
Performance Analysis: To understand how the proposed model improves the performance over the baselines, we examine the outputs of Sem-SynGTN and the two major baseline models, i.e., (1) HMEAE (Wang et al., 2019b), the most related work that ignores the syntactic and seman-tic structures and previously has the best BERTbased performance for EAE, and (2) SemSynGTN -GTN that considers the syntactic and semantic structures but does not model their interactions to capture multi-hop paths with GTN. Our investigation suggests that while SemSynGTN outperforms HMEAE and SemSynGTN -GTN in general, the performance gaps between the models become substantially larger for the sentences with large numbers of words (i.e., distances) between the argument candidates and event triggers (called #bw). In particular, Table 6 presents the performance of the three models on two subsets of the ACE 2005 development set, i.e., one with #bw ≤ 10 and one with #bw > 10. As we can see, the performance gaps between SemSynGTN and the two baselines on the subset with #bw > 10 are much larger than those with #bw ≤ 10. We attribute this better performance of SemSyn-GTN to its abilities to employ the combined structures based on syntax and semantic, and to model the multi-hop paths between words to compute the importance scores in the final structures of GTN. These abilities essentially enable SemSyn-GTN to capture longer and more flexible paths between words to compute effective representations for EAE. SemSynGCN is then able to perform better for the sentences with large #bw where encoding more context words is necessary to achieve high performance.