Graph based Neural Networks for Event Factuality Prediction using Syntactic and Semantic Structures

Event factuality prediction (EFP) is the task of assessing the degree to which an event mentioned in a sentence has happened. For this task, both syntactic and semantic information are crucial to identify the important context words. The previous work for EFP has only combined these information in a simple way that cannot fully exploit their coordination. In this work, we introduce a novel graph-based neural network for EFP that can integrate the semantic and syntactic information more effectively. Our experiments demonstrate the advantage of the proposed model for EFP.


Introduction
Events are often presented in sentences via the indication of anchor/trigger words (i.e., the main words to evoke the events, called event mentions) (Nguyen et al., 2016a). Event mentions can appear with varying degrees of uncertainty/factuality to reflect the intent of the writers. In order for the event mentions to be useful (i.e., for knowledge extraction tasks), it is important to determine their factual certainty so the actual event mentions can be retrieved (i.e., the event factuality prediction problem (EFP)). In this work, we focus on the recent regression formulation of EFP that aims to predict a real score in the range of [-3,+3] to quantify the occurrence possibility of a given event mention (Stanovsky et al., 2017;Rudinger et al., 2018). This provides more meaningful information for the downstream tasks than the classification formulation of EFP (Lee et al., 2015). For instance, the word "left" in the sentence "She left yesterday." would express an event that certainly happened (i.e., corresponding to a score of +3 in the benchmark datasets) while the event mention associated with "leave" in the sentence "She forgot to leave yesterday." would certainly not happen (i.e., a score of -3). EFP is a challenging problem as different context words might jointly participate to reveal the factuality of the event mentions (i.e., the cue words), possibly located at different parts of the sentences and scattered far away from the anchor words of the events. There are two major mechanisms that can help the models to identify the cue words and link them to the anchor words, i.e., the syntactic trees (i.e., the dependency trees) and the semantic information (Rudinger et al., 2018). For the syntactic trees, they can connect the anchor words to the functional words (i.e., negation, modal auxiliaries) that are far away, but convey important information to affect the factuality of the event mentions. For instance, the dependency tree of the sentence "I will, after seeing the treatment of others, go back when I need medical care." will be helpful to directly link the anchor word "go" to the modal auxiliary "will" to successfully predict the non-factuality of the event mention. Regarding the semantic information, the meaning of the some important context words in the sentences can contribute significantly to the factuality of an event mention. For example, in the sentence "Knight lied when he said I went to the ranch.", the meaning represented by the cue word "lied" is crucial to classify the event mention associated with the anchor word "went" as non-factual. The meaning of such cue words and their interactions with the anchor words can be captured via their distributed representations (i.e., with word embeddings and long-short term memory networks (LSTM)) (Rudinger et al., 2018).
The current state-of-the-art approach for EFP has involved deep learning models (Rudinger et al., 2018) that examine both syntactic and semantic information in the modeling process. However, in these models, the syntactic and semantic information are only employed separately in the different deep learning architectures to generate syntactic and semantic representations. Such representations are only concatenated in the final stage to perform the factuality prediction. A major problem with this approach occurs in the event mentions when the syntactic and semantic information cannot identify the important structures for EFP individually (i.e., by itself). In such cases, both the syntactic and semantic representations from the separate deep learning models would be noisy and/or insufficient, causing the poor quality of their simple combination for EFP. For instance, consider the previous example with the anchor word "go": "I will, after seeing the treatment of others, go back when I need medical care.". On the one hand, while syntactic information (i.e., the dependency tree) can directly connect "will" to "go", it will also promote some noisy words (i.e., "back") at the same time due to the direct links (see the dependency tree in Figure 1). On the other hand, while deep learning models with the sequential structure can help to downgrade the noisy words (i.e., "back") based on the semantic importance and the close distance with "go", these models will struggle to capture "will" for the factuality of "go" due to their long distance.
From this example, we also see that the syntactic and semantic information can complement each other to both promote the important context words and blur the irrelevant words. Consequently, we argue that the syntactic and semantic information should be allowed to interact earlier in the modeling process to produce more effective representations for EFP. In particular, we propose a novel method to integrate syntactic and semantic structures of the sentences based on the graph convolutional neural networks (GCN) (Kipf and Welling, 2016) for EFP. The modeling of GCNs involves affinity matrices to quantify the connection strength between pairs of words, thus facilitating the integration of syntactic and semantic information. In the proposed model, the semantic affinity matrices of the sentences are induced from Long Short-Term Memory networks (LSTM) that are then linearly integrated with the syntactic affinity matrices of the dependency trees to produce the enriched affinity matrices for GCNs in EFP. The extensive experiments show that the proposed model is very effective for EFP.

Related Work
EFP is one of the fundamental tasks in Information Extraction. The early work on this problem has employed the rule-based approaches (Nairn et al., 2006;Saurí, 2008;Lotan et al., 2013) or the machine learning approaches (with manually designed features) (Diab et al., 2009;Prabhakaran et al., 2010;De Marneffe et al., 2012;Lee et al., 2015), or the hybrid approaches of both (Saurí and Pustejovsky, 2012;Qian et al., 2015). Recently, deep learning has been applied to solve EFP. (Qian et al., 2018) employ Generative Adversarial Networks (GANs) for EFP while (Rudinger et al., 2018) utilize LSTMs for both sequential and dependency representations of the input sentences. Finally, deep learning has also been considered for the related tasks of EFP, including event detection (Nguyen and Grishman, 2015b;Nguyen et al., 2016b;Lu and Nguyen, 2018;Nguyen and Nguyen, 2019), event realis classification (Mitamura et al., 2015;Nguyen et al., 2016g), uncertainty detection (Adel and Schütze, 2017), modal sense classification (Marasovic and Frank, 2016) and entity detection (Nguyen et al., 2016d). the current event mention has happened. There are three major components in the EFP model proposed in this work, i.e., (i) sentence encoding, (ii) structure induction, and (iii) prediction.

Sentence Encoding
The first step is to convert each word in the sentences into an embedding vector. In this work, we employ the contextualized word representations BERT in (Devlin et al., 2018) for this purpose. BERT is a pre-trained language representation model with multiple computation layers that has been shown to improve many NLP tasks. In particular, the sentence (x 1 , x 2 , ..., x n ) would be first fed into the pre-trained BERT model from which the contextualized embeddings of the words in the last layer are used for further computation. We denote such word embeddings for the words in (x 1 , x 2 , . . . , x n ) as (e 1 , e 2 , . . . , e n ) respectively.
In the next step, we further abstract (e 1 , e 2 , . . . , e n ) for EFP by feeding them into two layers of bidirectional LSTMs (as in (Rudinger et al., 2018)).
This produces (h 1 , h 2 , . . . , h n ) as the hidden vector sequence in the last bidirectional LSTM layer (i.e., the second one). We consider (h 1 , h 2 , . . . , h n ) as a rich representation of the input sentence (x 1 , x 2 , . . . , x n ) where each vector h i encapsulates the context information of the whole input sentence with a greater focus on the current word x i .

Structure Induction
Given the hidden representation (h 1 , h 2 , . . . , h n ), it is possible to use the hidden vector corresponding to the anchor word h k as the features to perform factuality prediction (as done in (Rudinger et al., 2018)). However, despite the rich context information over the whole sentence, the features in h k are not directly designed to focus on the import context words for factuality prediction. In order to explicitly encode the information of the cue words into the representations for the anchor word, we propose to learn an importance matrix A = (a ij ) i,j=1..n in which the value in the cell a ij quantifies the contribution of the context word x i for the hidden representation at x j if the representation vector at x j is used to form features for EFP. The importance matrix A would then be used as the adjacent/weight matrix in the graph convolutional neural networks (GCNs) (Kipf and Welling, 2016;Nguyen and Grishman, 2018) to accumulate the current hidden representations of the context words into the new hidden representations for each word in the sentence.
In order to learn the weight matrix A, as presented in the introduction, we propose to leverage both semantic and syntactic structures of the input sentence. In particular, for the semantic structure, we use the representation vectors from LSTMs for x i and x j (i.e., h i and h j ) as the features to compute the contribution score in the cell a sem ij of the semantic weight matrix A sem = (a sem ij ) i,j=1..n : Note that we omit the biases in the equations of this paper for convenience. In the equations above, Essentially, a sem ij is a scalar to determine the amount of information that should be sent from the context word x i to the representation at x j based on the semantic relevance for EFP.
In the next step for the syntactic structure, we employ the dependency tree for the input sentence to generate the adjacent/weight matrix A syn = (a syn ij ) i,j=1..n , where a syn ij is set to 1 if x i is connected to x j in the tree, and 0 otherwise. Note that we augment the dependency trees with the selfconnection and reverse edges to improve the coverage of the weight matrix.
Finally, the weight matrix A for GCNs would be the linear combination of the sematic structure A sem and the syntactic structure A syn with the trade-off λ: Given the weight matrix A, the GCNs (Kipf and Welling, 2016) are applied to augment the representations of the words in the input sentence with the contextual representations for EFP. In particular, let H 0 be the the matrix with (h 1 , h 2 , . . . , h n ) as the rows: One layer of GCNs would take an input matrix H i (i ≥ 0) and produce the output matrix H i+1 : H i+1 = g(AH i W g i ) where g is a non-linear function. In this work, we employ two layers of GCNs (optimized on the development datasets) on the input matrix H 0 , resulting in the semantically and syntactically enriched matrix H 2 with the rows of (h g 1 , h g 2 , . . . , h g n ) for EFP.

Prediction
This component predicts the factuality degree of the input event mention based on the context-   aware representation vectors (h g 1 , h g 2 , . . . , h g n ). In particular, as the anchor word is located at the k-th position (i.e., the word x k ), we first use the vector h g k as the query to compute the attention weights for each representation vector in (h g 1 , h g 2 , . . . , h g n ). These attention weights would then be employed to obtain the weighted sum of (h g 1 , h g 2 , . . . , h g n ) to produce the feature vector V : where W a 1 , W a 2 and W a 3 are the model parameters. The attention weights α ′ i would help to promote the contribution of the important context words for the feature vector V for EFP.
Finally, similar to (Rudinger et al., 2018), the feature vector V is fed into a regression model with two layers of feed-forward networks to produce the factuality score. Following (Rudinger et al., 2018), we train the proposed model by optimizing the Huber loss with δ = 1 and the Adam optimizer with learning rate = 1.0.

Datasets, Resources and Parameters
Following the previous work (Stanovsky et al., 2017;Rudinger et al., 2018), we evaluate the proposed EFP model using four benchmark datasets: FactBack (Saurí andPustejovsky, 2009), UW (Lee et al., 2015), Meantime (Minard et al., 2016) and UDS-IH2 (Rudinger et al., 2018). The first three datasets (i.e., FactBack, UW, and Meantime) are the unified versions described in (Stanovsky et al., 2017) where the original annotations for these datasets are scaled to a number in [-3, +3]. For the fourth dataset (i.e., UDS-IH2), we follow the instructions in (Rudinger et al., 2018) to scale the scores to the range of [-3, +3]. Each dataset comes with its own training data, test data and development data. Table 2 shows the numbers of examples in all data splits for each dataset used in this paper.
We tune the parameters for the proposed model on the development datasets. The best values we find in the tuning process include: 300 for the number of hidden units in the bidirectional LSTM layers, 1024 for the dimension of the projected vector h ′ i in the structure induction component, 300 for the number of feature maps for the GCN layers, 600 for the dimention of the transformed vectors for attention based on (W a 1 , W a 2 , W a 3 ), and 300 for the number of hidden units in the two layers of the final regression model. For the tradeoff parameter λ between the semantic and syntactic structures, the best value for the datasets Fact-Back, UW and Meantime is λ = 0.6 while this value for UDS-IH2 is λ = 0.8.

Comparing to the State of the Art
This section evaluates the effectiveness of the proposed model for EFP on the benchmark datasets. We compare the proposed model with the best reported systems in the literature with linguistic features (Lee et al., 2015;Stanovsky et al., 2017) and deep learning (Rudinger et al., 2018). Table 1 shows the performance. Importantly, to achieve a fair comparison, we obtain the actual implementation of the current state-of-the-art EFP models from (Rudinger et al., 2018), introduce the BERT embeddings as the inputs for those models and compare them with the proposed models (i.e., the rows with "+BERT"). Following the prior work, we use MAE (Mean Absolute Error), and r (Pearson Correlation) as the performance measures.
In the table, we distinguish two methods to train the models investigated in the previous work: (i) training and evaluating the models on separate datasets (i.e., the rows associated with *), and (ii) training the models on the union of FactBank, UW and Meantime, resulting in single models to be evaluated on the separate datasets (i.e., the rows with **). It is also possible to train the models on the union of all the four datasets (i.e., FactBank, UW, Meantime and UDS-IH2) (corresponding to the rows with w/UDS-IH2 in the table). From the table, we can see that in the first method to train the models the proposed model is significantly better than all the previous models on FactBank, UW and UDS-IH2 (except for the MAE measure on UW), and achieves comparable performance with the best model (Stanovsky et al., 2017) on Meantime. In fact, the proposed model trained on the separate datasets also significantly outperforms the current best models on FactBank, UW and UDS-IH2 when these models are trained on the union of the datasets with multi-task learning (except for MAE on Factbank where the performance is comparable). Regarding the second method with multiple datasets for training, the proposed model (only trained on the union of FactBank, UW and Meantime) is further improved, achieving better performance than all the other models in this setting for different datasets and performance measures. Overall, the proposed model yields the state-of-the-art performance over all the datasets and measures (except for MAE on UW with comparable performance), clearly demonstrating the advantages of the model in this work for EFP. Table 3 presents the performance of the proposed model when different elements are excluded to evaluate their contribution. We only analyze the proposed model when it is trained with multiple datasets (i.e., FactBank, UW and Meantime). However, the same trends are observed for the models trained with separate datasets. As we can see from the table, both semantic and syntactic information are important for the proposed model as eliminating any of them would hurt the performance. Removing both elements (i.e., not using the structure induction component) would significantly downgrade the performance. Finally, we see that both the BERT embeddings and the attention in the prediction are necessary for the proposed model to achieve good performance.

Conclusion & Future Work
We present a graph-based deep learning model for EFP that exploits both syntactic and semantic structures of the sentences to effectively model the important context words. We achieve the state-ofthe-art performance over several EFP datasets.
One potential issue with the current approach is that it is dependent on the existence of the highquality dependency parser. Unfortunately, such parser is not always available in different domains and languages. Consequently, in the future work, we plan to develop methods that can automatically induce the sentence structures for EFP.