Biomedical Event Extraction with Hierarchical Knowledge Graphs

Biomedical event extraction is critical in understanding biomolecular interactions described in scientific corpus. One of the main challenges is to identify nested structured events that are associated with non-indicative trigger words. We propose to incorporate domain knowledge from Unified Medical Language System (UMLS) to a pre-trained language model via Graph Edge-conditioned Attention Networks (GEANet) and hierarchical graph representation. To better recognize the trigger words, each sentence is first grounded to a sentence graph based on a jointly modeled hierarchical knowledge graph from UMLS. The grounded graphs are then propagated by GEANet, a novel graph neural networks for enhanced capabilities in inferring complex events. On BioNLP 2011 GENIA Event Extraction task, our approach achieved 1.41% F1 and 3.19% F1 improvements on all events and complex events, respectively. Ablation studies confirm the importance of GEANet and hierarchical KG.


Introduction
Biomedical event extraction is a task that identifies a set of actions among proteins or genes that are associated with biological processes from natural language texts (Kim et al., 2009(Kim et al., , 2011. Development of biomedical event extraction tools enables many downstream applications, such as domain-specific text mining (Ananiadou et al., 2015;Spangher et al., 2020), semantic search engines (Miyao et al., 2006) and automatic population and enrichment of database (Hirschman et al., 2012).
A typical event extraction system 1) finds triggers that most clearly demonstrate the presence of events, 2) recognizes the protein participants (arguments), and 3) associates the arguments with the corresponding event triggers. For instance, the Can be qualified by

Positive regulation of biological process
Cause of Allowed qualifier Figure 1: An example of a UMLS-based hierarchical KG assisting event extraction. Circles represent concept nodes and triangles represent semantic nodes. Nodes associated with the tokens in the example sentence are boldfaced. Bidirectional edges imply hierarchical relation between concept and semantic nodes. The word "induces" is a trigger of a Positive regulation event, whose trigger role and corresponding argument role cannot be easily determined with only textual input. The KG provides clues for identifying this trigger and its corresponding arguments given the red and blue double line reasoning paths connecting nodes BMP-6, Induce, Phosphorylation, and Positive regulation of biological process. We can infer that: 1) "induces" is an action of biological function, 2) a biological function can be quantified by positive regulation, and 3) positive regulation can result in phosphorylation.
sentence "Protein A inhibits the expression of Protein B" will be annotated with two nested events: Gene expression(Trigger: expression, Arg-Theme: Protein B) and Negative Regulation(Trigger: inhibits, Arg-Theme: Gene expression(Protein B), Arg-Cause: Protein A). Early attempts on biomedical event extraction adopted hand-crafted features (Björne et al., 2009;Björne and Salakoski, 2011;Riedel and McCallum, 2011;Venugopal et al., 2014a). Recent advances have shown improvements using deep neural networks via distributional word representations in the biomedical domain (Moen and Ananiadou, 2013;Rao et al., 2017a;Björne and Salakoski, 2018;ShafieiBavani et al., 2019). Li et al. (2019) further extends the word representations with embeddings of descriptive annotations from a knowledge base and demonstrates the importance of domain knowledge in biomedical event extraction.
However, encoding knowledge with distributional embeddings does not provide adequate clues for identifying challenging events with nonindicative trigger words and nested structures. These embeddings do not contain structural or relational information about the biomedical entities. To overcome this challenge, we present a framework that incorporates knowledge from hierarchical knowledge graphs with graph neural networks (GNN) on top of a pre-trained language model.
Our first contribution is a novel representation of knowledge as hierarchical knowledge graphs containing both conceptual and semantic reasoning paths that enable better trigger and word identification based on Unified Medical Language System (UMLS), a biomedical knowledge base. Fig.  1 shows an example where the Positive Regulation event can be better identified with knowledge graphs and factual relational reasoning. Our second contribution is a new GNN, Graph Edgeconditioned Attention Networks (GEANet), that encodes complex domain knowledge. By integrating edge information into the attention mechanism, GEANet has greater capabilities in reasoning the plausibility of different event structure through factual relational paths in knowledge graphs (KGs).
Experiments show that our proposed method achieved state-of-the-art results on the BioNLP 2011 event extraction task (Kim et al., 2011). 1 2 Background UMLS Knowledge Base. Unified Medical Language System (UMLS) is a knowledge base for biomedical terminology and standards, which includes three knowledge sources: Metathesaurus, Semantic Network, and Specialist Lexicon and Lexical Tools (Bodenreider, 2004). We use the former two sources to build hierarchical KGs. The concept network from Metathesaurus contains the relationship between each biomedical concept pairs, while each concept contains one or more semantic types 1 Our code for pre-proecessing, modeling, and evaluation is available at https://github.com/PlusLabNLP/ GEANet-BioMed-Event-Extraction.  Figure 2: Overview of knowledge incorporation. Contextualized embeddings for each token are generated by SciBERT. GEANet updates node embeddings for v 1 , v 2 , and v 3 via corresponding sentence graph.
that can be found in the semantic network. The concept network provides direct definition lookup of the recognized biomedical terms, while the semantic network supports with additional knowledge in the semantic aspect. Example tuples can be found in Figure 1. 2 There are 3.35M concepts, 10 concept relations, 182 semantic types, and 49 semantic relations in total.

Proposed Approach
Our event extraction framework builds upon the pre-trained language model, SciBERT (Beltagy et al., 2019), and supplement it with a novel graph neural network model, GEANet, that encodes domain knowledge from hierarchical KGs. We will first illustrate each component and discuss how training and inference are done.

Hierarchical Knowledge Graph Modeling
The two knowledge sources discussed in Section 2 are jointly modeled as a hierarchical graph for each sentence, which we refer to as a sentence graph. Each sentence graph construction consists of three steps: concept mapping, concept network construction, and semantic type augmentation. The first step is to map each sentence in the corpus to UMLS biomedical concepts with MetaMap, an entity mapping tool for UMLS concepts (Aronson, 2001). There are 7903 concepts (entities) being mapped from the corpus, denoted as K. The next step is concept network construction, where a minimum spanning tree (MST) that connects mapped concepts in the previous step is identified, forming concept reasoning paths. This step is NPcomplete. 3 We adopt a 2-approximate solution that constructs a global MST for the corpora GE'11 by running breadth-first search, assuming all edges are of unit distance. To prune out less relevant nodes and to improve computation efficiency, concept nodes that are not in K with less than T neighbors in K are removed. 4 The spanning tree for each sentence is then obtained by depth-first search on the global MST. Each matched token in the corpus is also included as a token node in the sentence graph, connecting with corresponding concept node. Finally, the semantic types for each concept node are modeled as nodes that are linked with associated concept nodes in the sentence graph. Two semantic type nodes will also be linked if they have known relationships in the semantic network.

GEANet
The majority of existing graph neural networks (GNN) consider only hidden states of nodes and adjacency matrix without modeling edge information. To properly model the hierarchy of the graph, it is essential for the message passing function of a GNN to consider edge features. We propose Graph Edge Conditioned Attention Networks (GEANet) to integrate edge features into the attention mechanism for message propagation. The node embeddings update of GEANet at the l-th layer can be expressed as follows: i denotes the node embeddings at layer l, e i,j denotes the embedding for edge (i, j), and MLP ψ and MLP θ are two multi-layer perceptrons.
GEANet is inspired by Edge Conditioned Convolution (ECC), where convolution operation depends on edge type (Simonovsky and Komodakis, 2017), Compared to ECC, GEANet is able to determine the relative importance of neighboring nodes with attention mechanism.
Knowledge Incorporation. We build GEANet on top of SciBERT (Peters et al., 2019) to incorporate domain knowledge into rich contextualized representations. Specifically, we take the contextual embeddings {h 1 , ..., h n } produced by SciB-ERT as inputs and produces knowledge-aware embeddings {ĥ 1 , ...,ĥ n } as outputs. To initialize the embeddings for a sentence graph, for a mapped token, we project its SciBERT contextual embedding to initialize its corresponding node embedding h i,KG = h i W KG + b KG . Other nodes and edges are initialized by pretrained KG embeddings (details in Section 4.1). To accommodate multiple relations between two entities in UMLS, edge embeddings e i,j are initialized by summing the embeddings of each relation between the nodes i and j. Then we apply layers of GEANet to encode the graph h l i,KG = GEANet(h i,KG ). The knowledgeaware representation is obtained by aggregating SciBERT representations and KG representations, The process is illustrated in Figure 2 GEANet layer.

Event Extraction
The entire framework is trained with a multitask learning pipeline consisting of trigger classification and argument classification, following (Han et al., 2019a,b). Trigger classification predicts the trigger type for each token. The predicted score of each token is computed asŷ tri i = MLP tri (ĥ i ). In the argument classification stage, each possible pair of gold trigger and gold entity is gathered and labeled with corresponding argument role. 6 The argument scores between the i-th token and j-th token are computed asŷ arg i,j = MLP arg (ĥ i ;ĥ j ), where (; ) denotes concatenation. Cross Entropy is used for both tasks, where t denotes task, N t denotes the number of training instances of task t, y t i denotes the ground truth label, andŷ t i denotes the predicted label. The multitask learning minimizes the sum of the two losses L = L tri + L arg in the training stage. During inference, unmerging is conducted to combine identified triggers and arguments for multiple arguments events (Björne and Salakoski, 2011). We adopted similar unmerging heuristics. For Regulation events, we use the same heuristics as Björne et al. (2009). For Binding events, we subsume all Theme arguments associated with a trigger 5ĥ i = hi for each token i without mapped concept. 6 During inference, predicted triggers are used instead.  into one event such that every trigger corresponds to only one single Binding event.

Experimental Setup
Our models are evaluated on BioNLP 11 GENIA event extraction task (GE'11 ). All models were trained on the training set, validated on the dev set, and tested on the test set. A separate evaluation on Regulation events is conducted to validate the effectiveness of our framework on nested events with non-indicative trigger word. Reported results are obtained from the official evaluator under approximate span and recursive criteria.
In the preprocessing step, the GE'11 corpora were parsed with TEES preprocessing pipeline (Björne and Salakoski, 2018). Tokenization is done by the SciBERT tokenizer. Biomedical concepts in each sentence are then recognized with MetaMap and aligned with their corresponding tokens. The best performing model was found by grid search conducted on the dev set. The edge and node representation in KGs were intialized with 300 dimensional pre-trained embeddings using TransE (Wang et al., 2014). The entire framework is optimized with BERTAdam optimizer for a maximum of 100 epochs with batch size of 4. Training is stopped if the dev set F 1 does not improve for 5 consecutive epochs (more details see Appendix).

Results and Analysis
Comparison with existing methods We compare our method with the following prior works: TEES and Stacked Gen. use SVM-based models with token and sentence-level features (Björne and Salakoski, 2011;Majumder et al., 2016);   SciBERT-FT is a fine-tuned SciB-ERT without external resources, the knowledgeagnostic counterpart of GEANet-SciBERT. According to Table 1, SciBERT-FT achieves similar performance to KB-driven T-LSTM, implying that SciBERT may have stored domain knowledge implicitly during pre-training. Similar hypothesis has also been studied in commonsense reasoning (Wang et al., 2019). GEANet-SciBERT achieves an absolute improvement of 1.41% in F 1 on the test data compared to the previous state-of-theart method. In terms of Regulation events, Table  2 shows that GEANet-SciBERT outperforms the previous system and fine-tuned SciBERT by 3.19% and 1.39% in F 1. Ablation study To better understand the importance of different model components, ablation study is conducted and summarized in Table 3. GEANet achieves the highest F 1 when compared to two other GNN variants, ECC and GAT (Veličković et al., 2018), demonstrating its stronger knowledge incorporation capacity. Hierarchical knowledge graph representation is also shown to be critical. Removing semantic type (STY) nodes from hierarchical KGs leads to performance drop. Impact of amount of training data Model performance on different amount of randomly sampled training data is shown in Fig. 3. GEANet-SciBERT shows consistent improvement over finetuned SciBERT across different fractions. The performance gain is slightly larger with less training data. This illustrates the robustness of GEANet in integrating domain knowledge and its particular advantage under low-resource setting. Error Analysis By comparing the predictions from GEANet-SciBERT and gold events in the dev set, two major failed cases are identified: • Adjective Trigger: Most events are associated with a verb or noun trigger. Adjective triggers are scarce in the training set (∼7%), which poses a challenge to identify this type of trigger. Although knowledge-aware methods should be able to resolve these errors theoretically, these adjective triggers often cannot be linked with UMLS concepts. Without proper grounding, it is hard for our model to recognize these triggers.
• Misleading Trigger: Triggers providing "clues" about incorrect events can be misleading. For instance, Furthermore, expression of an activated PKD1 mutant enhances HPK1-mediated NFkappaB activation.
Our model predicts expression as a trigger of type Gene expression, while the gold label is Positive regulation. Despite that fact that our model is capable of handling such scenarios sometimes given grounded biomedical concepts and factual reasoning paths, there is still room for improvement.

Related Works
Event Extraction Most existing event extraction systems focus on extracting events in news. Early attempts relied on hand-crafted features and a pipeline architecture (Gupta and Ji, 2009;Li et al., 2013). Later studies gained significant improvement from neural architectures, such as convolutional neural networks (Chen et al., 2015;Nguyen and Grishman, 2015), and recurrent neural networks (Nguyen et al., 2016). More recent studies leverages large pre-trained language models to obtain richer contextual information (Wadden et al., 2019;Lin et al., 2020). Another line of works utilized GNN to enhance event extraction performance. Liu et al. (2018) applied attention-based graph convolution networks on dependency parsing trees. We instead propose a GNN, GEANet, for integrating domain knowledge into contextualized embeddings from pre-trained language models. Biomedial Event Extraction Event extraction for biomedicine is more challenging due to higher demand for domain knowledge. BioNLP 11 GE-NIA event extraction task (GE'11 ) is the major benchmark for measuring the quality of biomedical event extraction system (Kim et al., 2011). Similar to event extraction in news domain, initial studies tackle biomedical event extraction with humanengineered features and pipeline approaches (Miwa et al., 2012;Björne and Salakoski, 2011). Great portion of recent works observed significant gains from neural models (Venugopal et al., 2014b;Rao et al., 2017b;Jagannatha and Yu, 2016;Björne and Salakoski, 2018). Li et al. (2019) incorporated information from Gene Ontology, a biomedical knowledge base, into tree-LSTM models with distributional representations. Instead, our strategy is to model two knowledge graphs from UMLS hierarchically with conceptual and semantic reasoning paths, providing stronger clues for identifying challenging events in biomedical corpus.

Conclusion
We have proposed a framework to incorporate domain knowledge for biomedical event extraction. Evaluation results on GE'11 demonstrated the efficacy of GEANet and hierarchical KG representation in improving extraction of non-indicative trigger words associated nested events. We also show that our method is robust when applied to different amount of training data, while being advantageous in low-resource scenarios. Future works include grounding adjective triggers to knowledge bases, better biomedical knowledge representation and extracting biomedical events at document level.

A Implementation Details
Our models are implemented in PyTorch (Paszke et al., 2019). Hyper-parameters are found by grid search within search range listed in Table 4. The hyper-parameters of the best performing model are summarized in 5. All experiments are conducted on a 12-CPU machine running CentOS Linux 7 (Core) and NVIDIA RTX 2080 with CUDA 10.1.
To pre-train KGE, we leverage the TransE implementation from OpenKE (Han et al., 2018). All tuples associated with the selected nodes described in Section 3.1 are used for pre-training with margin loss and negative sampling, where γ denotes margin, and d(x, x ) denotes the − 1 distance between x and x . h and t are embeddings of head and tail entities from the gold training sets S with relation . (h , ,t ) denotes a corrupted tuplet with either the head or tail entity replaced by a random entity. TransE is optimized using Adam (Kingma and Ba, 2015) with hyperparameters illustrated in Table 6. Every 50 epochs, the model checkpoint is saved if the mean reciprocal rank on the development set improve from the last checkpoint; otherwise, training will be stopped.

B Dataset
The statistics of GE'11 is shown in 7. The corpus contains 14496 events with 37.2% containing nested structure (Björne and Salakoski, 2011). 7 We use the official dataset split for all the results reported.

Hyper-parameter
Range Relation MLP dim.