Biomedical Event Extraction using Abstract Meaning Representation

We propose a novel, Abstract Meaning Representation (AMR) based approach to identifying molecular events/interactions in biomedical text. Our key contributions are: (1) an empirical validation of our hypothesis that an event is a subgraph of the AMR graph, (2) a neural network-based model that identifies such an event subgraph given an AMR, and (3) a distant supervision based approach to gather additional training data. We evaluate our approach on the 2013 Genia Event Extraction dataset and show promising results.


Introduction
For several years now, the biomedical community has been working towards the goal of creating a curated knowledge base of biomolecule entity interactions. The scientific literature in the biomedical domain runs to millions of articles and is an excellent source of such information. However, automatically extracting information from text is a challenge because natural language allows us to express the same information in several different ways. The series of Genia Event Extraction shared tasks (Kim et al., 2009(Kim et al., , 2011(Kim et al., , 2013(Kim et al., , 2016 has resulted in various significant approaches to biomolecule event extraction spanning methods that use learnt patterns from annotated text (Bui et al., 2013) to machine learning methods (Björne and Salakoski, 2013) that use syntactic parses as features. In this work, we find that a semantic analysis of text that relies on Abstract Meaning Representations (Banarescu et al., 2013) is highly useful because it normalizes many lexical and syntactic variations in text. 1 This dataset is different from BioNLP 2016 GE dataset Figure 1: AMR with sample event annotations for sentence "This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho" AMR is a rooted, directed acyclic graph (DAG) that captures the notion of who did what to whom in text, in a way that sentences that have the same basic meaning often have the same AMR. The nodes in the graph (also called concepts) map to words in the sentence and the edges map to relations between the words. In the recent past, there have been several efforts towards parsing a sentence into its AMR (Flanigan et al., 2014;Wang et al., 2015;Pust et al., 2015;May, 2016). AMR naturally captures hierarchical relations between entities in text making it favorable for complex event detection. For example, consider the following sentence from the biomedical literature: "This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho". Figure 1 shows its Abstract Meaning Representation (AMR). The subgraph rooted at phosphorylate-01 identifies the event E 1 and the subgraph rooted at induce-01 identifies the event E 2 where E 1 = phosphorylation of radixin; E 2 = LPA induces E1.
We hypothesize that an event structure is a sub-Type Primary Args. Gene expression T(P) Transcription T(P) Localization T(P) Protein catabolism T(P) Binding T(P)+ Phosphorylation T(P/Ev), C(P/Ev) Regulation T(P/Ev), C(P/Ev) Positive regulation T(P/Ev), C(P/Ev) Negative regulation T(P/Ev), C(P/Ev) Table 1: Event types and their arguments in the 2013 Genia Event Extraction task graph of a DAG structure like AMR and under this assumption, we cast the event extraction task as a graph identification problem. Our first contribution is the testing of the above hypothesis that an event structure is a subgraph of an AMR graph. Given a sentence, we automatically obtain its AMR using an AMR parser (Pust et al., 2015) and explain how an event can be defined as a subgraph of the AMR graph. Under the assumption that we can correctly identify such an event subgraph from an AMR graph when it exists, we evaluate how good is our definition (Section 2).
Our second contribution is a supervised neural network-based model that is trained to identify an event subgraph given an AMR (Section 3). Our model is built on the intuition that the path between an interaction term and an entity term in an AMR graph contains important signal for identifying the relation between them. For e.g. in figure 1 the path {'induce-01', 'arg0', 'LPA'} suggests that LPA is the cause of induce. We encode this path using word embeddings pre-trained on millions of biomedical text and develop two pipelined neural network models: (a) to identify the theme of an interaction; and (b) to identify the cause of the interaction, if there exists one.
Experimental results show that our model, although achieves a reasonable precision, suffers from low recall. Our third contribution is a distant supervision (Mintz et al., 2009) based approach to collect additional annotated training data. Distant supervision works on the assumption that given a known relation between two entities, a sentence containing the two entities is likely to express this relation and hence can serve as training data for that relation. Data gathered using such a method can be noisy (Takamatsu et al., 2012). Roth et al. (2013) have discussed several prior work that address this issue. In our work, we introduce a method based on AMR path heuristic This LPA-induced rapid phosphorylation of radixin was significantly suppressed in the presence of C3 toxin, a potent inhibitor of Rho  to selectively sample the sentences we obtain using distant supervision (Section 3) and show its effectiveness over our vanilla neural network model.
We evaluate our event extraction model on the 2013 Genia Event Extraction dataset and show that our model achieves promising results when compared to the state-of-the-art system. Given that AMR parsing is still a young field, our model, which currently uses a parser of 67% accuracy, would perform better with improved AMR parsers.

Task description
The biomedical event extraction task in this work is adopted from the Genia Event Extraction subtask of the well-known BioNLP shared task ( (Kim et al., 2009), (Kim et al., 2011), (Kim et al., 2013)). Table 2 shows a sample event annotation for the sentence in Figure 1. The protein annotations T1-T4 are given as starting points. The task is to identify the events E1-E4 with their interaction type and arguments. Table 1 describes the various event types and the arguments they accept. The first four event types require only unary theme argument. The binding event can take a variable number of theme arguments. The last four events take a theme argument and, when expressed, also a cause argument. Their theme or cause may in turn be another event, creating a nested event (For e.g. event E2 in Table 2).

Model description
We cast this event extraction problem as a subgraph identification problem. Given a sentence we first obtain its AMR graph automatically using an AMR parser (Pust et al., 2015). Next, we identify protein nodes and interaction nodes in the graph. Protein Node Identification: In both the training and the test set, protein terms are pre-annotated (e.g. T 1 to T 4 in Table 2). We then use the AMR graph alignment information to identify nodes in the AMR graph aligned to these protein terms to get our protein nodes P . Interaction Node Identification: In the training data, interaction terms are pre-annotated (e.g. T 5 to T 8 in Table 2). To identify the interaction terms in the test set we use the following heuristic: any term that was annotated as an interaction term more than once in the training data is considered as an interaction term in the test data as well. We then use the AMR graph alignment information to identify nodes in the AMR graph aligned to the interaction terms to get our interaction nodes T . Given P and T , we identify an event sub-graph using the following two-step process: a. Theme Identification: Every pair (p i , t j ) where p i ∈ P and t j ∈ T , is a candidate for an event e m defined as e m : (Type: t j , Theme: p i ) where Type is one of the nine event types in Table 1. If e m can take other events as arguments (last four event types in Table 1) and if the shortest path between t j and p i includes an interaction term t k , such that the pair (p i , t k ) is an event e n in itself, then we define the event e m instead as e m : (Type: t j , Theme: e n ). For e.g. in Figure 1, the path between induce-01 and radixin includes phosphorylate-01 which is an event in itself (E 1 ). Hence event E 2 is defined with E 1 as its theme (in Table 2). b. Cause Identification: For events e m : (Type: t j : Theme: p i ) that can take a cause argument, we identify possible candidates for their cause by again looking for all pairs (p l , t j ) where p l ∈ P and l = i and add cause to the event e m as e m : (Type: t j , Theme: p i , Cause: p l ). Since these events can even take other events as their cause argument, we identify additional candidates for their cause by looking for all pairs (e n , t j ) where e n ∈ E and n = m and add cause to the event e m as e m : (Type: t j , Theme: p i , Cause: e n ).

Upper bound using "event is a subgraph of AMR" hypothesis
Before we learn to identify event sub-graphs from an AMR graph, we first calculate the upper bound  Table 3: Upper bound on the dev set using our "event is a subgraph of AMR" hypothesis that we are setting for our model because we are using an AMR parser instead of obtaining gold AMRs. For calculating this upper bound, we first obtain the AMR graph of a sentence using the AMR parser and then assume that if an event is a sub-graph of this AMR graph then we can identify it correctly. Table 3 shows the upper bound we get on the dev set of the 2013 Genia Event Extraction dataset (described in Section 5.1). The last column in the table is the state-of-the-art F1 score obtained by the system EVEX (Hakala et al., 2013) on the test set of the dataset 2 . In case of simple events i.e. events that take only proteins as theme arguments, an event is always a subgraph of the AMR unless there is an alignment error causing the protein node or the interaction node to be missing. Hence the upper bound on our precision is 100% whereas the upper bound on our recall is 82.48% for these simple events. In case of the other event types where an event can take other events as arguments, an event is correctly identified only if the path between the pair (p i , t j ) in the AMR graph includes all its subevents. Therefore we lose more on the precision and recall in these cases due to AMR parsing errors bringing our overall upper bound on precision down to 85.44% and our overall upper bound on recall down to 65.98%. These results give us following two important insights: 1. By using this hypothesis we have set an upper bound of 74.18% F1-score for our learning model. 2. As the accuracy of automatic AMR parsers improve, our model will perform better at the event extraction task.

LSTM based learning model
In this section we will describe our model that learns to identify an event sub-graph from an AMR graph. The key idea is that the path between the interaction node and the entity node (where the term entity is used to denote both a protein and a sub-event) contains information about how the event is structured. We build on this idea to develop a supervised model using Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) architecture that can learn to identify events using the nodes and the edges in the AMR path between the interaction term and the entity term.

Motivation
The input to our problem is a sequence of words (w i ) interwound with edge labels (e j ) of the form: w 1 , e 1 , w 2 , e 2 , ..., e n−1 , w n that exists in the path between an interaction node and an entity node in an AMR graph. Due to large semantic variations that exist in naturally occurring texts, traditional feature based methods suffer from sparsity issues while learning from such a sequence. Neural network based models provide a framework for learning from non-sparse representations. Specifically, LSTM is known to handle sequences of variable length and capture long range dependencies well. Since the input sequence in our case falls into this category, we build our model using the LSTM framework.

Event identification
We model the event identification task as a twostep process: Theme Identification and Cause Identification. For simple events, this process includes only theme identification (since they don't have cause). We describe the two LSTM models corresponding to the two steps as follows:

Theme Identification
Given a pair of interaction node (t j ) and protein node (p i ), the task is to identify if there exists an event with t j as the interaction and p i as the theme; and if yes, what is the type of the event.
We cast this problem as a multi-class classification task with label set as L : {NULL ∪ Event types} where Event types correspond to the nine event types described in Table 1 and NULL corresponds to no event. We train an LSTM model for this task with the input layer as the embeddings corresponding to the sequence of words interwound with edge labels in the shortest path between p i and t j in the AMR graph. We use a hidden layer of size 100 and an output layer of the size of our label set L. For e.g. in Figure 2, the sequence {'phosphorylate-01', 'arg1', 'radixin'} is the input sequence and the event type Phosphorylation is its label.

Cause Identification
The last four event types in Table 1 can take proteins or other events as cause argument. We cast this problem as a binary classification task where for an event we ask the question if a protein/event is its cause argument or not for every protein and every other event in that sentence. Let e m be the event identified as e m : (T ype : t j , T heme : p i ) that can take a cause argument. Let C = P ∪ E where P is the set of all other proteins in the AMR graph (except p i ) and E is the set of all identified events (except e m ). For every c k ∈ C, we get the shortest path between c k and t j and combine it with the shortest path between p i and t j and use the words and edges in this combined path as the input layer of our second LSTM model. We use a hidden layer of size 100 and an output layer of size one corresponding to the binary prediction of whether c k is the cause of the event e m or not.

Initialization of Embeddings
When initializing our model, we have two choices: we can initialize the embeddings in the input layer randomly or we can initialize them with values that reflect the meanings of the word types. It has been seen that using pre-defined word embeddings improves the performance of RNN models over random initializations (Collobert and Weston, 2008;Socher et al., 2011). We initialize the vectors corresponding to words in our input layer with 100-dimensional vectors generated by a word2vec (Mikolov et al., 2013) model trained on over one million words from the PubMed central article repository. Words not included in the pre-trained model and the edges are initialized randomly using uniform sampling from [-0.25, +0.25] to match the embedding standard deviation.

Event Construction
During test time, we first make predictions using our LSTM model for Theme identification. For every pair (p i , t j ) with a non-zero label l, we construct events as follows: For label l corresponding to interaction types that take only proteins as theme arguments, we construct event as e m : (T ype : t j , T heme : p i ). For label l corresponding to interaction types that can take another event as its theme, we look at the path between t j and p i in the AMR. If this path includes a pair (t k , p i ) that has a non-zero label, then we construct an event e n : (T ype : t j , T heme : e p ) where e p is the event constructed from the pair (t k , p i ). Otherwise, we construct the event as e n : (T ype : t j , T heme : p i ).
For each of the predicted event e m : (T ype : t j : T heme : p i ) that can take a cause argument, we run the second LSTM model for its Cause identification. If there is a pair (p i , c k ) which has a positive label, then we assign c k as the cause of the event e m .

Distant Supervision
An empirical evaluation of our LSTM-based learning model (Section 5.4) shows that it can suffer from low recall. Obtaining additional human annotated data for our complex event extraction task can be very costly. This motivates us to develop an approach that can gather more training data with minimal supervision.

Motivation
Distant supervision as a learning paradigm was introduced by Mintz et al. (2009) for relation extraction in general domain. They use Freebase to get a set of relation instances and entity pairs participating in those relations, extract all sentences containing those two entity pairs from Wikipedia text and use these sentences as their training data. This work and many others show that distant supervi- PubMed Central articles using BioPax database relations sion technique yields significant improvements in relation extraction. Neural network models like LSTM need to be trained on substantial amounts of training data for them to be able to generalize well. However due to lack of labeled data in biomedical domain, most work in relation extraction in this domain has been restricted to purely supervised techniques. In this work we cope with this problem by gathering additional training data using distant supervision from a knowledge base.

Methodology
Relation extraction using distant supervision requires two things: 1) A knowledge base containing relations between proteins, and 2) A large corpus of unannotated text that contain protein mentions. We use the BioPax (Biological Pathway Exchange) database (Demir et al., 2010) as our knowledge base of protein relations and we use the PubMed central articles as our unannotated text corpus. Given a database entry of the form ('Pro-tein1', 'Protein2', 'relation'), we extract all sentences from the PubMed central articles in which the two proteins co-occur. For example, Figure  3 shows some sample sentences extracted for the database entry ('DAG', 'PKC', increases). The first two sentences in the figure indeed express the relation in the database but the third sentence just mentions the two proteins in a comma separated list. We observe that a lot of the extracted sentences fall into the category of the third sentence. Hence as a first step, we filter such instances by tagging the sentence with their parts-of-speech and removing those in which the two proteins are separated only by nouns (or punctuations).

AMR Path Based Selection
The traditional distant supervision approach says that all the sentences extracted using the method above can be used as additional training data un-  Table 4: Mapping between event types and Biopax model relations der the assumption that all sentences in which the proteins co-occur express the relation mentioned in the database. However Takamatsu et al. (2012) note that this approach can often lead to a lot of false positives. Roth et al. (2013) have discussed several prior work that try to reduce such noise in the data. In our work, we develop a novel selection technique for reducing such noise using AMR path heuristic. We make the observation that given two protein nodes in an AMR, if there is a relation r between the two then the shortest path between the two protein nodes in the AMR contains the interaction term expressing the relation r.
For e.g. Figure 4 shows the AMR for the sentence "DAG is important for the activation of PKC, which phosphorylates tyrosinase, and can also be released..." that was extracted using the database entry {'DAG', 'PKC', 'increases'}. The interaction term 'activate' suggesting the relation 'increases' exists in the shortest path between the proteins DAG and PKC. Figure 5 shows AMR for the sentence "The sun-network links TCF3 with ZYX and HOXA9 via NEDD9 and CREBBP, respectively." extracted for the pair ('TCF3', 'HOXA9', increases). There is no interaction term suggesting the relation 'increases' in the shortest path between the proteins TCF3 and HOXA9. Table 4 shows the mapping we define between the event types and the relations found in the entries ('Protein1', 'Protein2', 'relation') that we extracted from the Biopax model. In each sentence extracted for the database entry ('P 1 ', 'P 2 ', 'r') , we check if the shortest path between the two protein nodes P 1 and P 2 in the AMR of the sentence contains one of the interaction terms corresponding to the event type mapped to the relation r. We discard all those sentences that do not satisfy this constraint.

Using Data for LSTM Model
We use these selected sentences as additional training data for our two LSTM models as follows: a. Theme identification: Let S be the sentence extracted for the database entry ('DAG', 'PKC', 'increases') and let 'activates be the interaction term that exists in the shortest path between the protein nodes. Since the database entry refers to 'DAG' as the cause and 'PKC' as the theme, we assume these roles for the two proteins in the extracted sentence S as well. Therefore, we can now use the path between the interaction term 'activates' and the theme 'PKC' as an input sequence for our model with the label corresponding to the event type of the interaction term 'activates'. b. Cause identification: In case of cause identification instead of using the path between the interaction term and the theme entity, we use the shortest path between the cause entity and the theme entity via the interaction term and use this as an input sequence to our model with a positive label.

Dataset and task setting
The event extraction task described in this work corresponds to the Task 1 of the Genia Event Extraction task described by the BioNLP Shared Task series (2009, 2011 and 2013). We train a model on a combination of abstract collection (from 2009 edition) and full text collection (from 2011 and 2013). We test our model on the dev set of the 2013 edition (since the gold annotation is publicly available only for the dev set and not the test set).

Data prepraration
The dataset made available for the Shared Task is in the form of sentences and event annotations as shown in Table 2. We convert these event annotations into input sequences and labels for our multiclass classification task (theme identification) and for our binary classification task (cause identification) as follows a. Theme identification: Given a sentence, we define the set T as the set of interaction terms corresponding to all its event annotations. We define the set P as the set of all its protein mentions. For every pair (t j , p i ) where p i ∈ P and t j ∈ T , we create a training data of the form {w 1 , e 1 , w 2 , e 2 , ..., e n−1 , w n , label} where the input sequence corresponds to the words interwound with edge labels in shortest path between t j and p i ; and the label is the event type of the event e m if there exists an event e m : (T ype : t j , T heme : p i ), NULL otherwise. We create the test data similarly; except we do not use event annotations for creating the set T but instead identify terms in the sentence that was annotated as an interaction term in the training data more than once. b. Cause identification: For every pair (t j , p k ) where t j is part of some event annotation e m : (T ype : t j , T heme : p i ) of event type that can take cause argument and p k ∈ P , we create a training data of the form {w 1 , e 1 , w 2 , e 2 , ..., e n−1 , w n , label} where the input sequence corresponds to the shortest path between p k and p i via t j ; and the label is 1 if p k is the cause of the event e m , 0 otherwise.

LSTM model setup
We implement our LSTM model using the lasagne library. For the first LSTM model, we use softmax as our non-linear function and optimize the cat-egorical cross entropy loss using adam (Kingma and Ba, 2014). For the second LSTM model, we use a sigmoid non-linear function and optimize the binary loss using adam. We use a dropout of 0.5, batch size of 100 and a learning rate of 0.001. Table 5 shows the results of our LSTM and distant supervision based event extraction model. We compare our results with the state-of-the-art event extraction system EVEX (Hakala et al., 2013). We report the Approximate Span/Approximate Recursive metric in all our tables (described in the Shared Task (Kim et al., 2013)). The columns to the left (with column heading LSTM) show the performance of our model trained only on the official training data. The columns to the right (with column heading LSTM+Distant Supervision) show the performance of our model trained on official training data plus the additional training data of 11792 sentences we gather using our distant supervision strategy.

Results and Discussion
The table highlights some of our results. Firstly, we note that, in cases where we obtain a large number of extra sentences using distant supervision (highlighted in the column "DS Sents"), we see a considerable gain in the recall values between "LSTM" and "LSTM+Distant Supervision" models. On the contrary, in cases where we extract only a small number, we see a small gain (or sometimes even a decrease in performance). This suggests we explore further ways of selecting our extra sentences. Secondly, although the overall performance of our model using the automatic AMR parser is lower than the current state-of-theart system, the gap of 5% in the F1 score can hopefully be reduced with the ongoing improvements in AMR parsing.

Related work
The biomedical event extraction task described in this work was first introduced in the BioNLP Shared Task in 2009 (Kim et al., 2009). This task helped shift the focus of relation extraction efforts from identifying simple binary interactions to identifying complex nested events that better represent the biological interactions stated frequently in text. Existing approaches to this task include SVM (Björne and Salakoski, 2013) other ML based approaches (Riedel and McCallum, 2011;Miwa et al., 2010Miwa et al., , 2012    learn subgraph patterns from the event annotations in the training data and cast the event detection as subgraph matching problem. Non-feature based approaches like graph kernels compare syntactic structures directly (Airola et al., 2008;Bunescu et al., 2005). Rule based methods that either use manually crafted rules or generate rules from training data (Cohen et al., 2009;Kaljurand et al., 2009;Kilicoglu and Bergler, 2011;Bui et al., 2013) have obtained high precision on these tasks. In our work, we take inspiration from the Turk Event Extraction System (TEES) (Björne and Salakoski, 2013) (the event extraction system for EVEX) that has consistently been the top performer in these series of tasks. They represent events using a graph format and break the event extraction task into separate multi-class classification tasks using SVM as their classifier. In our work we take a step further by making use of a deeper semantic representation as a starting point and identifying subgraphs in the AMR graph.
AMR has been successfully used for deeper semantic tasks like entity linking (Pan et al., 2015) and abstractive summarization (Mihalcea et al., 2015). Work by Garg et al. (2015) is the first one to make use of AMR representation for extracting interactions from biomedical text. They use graph kernel methods to answer the binary question of whether a given AMR subgraph expresses an interaction or not. Our work departs from theirs in that they concentrate only on binary interactions whereas we use AMR to identify complex nested events. Also, our approach additionally makes use of distant supervision to cope with the problem of limited annotated data.
Distant supervision techniques have been successfully used before for relation extraction (Mintz et al., 2009) in general domain. Recent work by (Liu et al., 2014) uses minimal supervision strategy for extracting relations particularly in biomedical texts. Our work departs from theirs in that we introduce a novel AMR path based heuristic to selectively sample the sentences obtained from distant supervision.

Conclusion
In this work, we show the effectiveness of using a deep semantic representation based on Abstract Meaning Representations for extracting complex nested events expressed in biomedical text. We hypothesize that an event structure is an AMR subgraph and empirically validate our hypothesis. For learning to extract such event subgraphs from AMR automatically, we develop two Recurrent Neural Network based models: one for identifying the theme, and the other for identifying the cause of the event. To overcome the dearth of manually annotated data in biomedical domain, which explains the low recall of event extraction systems, we train our model on additional training data gathered automatically using a selective distant supervision strategy. Our experiments strongly suggest that AMR parsing improvements, which are expected given the youth of this scientific field of inquiry, and the exploitation of larger, manually curated Biopax-like models and collections of biomolecular texts will be easy to capitalize on catalysts for driving future improvements in this task.