Biomedical Event Extraction Using Convolutional Neural Networks and Dependency Parsing

Event and relation extraction are central tasks in biomedical text mining. Where relation extraction concerns the detection of semantic connections between pairs of entities, event extraction expands this concept with the addition of trigger words, multiple arguments and nested events, in order to more accurately model the diversity of natural language. In this work we develop a convolutional neural network that can be used for both event and relation extraction. We use a linear representation of the input text, where information is encoded with various vector space embeddings. Most notably, we encode the parse graph into this linear space using dependency path embeddings. We integrate our neural network into the open source Turku Event Extraction System (TEES) framework. Using this system, our machine learning model can be easily applied to a large set of corpora from e.g. the BioNLP, DDI Extraction and BioCreative shared tasks. We evaluate our system on 12 different event, relation and NER corpora, showing good generalizability to many tasks and achieving improved performance on several corpora.


Introduction
Detection of semantic relations is a central task in biomedical text mining where information is retrieved from massive document sets, such as scientific literature or patient records. This information often consists of statements of interactions between named entities, such as signaling pathways between proteins in cells, or the combinatorial effects of drugs administered to a patient. Relation and event extraction are the primary methods for retrieving such information.
Relations are usually described as typed, sometimes directed, pairwise links between defined named entities. Automated relation extraction aims to develop computational methods for their detection.
Event extraction is a proposed alternative for relation extraction. Events differ from relations in that they can connect together more than two entities, that they have an annotated trigger word (usually a verb) and that events can act as arguments of other events. In the GENIA corpus, a sentence stating "The binding of proteins A and B is regulated by protein C" would be annotated with two nested events REGULATION(C, BIND(A, B)). While events can capture the semantics of text more accurately, their added complexity makes their extraction a more complicated task.
Many methods have been developed for relation extraction, with various kernel methods such as the graph kernel being widely used (Mooney and Bunescu, 2006;Giuliano et al., 2006;Airola et al., 2008). For the more complex task of event extraction approaches such as pipeline systems (Björne, 2014;Miwa et al., 2010), semantic parsing (Mc-Closky et al., 2011) and joint inference  have been explored.
In recent years, the advent of deep learning has resulted in advances in many fields, and relation and event extraction are no exception. Considerable performance increases have been gained with methods such as convolutional (Zeng et al., 2014) and recurrent neural networks (Miwa and Bansal, 2016). Some proposed systems have relied entirely on word embeddings (Quan et al., 2016), while others have developed various network architectures for utilizing parse graphs as an additional source of information (Collobert et al., 2011;Liu et al., 2015;Xu et al., 2015;Ma et al., 2015;Peng et al., 2017a).
In this work we present a new convolutional neural network method for extraction of both events and relations. We integrate our network as a classification module into the Turku Event Extraction System (Björne, 2014) 1 , allowing it to be easily applied to corpora or texts stored in the TEES XML format. Our neural network model is characterized by a unified representation of input examples that can be applied to detection of both keywords as well as their relations.

Corpora
We develop and evaluate our method on a large number of event and relation corpora (See Table 1).
These corpora originate from three BioNLP Shared Tasks (Kim et al., 2009(Kim et al., , 2011Nédellec et al., 2013), the two Drug-Drug Interaction (DDI) Extraction tasks (Segura-Bedmar et al., 2011 and the recent BioCreative VI Chemical-Protein relation extraction task (Krallinger et al., 2017). The BioNLP corpora cover various domains of molecular biology and provide the most complex event annotations. The DDI and BioCreative corpora use pairwise relation annotations, and one of the DDI corpora defines also a drug named entity recognition (NER) task.
All of these corpora are used in the TEES XML format and are installed or generated with the TEES system. The corpora are parsed with the TEES preprocessing pipeline, which utilizes the BLLIP parser (Charniak and Johnson, 2005) with the McClosky biomodel (McClosky, 2010), followed by conversion of the constituency parses into dependency parses with the Stanford Tools (de Marneffe et al., 2006). These tools generate the deep parse graph which is used as the source for our dependency path features.

TEES Overview
The TEES system is based around a graph representation of events and relations. Named entities and event triggers are nodes, and relations and event arguments are the edges that connect them. An event is modelled as a trigger node and its set of outgoing edges. For a detailed overview of TEES we refer to Björne (2014).
TEES works as a pipeline method that models relation and event extraction as four consecutive 1 http://jbjorne.github.io/TEES/ classification tasks (See Figure 2). The first stage is entity detection where each word token in a sentence is classified as an entity or a negative, generating the nodes of the graph. This stage is used in NER tasks as well as for event trigger word detection. The second stage is edge detection where relations and event arguments are predicted for all valid pairs of named entity and trigger nodes. For relation extraction tasks where entities are given as known data this is the only stage used.
In the entity detection stage TEES predicts a maximum of one entity per word token. However, since events are n-ary relations, event nodes may overlap. The unmerging stage duplicates event nodes by classifying each candidate event as a real event or not. Finally, modifier detection can be used to detect event modality (such as speculation or negation) on corpora where this is annotated.

Neural Network Overview
In TEES the four classification stages are implemented as multiclass classification tasks using the SVM multiclass support vector machine (Tsochantaridis et al., 2005) and a rich set of features derived mostly from the dependency parse graph.
We develop our convolutional neural network method using the Keras (Chollet et al., 2015) package with the TensorFlow backend (Dean et al., 2015). We extend the TEES system by replacing the SVM-based classifier modules with our network, using various vector space embeddings as input features. Our neural network design follows a common approach in NLP where the input sequence is processed by parallel convolutional layers (Kim, 2014;Zeng et al., 2014;Quan et al., 2016).
We use the same basic network structure for all four TEES classification stages (See Figure 1).
The input examples are modelled in the context of a sentence window, centered around the candidate entity, relation or event. The sentence is modelled as a linear sequence of word tokens. Each word token is mapped to relevant vector space embeddings. These embeddings are concatenated together, resulting in an n-dimensional vector for each word token.
This merged input sequence is processed by a set of 1D convolutions with window sizes 1, 3, 5 and 7. Global max pooling is applied for each convolutional layer and the resulting features are merged together into the convolution output vec-  Figure 1: We use the same network architecture for all four classification tasks, with only the input embeddings and the number of predicted labels changing between tasks. This figure demonstrates entity detection where the word "involve" is being classified as belonging to one or more of 10 entity types. The input sentence is padded for the fixed example length (11 in this case) and the word tokens are mapped to corresponding embeddings, in this example Word Vectors, POS tags, relative positions and distances. The embedding vectors are merged together before the convolutional layers, and the results of the convolution operations are likewise merged before the dense layer, after which the final layer shows the predicted labels.
tor. This output vector is fed into a dense layer of 200-800 neurons, which is connected to the final classification layer where each label is represented by one neuron. The classification layer uses sigmoid activation, and the other layers use relu activation. Classification is performed as multilabel classification where each example may have 0-n positive labels. We use the adam optimizer with binary crossentropy and a learning rate of 0.001.
Dropout of 0.1-0.5 is applied at two stages in the system to increase generalization. Weights are learned for all input embeddings except for the word vectors, where we use the original weights as-is to ensure generalization to words outside the task's training corpus.

Input Embeddings
All of the features used by our system are represented as embeddings, sets of vectors where each unique input item (such as a word string) maps to its own n-dimensional vector. The type and number of embeddings we use varies by classification task and is used to model the unique characteristics of each task (See Figure 2). The pre-made word vectors we use are 200-dimensional and the rest of the embeddings (learnt from the input corpus) are 8-dimensional.
Word Vectors are the most important of these embeddings. We use word2vec (Mikolov et al., 2013) vectors induced on a combination of the English Wikipedia and the millions of biomedical research articles from PubMed and PubMed Central by POS (Part-of-speech) tags generated by the BLLIP parser are used to define the syntactic categories of the words.
Entity features are used in cases where such information is already available, as in relation extraction where the pairs of entities are already known, or in event extraction where named entities are predefined.
Distance features follow the approach proposed by Zeng et al. (2014) where the relative distances to tokens of interest are mapped to their own vectors.
Relative Position features are used to mark whether tokens are located (B)efore, (A)fter or in the (M)iddle of the classified structure, or if they form a part of it as entities, event triggers or arguments. These features aim to model the context of the example in a manner somewhat similar to the shallow linguistic kernel of Giuliano et al. (2006).
Path Embeddings describe the shortest undirected path from a token of interest to another token in the sentence. Multiple sets of vectors (0-4), one for directed dependencies at each distance, are used for the dependencies of the path. For example, if paths of depth 4 are used, a shortest path of three directed dependencies connecting two tokens of interest could be modelled with four embedding vectors e.g. ←dobj, nsubj→, nn→, NULL. Our path embeddings are inspired by the concept of distance embeddings used by Zeng et al. (2014): Since it is possible to model linear distances between tokens in the input sentence, it  Figure 2: System stages. The TEES pipeline performs event extraction in four consecutive stages, generating first 1. the nodes (entities) and 2. edges (relations) of the event graph, which is then "pulled apart" in 3. unmerging, followed optionally by 4. modifier detection. The example being classified is shown with a dotted line in each image, and other examples in the same sentence with light gray dotted lines. We replace the four SVM classification stages in the TEES pipeline with convolutional neural networks. In place of the rich feature representations we use a sentence model where word token and dependency parse information is represented by embeddings. The Word Vector, POS and entities features are straightforwardly produced from the information of each token. The distance and relative position features model the position of the token in the sentence. The path features mark the dependencies connecting each token to a token of interest (candidate entity or relation endpoint). The shortest path features mark the set of dependencies forming the shortest path for a candidate relation. In the unmerging stage candidate event arguments are also used as features.
is also possible to model any other distance between these tokens, in our case paths in the dependency parse. Shortest Path Embeddings follow the approach of the n-gram features used in methods such as the TEES SVM system. The shortest path consists of the tokens and dependencies connecting the two entities of a candidate relation. For each token on the path we define two embeddings, one for the incoming and one for the outgoing dependency. For example, if the shortest path would consist of three tokens and the two dependencies connecting them, the shortest path embedding vectors for the three tokens could be e.g. ([begin], ←nsubj), (←nsubj, dobj→), (dobj→, [end]). Thus, our shortest path embeddings can be seen as a more detailed extrapolation of the "on dep path embeddings" of Fu et al. (2017).
Event Argument Embeddings are used only in the unmerging stage where predicted entities and edges are divided into separate events.

Parameter Optimization
When developing our system, we use the training set for learning, and the development set for parameter validation. We use the early stopping approach where the network is trained until the validation loss no longer decreases. We train for up to 500 epochs, stopping once validation loss has no longer decreased for 10 consecutive epochs.
Neural network models can be very sensitive to the initial random weights. Despite the relatively large training and validation sets, our model exhibits performance differences of several percentage points with different random seeds. In the current TensorFlow backend it is not possible to efficiently fix the random seed 3 , and in any case this would be unhelpful, as the impact of any given seed varies with the training data. Instead, we compensate for the inherent randomness of the network by training multiple models with randomized initializations and use as the final model the one which achieved the best performance on the validation set (measured using the micro-averaged F-score).
In addition to the random seed and optimal epoch, neural networks depend on a large number of hyperparameters. We use the process of training multiple randomized models also for parame-ter optimization. In addition to varying the random seed, we pick a random combination of hyperparameters from the ranges to be optimized, so that different models are randomized both in terms of initialization and the parameters. We test sizes of 200, 400 and 800 for the final dense layer, filter sizes of 128, 256 and 512 for the convolutional layers and dropout values of 0.1, 0.2 and 0.5. In addition, we experiment with path depths of 0-4 for the path embeddings. Training a single model can still be prone to overfitting if the validation set is too small. To improve generalizability, we explore the use of model ensembles. Instead of using the best randomized model as the final one, we take n-best models, ranked with micro-averaged F-score on the validation set, and use their averaged predictions. These ensemble predictions are calculated for each label as the average of all the models' predicted confidence scores.
With SVMs or random forests it is possible to "refit" a classifier after parameter selection, by retraining on the combined training and optimization sets, and this approach is also used by the TEES SVM classifiers. With the neural network, we cannot retrain with the validation set, as there would be no remaining data for detecting the optimal epoch. We approach also this issue using model ensembles. As the final source of randomization, we randomly redistribute the training and validation set documents before training each model. In this manner, the n-best models will together cover a larger part of the training data.
By training a large set of randomized models and using the n-best ones, we aim to address the effect of random initialization, parameter optimization and coverage of training data using the same process. However, with the size of the corpora used, training even a single model is relatively time consuming. In practice we are able to train only around 20 models for each of the four stages of the classification pipeline. Thorough parameter optimization comparable to the SVM system is thus not computationally feasible with the neural network, but good performance on varied corpora indicates that the current approach is at least adequate.

Results and Discussion
The results of applying our proposed system on the various corpora are shown in Table 2. We com- pare our method with previous results from the shared tasks for which these corpora were introduced as well as with later research. In the next sections we analyse our results for the different corpus categories.

The BioNLP Event Extraction Tasks
The BioNLP Event Extraction tasks provide the most complex corpora with often large sets of event types and at times relatively small corpus sizes. The GENIA corpora from 2009 and 2011 have been the subject of most event extraction research. Our proposed method achieves F-scores of 57.84 and 58.10 on GE09 and GE11, respectively. Compared to the best reported results of 58.27 (Miwa et al., 2012) and 58.07 (Venugopal et al., 2014), our method shows similar performance on these corpora.
Our CNN reimplementation of TEES outperforms the original TEES SVM system on all the BioNLP corpora. In addition, we achieve to the best of our knowledge the highest reported performance on the GE11, EPI11, REL11, CG13 and PC13 BioNLP Shared Task corpora.
The annotations for the test sets of the BioNLP Shared Task corpora are not provided, instead the users upload their predictions to the task organizers' servers for evaluation. While this method provides very good protection against overfitting and data leaks, unfortunately many of these evaluation servers are no longer working. Thus, we were able to evaluate our system on only a subset of all existing BioNLP Shared Task corpora.

The Drug-Drug Interactions Tasks
There have been two instances of the Drug-Drug Interactions Shared Task. The first one in 2011 concerned the detection of untyped relations for adverse drug effects. Unlike the other corpora, no official evaluator system or program exists for this corpus so we use our own F-score calculation. The lower performance compared to the original TEES system warrants further examination, but in any case the DDI11 corpus has been mostly superseded by the more detailed DDI13 corpora.
On the DDI13 corpora task 9.1, drug named entity recognition, our CNN system performs better than the original TEES entry, but neither of these TEES versions can detect more than single-token entities so they are not well suited for this task. Nevertheless, this result demonstrates the potential applicability of our method also to NER tasks.
Of all the DDI corpora the DDI13 task 9.2 corpus, typed relation extraction, has been the subject of much neural network based research in the past few years. A large variety of methods have been tested, and good results have been achieved by highly varying network models, some of which use no parsing or graph-like features, such as the multichannel convolutional neural network of Quan et al. (2016) which combines multiple sets of word vectors and achieves an F-score of 70.21. The highest result of 73.5 so far has been reported by Lim et al. (2018) who used a binary tree-LSTM model ensemble, with which our system achieves minutely higher, in practice comparable performance. Most recent DDI13 systems use corpusspecific rules for filtering negative candidate relations from the training data, which usually results in performance gains. As we aim to develop a generic method easily applicable to any corpus we did not implement these DDI filtering rules.

The CHEMPROT Task
Of all the evaluated corpora the CHEMPROT corpus used in the BioCreative VI Chemical-Protein relation extraction task is the most recent. Thus it provides an interesting point of comparison with current methods in relation extraction. All of our models outperform the task winning system combination of Peng et al. (2017b), with our mixed five model ensemble achieving a 5 pp increase over the shared task winning result. The CHEMPROT corpus is relatively large compared to its low number of five relation types, possibly making learning  (Hakala et al., 2013) TEES + rerank 59.24 48.95 53.61 BioMLN (Venugopal et al., 2014) MLN and SVM 58.95 40.29 47.87 Ours ( Table 3: The effect of path embeddings. The impact of using increasing depths of paths for embeddings is shown in terms of averaged F-score on the development set for entity (n) and edge (e) detection. easier for our system.

Effect of Deep Parsing
Compared to neural networks which use either only word vectors, or model parse structure at the network level (e.g. graph-LSTMs), an interesting aspect of our method is that it can function both with and without parse information. By turning dependency path features on and off we can evaluate the impact of deep parsing on the system (See Table 3). The path embeddings have the most impact on GE11 entity detection, where these paths link the entity candidate token to each other token. In GE11 event argument extraction the role of the path context embeddings is diminished. Surprisingly, on the DDI13 9.2 relation corpus path embeddings reduce performance, perhaps due to very long sentences and very indirect relations between the entity pairs. However, on another relation corpus, the CHEMPROT corpus, the path embeddings again increase performance, perhaps indicating that the CHEMPROT relation annotations follow more closely sentence syntax.

Computational Requirements
Our system improves on performance compared to the SVM-based TEES, but at the cost of increased computational requirements. The neural network effectively requires a specialized GPU for training and even then training times can be an issue.
For example, training the original TEES system on a four-core CPU for the GE09 task takes about 3 hours and classification of the test set with this model can be done in four minutes. For comparison, our GE09 neural network with 20 models for all four stages takes around nine hours to train on a Tesla P100 GPU. However, test set classification with a single model takes only about three minutes and using a five model ensemble about ten minutes.
Thus, while training the proposed method is much slower, classification can be performed relatively quickly. While the hardware and time requirements are much higher than with the SVM system, our proposed system can for some corpora achieve performance increases of even 10 pp. In most applications such gains are likely worth the increased computational requirements.

Conclusions
We have developed a convolutional neural network system that together with different vector space embeddings can be applied to diverse text classification tasks. We replace the TEES system's event extraction pipeline components with this network and demonstrate considerable performance gains on a set of large event and relation corpora, achieving state-of-the-art performance on many of them and the best reported performance on the GE11, EPI11, REL11, CG13, PC13, DDI13 9.2 and CP17 corpora.
To the best of our knowledge our system represents the first application of neural networks to extraction of complex events from the BioNLP GENIA corpora. Our system uses a unified linear sentence representation where graph analyses such as dependency parses are fully included using our dependency path embeddings, and we demonstrate that these path embeddings can increase the performance of the convolutional model. Unlike systems where separate subnetworks are used to model graph structures, our network receives all of the information through the unified linear representation, allowing the whole model to learn from all the features.
The Turku Event Extraction System provides a unified approach for utilizing a large number of event and relation extraction corpora. As we integrate our proposed convolutional neural network method into the TEES system, it can be used as easily as the original TEES system, with the framework handling tasks such as preprocessing and format conversions. Our Keras-based neural network implementation can also be extended and modified, allowing continued experimentation on the wide set of corpora supported by TEES. We publish our method and our trained neural network models as part of the TEES open source project 4 .