External Evaluation of Event Extraction Classifiers for Automatic Pathway Curation: An extended study of the mTOR pathway

This paper evaluates the impact of various event extraction systems on automatic pathway curation using the popular mTOR pathway. We quantify the impact of training data sets as well as different machine learning classifiers and show that some improve the quality of automatically extracted pathways.


Introduction
Biological pathways encode sequences of biological reactions, such as phosphorylation, activation etc, involving various biological species, such as genes, proteins (Aldridge et al., 2006;Kitano, 2002). Studying and analyzing pathways is crucial to understanding biological systems and for the development of effective disease treatments and drugs (Creixell et al., 2015;Khatri et al., 2012). There have been numerous efforts to reconstruct detailed process-based and disease level pathway maps such as Parkinson disease map (Fujita et al., 2014), Alzheimers disease Map (Mizuno et al., 2012), mTOR pathway Map (Caron et al., 2010), and the TLR pathway map (Oda and Kitano, 2006). Traditionally, these maps are constructed and curated by expert pathway curators who manually read numerous biomedical documents, comprehend and assimilate the knowledge in them and construct the pathway.
With increasing number of scientific publications manual pathway curation is becoming more and more impossible. Therefore, Automated Pathway Curation (APC) and semi-automated biological knowledge extraction has been an active research area (Ananiadou et al., 2010;Ohta et al., 2013;Szostak et al., 2015) trying to overcome the limitations of manual curation using various techniques from hand-crafted NLP systems (Allen et al., 2015) to machine learning techniques (Björne et al., 2011). Machine-learning NLP systems, in particular, show good performance in BioNLP tasks, but they are still performing less good in automated pathway curation, partly because there have been few attempts to measure the performance of NLP systems for APC directly.
Recently, there has been some attempt at remedying the situation and new datasets and evaluation measures have been proposed. For instance, Spranger et al. (2016) use the popular human-generated mTOR pathway map (Caron et al., 2010;Efeyan and Sabatini, 2010;Katiyar et al., 2009) and quantify the performance of a particular APC system and its ability to recreate the complete pathway automatically. Results reported were mixed.
One of the key components in such APC systems is identification of triggers, events and their relationships. These machine learning-based systems are essentially just supervised classification components.
This paper explores whether we can improve results of automated pathway curation for mTOR pathway by using different training datasets and learning algorithms. We show that the choice of event extraction classifiers increases F-score by up to 20% compared to state-of-the-art system. Our results also show that within limits the choice of training data has significantly less impact on results than the choice of classifier. Our results also suggest that additional research is necessary to solve the problem of APC.

Automatic Pathway Curation
We constructed an automatic pathway curation system that take as input scientific articles in PDF format and transforms them into SBML encoded, annotated pathway maps. The pipeline has multi-ple steps.
1. PDFs are translated into pure text files using the cermine 1 tool. 2. Preprocessing provides tokenization, POS tagging, dependency and syntax parsing. 3. An event extraction system extracts the mentions of entities (genes, proteins etc), reactions (e.g. phosphorylation) and their arguments (theme, cause, product). 4. A converter constructs pathways from the information provided by the event extraction system. 5. An annotation system maps extracted entities and events to Entrez gene identifiers and SBO terms.
The following sections detail steps 3 to 5.

Event Extraction
We used the TURKU Event Extraction System (TEES) for event extraction (Björne et al., 2010). This system is one of the most successful BioNLP systems. It has not only won 1st place in BioNLP competitions but was also the only one NLP system that participated in all BioNLP-ST 2013 tasks (Björne et al., 2012 Salakoski, 2015) allow to easily exchange the SVM classifiers with other supervised classification algorithms. For example, all scikit-learn multiclass, supervised learning algorithms that support sparse feature matrices can be applied (Pedregosa et al., 2011). Thanks to this it is possible to test different algorithms for event extraction task and automatic pathway extraction. For this paper, we exchanges classifiers in all steps 1-4s as described in Section 3. The output of TEES is a standoff formatted representation of entities and events.

Conversion Standoff to SBML pathways
In principle events and entities extracted by TEES correspond to biological species and reactions. We translate the NLP representation into SBML -the standard, XML-based markup language for representing biological models (Hucka et al., 2003). SBML essentially encodes models using biological players called sbml:species 2 . sbml:species can participate in interactions, called sbml:reaction. Species participate in interaction as sbml:reactant, sbml:product and sbml:modifier. The basic idea being that some quantity of reactant is consumed to produce a product. Reactions are influenced by modifiers. The mapping algorithm is adopted from and described in more detail in Spranger et al. (2015).

SBO/GO, Entrez Gene Annotations
The SBML encoded, automatically extracted pathway is further annotated using Systems Biology Ontology (SBO) (Le Novère, 2006) and Gene Ontology (GO) terms. SBO also provides a class hierarchy for reaction types. For instance, the NLP system identify phosphorylation reactions, which are a subclass of conversion reactions. All reactions in the data are automatically annotated with SBO/GO term (coverage 100%) using an annotation scheme detailed in (Spranger et al., 2015).
Species (e.g. proteins, genes) were annotated using the gene/protein named entity recognition and normalization software GNAT (Hakenberg et al., 2011) -a publicly available gene/protein normalization tool. GNAT returns a set of Entrez Gene identifiers (Maglott et al., 2005) for each input string. Species were annotated using all returned Entrez Gene identifiers for a particular species (organism human). We call the set of Entrez Gene identifiers returned by GNAT for each species Entrez Gene signature.

Classifiers for Event Extraction
In this paper we evaluate classifiers for event extraction (Section 2.1) and their impact on the overall performance of the automatic pathway extraction system. We compare the following classifiers: • Support Vector Machines (SVM) is the default TEES classifier (Joachims, 1999). It was optimized for linear classification and its performance scales linearly with the number of training examples.
• Decision Tree (DT) creates a model that can predict the target value by learning simple decision rules inferred from the training data. Compared to the other techniques they are relatively fast, cost of using tree is logarithmic in the number of examples. We use Gini impurity criterion to evaluate quality of the split.
• Random Forest (RF) classifiers fit a number of ensembled decision tree classifiers, each built from a bootstrap sample of a training set. The best split of node is chosen only from a random subset of the features, not all features. Final classifiers are combined by averaging their probabilistic prediction. Single tree have a higher bias but, due to averaging variance of the random forest as a whole decreases.
• Multinomial Naive Bayes (MNNB) This is an implementation of the naive Bayes algorithm for multinomial data which is one of the classic variants used in classification of discrete features (e.g. text classification). Additive smoothing parameter was set to 1.
• Multi-layer Perceptron (MLP) MLP is a feedforward neural network model. We use hidden layer with 100 neurons and rectified linear unit activation function. We optimize for logarithmic loss using stochastic gradient descent. Learning rate is constant and equal to 0.001.

Training Datasets
In order to quantify the impact of training data, we test the following three training sets.
• ANN -consists of 60 abstracts of scientific papers from Pubmed database related to the mTORpathway map. This dataset was human-annotated for NLP system training (Ohta et al., 2011, Corpus annotations (c) GENIA Project) .
• GE11 consists of 908 abstracts and full texts of scientific papers used in BioNLP ST 2011 GENIA Event Extraction task as training data (Kim et al., 2012).
• PC13 consists of 260 abstracts of scientific papers used in BioNLP ST 2013 Pathway Curation task as training data (Ohta et al., 2013). The task goal was to evaluate the applicability of event extraction systems to support the automatic curation and evaluation of biomolecular pathway models.
The overall corpora statistics are summarized in Table 1. GE11 and PC13 have the largest number of annotated events. ANN is much smaller in comparison. Also, the distribution of event types differs between data sets (Table 2). GE11 uses more general terms (Binding, Regulation) compared to PC13 where some specific events appear only a few times (Deacetylation, Hydroxylation, Methylation).
We use GE11-Devel BioNLP ST2011 dataset for hyperparameter optimization of all classifiers.

Test Data
Performance of classifiers is tested on the mTOR pathway map (Caron et al., 2010). The map was constructed by expert human curators using 522 full text papers from the PubMed database. The experts curated a single large map using CellDesigner (Funahashi et al., 2008) -a software for modeling and executing mechanistic models of pathways. CellDesigner represents information using a heavily customized XML-based SBML format (Hucka et al., 2003).
Target Human expert data We translate the curator map into standard SBML and further enrich the information using SBO/GO and Entrez Gene annotations. For SBO/GO, we use existing annotations provided by curators and extend them by automatic annotations deduced from reactants and products of reactions. For example, if a phosphoryl group is added in a reaction, it is annotated using the SBO term for phosphorylation. Each reaction may be annotated with multiple SBO/GO terms. Also we annotate the curated map with Entrez gene identifiers (similar to the automatic extraction data). We call this pathway TARGET.
Testing classifiers The 522 full text papersused by human curators for the construction of the mTOR pathway -are used for evaluating the different text mining classifiers. For this, we plug in (trained) classifiers into the automatic pathway extraction pipeline which performs preprocessing, event extraction, conversion to SBML and annotation (see also Section 2). The output of this is an annotated SBML file that is subsequently compared to human-curated SBML-encoded pathway data.

Evaluation
Evaluation of the classifiers (and the system as a whole) is performed by comparing the automatically extracted pathway with the hand-curated pathway. Spranger et al. (2016) propose a number of graph overlap algorithms for quantifying the difference and similarity of two pathways. Here we employ the same measures. The following summarizes the strategies.
Species In order to decide whether species in two pathways are the same, we use the name of the identifiers and their Entrez gene signatures.
nmeq: Two species are equal if their names are exactly equal. We remove certain prefixes from the names (e.g. phosphorylated). appeq: Two species are equal if their names are approximately equal. Two names are approximately equal iff their Levenshteinbased string distance is above 90 (Levenshtein, 1966) enteq: Two species are equal if their entrez gene identifiers are exactly equal. This basically translates to the two species bqbiol:is identifier sets being exactly the same (order does not matter). entov: Two species are equal if their entrez gene identifiers sets overlap. This basically translates to the two species bqbiol:is identifier sets overlapping. wc: Human curated data contains complex species that contain other species as constituents (species that consist of various proteins etc). wc allows species to match with constituents of complexes.
Reaction match based on their SBO/GO annotations sboeq: Two reactions are equal iff their signatures are exactly the same. That is, the whole set of SBO/GO terms of one reaction is the same as of the other reaction. sboov: Two reactions are equal, iff their signatures overlap. That is, the intersection of the set of SBO/GO terms of one reaction is with the set of SBO/GO terms of the other reaction is not empty. sobisa: Two reactions are equal, iff there is at least one SBO/GO term in each signature that relate in a is a relationship in the SBO reaction type hierarchy. For instance, if there is a phosphorylation reaction and a conversion reaction, then sboisa will match because phosphorylation is a subclass of conversion according to the SBO type hierarchy.
Edges only match if their labels are strictly equal. So if an edge is a reactant, then it has to be a reactant in the other pathway. Same holds for products and modifiers.
Subgraph matching strategies are combinations of matching strategies for species, reactions (and for edges which is always the same). For instance, the matching strategy nmeq, sboeq is the most strict and requires that species names are exactly equal and that SBO/GO signatures of reactions are exactly equal. The matching strategy appeq/enteq/wc, sboisa is the most loose strategy. In this strategy, two species match if their names are approximately equal or if their Entrez gene identifiers overlap or if any of this applies to one of the constituents of the two species. Two reactions match if any of their SBO/GO terms are in a is a relationship. We compare a total of 24 matching strategies.
Subgraph overlap is computed as follows. For each subgraph in the extracted pathway we search for subgraphs in the human curated data that match according to some subgraph matching strategy. We use micro-averaged F-score, precision and recall (Sokolova and Lapalme, 2009) for quantifying the retrieval results. F-score is used to quantify the overlap of species, reactions and edges. We then macro-average these results to get a total F-score quantifying performance of the extraction system as a whole.

Results
Some classifiers take long to train, so we only have partial results for MLP. However, all other classi-fiers (DT, MNNB, RF, SVM) finished training on all selected combinations of training data sets.
Since we tested 24 subgraph overlap measures with 18 classifiers, we receive a lot of data that cannot be discussed in detail in this paper. Here, we concentrate on general trends in the data. Code and datasets are published as appropriate 3 .

Extraction Results: Species, Reactions, Subgraphs
Generally speaking the extracted pathways contain two order of magnitudes more species reactions, and edges than the TARGET pathway (see Table 3 for all results). This is normal since the extracted pathways consist of all combinations of entity and event mentions in text. The same entities may occur more often in the text then they are referenced in the actual pathway.
Our results show that extraction classifiers perform inconsistent with respect to the identification of compartments. While some classifiers retrieve a lot of compartment information (via localization events), others (especially MNNB trained on ANN and PC13 datasets) do not extract any compartments. MNNB with our parameter choice might not be able to learn many different event types so it skips least frequent reaction types (one of which is localization event).
Measuring how many subgraphs there are per pathway, we can see that more than half of all species extracted by classifiers are isolated and not connected to any reactions. Similarly we see many (small) subgraphs being extracted by the classifiers, whereas TARGET consists of essentially one large connected graph (with a few modeling mistakes).

General Trends Subgraphs overlap
Let us first concentrate on overall performance especially with respect to previous results. For this we compute the best classifiers and their score for different matching strategies. For each matching strategy, we evaluate all classifiers and then choose the best performing one and compare it with the results reported in Spranger et al. (2016)/Spr16. Table 4 shows that the best classifiers outperform Spr16 in all cases and for some subgraph overlap measures by 10 points.
(with 14 precision, 13 recall scores). For the loosest strategy (appeq/entov/wc, sboisa) this goes up to F-score 52 (47 precision, 66 recall). These results show that when it comes to exact extraction the classifiers fail badly, whereas with more looser overlap strategies, performance becomes reasonable and there is some overlap between the ex-tracted and the human-curated data. Of course, this also entails that the automatically extracted pathway does not completely capture what humans are constructing from the text. Generally speaking overlap strategies that are loose with respect to constituents of complex species (wc) outperform their non wc counterparts. For instance, nmeq/wc, sboeq performs much better than nmeq, sboeq. This shows that complex species are important for the mTOR pathway but their extraction is not very detailed -which is why the overlap matching strategy has to be lenient with respect to complex species constituents. The increase in F-score for wc matching strategies is primarily driven by an increase in recall score. For instance, the difference between nmeq, sboeq and nmeq/wc, sboeq is more than 20 points, whereas precision does not improve that much. The reasons for that is that the same subgraphs in the extracted pathway overlap with more subgraphs in TARGET. So it is not the case that other subgraphs in the extracted pathway overlap with TARGET.
Results also show that recall is in general much higher than precision for looser strategies. For instance, wc strategies (right hand side of Figure  1) double the recall score w.r.t to their precision scores. This also shows that in principle loosening matching strategies impacts mostly recall as the same subgraphs in the extracted data overlap with the human curated data.

Classifier Performance in Detail
The bottom figure in Figure 2 shows the best classifiers in terms of precision, recall and F-score. We measured how often a classifier is the best classifier (for each of the 24 subgraph overlap strategies). It is clear that overall Random Forest classifier (RF) performance is the best. For all 24 matching strategies it is a Random Forest classifier that is better than any other competitor with RF trained on PC13 and ANN being the most frequent best classifier overall. Second place is Random Forest trained simply on GE11 (the largest dataset in terms of entities and events). No other classifiers (SVM, MLP, MNNB, DT) outperform RF. Training on all datasets (RF+GE11+PC13+ANN) does not seem to increase success significantly. Performance across different RF classifiers is on par and good (see Table 5) Results in the top figure of Figure 2 show that RF has the best precision performance. Figure 2: Histogram of best classifiers. This histogram is generated by counting how often a classifier is the best for a particular subgraph matching strategy.
RF+PC13+ANN is the most frequent best classifier w.r.t precision.
RF+GE11 and RF+GE11+PC13+ANN also performing comparably. Compared to recall this means that RF wins F-score because they are best in precision.
No RF classifier performs best in recall. Results show that MLP, DT and SVM all perform well for certain subgraph overlap strategies with SVM be- Figure 3: Statistics of classifier performance across all matching strategies. X-axis -classifiers. Y-axis -macro precision top, macro recall middle and macro f-score bottom (with 100 being perfect score).
ing most often the best classifier, followed by various DT-based classifiers and MLP. Figure 3 gives results for all classifiers across all matching strategies. Looser strategies give the max and strict matching strategies the min data points. We can see that performance is primarily driven by the choice of classifier as the F-score mostly varies with the type of classifier used (even though there are a few outliers). Situation is a bit more varied for precision and recall. Interestingly choice of dataset seems to have less impact. Generally speaking MNNB are the least successful. RF clearly dominate precision on average but are close enough to DT and SVM on recall.

Conclusion
This paper continues the current trend of extending NLP systems for APC and building more complete systems that allow evaluation with respect to some external standard -here the hand curated mTOR pathway.
We measured the impact of different classifiers on retrieval performance and showed that certain classifiers have the potential to increase retrieval performance. Especially Random Forest classifiers perform much better on mTOR than previously tried Support Vector Machines. On the other hand, the training data choice seems to have little impact (at least for the tested ANN, GE11 and PC13 training datasets). Spranger et al. (2016) argue that not all of the problems of APC can be overcome by using more training data on event extraction systems. They argue that additions such as complex species recognition, co-reference resolution and pathway construction are needed to ultimately solve the problem posed by APC. This certainly remains true and is not directly questioned by results in this paper. The system described here does not automatically compose single pathway maps from the extracted data. Nevertheless, our results suggest that a lot of progress can be made by improving on the event extraction part of the pipeline. This paper focuses on evaluating current machine learning techniques for event extraction. We are currently in the process of evaluating other systems including rule-based ones.