Biomedical Event Extraction as Sequence Labeling

We introduce Biomedical Event Extraction as Sequence Labeling (B EE SL), a joint end-to-end neural information extraction model. B EE SL recasts the task as sequence labeling, taking advantage of a multi-label aware encoding strategy and jointly modeling the intermediate tasks via multi-task learning. B EE SL is fast, accurate, end-to-end, and unlike current methods does not require any external knowledge base or preprocessing tools. B EE SL out-performs the current best system (Li et al., 2019) on the Genia 2011 benchmark by 1.57% absolute F1 score reaching 60.22% F1, establishing a new state of the art for the task. Importantly, we also provide ﬁrst results on biomedical event extraction without gold entity information. Empirical results show that B EE SL’s speed and accuracy makes it a viable approach for large-scale real-world scenarios. 1


Introduction
Biomedical event extraction provides invaluable means for assisting domain experts in the curation of knowledge bases and biomolecular pathways (Ananiadou et al., 2010). While the task has received significant attention in research over the last decade, it remains challenging. Progress has been rather stagnating (see Figure 1).
Events are typically highly complex and nested structures, which require deep contextual knowledge to resolve. This is particularly the case for biomedical NLP (Kim et al., 2011), where biomolecular events can be nested (Miwa et al., 2014) and long-distance arguments are frequent (Li et al., 2019). Figure 2 shows an example with four events. Each event consists of an event mention (trigger) and one or more arguments. For instance, there is a +REGULATION event triggered by the 1 The source code is available at https://github. com/cosbi-research/beesl. Combination (Joint+parsing)  SVM pipeline (+CR) (Miwa et al., 2013) SVM pipeline & MLN (Joint) (Venugopal et al., 2014) Stacked generalization (Majumder et al., 2016) CNN pipeline (+DP) (Björne & Salakoski, 2018) ensemble single KB-TreeLSTM pipeline (Li et al., 2019) TreeLSTM pipeline BEESL Figure 1: Performance of biomedical event extraction on the BioNLP Genia 2011 test set over time.
span "induced", with a PROTEIN entity (i.e., "IL-12") as CAUSE and a nested +REGULATION event (i.e., "activation") as THEME. Many state-of-theart biomedical event extraction systems still work as a pipeline and extract event triggers and their arguments independently (Björne and Salakoski, 2018;Li et al., 2019). They typically employ dependency parsing as features in a CNN model ensemble (Björne and Salakoski, 2018) or in Tree-LSTMs with knowledge bases (Li et al., 2019). We propose a new approach for biomedical event extraction by casting it as a sequence labeling task (BEESL). Our approach is conceptually simple: we convert the event structures into a representation suitable for sequence labeling, and leverage a multi-label aware decoder with BERT (Devlin et al., 2019) in a multi-task sequence labeling model. This reduces the problem to predicting a structured output for an input sequence to wordlevel tagging decisions. Compared to previous alternatives (cf. Section 7) which cast event extraction as syntactic or semantic tree-or graph-parsing task, this leads to a faster, joint model which also mitigates error propagation of locally-optimized classifier pipelines (Björne and Salakoski, 2018 Contributions To the best of our knowledge, we are the first to cast biomedical event extraction as sequence labeling. We demonstrate that BEESL is an attractive and efficient solution to extract biomedical events. We evaluate it on the BioNLP Genia 2011 benchmark, obtaining a new state of the art (cf. Figure 1), while gaining on efficiency. We additionally provide empirical results of the impact of alternative multi-task encodings, and to the best of our knowledge, the first results of biomedical event extraction without assuming gold entities.

Encoding Event Structures
This section introduces the event structures and how we encode them for sequence labeling.

Event structures
Events are structured representations which comprise multiple information units (Figure 2, top). An event is anchored to a trigger, a text span which indicates the presence of an event (Figure 2, rounded boxes). Each event has one or more arguments, namely entities or other events ( Figure 2, end of arrows), which are assigned a role in the event (Figure 2, labels on arrows). For example, an EXPRES-SION event is indicated in Figure 2 at "production" involving the PROTEIN "IL-10" as its argument. Nested structures are possible and frequent. For instance, the +REGULATION event centered on "activation" is both argument of the "induced"-anchored event as well as the "promote"-anchored event.

Sequence labeling encoding
Given [x 1 , ..., x n ] a sequence of n tokens, we encode event structures as token-level labels [y 1 , ..., y n ], to reduce the task to a sequence labeling problem. Adopting dependency parsing terminology, we encode the label y i for each token x i as a tuple d, r, h , where d is the dependent and refers to the token and its mention type (either trigger, entity, or nothing), r is the relation and used to refer to its role, and head (h) denotes the event the token refers to (Figure 2, bottom). In more detail, to discriminate event heads with the same type in text, we encode the heads h as relative head mention position. 2 For instance, h = +REG +1 means the head is the first +REGULATION on the right of d in the relative surface order, whereas h = +REG −2 means it is the second +REGULATION on the left. In Figure 2 the label for "production" is EXPRESSION, THEME, +REG −1 , denoting the token is an EXPRESSION trigger, THEME of the first +REGULATION event on the left. As opposed to dependency parsing, tokens may have zero or multiple roots, and thus multiple heads and relations. This poses additional challenges. For instance, the "activation"-anchored event (Figure 2) is both THEME and CAUSE of "induced"and "promote"-anchored event heads, respectively. As a result, both r and h are multi-label, and the label for "activation" is encoded as +REGULATION, [THEME, CAUSE], [+REG −1 , +REG +1 ] , where the order of r and h items is preserved.

Event Extraction as Sequence Labeling
Formally, we aim to learn a function f : X → Y that assigns each token x i a structured label y i , i.e., d, r, h . A straightforward solution is to predict the label y i as an atomic entity (i.e., single label) in a single-task model. For BEESL, we instead propose to use multi-task learning (MTL) which allows to learn interdependencies while cutting down the label space, paired with multi-label prediction. An overview of BEESL is shown in Figure 3. We use BERT (Devlin et al., 2019) as encoder, pretrained on biomedical texts (Section 4). We mask entity spans for better generalization (Alt et al., 2019). The first WordPiece (Schuster and Nakajima, 2012) of each token x i is used for prediction, where the contextual hidden representation e i of the token x i is encoded with layer-wise attention over the BERT layers, similarly to Kondratyuk and Straka, 2019). As decoders, we use standard softmax with a cross entropy loss unless otherwise specified, and introduce a multilabel decoder (Section 3.2) (Figure 3, upper right).
We empirically evaluate both single-task and multi-task setups, including several MTL encoding alternatives, discussing their limitations and benefits. In the following, we first introduce the multi-task setups, and then multi-label decoding.

Multi-task strategies
We denote the label spaces for each component of the labels as d i ∈ D, r i ∈ R, and h i ∈ H. Further, we use L to refer to the maximum label space size.
Single-task A single-task (ST) setup is used as a baseline. It predicts a single label y i = d, r, h for each input token x i . The label space is up to L = |D| × |R| × |H|.
Multi-task The label y i for each token x i is decomposed into parts (hereafter, sub-labels), each treated as a prediction task. The decomposition of the label space allows each sub-label space to be framed as a different task with its own private decoder, mitigating the output space sparsity . Depending on the decomposition of the label y i = d, r, h , we have four multi-task learning options (pairs of tasks, or each subpart as a task, respectively) with the following properties: Option 4 encodes each subpart as its own task. While this leads to the smallest label space, it decouples the problem into 3 separate tasks. Options 1-3 are pair-wise task setups. We hypothesize that BEESL benefits from disentangling mention detection from head labeling (option 1).
As illustrated in Figure 3, BEESL uses the predicted sub-labels to form the complete label tuplê y i = d ,r,ĥ . In case r and h belong to different sub-label spaces (as is possible in options 2-4), we require that both predictionsr andĥ are present (non-empty) to ensure well-formedness. This is a downside of these alternative options 2-4, as we will see empirically (Section 5).
During training, the MTL loss is computed as where L t is the loss for each task t, given by the respective decoder (see also Section 3.2), with λ t a task-specific weighting parameter. In our experiments we kept λ = 1.0 for all, since preliminary experiments showed weighting sub-tasks differently was not beneficial. In the single-task setup, the loss reduces to L = L t .

Multi-label decoder
The multi-label decoder is designed to handle multiple labels per token, thus being suitable for predicting relations and heads. Given a task with l j ∈ L labels, it models P (l j |e i ) for each label l j . Differently from the single-label decoder, each label is predicted with a sigmoid, where all contribute equally to the loss. Given the probabilities P (l j |e i ) for the l j ∈ L labels and a threshold τ , the token x i is assigned all the labels l j with probability P (l j |e i ) ≥ τ . If no P (l j |e i ) ≥ τ is found, we take the highest scoring label l j (which may also be empty) as a fallback. 3 We employ a binary crossentropy loss, averaged across all batches.

Experimental Setup
We evaluate BEESL on the Genia 2011 benchmark (Kim et al., 2011), which comprises both abstracts and full-texts. The corpus consists of annotations for PROTEIN entities and 9 fine-grained event types. The Genia event extraction tasks expect both texts and entities as input, and complete events need to be predicted. Statistics on the dataset are shown in Table 1. Event types can be categorized into simple, binding and complex events, related to the number and types of arguments. Simple events require a THEME only, binding events require one or more THEME arguments, while complex events take both THEME and CAUSE arguments, where both can in turn be other events, resulting in nested structures. Björne and Salakoski (2011) estimated that 37.2% of the events in the data are nested. We refer the reader to Appendix A.1 for formal event definitions.
BEESL is based on MaChAmp (van der Goot et al., 2020), a toolkit for multi-task learning and fine-tuning of BERT-like models. We extend MaChAmp to also handle multi-label sequence labeling. We experiment with BEESL in single-and different multi-task setups.
After sequence labeling, token-level labels are converted into the official BioNLP-ST standoff format for evaluation (Kim et al., 2011). We simply split the event arguments based on their formal definition, producing complete structures (e.g., an EXPRESSION event with k THEME arguments is split into k EXPRESSION events, with one THEME each). Similarly to previous work, we focus on sentence-level events. We used BioBERT-Base 1.1 as our BERT model for experiments, since it provides state-of-the-art performance across multiple biomedical information extraction tasks . For multi-label decoding, we tune the threshold τ for each setup (yielding τ M T = 0.5 and τ ST = 0.7). Other hyper-parameter values and tuning details are provided in Appendix A.2.
Evaluation In line with previous work, we evaluate BEESL in terms of precision (P), recall (R), and F1 score according to the approximate recursive span matching criterion (Kim et al., 2011) using the official BioNLP online evaluation service. 4 For early stopping during training, we employ the simpler span-based F1 score (as used in named entity recognition) as our proxy metric. We found it highly correlates with the approximate recursive span based F1 official metric.
No gold entities In biomedical event extraction, entities are typically given in advance. To evaluate BEESL in a setup with predicted entities (Section 6.3), we firstly employ our model as singletask sequence labeler for BIO-tagged entity mentions using default settings and a standard CRF decoder . Note that for comparison purposes in all other experiments we assume entity mentions are gold-tagged. Then, we evaluate BEESL with raw texts and predicted entities as input, thus indirectly penalizing events that take over-predicted entities or that miss entities since they are under-predicted.

Results
First, we evaluate the MTL and multi-label decoding strategies on the development set to determine the best setup (Sections 5.1, 5.2). Then, we compare BEESL to the results obtained by the top performing systems on the official test set (Section 5.3). Finally, we gauge its speed (Section 5.4).   outperforming the other MTL options, particularly in recall. These results show that a multi-task setup with separate tasks for mention detection and head labeling, respectively, is the most useful. Option 1, i.e., d , r, h defaults to the multi-task option for BEESL (Figure 3) used in the following experiments.

Adding the multi-label decoder
We evaluate the multi-label decoder for both singletask (BEESL ST ) and multi-task (BEESL M T ) setups (Table 3, bottom). Multi-label decoding is beneficial, as the data contains many multi-headed tokens, and modeling them improves both setups. Single task performance increases substantially, from 61.13 to 63.34 F1 score. Similar signifi-cant performance gains are observed for multi-task learning, from 62.37 to 65.04 F1 score. Regardless of the multi-label modeling, the multi-task setup provides the highest overall performance.

Comparison to the state of the art
We now compare the multi-task multi-label BEESL to the top performing systems (hereafter, simply BEESL). As shown in Table 2, BEESL outperforms the state-of-the-art by a large margin, i.e., an absolute improvement of 1.57 points in F1 score over the KB-Tree LSTM model (Li et al., 2019) (hereafter, KBTL). It improves over both precision and recall, and yields a new state of the art with an F1 score of 60.22%, yet being conceptually simple. Table 4 compares F1 scores of BEESL to the previous best model on a per-event level (precision and recall are provided in Appendix A.3). BEESL outperforms the KBTL approach (Li et al., 2019) overall on 7 out of the 9 event types. From a coarsegrained perspective, BEESL outperforms KBTL on simple, binding, and complex event categories. Particularly, improvements over KBTL on simple events are as large as +13% F1 score. Furthermore, noticeable are also the improvements for binding and nested, complex events, for which our model achieves 50.19% and 48.32% F1 score. From a closer look, the recall of BEESL on simple events is substantially higher than KBTL, which ease a correct identification of complex events.
Next, we look at performance per text type (i.e., abstract and full-text subsets). BEESL achieves 62.14% F1 score on abstracts-only documents,  and 55.59% F1 score on full-texts. This confirms that full-texts are harder to process than abstracts, due to the differences in structural and content aspects (Cohen et al., 2010).
To sum up, BEESL handles events well, and unlike most prior work, does not use knowledge bases or dependency parsers as pre-processing step. BEESL uses multi-task learning with a contextual encoder and multi-label aware decoding, herewith bringing progress to the biomedical event extraction task as illustrated in Figure 1.

Speed comparison
We compare BEESL to TEES, the Turku Event Extraction System (Björne and Salakoski, 2018) to compare their speed at inference time on commodity hardware. TEES is the 2nd top-performing system (Figure 1), and its code is freely available. To the best of our knowledge, the source code of (Li et al., 2019) is not yet available. Table 5 show that BEESL is ∼2x faster and ∼5x faster on a consumer grade CPU 5 than TEES single and ensemble system, respectively. In terms of sentences per minute, BEESL processes ∼500 sents/min compared to 255 sents/min and 101 sents/min in TEES single (3.42% lower F1) and ensemble (2.12% lower F1), respectively.   Table 6: Ablation study on BEESL when removing the multi-task capability (i.e., replacing MTL with independent classifiers) and the multi-label handling.

Analysis and Discussion
To gain insights about BEESL, we shed more light on several aspects. Firstly, we analyze how much BEESL gains from multi-task learning, compared to using a powerful contextualized BERT encoder alone in a single-task learning setup and a formulation with two independent classifiers (Section 6.1). Then, we quantify the stability of the threshold τ of the multi-label decoder (Section 6.2). We also aim to get deeper insight on model performance without gold entities (Section 6.3), and qualitatively study the sources of prediction errors of BEESL (Section 6.4).

How important is multi-task learning?
As opposed to running one single model which models d and r, h jointly in a multi-task setup, we also compare to single-task (ST) and an experiment in which we formulate two classifiers which predict the two labels from the best MTL setup separately. This allows us to gauge the effectiveness of the multi-task learning approach compared to local classifiers which use strong BERT-based encoding, and compared to predicting an atomic label in ST.
Results in Table 6 confirm that leveraging a shared encoder and multi-task learning for both triggers and heads is crucial. Without multi-task learning and multi-label decoding, the F1 score drops to 61.44 (independent classifiers) and 61.13  (multi-label) 65.04 with best-only prediction 64.54 -0.50 Table 7: Ablation study on the threshold τ of the multilabel decoder ("with best-only predicion": τ = 1.0).  Table 3). Adding multilabel decoding helps, as expected. However, the full power of BEESL is only achieved by using both the multi-task and the multi-label approach, which leads to the novel state of the art.

How brittle is BEESL to the threshold τ ?
As shown in Table 3, using a multi-label decoder largely increases the performance of a system with a single-label decoder (from 62.37 to 65.04 F1 score). However, what is left is how much the threshold τ impacts the performance. To get insights on it, we firstly performed an ablation study setting τ = 1.0. As introduced in Section 3.2, this reduces to predicting the highest scoring label only -however, in a reduced label space induced by the multi-label aware decoder. We found only part of the improvement is due to the threshold τ in both multi-task and single-task settings (+0.50% and +0.47%, respectively) (Table 7).
Moreover, we evaluated BEESL with different τ values. As shown in Figure 4, a threshold in the range 0.3-0.7 only marginally alters the results, which are still better than predicting the highest scoring label only (τ = 1.0).

What is the effect of using gold entities?
The standard in biomedical event extraction is to evaluate the performance of a system on gold en-   tities. In real-world situations it is unlikely that the data is annotated for entities. We believe it is important to estimate the impact non-gold entities have on system performance (hereafter, silver entities). The performance of the entity prediction on the development set is 87.95 span-based F1 score. The results on the event extraction task using silver entities are shown in Table 8. The overall drop in F1 amounts to around 5%, and it is wellbalanced across precision and recall. This shows that BEESL's performance is clearly affected, but that the system is relatively robust to noisy, nongold silver entities. We believe that this performance gap can be further minimized by using jackknifing (Agić and Schluter, 2017) to reduce data mismatch, however, this requires to align the predicted entities with the existing events in the training data, which is non-trivial, and we leave this for future work.

What are the sources of errors?
We randomly sampled 30 documents (comprising 168 gold events) from the development set for a manual scrutiny for sources of errors. We classified errors into two broad categories, namely trigger and argument errors. Further, we classify them in fine-grained categories based on the type of error, namely under-prediction, over-prediction, and wrong type. Table 9 summarizes the results.
We notice the largest fraction of errors is due to trigger errors. From a closer look, under-predicted triggers account for 31.43% of the total, whereas over-predicted triggers for 28.57%. We investigated the reason for these errors, finding that overpredicted triggers are often due to generic words used very frequently to indicate specific trigger types. For instance, BEESL identifies the +REG-ULATION event anchored at "activated" in the following sentence: "Tax [...] maximally activated HTLV-I-LTR-CAT and kappa B-fos-CA" albeit the gold standard does not contain the event in this instance. However, from a semantic point of view we believe these errors are acceptable. Other cases include the words such as "detected" and "influences", which are often used as EXPRESSION and REGULATION event triggers, respectively. Under-prediction of triggers is instead due to a variety of reasons. Both rare words (e.g., a +REG-ULATION event centered on "co-transfected") and uncertain events account for a large fraction of this error type. An example of uncertain event is represented by the +REGULATION trigger "importance" in the sentence "[...] importance of NF-kappa B in LT gene expression", that BEESL does not predict.
Wrongly typed triggers represent only 10% of the errors. An example is represented by ambiguous trigger types. In the sentence "T cells upregulates A3G mRNA levels", BEESL classifies "levels" as an EXPRESSION trigger, while the gold annotation indicates it is a TRANSCRIPTION trigger. By a closer look, we found some triggers in the corpora are annotated as EXPRESSION and TRANSCRIP-TION types interchangeably. This is due to the fact a TRANSCRIPTION is a gene EXPRESSION.
Regarding the identification of arguments, overpredictions are quite uncommon. If erroneous, the main error we found may benefit from syntactic information, which we aim to integrate in a multitask setup in future work. We found no misclassification of arguments in our document samples. Under-prediction of arguments are instead mostly due to under-predicted events.

Related Work
Biomedical event extraction has a long-standing tradition Miwa et al., 2012;Vlachos and Craven, 2012;Venugopal et al., 2014;Majumder et al., 2016). Current work has explored neural methods and uses multiple classification stages. Namely, first identifying trigger mentions, and then evaluating all entity pairs (Li et al., 2019;Björne and Salakoski, 2018). They come with the shortcomings of traditional pipeline methods. Many studies use dependency parsers to obtain features or for guidance of Tree-LSTMs (Li et al., 2019;Björne and Salakoski, 2018).
Recent work in syntactic parsing has shown that reducing parsing to sequence labeling is a viable alternative for both constituent and dependency parsing (Spoustová and Spousta, 2010;Gómez-Rodríguez and Vilares, 2018;Strzyz et al., 2019), which we took as inspiration. Moreover, earlier work framed biomedical event extraction as syntactic and semantic tree-or graph-parsing Rao et al., 2017). In particular, Mc-Closky et al. (2011) do dependency parsing, followed by a second-stage parse reranker model for event extraction, and Rao et al. (2017) cast the problem as subgraph identification problem.
Joint learning for biomedical event extraction was explored in early work Venugopal et al., 2014;Vlachos and Craven, 2012). Contemporary to our work, a very recent study proposes oneIE, a joint learning model for event extraction (Lin et al., 2020). It proposes a single end-to-end model for event extraction using 4 stages, paired with a beam search, obtaining good results on ACE data. Processing multiple heads has previously been done for relation extraction using multi-head selection (Bekoulis et al., 2018a,b), and sequence labeling has been employed for joint entity and relation classification (Dai et al., 2019) with inter-token attention. We employ it at the token-level for multi-label sequence labeling.

Conclusion
This paper proposes BEESL, a new end-to-end biomedical event extraction system which is both efficient and accurate. BEESL is broadly applicable to event extraction and other tasks that can be recast as sequence labeling. The system's strength comes from the joint multi-task modeling paired with multi-label decoding, which aids interdependencies between the tasks and is superior to alternative decoders based on strong contextualized BERT embeddings. BEESL is fast, and achieves stateof-the-art performance on the Genia 2011 event extraction benchmark without the need of external tools for features and resources such as knowledge bases. Our analysis shows that BEESL works very well across event types.
We release the code freely, to foster research on using BEESL for other NLP tasks as well, e.g., enhanced dependency parsing, fine-grained named entity recognition, and semantic parsing.

A.1 Data and formal event definitions
Events on the Genia 2011 benchmark follow the formal specification detailed in

A.2 Hyper-parameters
The list of hyper-parameter values and the search space are presented in Table 11, whereas the number of trainable parameters in BEESL is ≈ 110M . For tuning, we started from the values reported in previous works on multi-task learning for NLP evaluation benchmarks, e.g., UDify (Kondratyuk and Straka, 2019). We performed 32 search trials via grid search, in which "batch size" and "base learning rate" have been coupled -(32, 1e − 3) and (64, 1e − 2). Additional 9 search trials have been performed for threshold τ selection for the BEESL multi-task multi-label model. We used the official approximate recursive span matching based F1 score for model selection, whereas the sum of span-based F1 scores of the tasks was employed to determine early stopping of the training process.

A.3 Miscellaneous
Technical details Texts have been tokenized and segmented using scispaCy 0.2.4 (Neumann et al., 2019). In our data it is uncommon that multiple contiguous triggers have the same type, so BIO encoding is not needed. In the rare case of overlapping event triggers of different types, we create a single label d concatenating their types. Similarly Multi-label threshold 0.5 0.1, 0.2, ..., 1.0 to previous work, for BINDING events with multiple THEME arguments we employ a simple heuristic to convert them into the BioNLP-ST standoff format (Vlachos and Craven, 2012  Upper bound of the encoding We quantified the upper bound of our encoding strategy by directly evaluating the performance of the encoded development set. Results (P: 95.76%, R: 91.30%, F1: 93.48%) indicate the goodness of our strategy, and that the ≈ 6% missing is due to cross-sentence arguments we disregard, similarly to previous work.