Linguistically Informed Relation Extraction and Neural Architectures for Nested Named Entity Recognition in BioNLP-OST 2019

Named Entity Recognition (NER) and Relation Extraction (RE) are essential tools in distilling knowledge from biomedical literature. This paper presents our findings from participating in BioNLP Shared Tasks 2019. We addressed Named Entity Recognition including nested entities extraction, Entity Normalization and Relation Extraction. Our proposed approach of Named Entities can be generalized to different languages and we have shown it’s effectiveness for English and Spanish text. We investigated linguistic features, hybrid loss including ranking and Conditional Random Fields (CRF), multi-task objective and token level ensembling strategy to improve NER. We employed dictionary based fuzzy and semantic search to perform Entity Normalization. Finally, our RE system employed Support Vector Machine (SVM) with linguistic features. Our NER submission (team:MIC-CIS) ranked first in BB-2019 norm+NER task with standard error rate (SER) of 0.7159 and showed competitive performance on PharmaCo NER task with F1-score of 0.8662. Our RE system ranked first in the SeeDev-binary Relation Extraction Task with F1-score of 0.3738.


Introduction
Extracting knowledge from scientific articles is a challenging but very important problem. This becomes especially critical for biomedical literature which is growing at an increasing rate of at least 4% per year, as of June 2019 there are 30 Million documents in PubMed (Lu, 2011). Named Entity Recognition (NER) (Settles, 2004;Lample et al., 2016) in the context of biomedical domain refers to the task of identifying the name of the biological entities e.g. name of a bacteria. Relation extraction 1 (RE) ( Figure 1: An illustration of (nested) NER + Normalization and Relation Extraction in Biomedical entities. Each rectangular box spans an entity, where the overlapping spans indicate nested entities. E.g., fish is a nested entity (a sub-concept) of type Habitat within the parent entity fish pathogen of type Phenotype. The identifiers (e.g. OBT:002669, NCBI:40269, etc.) refer to unique IDs in Biomedical databases (i.e., OBT → OntoBiotope Ontology and NCBI → NCBI Taxonomy), used to perform entity normalization (i.e., entity linking). The arrows indicate binary relationships. McDonald et al., 2005;Lever and Jones, 2016; refers to identifying relations among biological entities (binary or n-ary). Figure 1 illustrates an example of (nested) NER and RE consisting of five entities, where three entities participate in two distinct relationships. It is often required to link named entity(s) to a unique reference in database(s). For instance, one of the two occurrences of fish refers to marine fish while the second refers to a farm fish, where the two entities are linked (or normalized) to different identifiers (e.g., OBT:002793 and OBT:002903) in the biomedical database (e.g., OntoBiotope Ontology). The act of linking entities to standard entities with a unique identifier is known as entity normalization and is challenging as several entity mentions can correspond to the same standard entity (or unique identifier), e.g. E. coli, Bacillus coli and Bacteriumcoli refer to the standard entity Escherichia coli in the database. The linking process relies on knowledge base (KB) search (heuristic OR semantic) in order to resolve entities.
NER is a critical primitive step in the NLP  Figure 2: System Architecture for NER task, consisting of two bi-LSTM-CRF architectures: Level1 NER to detect parent entities and Level2 Nested NER to detect sub-concepts within the parent entities (output of Level1 NER). Here, w e: a word embedding vector; c_e: an embedding vector for a word computed using character-level bidirectional LSTM; t_f : a vector of additional linguistic features; B_P: B_Pathogen; B-S_H: a sub-concept of type Habitat detected by the Level2 Nested NER run over the the parent entity.
pipeline as downstream tasks such as RE, text classification, Question Answering (QA) etc., depend on it. Even though several methods have been devised to engineer reliable NER systems; however, most of them don't explicitly address the extraction (or recognition) of nested entities, especially required in the biomedical domain. Nested entity is defined as an entity or sub-concept which is part of a longer entity (i.e., a parent). For instance in the Figure 1, fish is a nested entity as it is part of a parent entity fish pathogen. In this work, we have also investigated extracting nested entities via two bi-LSTM-CRF (Lample et al., 2016) networks: one for parent detection and another for nested entities with the parent entity.

Task Description and Contribution
We participate in the following three tasks organized by BioNLP workshop 2019: (1) Phar-maCoNER: Recognition of pharmaceutical drugs and chemical entities in Spanish text.
(2) BB-norm+NER: Recognition of Microorganism, Habitat and Phenotype entities and normalization with NCBI Taxonomy and OntoBiotope habitat concepts. (3) SeeDev Binary RE: Binary Relation extraction of genetic and molecular mechanisms involved in plant seed development.
Following are our multi-fold contributions: 1. To address NER tasks, we have employed neural network based sequence classifier, i.e., bi-LSTM-CRF and investigated multitasking of named entity detection (NED) and language modeling (LM). We further introduced hybrid loss including CRF and ranking. We also incorporated linguistic features such as POS, orthographic features, etc. We apply the proposed modeling approaches to both English and Spanish texts. Comparing with other systems, our submission (Team: MIC-CIS) is ranked 1 st in BB-norm+NER task (Bossy et al., 2019) with standard error rate of 0.7159. In PharmaCoNER task (Gonzalez-Agirre et al., 2019), our submission scored F1-score of 0.8662.

2.
To address RE task, we employed linguistic and entity features in SVM. Our submission (Team: MIC-CIS) is ranked 1 st in SeeDevbinary RE task (Chaix et al., 2016) with F1score of 0.3738.

Methodology
In the following sections we discuss our proposed model for NER and RE. Figure 2 describes the architecture of our model, where we design two sequence taggers Level1 NER and Level2 Nested NER to extract parent and nested entities respectively. Furthermore, Level1 NER can be configured in two modes: (1) LSTM-CRF (Lample et al., 2016) with word embeddings (w e), character embeddings (c e) and token-level features (t f ) such as POS, capitalization features, word shape, etc. (refer to table 1 for the complete list of word level features) (2) LSTM-CRF+Multi-task that performs entity detection and language modelling as auxiliary tasks. Note that Level2Nested NER only operates on the parent entities detected by Level1 NER. The parent and nested entities are than normalized to unique identifiers in KB by our entity normalization algorithm.

BiLSTM-CRF
The input to LSTM is a sequence of word features (w 1 , w 2 , . . . , w n ) and they compute a hidden state for each element in the sequence (h 1 , h 2 , . . . , h n ). This hidden state can be used to jointly model tagging decisions using CRF (Lafferty et al., 2001). CRF imposes ordering constraints on the tagging decisions e.g. I_Habitat should always be preceded by B_Habitat. For an input sentence, we consider a matrix P of scores output by the bidirectional LSTM. The size of P is n × k, where k is the number of distinct tags, and P i,j corresponds to the score of the j th tag of the i th word in a sentence. For a sequence of predictions y = (y 1 , y 2 , . . . , y n ), we define its score to be where the matrix A express transition scores such that A i,j represents the score of a transition from the tag i to tag j. We add start and end tag to the set of possible tags, therefore, the size of A is k + 2. During training, we minimize the negative log-probability of the correct tag sequence:

Hybrid Loss: CRF + Ranking
We use a variant of ranking loss function proposed by dos Santos et al. (2015). Ranking maximizes the distance between the true label y + and the most competitive label c − : where γ is the scaling factor that penalizes the predictions, m + and m − are margins for correct and incorrect labels respectively. We follow Vu et al. (2016) to set the values of margins.
The hybrid loss function hence is the sum of CRF tagging loss and ranking loss: where α ∈ [0, 1], weighs the contribution of ranking loss in the overall loss value. During training we minimize the hybrid loss and found it to improve the F1 score for both BB-norm+NER and PharmaCoNER tasks.

Multi-Tasking of Named Entity Recognition, Detection and Language Modelling
We employed auxiliary objectives of named-entity detection (NED) (Aguilar et al., 2017) and bidirectional language modelling (LM) (Rei, 2017) (Collobert and Weston, 2008) and improves the overall performance. With these multitasking objectives, for each word token our model predicts the NED tag, next word, previous word and the NER tag 2 . LM and NED layers in figure 2 realizes NED and LM objectives respectively. Note that Multi-tasking is only enabled at train time and requires no additional labelling.

Nested Entities
The dataset of BB-norm+NER task contains 17.4% nested entities 3 which cannot be extracted by standard Bi-LSTM CRF model. We employed two Bi-LSTM-CRF models: Level1 NER model to detect parent entities and Level2 Nested NER model to detect nested entities. Figure 2 (right) shows the architecture of Level2 Nested NER. The parent entities detected by Level1 NER are fed to Level2 Nested NER to detect nested entities in the parent entities. Level2 Nested NER has the same architecture as Level1 NER but without the multitasking objectives. It is easy to see that current architecture can only detect nested entities at level 2. The final output of model is the aggregation of parent entities and nested entities.

Entity Normalization
The goal of entity normalization (entity linking) is to map noisy predicted entities in text to canonical 2 we used IOBES tagging scheme 3 https://groups.google.com/d/msg/ bb-2019/A2MuFYiPQIY/9YtMmakeBQAJ entities in knowledge base (KB). This is challenging because: (1) not all variations of textual forms for a canonical entity exists in the KB, (2) syntactic variations in the predicted entity mentions due to misspellings, abbreviations, acronyms and boundary errors.
For BB-norm+NER task, we used two Biomedical databases OntoBiotope Ontology and NCBI Taxonomy. OntoBiotope Ontology contains 3, 602 canonical forms of type Habitat and Phenotype. NCBI Taxonomy contains 1, 082, 401 records for type Microorganism. We employed exact, fuzzy and semantic (embedding) search to perform entity normalization. Algorithm 1 illustrates the detailed steps of our algorithm, note that type and order of search depends on the predicted named entity type. We also employed caching to minimize pairwise comparisons and improve the overall run-time efficiency.

Post-processing for NER+norm
Our model (see Figure 2) employs CRF at decoding step to impose boundary ordering constraints on the predicted named entity types e.g. I should always be preceded by a B token. But our model does not always respect such ordering constraints and therefore, we resolve boundary inconsistencies at inference time to make the NER labels consistent. Post-processing column in the Table  3 illustrates the post-processing resolving inconsistent labels after the voting on majority labels, consider row r3 where post-processing correctly imposes the semantics of boundary ordering by changing I-Habitat to B-Habitat.

Relation Extraction
Deep Learning based methods are state of the art in relation extraction (Wu and He, 2019;Wang et al., 2016) but they require large amount of labelled training data. In cases when enormous training data is not available than Kernel methods like Support Vector Machines (SVM) are an optimal choice. We employed SVM for performing relation extraction. One of the downsides of SVM is that they usually require lots of hand-crafted features to train properly. Table 2 lists computed general and entity features.
Our best model was trained with Radial Basis Function (RBF) Kernel with value of penalty parameter C determined by grid search for each dataset. We employed oversampling and classweight penalization to handle imbalanced data.   Surprisingly oversampling did not provide any performance improvement therefore, final models were trained only with higher class weights for minority classes. We did not normalize any input feature as it resulted in reduced performance. In relation extraction participating entities are not known in advance, the usual practise is to test every valid pair of entities for a relation. We employed heuristic of token counts between entities to filter the probable invalid relations. The value of token counts was determined using crossvalidation.

Ensemble Strategy
Bagging is a helpful technique to reduce variance without impacting bias of the learning algorithm. We employed a variant of Bagging (Breiman, 1996) which makes sure that every sample in the training set is part of the development set at least once and vice versa. We created three data folds and trained the model using optimal configuration on each fold, prediction on test involves majority voting among the three trained models.
The commonly used tagging schemes (BIO, BIOES etc.,) for NER contains information about the boundary of an entity along with the class of an entity, which is spitted by the model at each time-step. Due to this dual information in a single output, maximum voting is not trivial as models can not only disagree on the class but also on the boundary of an entity. Empirically we found that our model is better at predicting the class of an entity rather than the boundary of an entity, therefore, we followed the strategy class determines the boundary. In cases when voting results in a tie, we take the prediction of the confident model, we treat the model trained on original train/dev split as the confident model. We also experimented with an extreme version of ensembling where we aggregate the output of every model with distinct spans, as expected this improves the recall but with the cost of reduced precision. One possible optimization to this ensemble strategy is to only aggregate the non-overlapping spans to control reduction in precision without much decrease in recall, we will explore this as a future work. Table 3 shows the ensemble correcting individual model's erroneous predictions.
In case of ensemble for RE, we followed the straight forward approach of majority voting at sentence level for each test sample.

Dataset and Experimental Setup
Data: We employed bagging (discussed in section 3.3) to split the annotated corpus into 3-folds. We used pre-processed versions of datasets for BB-norm+NER 4 and SeeDev 5 provided by the organizers. This pre-processed version comes with sentence splitting, word tokenization and POS tagging. PharmaCoNER: The dataset consists of four entity types with very few mentions of type UN-CLEAR and NO NORMALIZABLES as shown in table 4. Entities of type UNCLEAR are ignored in the evaluation of this shared task but we still treat them as regular entities.
BB-norm+NER: The dataset consists of three entity types with few mentions of type Phenotype (see table 4). The dataset also contains 3.6% disconnected entities 6 , we did not employ any strategy to handle disconnected entities and instead treat them as separate (regular) entities.  Experimental Setup: We found sub-word information to be very helpful in identifying entities and relations in biomedical domain and all our experiments used word embeddings trained using FastText (Bojanowski et al., 2017). For tasks in English language we used FastText embeddings trained on PubMed (Zhang et al., 2019). We don't employ any strategy for handling imbalanced classes for NER but have used class weighting by a factor of 10 for all positive classes for RE. Table 5 lists the best configuration of hyperparameters for all the tasks.
PharmaCoNER: We used SPACCC POS-TAGGER (Soares and gonzalez agirre, 2019) for sentence splitting, word tokenization and POS tagging. We trained FastText embeddings on the following corpora: IBECS (Rodríguez, 2002), IULA-Spanish-English-Corpus (Marimon et al., 2017), MedlinePlus (Miller et al., 2000), PubMed (Lu, 2011), ScIELO (Goldenberg et al., 2007) and PharmaCoNer . We trained embeddings on two variants of corpora: (1) Include train and development set of PharmaCoNER (2) Include complete dataset of PharmaCoNER. We concatenated these two embeddings to provide complementary information and found them to empirically work better than the embeddings trained on individual corpora variant. We compute micro-F1 using the script  provided by the organizers on the dev set 7 . BB-norm+NER: For training NER model we compute macro-F1 8 (Tsai et al., 2006) on the dev set. NER and Entity normalization together are evaluated using Standard Error Rate (SER) (Bossy et al., 2015). During the entity normalization step, the fuzzy and semantic search can resolve an entity mention to multiple normalization identifiers. Our algorithm returns top 5 matched identifiers, however, we empirically found selecting the top most identifier gives superior performance.
SeeDev: We adopted two strategies to create negative relation instances for train and dev+test set: (1)Train: only consider sentences not participating in any positive relation (2) Dev+Test: consider all the sentences. Negative relation instances are always created only among the valid combination of entity types. We also employed an extended version of keywords match of Li et al. (2016) as a feature (referred as keyword vectors in table 2).

Results on Development Set
To investigate the impact of features we incrementally enabled them and observe the affect on performance on dev set.
NER: Table 6 shows the score on dev set for PharmaCoNER and BB-norm+NER. Observe that FastText embeddings (row r2) outperform randomly initialized embeddings (row r1) and con-7 https://github.com/PlanTL-SANIDAD/ PharmaCoNER-CODALAB-Evaluation-Script 8 evaluation measure with strict boundary detection  tribute to biggest performance boost for both datasets. Subsequently, Orthographic (row r3) and POS (row r4) features 9 improve the scores for PharmaCoNER but surprisingly lower the score for BB-norm+NER. In row r5, we perform multitasking with auxiliary task of NED leading to improvement only for PharmaCoNER. Next, we incorporate hybrid loss including ranking (row r7) which consistently improves the score on both datasets. In row r8, we employed Brute Force Search (discussed in section 4.3) that significantly reduce SER for BB-norm+NER. Finally, we create an ensemble of (r7, r10, r12) and (r8, r10, r12) on test set for PharmaCoNER and BB-norm+NER respectively. RE: Table 7 shows the score on dev set for SeeDev 10 . In row r1, negative instances dominate the training set resulting in no learning. Observe that introduction of class weights (row r2) compensate the dominance of negative instances leading to F1 score of 0.205. Next, we added entitytype (row r3) and sdp-entity (row 4) features, both of these features significantly improves F1 score i.e. by an absolute value of more than 4.0. Subsequently, emb-sdp (row r5) and lemma (row r6) contribute to incremental improvements. Finally, we create an ensemble of row r6 on all three data folds.

Analysis on Development Set
BB-norm+NER: We also explored approaching the problem of NER and entity normalization in a reverse manner by matching every entity mention from the biomedical databases (i.e. NCBI Tax matching is indeed exhaustive search, we refer to it as Brute-force search. Figure 3 shows the comparison of: (1) brute-force search (2) Level1 NER (3) aggregation of brute-force search and Level1 NER (4) aggregation of brute-force search, Level1 NER and Level2 NER. Brute-force search yields high precision but a moderately low recall with SER value of 0.7. In comparison, Level1 NER has significantly higher recall with a little reduction in precision yielding SER value of 0.52. The aggregation of brute-force search and Level1 NER improves recall and lowers SER value to 0.49. Finally, aggregation of brute-force search, Level1 NER and Level2 NER results in a balanced precision and recall values but an overall higher value of SER. Our submission on test set employed aggregation of brute-force search and Level1 NER. SeeDev: We employed the heuristic of token counts between target entities to filter potential negative relation instances. With this heuristic in place, we only consider sentences with entity distance less than or equal to threshold parameter τ . Figure 4 shows the impact of different values of τ on system performance. The value of τ ≤ 20 gives significant boost in precision with minor decrease in recall. Our submission employed the threshold value of τ ≤ 20 between entity tokens.

Comparison with Participating Systems
SeeDev:  achieves the best score among all participating systems with F1 score of 0.373 showing compelling advantage. The system attains the highest precision (0.294) and recall (0.511). Precision and recall are not balanced however, and our system need an improvement to bring down false positives.
BB-norm+NER: Table 8 (right) shows the comparison of performance among participating teams on BB-norm+NER test set. Our two submissions (MIC-CIS-1, MIC-CIS-2) ranked first and second with standard error rate (SER) of 0.7159 and 0.7867 respectively. The second submission employed Level2 NER to extract nested entities and hence has higher recall but with reduced precision. MIC-CIS-1 has the highest precision 0.6242 and MIC-CIS-2 has the recall close to the best recall of BLAIR GMU-1 with score 0.4676. Precision and recall are not balanced, we hypothesize improvement in nested entities extraction and modelling discontinuous entities will improve the system recall.

Conclusion and Future Work
In this paper, we described our system with which we participate in PharmaCoNER, BB-norm+NER and SeeDev shared tasks. Our NER system employed linguistic features, multi-tasking via auxiliary objectives and hybrid loss including ranking loss to extract flat and nested entities in English and Spanish text. Our RE system employed SVM with linguistic features. Compared to other participating systems, our submissions are ranked 1 st in BB-norm+NER and SeeDev task. Our system demonstrates competitive performance on Phar-maCoNER with F1-score of 0.8662.
In future, we would like to explore improved modelling strategies for nested NER and discontinuous entities extraction. Further, in this work we only addressed intra-sentence RE, we would be interested to explore approaches for inter-sentence RE (Peng et al., 2017;Gupta et al., 2019b). Moreover, we would like to investigate interpretability of LSTMs for NER and RE .