Simple Hierarchical Multi-Task Neural End-To-End Entity Linking for Biomedical Text

Recognising and linking entities is a crucial first step to many tasks in biomedical text analysis, such as relation extraction and target identification. Traditionally, biomedical entity linking methods rely heavily on heuristic rules and predefined, often domain-specific features. The features try to capture the properties of entities and complex multi-step architectures to detect, and subsequently link entity mentions. We propose a significant simplification to the biomedical entity linking setup that does not rely on any heuristic methods. The system performs all the steps of the entity linking task jointly in either single or two stages. We explore the use of hierarchical multi-task learning, using mention recognition and entity typing tasks as auxiliary tasks. We show that hierarchical multi-task models consistently outperform single-task models when trained tasks are homogeneous. We evaluate the performance of our models on the biomedical entity linking benchmarks using MedMentions and BC5CDR datasets. We achieve state-of-theart results on the challenging MedMentions dataset, and comparable results on BC5CDR.


Introduction & Related Work
The task of identifying and linking mentions of entities to the corresponding knowledge base is a key component of biomedical natural language processing, strongly influencing the overall performance of such systems. The existing biomedical entity linking systems can usually be broken down into two stages: (1) Mention Recognition (MR) where the goal is to recognise the spans of entity mentions in text and (2) Entity Linking (EL, also referred as Entity Normalisation or Standardisation), which given a potential mention, tries to link it to an appropriate type and entity. Often, the entity linking task includes the Entity Typing (ET) and Entity Disambiguation (ED) as separate steps, with the former task aiming to identify the type of the mention, such as gene, protein or disease before passing it to the entity disambiguation stage, which effectively grounds the mention to an appropriate entity.
Widely studied in the general domain, entity linking is particularly challenging for the biomedical text. This is mostly due to the size of the ontology, (here referred to as the knowledge base), high syntactic and semantic overlap between types and entities, the complexity of terms, as well as the lack of availability of annotated text.
Due to these challenges, the majority of the existing methods rely on hand-crafted complex rules and architectures including semi-Markov methods , approximate dictionary matching  or use a set of external domain-specific tools with manually curated ontologies (Kim et al., 2019). These methods often include multiple steps, each of these steps carrying over the errors to the subsequent stages. Nevertheless, these tasks are usually interdependent and have been proven to often benefit from a joint objective (Durrett and Klein, 2014). Recently, both in the general and biomedical domain, there has been a steady shift to neural methods to solve EL (Kolitsas et al., 2018;Habibi et al., 2017), leveraging a range of methods including the use of entity embeddings (Yamada et al., 2016), multi-task learning (Mulyar and McInnes, 2020;Khan et al., 2020), and others (Radhakrishnan et al., 2018). There have also been a plethora of mixed methods combining heuristic approaches such as approximate dictionary matching with language models (Loureiro and Jorge, 2020).
This work focuses on multi-task approaches to end-to-end entity linking, which has already been studied in the biomedical domain. These include ones leveraging pre-trained language models (Peng et al., 2020;Crichton et al., 2017;Khan et al., 2020), model dependency (Crichton et al., 2017) and building out a cross-sharing model structure . An interesting approach has been proposed by Zhao et al. (2019), where authors established a multi-task deep learning model that trained NER and EL models in parallel, with each task leveraging feedback from the other. A model with a similar setup and architecture to the one here, casting the EL problem as a simple per token classification problem has been outlined by Broscheit (2019). Nevertheless, its application domain, architecture, and training regime strongly differ from the one proposed here.
In this study, we investigate the use of a significantly simpler model, drawing on a set of recent developments in NLP, such as pre-trained language models, hierarchical and multi-task learning to outline a simple, yet effective approach for biomedical end-to-end entity linking. We evaluate our models on three tasks, mention recognition, entity typing, and entity linking, investigating different task setups and architectures on the MedMentions and BioCreative V CDR corpora.
Our contributions are as follows: (1) we propose and evaluate two simple setups using fully neural end-to-end entity linking models for biomedical literature. We treat the problem as a per token classification or per entity classification problem over the entire entity vocabulary. All the steps included in the entity linking task are performed in a single or two steps. (2) We examine the use of mention recognition and entity typing as auxiliary tasks in both multi-task and hierarchical multi-task learning scenario, proving that hierarchical multitask models outperform single-task models when tasks are homogeneous. (3) We outline the optimal training regime including adapting the loss for the extreme classification problem.

Tasks
Our main task, which we refer to as Entity Linking (EL) aims at classifying each token or a mention to an appropriate entity concept unique identifier (CUI). In order for the mention to be correctly identified, all tokens for the mention need to have the correct golden annotation. If the model has wrongly predicted the token right after or before the entity's golden annotated span, the entity prediction is wrong at the mention-level (Mohan and Li, 2019). For the per entity setup, where the entity representation is derived through mean pooling of all tokens spanning a predicted entity, both the final We also make use of two other tasks: Entity Typing (ET) and Mention Recognition (MR), with the former predicting entity Type Unique Identifier (TUI) for each token and the latter predicting whether a token is a part of the mention. We always use the BILOU scheme for mention recognition token annotation, and due to the low number of types in the BC5CDR dataset, also for the ET task on this corpora. We evaluate the entity prediction at mention-level similarly as in the EL and ET. In per token setup, all three tasks are essentially sequence labelling problems, while in per entity setup, only the MR is a sequence labelling problem and both ET and EL are classification problems leveraging the predictions produced by the MR model.
The reason behind employing ET and MR tasks is for investigating the multi-task learning methods, where we treat ET and MR as auxiliary tasks aimed at regularising and providing additional information to the main EL task leveraging its inherently hierarchical structure. Correspondingly, we also look at the performance impact of the two other tasks on EL task.

Models
We outline three models: single-task model, multitask model, and hierarchical multi-task model. The model architecture for the latter two models is depicted on Figure 2. All models take a sentence with the surrounding context as their input and output a prediction for a token (PT setup) or an average of token embeddings spanning an entity (PE setup). For tokenisation, embedding layer and encoder we use SciBERT (base).
The single-task model only adds a feedforward neural network at the top of the encoder transformer, which acts as a decoder. In the multi-task scenario, three feedforward layers are added on the top of the transformer, each corresponding to a specific task, namely MR, ET, and EL. All of these tasks share the encoder and during a forward pass, the encoder output is fed into each task-specific layers separately, after which the cumulative loss is summed and backpropagated through the model. The intuition behind sharing the encoder is that training on multiple interdependent tasks will act as a regularisation method, thus improving the overall performance and speed of convergence.
The last model is a hierarchical multi-task model that leverages the natural hierarchy between the 3 tasks by introducing an inductive bias by supervising lower level tasks at the bottom layers of the model (MR, ET) and higher level task (EL) at the top layer. Similarly, as in (Sanh et al., 2019), we add task-specific encoders and shortcut connections to process the information from lower to higher level tasks. The higher level tasks take the concatenation of the general transformer encoder output and lower-level task encoder specific output as their input. Here, we use multi-layer BiLSTMs as task-specific encoders.
We experiment with all three models in the per token scenario, as all tasks in this setup are sequence labelling problems. For the per entity framework, we look at a single-task and hierarchical multi-task model, where only the MR step is a sequence labelling task and ET and EL are both classification tasks.

Training details
We treat both PE and PT setups as multi-class classification problems over the entire entity vocabulary. In both cases, we use categorical crossentropy to compute the loss. To address the class imbalance problem in the PT framework, we apply a lower weight to the Nil token's output class, keeping other class weights equal. To improve convergence speed and memory efficiency we compute the loss only through the entity classes present in the batch. Therefore, for token t i in a sequence T , (or correspondingly the mean pooled entity representation from a set of tokens) with a label y i and its assigned class weight w k in a minibatch B and entity labels derived from this batchÊ = E(B) , the loss is computed by Here, y k ij represents the target label for token i in a sequence j for class k, and h θ (t ij , k) represents the model prediction for token t ij and class k, where the parameters θ are defined by the encoder and decoder layers in the model.
We found using the context, namely the sentence after and before the sentence of interest beneficial for the encoder. After encoder, the context sentences are discarded from further steps. For the encoder, we use the SciBERT (base) transformer, and we fine tune the model parameters during training. For the hierarchical multi-task model, we follow the training regime outlined in (Sanh et al., 2019) and found tuning the encoder only on the EL task marginally outperforming sharing it across all three tasks. We treated the Nil output class weight as an additional hyperparameter that we set to 0.125 for MedMentions (full) and BC5CDR datasets, and 0.01 for MedMentions st21pv. All trainings were performed using Adam (Kingma and Ba, 2015) with 1e − 4 weight decay, 2 − e5 learning rate, batch size of 32 and max sequence length of 128.   The models were trained on a single NVIDIA V100 GPU until convergence.

Datasets and Evaluation metrics
We evaluate our models on three datasets; two versions of the recently released MedMentions dataset; (1) full set and (2) and st21pv subset of it (Mohan and Li, 2019) and BioCreative V CDR task corpus (Li et al., 2016). Each mention in the dataset is labelled with a concept unique identifier (CUI) and type unique identifier (TUI). Both MedMentions datasets target UMLS ontology but vary in terms of number of types and mentions, while the BioCreative V corpora is normalised with MeSH identifiers. The datasets details are summarised in Table 1.
We measure the performance of each task using mention-level metrics described in (Mohan and Li, 2019), providing precision, recall, and F1 scores. Additionally, we record the per token accuracy for the per token setup. As benchmarks, we use SciSpacy (Neumann et al., 2019) package, which has been shown to outperform other biomedical text processing tools such as QuickUMLS or MetaMap on full MedMentions and BC5CDR (Vashishth et al., 2020). Due to little results reported on the end-to-end entity linking task on MedMentions, we also use BiLSTM-CRF in per token setup as a benchmark.

Results and discussion
In Tables 2 and 3 we outline the results on MR, ET, and EL tasks. While the reported results are all optimal for single-task models, it should be noted that all multi-task models optimise for the EL task with MR and ET serving as auxiliary tasks, hence the EL is the focus of the discussion. All of the models outlined here significantly outperform SciSpacy and BiLSTM-CRF, particularly in ET and EL. The per entity setup proves to perform better on EL than the simpler per token framework by 0.87 F1 points on average, yielding particularly better recall results (2.03 points). Error analysis has shown that this is often due to the lexical overlap of some Nil tokens with entity tokens, resulting in a model often assigning an entity label for to-kens with gold Nil token label. Furthermore, in the per token setup, the multi-task models consistently outperform the single-task models on EL, with the hierarchical multi-task model achieving the best results (on average 1.45 F1 points better than single-task models). In contrast, this has not been the case for the per entity framework, where the single-task models have on average performed marginally better on EL. We hypothesise that this is due to the homogeneity of the tasks in the per token setup, with all the tasks being sequence labelling problems, which is not the case for the per entity case. Interestingly, the achieved results are higher for the full MedMentions dataset than for the st21pv subset. This highlights the problem of achieving high macro performance mentioned in (Loureiro and Jorge, 2020) for biomedical entity linking.

Conclusion & Future Work
In this work, we have proposed a simple neural approach to end-to-end entity linking for biomedical text which makes no use of heuristic features. We have proven that the problem can benefit from the hierarchical multi-task learning when tasks are homogeneous. We report state-of-the-art results on EL on the full MedMentions dataset and comparable results on the MR and ET tasks on BC5CDR (Zhao et al., 2019). The work could easily be extended by, for example, using the output of the PT setup as features or by further developing the hierarchical multi-task framework of end-to-end entity linking problem. Moreover, the additional parameters such as output class weights or loss scaling which has not been used here could be easily adapted to a particular problem.