End-to-end Biomedical Entity Linking with Span-based Dictionary Matching

Disease name recognition and normalization is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset cannot be accurately predicted. This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features to address this problem. Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models. Experiments using two major datasaets demonstrate that our model achieved competitive results with strong baselines, especially for unseen concepts during training.


Introduction
Identifying disease names , which is generally called biomedical entity linking, is the fundamental process of biomedical natural language processing, and it can be utilized in applications such as a literature search system  and a biomedical relation extraction (Xu et al., 2016). The usual system to identify disease names consists of two modules: named entity recognition (NER) and named entity normalization (NEN). NER is the task that recognizes the span of a disease name, from the start position to the end position. NEN is the post-processing of NER, normalizing a disease name into a controlled vocabulary, such as a MeSH or Online Mendelian Inheritance in Man (OMIM).
Although most previous studies have developed pipeline systems, in which the NER model first recognizs disease mentions Weber et al., 2020) and the NEN model normalizes the recognized mention (Leaman et al., 2013;Ferré et al., 2020;Xu et al., 2020;Vashishth et al., 2020), a few approaches employ a joint learning architecture for these tasks Lou et al., 2017). These joint approaches simultaneously recognize and normalize disease names utilizing their mutual benefits. For example, Leaman et al. (2013) demonstrated that dictionary-matching features, which are commonly used for NEN, are also effective for NER. While these joint learning models achieve high performance for both NER and NEN, they predominately rely on hand-crafted features, which are difficult to construct because of the domain knowledge requirement.
Recently, a neural network (NN)-based model that does not require any hand-crafted features was applied to the joint learning of NER and NEN (Zhao et al., 2019). NER and NEN were defined as two token-level classification tasks, i.e., their model classified each token into IOB2 tags and concepts, respectively. Although their model achieved the state-of-the-art performance for both NER and NEN, a concept that does not appear in training data (i.e., zero-shot situation) can not be predicted properly.
One possible approach to handle this zero-shot situation is utilizing the dictionary-matching features. Suppose that an input sentence "Classic polyarteritis nodosa is a systemic vasculitis" is given, where "polyarteritis nodosa" is the target entity. Even if it does not appear in the training data, it can be recognized and normalized by referring to a controlled vocabulary that contains "Polyarteritis Nodosa (MeSH: D010488)." Combining such looking-up mechanisms with NN-based models, however, is not a trivial task; dictionary matching must be performed at the entity-level, whereas standard NN-based NER and NEN tasks are performed at the token-level (for example, Zhao et al., 2019).
To overcome this problem, we propose a novel end-to-end approach for NER and NEN that com- Using the score obtained from both features, it directly classifies the disease concept. Thus, our model can handle the zero-shot problem by using dictionary-matching features while maintaining the performance of the NN-based models. Our model is also effective in situations other than the zero-shot condition. Consider the following input sentence: "We report the case of a patient who developed acute hepatitis," where "hepatitis" is the target entity that should be normalized to "drug-induced hepatitis." While the longer span "acute hepatitis" also appears plausible for standalone NER models, our end-to-end architecture assigns a higher score to the correct shorter span "hepatitis" due to the existence of the normalized term ("drug-induced hepatitis") in the dictionary.
Through the experiments using two major NER and NEN corpora, we demonstrate that our model achieves competitive results for both corpora. Further analysis illustrates that the dictionarymatching features improve the performance of NEN in the zero-shot and other situations.
Our main contributions are twofold: (i) We propose a novel end-to-end model for disease name recognition and normalization that utilizes both NN-based features and dictionary-matching features; (ii) We demonstrate that combining dictionary-matching features with an NN-based model is highly effective for normalization, especially in the zero-shot situations.

Task Definition
Given an input sentence, which is a sequence of words x = {x 1 , x 2 , · · · , x |X| } in the biomedical literature, let us define S as a set of all possible spans, and L as a set of concepts that contains the special label Null for a non-disease span. Our goal is to predict a set of labeled spans y = { i, j, d k } |Y | k=1 , where (i, j) ∈ S is the word index in the sentence, and d ∈ L is the concept of diseases.

Model Architecture
Our model predicts the concepts for each span based on the score, which is represented by the weighted sum of two factors: the context score score cont obtained from span representations and the dictionary-matching score score dict . Figure 1 illustrates the overall architecture of our model. We denote the score of the span s as follows: where c ∈ L is the candidate concept and λ is the hyperparameter that balances the scores. For the concept prediction, the scores of all possible spans and concepts are calculated, and then the concept with the highest score is selected as the predicted concept for each span as follows: Context score The context score is computed in a similar way to that of Lee et al. (2017), which is based on the span representations. To compute the representations of each span, the input tokens are first encoded into the token embeddings. We used BioBERT  as the encoder, which is a variation of bidirectional encoder representations from transformers (BERT) that is trained on a large amount of biomedical text. Given an input sentence containing T words, we can obtain the contextualized embeddings of each token using BioBERT as follows: where h 1:T is the input tokens embeddings.
Span representations are obtained by concatenating several features from the token embeddings: where h start(s) and h end(s) are the start and end token embeddings of the span, respectively; andĥ s is the weighted sum of the token embeddings in the span, which is obtained using an attention mechanism (Bahdanau et al., 2015). φ(i) is the size of span s. These representations g s are then fed into a simple feed-forward NN, FFNN, and a nonlinear function, GELU (Hendrycks and Gimpel, 2016).
Given a particular span representation and a candidate concept as the inputs, we formulate the context score as follows: score cont (s, c) = g s · W c where W ∈ R |L|×d g is the weight matrix associated with each concept c, and W c represents the weight vector for the concept c.
Dictionary-matching score We used the cosine similarity of the TF-IDF vectors as the dictionarymatching features. Because there are several synonyms for a concept, we calculated the cosine similarity for all synonyms of the concept and used the maximum cosine similarity as the score for each concept. The TF-IDF is calculated using the character-level n-gram statistics computed for all diseases appearing in the training dataset and controlled vocabulary. For example, given the span "breast cancer," synonyms with high cosine similarity are "breast cancer (1.0)" and "male breast cancer (0.829)."

Datasets
To evaluate our model, we chose two major datasets used in disease name recognition and normalization against a popular controlled vocabulary, MEDIC (Davis et al., 2012). Both datasets, the National Center for Biotechnology Information Disease (NCBID) corpus (Dogan et al., 2014) and the BioCreative V Chemical Disease Relation (BC5CDR) task corpus (Li et al., 2016), comprise of PubMed titles and abstracts annotated with disease names and their corresponding normalized term IDs (CUIs). NCBID provides 593 training, 100 development, and 100 test data splits, while BC5CDR evenly divides 1500 data into the three sets. We adopted the same version of MEDIC as TaggerOne  used, and that we dismissed non-disease entity annotations contained in BC5CDR.

Baseline Models
We compared several baselines to evaluate our model. DNorm (Leaman et al., 2013) and NormCo (Wright et al., 2019) were used as pipeline models due to their high performance. In addition, we used the pipeline systems consisting of stateof-the-art models: BioBERT  for NER and BioSyn (Sung et al., 2020) for NEN.
TaggerOne  and Transition-based model (Lou et al., 2017) are used as joint-learning models. These models outperformed the pipeline models in NCBID and BC5CDR. For the model introduced by Zhao et al.
(2019), we cannot reproduce the performance reported by them. Instead, we report the performance of the simple token-level joint learning model based on the BioBERT, which referred as "joint (token)".

Implementation
We performed several preprocessing steps: splitting the text into sentences using the NLTK toolkit (Bird et al., 2009), removing punctuations, and resolving abbreviations using Ab3P (Sohn et al., 2008), a common abbreviation resolution module. We also merged disease names in each training set into a controlled vocabulary, following the methods of Lou et al. (2017).
For training, we set the learning rate to 5e-5, and mini-batch size to 32. λ was set to 0.9 using the development sets. For BC5CDR, we trained the model using both the training and development sets following . For computational efficiency, we only consider spans with up to 10 words.

Evaluation Metrics
We evaluated the recognition performance of our model using micro-F1 at the entity level. We consider the predicted spans as true positive when their spans are identical. Following the previous work (Wright et al., 2019;, the performance of NEN was evaluated using micro-F1 at the abstract level. If a predicted concept was found within the gold standard concepts in the abstract, regardless of its location, it was considered as a true positive. Table 1 illustrates that our model mostly achieved the highest F1-scores in both NER and NEN, except for the NEN in BC5CDR, in which the transition-based model displays its strength as a baseline. The proposed model outperformed the pipeline model of the state-of-the-art models for both tasks, which demonstrates that the improvement is attributed not to the strength of BioBERT but the model architecture, including the endto-end approach and combinations of dictionarymatching features.

Results & Discussions
Comparing the model variation results, adding dictionary-matching features improved the performance in NEN. The results clearly suggest that dictionary-matching features are effective for NNbased NEN models.

Contribution of Dictionary-Matching
To analyze the behavior of our model in the zeroshot situation, we investigated the NEN performance on two subsets of both corpora: disease names with concepts that appear in the training   data (i.e., standard situation), and disease names with concepts that do not appear in the training data (i.e., the zero-shot situation). Table 2 shows the number of mentions and concepts in each situation. Table 3 displays the results of the zero-shot and standard situation. The proposed model with dictionary-matching features can classify disease concepts in the zero-shot situation, whereas the NN-based classification model cannot normalize the disease names.
The results of the standard situation demonstrate that combining dictionary-matching features also improves the performance even when target concepts appear in the training data. This finding implies that an NN-based model can benefit from dictionary-matching features, even if the models can learn from many training data.

Case study
We examined 100 randomly sampled sentences to determine the contributions of dictionary-matching features. There are 32 samples in which the models predicted concepts correctly by adding dictionarymatching features. Most of these samples are disease concepts that do not appear in the training set but appear in the dictionary. For example, "pure red cell aplasis (MeSH: D012010)" is not in the BC5CDR training set while the MEDIC contains "Pure Red-Cell Aplasias" for "D012010". In this case, a high dictionary-matching score clearly leads to a correct prediction in the zero-shot situation.
In contrast, there are 32 samples in which the dictionary-matching features cause errors. The sources of this error type are typically general disease names in the MEDIC. For example, "Death (MeSH:D003643)" is incorrectly predicted as a disease concept in NER. Because these words are also used in the general context, our model overestimated their dictionary-matching scores.
Furthermore, in the remaining samples, our model predicted the code properly and the span incorrectly. For example, although "thoracic hematomyelia" is labeled as "MeSH: D020758" in the BC5CDR test set, our model recognized this as "hematomyelia." In this case, our model mostly relied on the dictionary-matching features and misclassifies the span because 'hematomyelia" is in the MEDIC but not in the training data.

Limitations
Our model is inferior to the transition-based model for BC5CDR. One possible reason is that the transition-based model utilizes normalized terms that co-occur within a sentence, whereas our model does not. Certain disease names that co-occur within a sentence are strongly useful for normalizing disease names. Although BERT implicitly considers the interaction between disease names via the attention mechanism, a more explicit method is preferable for normalizing diseases.
Another limitation is that our model treats the dictionary entries equally. Because certain terms in the dictionary may also be used for non-disease concepts, such as gene names, we must consider the relative importance of each concept.

Conclusion
We proposed a end-to-end model for disease name recognition and normalization that combines the NN-based model with the dictionary-matching features. Our model achieved highly competitive results for the NCBI disease corpus and BC5CDR corpus, demonstrating that incorporating dictionary-matching features into an NN-based model can improve its performance. Further experiments exhibited that dictionary-matching features enable our model to accurately predict the concepts in the zero-shot situation, and they are also beneficial in the other situation. While the results illustrate the effectiveness of our model, we found several areas for improvement, such as the general terms in the dictionary and the interaction between disease names within a sentence. A possible future direction to deal with general terms is to jointly train the parameters representing the importance of each synonyms.