Clustering-based Inference for Biomedical Entity Linking

Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clustering-based inference model further improves entity linking by 2.3 points.


Introduction
Ambiguity is inherent in the way entities are mentioned in natural language text. Grounding such ambiguous mentions to their corresponding entities, the task of entity linking, is critical to many applications: automated knowledge base construction and completion (Riedel et al., 2013;Surdeanu et al., 2012), information retrieval (Meij et al., 2014), smart assistants (Balog and Kenter, 2019), question answering (Dhingra et al., 2020), text mining Murty et al., 2018).
Consider the excerpt of text from a biomedical research paper in Figure 1, the three highlighted mentions (expression, facial expressions, and facially expressive) all link to the same entity, namely C0517243 -Facial Expresson in the leading biomedical KB, Unified Medical Language System (UMLS) (Bodenreider, 2004).
The mention expression is highly ambiguous and easily confused with the more prevalent en-When emotion and expression diverge : The social costs of Parkinson 's disease Patients with Parkinson 's disease are perceived more negatively than their healthy peers , yet it remains unclear what factors contribute to this negative social perception . Based on a cohort of 17 Parkinson 's disease patients and 20 healthy controls , we assessed how naïve raters judge the emotion and emotional intensity displayed in dynamic facial expressions as adults with and without Parkinson 's disease watched emotionally evocative films ( Experiment 1 ) , and how age -matched peers naïve to patients ' disease status judge their social desirability along various dimensions from audiovisual stimuli ( interview excerpts ) recorded after certain films ( Experiment 2 ) . In Experiment 1 , participants with Parkinson 's disease were rated as significantly more facially expressive than healthy controls ...
[C0017262] Name: Gene  Figure 1: Biomedical Entity Linking. All three highlighted mentions refer to the same entity. The mention expression is clearly related to the other two highlighted mentions which are much less ambiguous. If considered independently expression is more closely related to an incorrect entity.
tity, Gene expression. This linking decision may become easier with sufficient training examples (or sufficiently rich structured information in the knowledge-base) . However, in biomedical (Mohan and Li, 2019) and other specialized domains (Logeswaran et al., 2019), it is often the case that the knowledge-base information is largely incomplete. Furthermore, the scarcity of training data leads to a setting in which most entities have not been observed at training. State-of-the-art entity linking methods which are able to link entities unseen at training time make predictions for each mention independently (Logeswaran et al., 2019;Wu et al., 2019). In this way, the methods may have difficulty linking mentions which, as in the example above, have little lexical similarity with the entities in the knowledge-base, as well as mentions for which the context is highly ambiguous. These mentions cannot directly use information from one mention (or its linking decision) to inform the prediction of another mention. On the other hand, entity linking methods that do jointly consider entity linking decisions (Ganea and Hofmann, 2017;Le and Titov, 2018) are designed for cases in which all of the entities in the knowledge-base to have example mentions or metadata at training time (Logeswaran et al., 2019).
In this paper, we propose an entity linking model in which entity mentions are either (1) linked directly to an entity in the knowledge-base or (2) join a cluster of other mentions and link as a cluster to an entity in the knowledge-base. Some mentions may be difficult to link directly to their referent ground truth entity, but may have very clear coreference relationships to other mentions. So long as one mention among the group of mentions clustered together links to the correct entity the entire cluster can be correctly classified. This provides for a joint, tranductive-like, inference procedure for linking. We describe both the inference procedure as well as training objective for optimizing the model's inference procedure, based on recent work on supervised clustering (Yadav et al., 2019).
It is important to note that our approach does not aim to do joint coreference and linking, but rather makes joint linking predictions by clustering together mentions that are difficult to link directly to the knowledge-base. For instance, in Figure 1, the mention expression may be difficult to link to the ground truth Facial expression entity in the knowledge-base because the mention can refer to a large number of entities. However, the local syntactic and semantic information of the paragraph give strong signals that expression is coreferent with facial expression, which is easily linked to the correct entity.
We perform experiments on two biomedical entity linking datsets: MedMentions (Mohan and Li, 2019), the largest publicly available dataset as well as the benchmark BC5CDR (Li et al., 2016). We find that our approach improves over our strongest baseline by 2.3 points of accuracy on MedMentions and 0.8 points of accuracy on BC5CDR over the baseline method (Logeswaran et al., 2019). We further analyze the performance of our approach and observe that (1) our method better handles ambiguous mention surface forms (as in the example shown in Figure 1) and (2) our method can correctly link mentions even when the candidate generation step fails to provide the correct entity as a candidate.

Background
Each document D ∈ D, has a set of mentions We denote the set of all mentions across all documents as plainly M. The task of entity linking is to classify each mention m i as referent to a single entity e i from a KB of entities. We use E(m i ) to refer to the ground truth entity of mention m i andê i to refer to the predicted entity.
Knowledge-bases. We assume that we are given a knowledge-base corresponding to a closed world of entities. These KBs are typically massive: English Wikipedia contains just over 6M entities 1 and the 2020 release of the UMLS contains 4.28M entities 2 . We describe in Sections 5.1 & 5.2 the details of the KBs used in each of the experiments.
Candidate Generation. Given the massive number of entities to which a mention may refer, previous work (Logeswaran et al., 2019, inter alia) uses a candidate generation step to reduce the restrict the number of entities considered for a given mention, m, to a candidate set Γ(m). The recall of this step is critical to the overall performance of entity linking models.

Model
In this section, we describe our clustering-based approach for jointly making entity linking predictions for a set of mentions. Our proposed inference method builds a graph where the nodes are the union of all of the mentions and entities and the edges have weights denoting the affinities between the endpoints. To make linking decisions, we cluster the nodes of the graph such that each cluster contains exactly one entity, following which each mention is assigned to the entity in its cluster. Color families indicate ground-truth cluster assignments. The left figure shows the graph G that is the basis of the clustering task, the center figure show predictions under independent linking model, and the right figure shows our proposed inference linking mentions to entities by running our proposed constrained clustering inference procedure over G that assigns at most one entity per cluster.

Clustering-based Entity Linking
, is determined by ϕ or ψ depending on the vertices of the edge: w(m i , m j ) = ϕ(m i , m j ) and w(m i , e l ) = ψ(m i , e l ). Linking decisions for each mention are determined by clustering the vertices of G under the constraint that every entity must appear in exactly one cluster. Given the graph G, we start with every node in their own individual cluster. We define affinity between a pair of clusters as the strongest crosscluster edge between nodes in the two clusters. Iteratively, we greedily merge clusters by choosing a pair of clusters with largest affinity between them under the constraint that we cannot merge two clusters which both contain an entity. When every cluster contains exactly one entity, this process can no longer merge any clusters, and thus terminates 3 . Each mention is linked to the entity present in its cluster at the end of inference. Algorithm 1 describes this process of constructing the graph and clustering nodes to make linking decisions more formally. Figure 2 shows the proposed inference in action on five entities and six mentions. Initially, every mention and entity start in a singleton cluster. In the first round, clusters {m 1 } and {m 2 } are merged, followed by merger of {e 3 } and {m 6 } in the second round, and so on. Note that in fifth round, clusters c 1 = {m 4 , e 2 } has higher affinity with c 2 = {m 1 , m 2 , m 3 , e 1 } than with c 3 = {m 5 }, yet c 1 and c 3 are merged instead of c 1 and c 2 due to the constraint that we cannot merge two clusters which both contain an entity. At the end, every mention is clustered together with exactly one entity, and there could be entities present as singleton clusters such as {e 4 } and {e 5 }. Note that m 3 correctly links to its gold entity e 1 as a result of being clustered with mentions m 1 , m 2 even though it has higher affinity with entity e 3 : w(m 3 , e 3 ) > w(m 3 , e 1 ).

Affinity Models
We parameterize ψ(·, ·) and φ(·, ·) using two separate deep transformer encoders (Vaswani et al., 2017) for our mention-mention affinity model and mention-entity affinity model -specifically we use the BERT architecture  initialized using the weights from BioBERT .

Mention-Mention Model
The mention-mention model is also a crossencoder, taking as input a pair of mention in context and producing a single scalar affinity for every pair. The input tokens take the form: where m i is the mention tokens and c l and c r are the left and right context of the mention in the text, respectively. The [START] and [END] tokens are special tokens fine-tuned to signify the start and end of the mention in context, respectively. We restrict the length of each input sequence to have a maximum of 256 tokens. A representations of each mention is computed using the average of the encoder's output representations corresponding to the mention's input tokens. The affinity for a mention pair is computed by concatenating their mention representations and passing it through a linear layer with a sigmoid activation. We make this affinity symmetric by averaging the two possible orderings of a pair of mentions in the cross-encoder input sequence.

Mention-Entity Model
The mention-entity affinity model is a crossencoder model (Vig and Ramea, 2019;Wolf et al., 2019;Humeau et al., 2019, inter alia) and takes as input the concatenation of the mention in context with the entity description. The input tokens take the form: where the mention in context is the same as in the mention-mention model and e is the description of the entity. We restrict the length of this input sequence to 256 tokens. After passing the input sequence through BERT, we transform the output representation corresponding to the [CLS] token with a linear layer with one output unit. This value is finally passed through the sigmoid function to output affinity between the mention and the entity.

Training
In this section, we explain the training procedure for the affinity models ϕ(·, ·) and ψ(·, ·) used by the clustering inference procedure. We train the mention-mention and mention-entity models independently in a way that allows the affinities to be comparable when performing inference.
Algorithm 1 Clustering Inference for Linking for e l ∈ Γ(m i ) do 10: Let S be the edges sorted in descending order 13: £ Cluster nodes of G under linking constraint 14: We use triplet max-margin based training objectives to train both models. The most important aspect of our procedure is how we pick negatives during training. For the mention-entity model, we restrict our negatives to be from the candidate set. For the mention-mention model, we restrict our negatives to come from mentions within the same document. From these sets of possible negatives we choose the top-k most offending ones according the instantaneous state of the model -i.e. the negatives with highest predicted affinities according to the model at that point during training. The following sections detail the training procedures for both models.

Mention-Mention Affinity Training
To train the mention-mention affinity model we use a variant of the maximum spanning tree (MST) supervised single linkage clustering algorithm presented in Yadav et al. (2019). Let M (D) e l = {m ∈ M (D) | E(m) = e l } be the set of mentions referring to entity e l in any one document and the set of ground truth clusters be represented by e l | e l ∈ E}. Let P be the set of positive training edges: the edges of the MST of the complete graph on the cluster C ∈ C * . Let N ϕ (m * ) be the k-nearest within document negatives to the anchor point m * ∈ C according to the current state of the model during training. The objective of this training procedure is to minimize the following triplet max-margin loss 4 with margin µ for each cluster C ∈ C * :

Mention-Entity Affinity Training
For the mention-entity model, we use a triplet maxmargin based objective with margin µ where the anchor is a mention m in the training set, the positive is the ground truth entity e + = E(m), and the negatives are chosen from the candidate set Γ(m). Denote the k most offending negatives according to the current state of the model during training as where ψ,µ (a, p, n) = [ψ(a, n) − ψ(a, p) + µ] + .

Experiments
We evaluate on biomedical entity linking using the MedMentions (Mohan and Li, 2019) and BC5CDR (Li et al., 2016) datasets. We compare to state-ofthe-art methods. We then analyze the performance of our method in more detail and provide qualitative examples demonstrating our approaches' ability to use mention-mention relationships to improve candidate generation and linking.

MedMentions
MedMentions is a publicly available 5 dataset consisting of the titles and abstracts of 4,392 PubMed articles. The dataset is hand-labeled by annotators and contains labeled mention spans and entities linked to the 2017AA full version of UMLS. Following the suggestion of Mohan and Li (2019) that were deemed most important for building scientific knowledge-bases. We refer the readers to Mohan and Li (2019) for a complete analysis of the dataset and provide a few important summary statistics here. The train/dev/test split partitions the PubMed articles into three non-overlapping groups. This means that some entities seen at training time will appear in dev/test and other entities will appear in dev/test but not at training time. In fact, a large number of entities that appear in dev/test time are unseen at training, about 42% of entities. See Table 1 for split details and statistics. Previous work has evaluated on MedMentions using unfairly optimistic candidate generation settings such as using only 10 candidates including the ground truth (Zhu et al., 2019) or restricting candidates to entities appearing somewhere in the MedMentions corpus (Murty et al., 2018). We instead work in a much more general setting where all entities in UMLS are considered at candidate generation time and the generated candidates might not include the ground truth entity.

BC5CDR
BC5CDR (Li et al., 2016) is another entity linking benchmark in the biomedical domain. The dataset consists of 1,500 PubMed articles annotated with labeled disease and chemical entities. Unlike MedMentions, which contains 21 types of entities, this dataset contains just two types. These chemical and disease mentions are labeled with entities from MeSH 6 , a much smaller biomedical KB than UMLS. See Table 1 for split details and statistics.

Preprocessing
The MedMentions ST21PV corpus is processed as follows: (i) Abbreviations defined in the text of each paper are identified using AB3P (Sohn et al., 2008). Each definition and abbreviation instance is then replaced with the expanded form. (ii) The text of each paper in the corpus is tokenized and

Cluster-based Predictions Entity KB
Impact of alcohol use on EEG dynamics of response inhibition : a cotwin control analysis Research indicates that alcohol misuse is associated with behavioral disinhibition , but the neurophysiological mechanisms governing this relationship remain largely unknown . Recent work suggests that successful inhibition and cognitive control involve electrophysiological theta -band dynamics ….

Response Inhibition
Impact of cofactor -binding loop mutations on thermotolerance and activity of E. coli transketolase […] Denaturation of the cofactor -b i n d i n g l o o p s o f E s ch e r i ch i a c o l i transketolase was previously linked to the loss of enzyme activity under conditions of high pH or urea . Incubation at temperatures just below the thermal melting transition , above which the protein aggregates , was also found to anneal the enzyme to give an increased specific activity .  Figure 3: Example predictions on Ambiguous Mentions. Here we show two example outputs for highly ambigous mention surface forms (inhibition and activity). The independent model incorrectly makes predictions on these surface forms. The clustering-based model is able to have each ambiguous mention link to a less ambiguous mention in the same abstract and thereby make correct predictions. split into sentences using CoreNLP (Manning et al., 2014). (iii) Overlapping mentions are resolved by preferring longer mentions that begin earlier in each sentence, and mentions are truncated at sentence boundaries. This results in 379 mentions to be dropped from the total of 203,282. (iv) Finally, the corpus is saved into the IOB2 tag format. The same preprocessing steps are used for BC5CDR, except overlapping mentions are not dropped.

Candidate Generation
For both datasets, we use a character n-gram TF-IDF model to produce candidates for all of the mentions in all splits. The candidate generator utilizes the 200k most frequent character n-grams, n ∈ {2 . . . 5} and the 200k most frequent words in the names in E to produce sparse vectors for all of the mentions and entity descriptions (which in our case is the canonical name, the type, and a list of known aliases and synonyms). Table 5 provides candidate generation results for each dataset. The results report the average recall@K at different numbers of candidates (K), i.e., whether or not the gold entity is top K candidates for a given mention.

Training and Inference Details
Our model contains 220M parameters, the majority of which are contained within the two separate BERT-based models. We optimize both the models with mini-batch stochastic gradient descent using the Adam optimizer (Kingma and Ba, 2014) with recommended learning rate of 5e-5  with no warm-up steps. We accumulate gradients over all of the triples for a batch size of 16 within document clusters. We compute the top-k most offending negatives on-the-fly for each batch by running the model in inference mode proceeding each training step. Training and inference are done on a single machine with 8 NVIDIA 1080 Ti GPUs. We train our model on MedMentions for two epochs and BC5CDR for four epochs. Training takes approximately three days for MedMentions and one day for BC5CDR. Clustering-based inference takes about three hours for MedMentions and one hour for BC5CDR. Code and data to reproduce experiments will be made available.

Results
We compare our clustering-based inference procedure, which we refer to our approach as CLUSTERING-BASED, to a state-of-the-art independent inference procedure, INDEPENDENT, which is the zero-shot architecture proposed by Logeswaran et al.. This same model is used as the mentionentity affinity model used in our approach. We also compare to to an n-gram tf-idf model (our candidate generation model), TAGGERONE , BIOSYN (Sung et al., 2020), and SAPBERT (Liu et al., 2020) on both MedMentions and BC5CDR. Table 2 shows performance of the baseline models, INDEPENDENT, and CLUSTERING-BASED inference procedure on MedMentions and BC5CDR. We report results using the gold mention segmen- Acc. on Overall Acc. on Acc.
Seen Unseen Acc. Seen Unseen N-GRAM TF-IDF 50.9 50.9 51.0 86.9 89.2 74.6 BIOSYN (Sung et al., 2020) 72.5 † 76.5 † 58.7 † 87.8 † 89.0 † 81.1 † SAPBERT (Liu et al., 2020) 69.8 † 72.9 † 58.9 † 85.2 † 85.8 † 82.0 † INDEPENDENT (Logeswaran et al., 2019) 72   73.8 78.2 58.8 89.8 91.8 79.9 BIOSYN (Sung et al., 2020) 77.0 † 80.7 † 64.1 † 87.9 † 89.1 † 81.3 † SAPBERT (Liu et al., 2020) 74.  Table 2: Entity Linking Results. We report linking accuracy on MedMentions and BC5CDR datasets with gold mentions spans and gold mention spans and types. We observe that CLUSTERING-BASED inference provides improved accuracy in each setting with additional improvements seen when gold entity types are provided. ( † Hits at one synonym -multiple entities could be predicted) tation (rather than end-to-end) to focus on the performance of each model in terms of linking rather than confounding the performance by including segmentation. Due to TAGGERONE's joint entity recognition, typing, and linking architecture, we cannot make predictions for gold mention boundaries without also using their gold types. And so to have a fair comparison to TaggerOne, we provide the gold mention boundaries and types to each system and report these results as well.
We use seen and unseen to refer to the sets of mentions whose ground truth entities are seen and unseen at training, respectively. Note that even if a mention is in the subset of mentions referred to as seen, it does not mean that we have seen the particular surface form before in the training set, merely that we have seen other mentions of that particular entity.
On MedMentions, when the models are provided with only the gold mention span, CLUSTERING-BASED inference procedure outperforms INDEPEN-DENT by 1.3 points of accuracy, and we see improvements in accuracy for both seen and unseen entities. When the models are additionally provided with the gold type, we see substantial improvements in accuracy for both INDEPENDENT and CLUSTERING-BASED over TAGGERONE, namely 3.0 and 5.3 points of improvement, respectively.
On BC5CDR, when the models are provided with only the gold mention span, CLUSTERING-BASED inference procedure outperforms INDEPEN-DENT by 0.4 points of accuracy, and we see improvements in accuracy for both seen and unseen entities. When the models are additionally provided with the gold type, we see improvements in accuracy for both INDEPENDENT and CLUSTERING-BASED over TAGGERONE, namely 0.8 and 1.6 points of improvement, respectively.
Observe that the candidate generation results are drastically different for the two datasets (Table 5). We posit that the ability to generate correct candidates correlates with the relative difficultly of the linking task on each dataset, respectively.

Analysis: Recovering from Poor Candidate Generation
We hypothesize that our clustering-based inference procedure would allow for better performance on mentions for which candidate generation is difficult. Observe that while the performance of the independent model is upper bounded by the recall of candidate generation, this is not an upper bound for the clustering-based model. The clustering-based model can allow mentions that have no suitable candidates to link to other mentions in the same document. We report the accuracy of both systems with respect to whether or not the ground truth entity is in each mentions' list of candidates. The accuracy for each system and each partition of mentions is shown in Table 3. Observe that our approach offers a large number of mentions a correct resolution, when the independent model could not link them correctly due to the ground truth entity being missing from the candidate list. Additionally, it can be seen that CLUSTERING-BASED does sacrifice some performance in comparison to INDEPENDENT, but more than makes up for it in the case where the ground truth entity is not in the candidate set.

Analysis: Handling Ambiguous Mentions
We also hypothesize that for mentions which are highly ambiguous and could refer to many different entities, such as common nouns like virus, disease, etc, the clustering-based inference should offer improvements. Table 4 shows that our approach is able to correctly link more ambiguous mentions compared to independent model 7 . Figure 3 shows two examples from this subset where CLUSTERING-BASED inference is able to make the correct linking decision and INDEPENDENT is not.
In the biomedical domain, Unified Medical Language System (UMLS) is often used as a knowledge-base for entities (Mohan and Li, 2019;. While UMLS is a rich ontology of concepts and relationships between them, this domain is low resource compared to Wikipedia with respect to number of labeled training data for each entity mention. This leads to a zero-shot setting in datasets such as MedMentions (Mohan and Li, 2019) where new entities are seen at test time. Previous work has addressed this zero-shot setting using models of the type hierarchy (Murty et al., 2018;Zhu et al., 2019). This previous work (Murty et al., 2018;Zhu et al., 2019) uses an unrealistic candidate generation setting where the true positive candidate is within the candidate set and/or entities are limited to those in the dataset rather than those in the knowledge-base.
Mention-mention relationships are also explored in (Le and Titov, 2018) which extends the pairwise CRF model (Ganea and Hofmann, 2017) to use mention-level relationships in addition to entity relationships. These works use attention in a way to build the context representation of the mentions. However, as mentioned by Logeswaran et al. (2019) is not well suited for zero-shot linking.
Coreference (both within and across documents) has also been explored by past work (Dutta and Weikum, 2015). This work uses an iterative procedure that performs hard clustering for the sake of aggregating the contexts of entity mentions. Durrett and Klein (2014) presents a CRF-based model for joint NER, within-document coreference, and linking. They show that jointly modeling these three tasks improves performance over the independent baselines. This differs from our work since we do not require coreference decisions to be correct in order to make correct linking decisions. Other work performs joint entity and event coreference (Barhom et al., 2019) without linking.

Conclusion
In this work, we presented a novel clustering-based inference procedure which enables joint entity linking predictions. We evaluate the effectiveness of our approach on the two biomedical entity linking datasets, including the largest publicly available dataset. We show through analysis that our approach is better suited to link mentions with ambiguous surface forms and link mentions where the ground truth entity is not in the candidate set.

Ethical Considerations
Entity linking is a task with the intention of providing useful information when building a semantic index of documents. This semantic index is a core component of systems which allow users to search, retrieve, and analyze text documents. In our specific case, we are interested in building semantic indexes of scientific documents where the end user would be scientists and researchers. The goal is to help them navigate the vast amount of literature and accelerate science. This being said, users need to take the outputs of such a system as suggestions and with the potential that the information is incorrect. Researchers must be aware that the sys-  Table 3: Performance when Candidate Generation Fails. We report the accuracy of each method on mentions for which the ground truth entity is in the candidate list (E(m) ∈ Γ(m)) and is not in the list (E(m) ∈ Γ(m)). We observe that our proposed approach is able to perform reasonably well even when candidate generation fails.

Accuracy
INDEPENDENT (Logeswaran et al., 2019) 71.91 CLUSTERING-BASED (ours) 73.03 Table 4: Performance on Ambiguous Mentions We select mentions for which the surface form is labeled 10 or more different entities in MedMentions and measure performance on instances of these surface forms on the test data. We observe that CLUSTERING-BASED is able to more accurately link these mentions. Figure 3 shows examples of these mentions.  Table 5: Candidate Generation Recall. Recall is measured by whether or not the ground truth entity is in the top K candidate entities for the given mention. We report the micro average recall over all mentions.
tem is not perfect and they should not jump to any conclusions especially about important decisions. Additionally, the researcher can always verify the decisions being made by the system. While this paper focuses on biomedical entity linking, this technique could be extended to other domains. In such other domains, users might not have as much expertise, but the user is still responsible for making decisions on their own, since the system is not perfect. In addition, the system developers and designers need to be aware of their particular application to ensure to mitigate harm which could come from such a system. For exam-ple, in any application that deals with personalized data, we need to be wary of the potential outcomes which could come from an entity linking based system or semantic index, such as privacy or other potential malicious behaviour or unforeseen consequences due to the decisions being made by the system.