Triplet-Trained Vector Space and Sieve-Based Search Improve Biomedical Concept Normalization

Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is critical for mining and analyzing biomedical texts. We propose a vector-space model for concept normalization, where mentions and concepts are encoded via transformer networks that are trained via a triplet objective with online hard triplet mining. The transformer networks refine existing pre-trained models, and the online triplet mining makes training efficient even with hundreds of thousands of concepts by sampling training triples within each mini-batch. We introduce a variety of strategies for searching with the trained vector-space model, including approaches that incorporate domain-specific synonyms at search time with no model retraining. Across five datasets, our models that are trained only once on their corresponding ontologies are within 3 points of state-of-the-art models that are retrained for each new domain. Our models can also be trained for each domain, achieving new state-of-the-art on multiple datasets.


Introduction
Concept normalization (aka. entity linking or entity normalization) is a fundamental task of information extraction which aims to map concept mentions in text to standard concepts in a knowledge base or ontology. This task is important for mining and analyzing unstructured text in the biomedical domain as the texts describing biomedical concepts have many morphological and orthographical variations, and utilize different word orderings or equivalent words. For instance, heart attack, coronary attack, MI, myocardial infarction, cardiac infarction, and cardiovascular stroke all refer to the same concept. Linking such terms with their corresponding concepts in an ontology or knowledge base is critical for data interoperability and the development of natural language processing (NLP) techniques.
Research on concept normalization has grown thanks to shared tasks such as disorder normalization in the 2013 ShARe/CLEF (Suominen et al., 2013), chemical and disease normalization in BioCreative V Chemical Disease Relation (CDR) Task , and medical concept normalization in 2019 n2c2 shared task (Henry et al., 2020), and to the availability of annotated data (Dogan et al., 2014;Luo et al., 2019). Existing approaches can be divided into three categories: rule-based approaches using string-matching or dictionary look-up (Leal et al., 2015;D'Souza and Ng, 2015;Lee et al., 2016), which rely heavily on handcrafted rules and domain knowledge; supervised multi-class classifiers (Limsopatham and Collier, 2016;Lee et al., 2017;Tutubalina et al., 2018;Niu et al., 2019;Li et al., 2019), which cannot generalize to concept types not present in their training data; and two-step frameworks based on a nontrained candidate generator and a supervised candidate ranker (Leaman et al., 2013;Li et al., 2017;Liu and Xu, 2017;Nguyen et al., 2018;Murty et al., 2018;Mondal et al., 2019;Ji et al., 2020;Xu et al., 2020), which require complex pipelines and fail if the candidate generator does not find the gold truth concept.
We propose a vector space model for concept normalization, where mentions and concepts are encoded as vectors -via transformer networks trained via a triplet objective with online hard triplet mining -and mentions are matched to concepts by vector similarity. The online hard triplet mining strategy selects the hard positive/negative exemplars from within a mini-batch during training, which ensures consistently increasing difficulty of triplets as the network trains for fast convergence. There are two advantages of applying the vector space model for concept normalization: 1) it is computationally cheap compared with other supervised classification approaches as we only compute the representations for all concepts in ontology once after training the network; 2) it allows concepts and synonyms to be added or deleted after the network is trained, a flexibility that is important for the biomedical domain where frequent updates to ontologies like the Unified Medical Language System (UMLS) Metathesaurus 1 are common. Unlike prior work, our simple and efficient model requires neither negative sampling before the training nor a candidate generator during inference.
Our work makes the following contributions: • We propose a triplet network with online hard triplet mining for training a vector-space model for concept normalization, a simpler and more efficient approach than prior work. • We propose and explore a variety of strategies for matching mentions to concepts using the vector-space model, with the most successful being a simple sieve-based approach that checks domain-specific synonyms before domain-independent ones. • Our framework produces models trained on only the ontology -no domain-specific training -that can incorporate domain-specific concept synonyms at search time without retraining, and these models achieve within 3 points of state-of-the-art on five datasets. • Our framework also allows models to be trained for each domain, achieving state-ofthe-art performance on multiple datasets.

Related work
Earlier work on concept normalization focuses on how to use morphological information to conduct lexical look-up and string matching (Kang et al., 2013;D'Souza and Ng, 2015;Leal et al., 2015;Kate, 2016;Lee et al., 2016;Jonnagaddala et al., 2016). They rely heavily on handcrafted rules and domain knowledge, e.g., D'Souza and Ng (2015) define 10 types of rules at different priority levels to measure morphological similarity between mentions and candidate concepts in the ontologies. The lack of lexical overlap between concept mention and concept in domains like social media, makes rule-based approaches that rely on lexical matching less applicable.
Supervised approaches for concept normalization have improved with the availability of annotated data and deep learning techniques. When the number of concepts to be predicted is small, classification-based approaches (Limsopatham and Collier, 2016;Lee et al., 2017;Tutubalina et al., 2018;Niu et al., 2019;Li et al., 2019;Miftahutdinov and Tutubalina, 2019) are often adopted, with the size of the classifier's output space equal to the number of concepts. Approaches differ in neural architectures, such as character-level convolution neural networks (CNN) with multi-task learning (Niu et al., 2019) and pre-trained transformer networks (Li et al., 2019;Miftahutdinov and Tutubalina, 2019). However, classification approaches struggle when the annotated training data does not contain examples of all concepts -common when there are many concepts in the ontology -since the output space of the classifier will not include concepts absent from the training data.
To alleviate the problems of classification-based approaches, researchers apply learning to rank in concept normalization, a two-step framework including a non-trained candidate generator and a supervised candidate ranker that takes both mention and candidate concept as input. Previous candidate rankers have used point-wise learning to rank (Li et al., 2017), pair-wise learning to rank (Leaman et al., 2013;Liu and Xu, 2017;Nguyen et al., 2018;Mondal et al., 2019), and list-wise learning to rank (Murty et al., 2018;Ji et al., 2020;Xu et al., 2020). These learning to rank approaches also have drawbacks. Firstly, if the candidate generator fails to produce the gold truth concept, the candidate ranker will also fail. Secondly, the training of candidate ranker requires negative sampling beforehand, and it is unclear if these pre-selected negative samples are informative for the whole training process (Hermans et al., 2017;Sung et al., 2020).
Inspired by Schroff et al. (2015), we propose a triplet network with online hard triplet mining for concept normalization. Our framework sets up concept normalization as a one-step process, calculating similarity between vector representations of the mention and of all concepts in the ontology. Online hard triplet mining allows such a vector space model to generate triplets of (mention, true concept, false concept) within a mini-batch, leading to efficient training and fast convergence (Schroff et al., 2015). In contrast with previous vector space models where mention and candidate concepts are mapped to vectors via TF-IDF (Leaman et al., 2013), TreeLSTMs (Liu and Xu, 2017), CNNs (Nguyen et al., 2018;Mondal et al., 2019) or ELMO (Schumacher et al., 2020), we generate vector representations with BERT (Devlin et al., 2019), since it can encode both surface and semantic information (Ma et al., 2019).
There are a few similar works to our vector space model, CNN-triplet (Mondal et al., 2019), BIOSYN (Sung et al., 2020), RoBERTa-Node2Vec (Pattisapu et al., 2020), and TTI (Henry et al., 2020). CNN-triplet is a two-step approach, requiring a generator to generate candidates for training the triplet network, and requiring various embedding resources as input to CNN-based encoder. BIOSYN, RoBERTa-Node2Vec, and TTI are onestep approaches. BIOSYN requires an iterative candidate retrieval over the entire training data during each training step, requires both BERT-based and TF-IDF-based representations, and performs a variety of pre-processing such as acronym expansion. Both RoBERTa-Node2Vec and TTI use a BERTbased encoder to encode the mention texts into a vector space, but they differ in how to generate vector representations for medical concepts. Specifically, RoBERTa-Node2Vec uses a Node2Vec graph embedding approach to generate concept representations, and fixes such representations during training, while TTI randomly initializes vector representations for concepts, and keeps such representations learnable during training. Note that none of these works explore search strategies that allow domainspecific synonyms to be added without retraining the model, while we do.

Proposed methods
We define a concept mention m as a text string in a corpus D, and a concept c as a unique identifier in an ontology O. The goal of concept normalization is to find a mapping function f that maps each textual mention to its correct concept, i.e., c = f (m). We define concept text t as a text string denoting the concept c, and t ∈ T (c), where T (c) is all the concept texts denoting concept c. Concept text may come from an ontology, t ∈ O(c), where O(c) is the synonyms of the concept c from the ontology O, or from an annotated corpus, t ∈ D(c), where D(c) is the mentions of the concept c in an annotated corpus D. T (c) will allow the generation of tuples (t, c) such as (MI,C0027051) and (Myocardial Infarction,C0027051). Note that, for a tp heart attack BERT encoder Figure 1: Example of loss calculation for a single instance of triplet-based training. The same BERT model is used for encoding t i , t p , and t n .
, it is common for there to be more concept synonyms in the ontology than the annotated corpus, it is common for the ontology and annotated corpus to provide different concept synonyms, and it is common that annotated corpus only covers a small subset of all concepts in an ontology. We implement f as a vector space model: where V (x) is a vector representation of text x and Sim is a similarity measure such as cosine similarity, inner product, or euclidean distance. We learn the vector representations V (x) using a triplet network architecture (Hoffer and Ailon, 2015), which learns from triplets of (anchor text t i , positive text t p , negative text t n ) where t i and t p are texts for the same concept, and t n is a text for a different concept. The triplet network attempts to learn V such that for all training triplets: The triplet network architecture has been adopted in learning representations for images (Schroff et al., 2015;Gordo et al., 2016) and text (Neculoiu et al., 2016;Reimers and Gurevych, 2019). It consists of three instances of the same sub-network (with shared parameters). When fed a (t i , t ip , t in ) triplet of texts, the sub-network outputs vector representations for each text, which are then fed into a triplet loss. We adopt PubMed-BERT (Gu et al., 2020) as the sub-network, where the representation for the concept text is an average pooling of the representations for all sub-word tokens 2 . This architecture is shown in Figure 1. The inputs to our model are only the mentions or synonyms. We leave the resolution of ambiguous mentions, which will require exploration of contextual information, for future work.

Online hard triplet mining
An essential part of learning using triplet loss is how to generate triplets. As the number of synonyms gets larger, the number of possible triplets grows cubically, making training impractical. We follow the idea of online triplet mining (Schroff et al., 2015) which considers only triplets within a mini-batch. We first feed a mini-batch of b concept texts to the PubMed-BERT encoder to generate a d-dimensional representation for each concept text, resulting in a matrix M ∈ R b×d . We then compute the pairwise similarity matrix: where each entry S ij corresponds to the similarity score between the i th and j th concept texts in the mini-batch. As the easy triplets would not contribute to the training and result in slower convergence (Schroff et al., 2015), for each concept text t i , we only select a hard positive t p and a hard negative t n from the mini-batch such that: where C(x) is the ontology concept from which t x was taken, i.e., if t x ∈ T (c) then C(x) = c. We train the triplet network using batch hard soft margin loss (Hermans et al., 2017): where S, n, and p are as in eqs.

Similarity search
Once our vector space model has been trained, we consider several options for how to find the most similar concept c to a text mention m. First, we 2 We also experimented with using the output of the CLStoken, and max-pooling of the output representations for the sub-word tokens as proposed by (Reimers and Gurevych, 2019), but neither resulted in better performance.

Searching Over Representation Type
Ontology Training Data Text Concept We consider the following search targets: Data We search over the concepts in the annotated data. These mentions will be more domainspecific (e.g., PT may refer to patient in clinical notes, but to physical therapy in scientific articles), but may be more predictive if the evaluation data is from the same domains. We search over the train subset of the data for dev evaluation, and train + dev subset for test evaluation. Ontology We search over the concepts in the ontology. The synonyms will be more domainindependent, and the ontology will cover concepts never seen in the annotated training data. Data and ontology We search over the concepts in both the training data and the ontology. For concepts in the annotated training data, their representations are averaged over mentions in the training data and synonyms in the ontology.
We consider the following representation types: Text We represent each text (ontology synonym or training data mention) as a vector by running it through our triplet-fine-tuned PubMed-BERT encoder. Concept normalization then compares the mention vector to each text vector: When a retrieved text t is present in more than one concept (e.g., no appetite appears in concepts C0426579, C0003123, C1971624), and thus we see the same Sim for multiple concepts, we pick a concept randomly to break ties.
First component Second component Concept We represent each concept as a vector by taking an average over the triplet-fine-tuned PubMed-BERT representations of that concept's texts (ontology synonyms and/or training data mentions). Concept normalization then compares the mention vector to each concept vector: The averages here mean that different concepts with some (but not all) overlapping synonyms (e.g., C0426579, C0003123, C1971624 in UMLS all have the synonym no appetite) will end up with different vector representations.

Sieve-based search
Traditional sieve-based approaches for concept normalization (D'Souza and Ng, 2015;Jonnagaddala et al., 2016;Luo et al., 2019;Henry et al., 2020) achieved competitive performance by ordering a sequence of searches over dictionaries from most precise to least precise. Inspired by this work, we consider a sieve-based similarity search that: 1) searches over the annotated training data, then 2) searches over the ontology (possibly combined with the annotated training data). Table 2   ShARe/CLEF (Suominen et al., 2013). The statistics of each dataset are described in  We take 40 clinical notes from the released data as training, consisting of 5,334 mentions, and the standard evaluation data with 6,925 mentions as our test set. Around 2.7% of mentions in MCN are assigned the CUI-less label.

Implementation details
Unless specifically noted otherwise, we use the same training procedure and hyper-parameter settings across all experiments and on all datasets. As the triplet mining requires at least one positive text in a batch for each anchor text, we randomly sample one positive text for each anchor text and group them into batches. Like previous work (Schroff et al., 2015;Hermans et al., 2017), we adopt euclidean distance to calculate similarity score during training, while at inference time, we compute cosine similarity as it is simpler to interpret. For the sieve-based search, if the cosine similarity score between the mention and the prediction of the first sieve is above 0.95, we use the prediction of first sieve, otherwise, we use the prediction of the second sieve. When training the triplet network on the combination of the ontology and annotated corpus, we take all the synonyms from the ontology and repeat the concept texts in the annotated corpus such that |D| |O| = 1 3 . In preliminary experiments we found that large ontologies overwhelmed small annotated corpora. We also experimented with three ratios 1 3 , 2 3 , and 1 between concept texts and synonyms of ontology on NCBI and BC5CDR-D datasets, and found that the ratio of 1 3 achieves the best performance for Train:OD models. We then kept the same ratio setting for all datasets. We did not thoroughly explore other ratios and leave that to future work.
For all experiments, we use PubMed-BERT (Gu et al., 2020) as the starting point, which pre-trains a BERT-style model from scratch on PubMed abstracts and full texts. In our preliminary experi-ments, we also tried BioBERT  as the text encoder, but that resulted in worse performance across five datasets. We use the pytorch implementation of sentence-transformers 7 to train the Triplet Network for concept normalization. We use the following hyper-parameters during the training of the triplet network: sequence_length = 8, batch_size = 1500, epoch_size = 100, optimizer = Adam, learning_rate = 3e-5, warmup_steps = 0.

Evaluation metrics
The standard evaluation metric for concept normalization is accuracy, because the most similar concept in prediction is of primary interest. For composite mentions like breast and ovarian cancer that are mapped to more than one concept in NCBI, BC5CDR-D, and BC5CDR-C datasets, we adopt the evaluation strategy that composite entity is correct if every prediction for each separate mention is correct (Sung et al., 2020).

Model selection
We use the development data to choose whether to train the triplet network on just the ontology or also the training data, and to choose which among the similarity search strategies described in section 3.2. Table 4 shows the performance of all such systems across the five different corpora. The top half of the table focuses on settings where the triplet network only needs to be trained once, on the ontology, and the bottom half focuses on settings where the triplet network is retrained for each new dataset. For each half of the table, the last column gives the average of the ranks of each setting's performance across the five corpora. For example, when training the triplet network only on the ontology, the searching strategy D-C (search the training data using concept vectors) is almost always the worst performing,  Table 4: Dev performances of the triplet network trained on ontology and ontology + data with different similarity search strategies. The last column Avg. Rank shows the average rank of each similarity search strategy across multiple datasets. Models with best average rank are highlighted in grey; models with best accuracy are bolded.
ranking 14th of 14 in four corpora and 12th of 14 in one corpus, for an average rank of 13.6. Table 4 shows that the best models search over both the ontology and the training data. Models that only search over the training data (D-T and D-C) perform worst, with average ranks of 12.6 or higher regardless of what the triplet network is trained on, most likely because the training data covers only a fraction of the concepts in the test data. Models that only search over the ontology (O-T and O-C) are only slightly better, with average ranks between 9.6 and 12, though the models in the first two rows of the table at least have the advantage that they require no annotated training data (they train on and search over only the ontology). However, the performance of such models can be improved by adding domain-specific synonyms to the ontology, i.e., OD-T vs. O-T (rows 5 vs. 1), and OD-C vs. O-C (rows 6 vs. 2), or adding domain-specific synonyms and then searching in a sieve-based manner (rows 7-14). Table 4 also shows that searching based on text (ontology synonyms or training data mentions) vectors typically outperforms searching based on con-cept (average of text) vectors. Each pair of rows in the table shows such a comparison, and only in rows 15-16 and 19-20 are the average ranks of the -C models higher than the -T models.
Table 4 also shows that models using mixed representation types (-T and -C) have worse ranks than the text-only models (-T). For instance, going from Train:O-Search:O-C to Train:O-Search:O-T improves the average rank from 12 to 10.2, going from Train:OD-Search:D-T+OD-C to Train:OD-Search:D-T+OD-T improves the average rank from 5.2 to 2.4, etc. There are a few exceptions to this on the MCN dataset. We analyzed the differences in the predictions of Train:OD-Search:D-T+OD-T (row 25) and Train:OD-Search:D-T+OD-C (row 26) on this dataset, and found that concept vectors sometimes helps to solve ambiguous mentions by averaging their concept texts. For instance, the OD-T model finds concepts C0013144 and C2830004 for mention somnolent as they have the overlapping synonym somnolent, while the OD-C model ranks C2830004 higher as the other concept also has other synonyms such as Drowsy, Sleepiness.
Finally,   88.80 88.9 94.1 --CNN-based ranking (Li et al., 2017) 86.10 --90.30 -BERT-based ranking (Ji et al., 2020) 89.06 --91.10 -BERT-based ranking (Xu et al., 2020) ----83.56 BIOSYN (Sung et al., 2020) 91    3, and 5; and rows 21 vs. 15, 17, and 19). From this analysis on the dev set, we select the following models to evaluate on the test set: Train:O + Search:O-T This is the best approach that requires only the ontology; no annotated training data is used. Train:O + Search:D-T+OD-T This is the best approach that only needs to be trained once (on the ontology), as the training data is only used to add extra concept text during search time. This is similar to a real-world scenario where a user manually adds some extra domain-specific synonyms for concepts they care about. Train:OD + Search:D-T+OD-T This is the best approach that can be created from any combination of ontology and training data. The triplet network must be retrained for each new domain. Train:OD + Search:tuned This is the bold models in the second half of table 4. It requires not only retraining the triplet network for each new domain, but also trying out all search strategies on the new domain and selecting the best one. Table 5 shows the results of our selected models on the test set, alongside the best models in the literature. Our Train:OD+Search:tuned model achieves new state-of-the-art on BC5CDR-C (p 8 =0.0291), equivalent performance on NCBI 8 We used a one-sample bootstrap resampling test. The one sample is 10,000 runs of bootstrapping results of our system. (p=0.6753) and BC5CDR-D (p=0.1204), <1 point worse on ShARe (p=0.0375), and <2 points worse on MCN (p=0). Note that the performance of TTI is from an ensemble of multiple system runs. Yet this model is simpler than most prior work: it requires no two-step generate-and-rank framework (Li et al., 2017;Ji et al., 2020;Xu et al., 2020), no iterative candidate retrieval over the entire training data (Sung et al., 2020), no hand-crafted rules or features (D'Souza and Ng, 2015;Luo et al., 2019), and no acronym expansion or TF-IDF transformations (D'Souza and Ng, 2015;Ji et al., 2020;Sung et al., 2020).

Results
The PubMed-BERT rows in Table 5 demonstrate that the triplet training is a critical part of the success: if we use PubMed-BERT without triplet training, performance is 2 to 8 points worse than our best models, depending on the dataset. Yet, we can see that our proposed search strategies are also important, as on the BC5CDR datasets, PubMed-BERT can get within 3 points of the state-of-the-art using the D-T+OD-T search strategy (though it is much further away on the other datasets).
Perhaps most interestingly, our triplet network trained only on the ontology and no annotated training data, Train:O+Search:D-T+OD-T, achieves within 3 points of state-of-the-art on all datasets. We believe this represents a more realistic scenario: unlike prior work, our triplet network does not need to be retrained for each new dataset/domain if their concepts are from the same ontology. Instead, the model can be adapted to a new dataset/domain by simply pointing out any extra domain-specific synonyms for concepts, and the search can integrate these directly. Domain-specific synonyms do  Table 6: Top similar texts, their concepts, and similarity scores for mention primary HPT (D049950) predicted from models PubMed-BERT + Search:OD-T, Train:O + Search:OD-T and Train:OD + Search:OD-T. seem to be necessary for all datasets; without them (i.e., Train:O+Search:O-T), performance is about 10 points below state-of-the-art. As a small qualitative analysis of the models, Table 6 shows an example of similarity search results, where the systems have been asked to normalize the mention primary HPT. PubMed-BERT fails, producing unrelated acronyms, while both triplet network models find the concept and rank it with the highest similarity score.

Limitations and future research
Our ability to normalize polysemous concept mentions is limited by their context-independent representations. Although our PubMed-BERT encoder is a pre-trained contextual model, we feed in only the mention text, not any context, when producing a representation vector. This is not ideal for mentions with multiple meanings, e.g., potassium in clinical notes may refer to the substance (C0032821) or the measurement (C0202194), and only the context will reveal which one. A better strategy to generate the contextualized representation for the concept mention, e.g., Schumacher et al. (2020), may yield improvements for such mentions.
We currently train a separate triplet network for each ontology (one for MEDIC, one for CTD, one for SNOMED-CT, etc.) but in the future we would like to train on a comprehensive ontology like the UMLS Metathesaurus (Bodenreider, 2004), which includes nearly 200 different vocabularies (SNOMED-CT, MedDRA, RxNorm, etc.), and more than 3.5 million concepts. We expect such a general vector space model would be more broadly useful to the biomedical NLP community.
We explored one type of triplet training network, but in the future we would like to explore other variants, such as semi-hard triplet mining (Schroff et al., 2015) for generating samples, cosine similarity for measuring the similarity during training and inference, and multi-similarity loss  for calculating the loss.

Conclusions
We presented a vector-space framework for concept normalization, based on pre-trained transformers, a triplet objective with online hard triplet mining, and a new approach to vector similarity search. Across five datasets, our models that require only an ontology to train are competitive with state-of-the-art models that require domain-specific training.
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Association for Computational Linguistics.
A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8452-8464, Online. Association for Computational Linguistics.