A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization

Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is challenging because ontologies are large. In most cases, annotated datasets cover only a small sample of the concepts, yet concept normalizers are expected to predict all concepts in the ontology. In this paper, we propose an architecture consisting of a candidate generator and a list-wise ranker based on BERT. The ranker considers pairings of concept mentions and candidate concepts, allowing it to make predictions for any concept, not just those seen during training. We further enhance this list-wise approach with a semantic type regularizer that allows the model to incorporate semantic type information from the ontology during training. Our proposed concept normalization framework achieves state-of-the-art performance on multiple datasets.


Introduction
Mining and analyzing the constantly-growing unstructured text in the bio-medical domain offers great opportunities to advance scientific discovery (Gonzalez et al., 2015;Fleuren and Alkema, 2015) and improve the clinical care (Rumshisky et al., 2016;. However, lexical and grammatical variations are pervasive in such text, posing key challenges for data interoperability and the development of natural language processing (NLP) techniques. For instance, heart attack, MI, myocardial infarction, and cardiovascular stroke all refer to the same concept. It is critical to disambiguate these terms by linking them with their corresponding concepts in an ontology or knowledge base. Such linking allows downstream tasks (relation extraction, information retrieval, text classification, etc.) to access the ontology's rich knowledge about biomedical entities, their synonyms, semantic types and mutual relationships. Concept normalization is a task that maps concept mentions, the in-text natural-language mentions of ontological concepts, to concept entries in a standardized ontology or knowledge base. Techniques for concept normalization have been advancing, thanks in part to recent shared tasks including clinical disorder normalization in 2013 ShARe/CLEF (Suominen et al., 2013) and 2014 SemEval Task 7 Analysis of Clinical Text (Pradhan et al., 2014), and adverse drug event normalization in Social Media Mining for Health (SMM4H) (Sarker et al., 2018;Weissenbacher et al., 2019). Most existing systems use a string-matching or dictionary look-up approach (Leal et al., 2015;D'Souza and Ng, 2015;Lee et al., 2016), which are limited to matching morphologically similar terms, or supervised multi-class classifiers (Belousov et al., 2017;Tutubalina et al., 2018;Niu et al., 2019;Luo et al., 2019a), which may not generalize well when there are many concepts in the ontology and the concept types that must be predicted do not all appear in the training data.
We propose an architecture (shown in Figure 1) that is able to consider both morphological and semantic information. We first apply a candidate generator to generate a list of candidate concepts, and then use a BERT-based list-wise classifier to rank the candidate concepts. This two-step architecture allows unlikely concept candidates to be filtered out prior to the final classification, a necessary step when dealing with ontologies with millions of concepts. In contrast to previous list-wise classifiers (Murty et al., 2018) which only take the concept mention as input, our BERT-based list-wise classifier takes both the concept mention and the candidate concept name as input, and is thus able to handle concepts that never appear in the training data. We further enhance this list-wise approach with a semantic type regularizer that allows our ranker to leverage semantic type information from  the ontology during training.
Our work makes the following contributions: • Our proposed concept normalization framework achieves state-of-the-art performance on multiple datasets.
• We propose a concept normalization framework consisting of a candidate generator and a list-wise classifier. Our framework is easier to train and the list-wise classifier is able to predict concepts never seen during training.
• We introduce a semantic type regularizer which encourages the model to consider the semantic type information of the candidate concepts. This semantic type regularizer improves performance over the BERT-based listwise classifier on multiple datasets.
The code for our proposed generate-and-rank framework is available at https://github.com/ dongfang91/Generate-and-Rank-ConNorm.

Related work
Traditional approaches for concept normalization involve string match and dictionary look-up. These approaches differ in how they construct dictionaries, such as collecting concept mentions from the labeled data as extra synonyms (Leal et al., 2015;Lee et al., 2016), and in different string matching techniques, such as string overlap and edit distance (Kate, 2016). Two of the most commonly used knowledge-intensive concept normalization tools, MetaMap (Aronson, 2001) and cTAKES (Savova et al., 2010) both employ rules to first generate lexical variants for each noun phrase and then conduct dictionary look-up for each variant. Several systems (D'Souza and Ng, 2015;Jonnagaddala et al., 2016) have demonstrated that rule-based concept normalization systems achieve performance competitive with other approaches in a sieve-based approach that carefully selects combinations and orders of dictionaries, exact and partial matching, and heuristic rules. However, such rule-based approaches struggle when there are great variations between concept mention and concept, which is common, for example, when comparing social media text to medical ontologies. Due to the availability of shared tasks and annotated data, the field has shifted toward machine learning techniques. We divide the machine learning approaches into two categories, classification (Savova et al., 2008;Stevenson et al., 2009;Limsopatham and Collier, 2016;Yepes, 2017;Festag and Spreckelsen, 2017;Lee et al., 2017;Tutubalina et al., 2018;Niu et al., 2019) and learning to rank (Leaman et al., 2013;Liu and Xu, 2017;Li et al., 2017;Nguyen et al., 2018;Murty et al., 2018).
Most classification-based approaches using deep neural networks have shown strong performance. They differ in using different architectures, such as Gated Recurrent Units (GRU) with attention mechanisms (Tutubalina et al., 2018), multi-task learning with auxiliary tasks to generate attention weights (Niu et al., 2019), or pre-trained transformer networks (Li et al., 2019;Miftahutdinov and Tutubalina, 2019); different sources for training word embeddings, such as Google News (Limsopatham and Collier, 2016) or concept definitions from the Unified Medical Language System (UMLS) Metathesaurus (Festag and Spreckelsen, 2017); and different input representations, such as using character embeddings (Niu et al., 2019). All classification approaches share the disadvantage that the output space must be the same size as the number of concepts to be predicted, and thus the output space tends to be small such as 2,200 concepts in (Limsopatham and Collier, 2016) and around 22,500 concepts in (Weissenbacher et al., 2019). Classification approaches also struggle with concepts that have only a few example mentions in the training data.
Researchers have applied point-wise learning to rank (Liu and Xu, 2017;Li et al., 2017), pairwise learning to rank (Leaman et al., 2013;Nguyen et al., 2018), and list-wise learning to rank (Murty et al., 2018;Ji et al., 2019) on concept normalization. Generally, the learning-to-rank approach has the advantage of reducing the output space by first obtaining a smaller list of possible candidate concepts via a candidate generator and then ranking them. DNorm (Leaman et al., 2013), based on a pair-wise learning-to-rank model where both mentions and concept names were represented as TF-IDF vectors, was the first to use learning-to-rank for concept normalization and achieved the best performance in the ShARe/CLEF eHealth 2013 shared task. List-wise learning-to-rank approaches are both computationally more efficient than pairwise learning-to-rank (Cao et al., 2007) and empirically outperform both point-wise and pair-wise approaches (Xia et al., 2008). There are two implementations of list-wise classifiers using neural networks for concept normalization: Murty et al. (2018) treat the selection of the best candidate concept as a flat classification problem, losing the ability to handle concepts not seen during training; Ji et al. (2019) take a generate-and-rank approach similar to ours, but they do not leverage resources such as synonyms or semantic type information from UMLS in their BERT-based ranker.

Concept normalization framework
We define a concept mention m as an abbreviation such as "MI", a noun phrase such as "heart attack", or even a short text such as "an obstruction of the blood supply to the heart". The goal is then to assign m with a concept c. Formally, given a list of pre-identified concept mentions M = {m 1 , m 2 , ..., m n } in the text and an ontology or knowledge base with a set of concepts C = {c 1 , c 2 , ..., c t }, the goal of concept normalization is to find a mapping function c j = f (m i ) that maps each textual mention to its correct concept.
We approach concept normalization in two steps: we first use a candidate generator G(m, C) → C m to generate a list of candidate concepts C m for each mention m, where C m ⊆ C and |C m | |C|. We then use a candidate ranker R(m, C m ) →Ĉ m , whereĈ m is a re-ranked list of candidate concepts sorted by their relevance, preference, or importance. But unlike information retrieval tasks where the order of candidate concepts in the sorted listĈ m is crucial, in concept normalization we care only that the one true concept is at the top of the list.
The main idea of the two-step approach is that we first use a simple and fast system with high recall to generate candidates, and then a more precise system with more discriminative input to rank the candidates.

Candidate generator
We implement two kinds of candidate generators: a BERT-based multi-class classifier when the number of concepts in the ontology is small, and a Lucenebased 1 dictionary look-up when there are hundreds of thousands of concepts in the ontology.

BERT-based multi-class classifier
BERT (Devlin et al., 2019) is a contextualized word representation model that has shown great performance in many NLP tasks. Here, we use BERT in a multi-class text-classification configuration as our candidate concept generator. We use the final hidden vector V m ∈ R H corresponding to the first input token ([CLS]) generated from BERT (m) and a classification layer with weights W ∈ R |C|×H , and train the model using a standard classification loss: where y is a one-hot vector, and |y| = |C|. The score for all concepts is calculated as: We select the top k most probable concepts in p(C) and feed that list C m to the ranker.

Lucene-based dictionary look-up system
Multi-pass sieve rule based systems (D'Souza and Ng, 2015;Jonnagaddala et al., 2016;Luo et al., 2019b) achieve competitive performance when used with the right combinations and orders of different dictionaries, exact and partial matching, and heuristic rules. Such systems relying on basic lexical matching algorithms are simple and fast to implement, but they are only able to generate candidate concepts which are morphologically similar to a given mention. Inspired by the work of Luo et al. (2019b), we implement a Lucene-based sieve normalization system which consists of the following components (see Appendix A.1 for details): a. Lucene index over the training data finds all mentions that exactly match m.
b. Lucene index over ontology finds concepts whose preferred name exactly matches m.
c. Lucene index over ontology finds concepts where at least one synonym of the concept exactly matches m.
d. Lucene index over ontology finds concepts where at least one synonym of the concept has high character overlap with m.
The ranked list C m generated by this system is fed as input to the candidate ranker.

Candidate ranker
After the candidate generator produces a list of concepts, we use a BERT-based list-wise classifier to select the most likely candidate. BERT allows us to match morphologically dissimilar (but semantically similar) mentions and concepts, and the list-wise classifier takes both mention and candidate concepts as input, allowing us to handle concepts that appear infrequently (or never) in the training data. Here, we use BERT similar to a question answering configuration, where given a concept mention m, the task is to choose the most likely candidate concept c m from all candidate concepts C m . As shown in Figure 1, our classifier input includes the text of the mention m and all synonyms of the candidate concept c m , and takes the form We calculate the final hidden vector V (m,cm) ∈ R H corresponding to the first input token ([CLS]) generated from BERT for each such input, and then concatenate the hidden vectors of all candidate concepts to form a matrix V (m,Cm) ∈ R |Cm|×H . We use this matrix and classification layer weights W ∈ R H , and compute a standard classification loss: where y is a one-hot vector, and |y| = |C m |.

Semantic type regularizer
To encourage the list-wise classifier towards a more informative ranking than just getting the correct concept at the top of the list, we propose a semantic type regularizer that is optimized when candidate concepts with the correct semantic type are ranked above candidate concepts with incorrect types. The semantic type of the candidate concept is assumed correct only if it exactly matches the semantic type of the gold truth concept. If the concept has multiple semantic types, all must match. Our semantic type regularizer consists of two components: is the set of indexes of candidate concepts with incorrect semantic types (negative candidates), P (y) (positive candidates) is the complement of N (y),ŷ t is the score of the gold truth candidate concept, and thus t ∈ P (y).
The margins m 1 and m 2 are hyper-parameters for controlling the minimal distances betweenŷ t and y p and betweenŷ p andŷ n , respectively. Intuitively, R p tries to push the score of the gold truth concept above all positive candidates at least by m 1 , and R n tries to push the best scored negative candidate below all positive candidates by m 2 . The final loss function we optimize for the BERTbased list-wise classifier is: where λ and µ are hyper-parameters to control the tradeoff between standard classification loss and the semantic type regularizer.

SMM4H-17
The SMM4H-17 dataset 4 consists of 9,149 manually curated ADR expressions from tweets. The mentions are mapped to 22,500 concepts with 61 semantic types from MedDRA Preferred Terms (PTs). We use the 5,319 mentions from the released set as our training data, and keep the 2,500 mentions from the original test set as evaluation.
MCN The MCN dataset consists of 13,609 concept mentions drawn from 100 discharge summaries from the fourth i2b2/VA shared task (Uzuner et al., 2011). The mentions are mapped to 3792 unique concepts out of 434,056 possible concepts with 125 semantic types in SNOMED-CT and RxNorm. We take 40 clinical notes from the released data as training, consisting of 5,334 mentions, and the standard evaluation data with 6,925 mentions as our test set. Around 2.7% of mentions in MCN could not be mapped to any 4 http://dx.doi.org/10.17632/rxwfb3tysd. 1 concepts in the terminology, and are assigned the CUI-less label.
A major difference between the datasets is the space of concepts that systems must consider. For AskAPatient and TwADR-L, all concepts in the test data are also in the training data, and in both cases only a couple thousand concepts have to be considered. Both SMM4H-17 and MCN define a much larger concept space: SMM4H-17 considers 22,500 concepts (though only 513 appear in the data) and MCN considers 434,056 (though only 3,792 appear in the data). AskAPatient and TwADR-L have no unseen concepts in their test data, SMM4H-17 has a few (43), while MCN has a huge number (2,256). Even a classifier that perfectly learned all concepts in the training data could achieve only 70.15% accuracy on MCN. MCN also has more unseen mentions: 53.9%, where the other datasets have less than 40%. The MCN dataset is thus harder to memorize, as systems must consider many mentions and concepts never seen in training.
Unlike the clinical MCN dataset, in the three social media datasets -AskAPatient, TwADR-L, and SMM4H-17 -it is common for the ADR expressions to share no words with their target medical concepts. For instance, the ADR expression "makes me like a zombie" is assigned the concept "C1443060" with preferred term "feeling abnormal". The social media datasets do not include context, only the mentions themselves, while the MCN dataset provides the entire note surrounding each mention. Since only 4.5% of mentions in the MCN dataset are ambiguous, for the current experiments we ignore this additional context information.

Unified Medical Language System
The UMLS Metathesaurus (Bodenreider, 2004) links similar names for the same concept from nearly 200 different vocabularies such as SNOMED-CT, MedDRA, RxNorm, etc. There are over 3.5 million concepts in UMLS, and for each concept, UMLS also provides the definition, preferred term, synonyms, semantic type, relationships with other concepts, etc.
In our experiments, we make use of synonyms and semantic type information from UMLS. We restrict our concepts to the three vocabularies, Med-DRA, SNOMED-CT, and RxNorm in the UMLS version 2017AB. For each concept in the ontologies of the four datasets, we first find its concept unique identifier (CUI) in UMLS. We then extract synonyms and semantic type information according to the CUI. Synonyms (English only) are collected from level 0 terminologies containing vocabulary sources for which no additional license agreements are necessary.

Evaluation metrics
For all four datasets, the standard evaluation of concept normalization systems is accuracy. For the AskAPatient and TwADR-L datasets, which use 10-fold cross validation, the accuracy metrics are averaged over 10 folds.

Implementation details
We use the BERT-based multi-class classifier as the candidate generator on the three social media datasets AskAPatient, TwADR-L, and SMM4H-17, and the Lucene-based candidate generator for the MCN dataset. In the social media datasets, the number of concepts in the data is small, few test concepts are unseen in the training data, and there is a greater need to match expressions that are morphologically dissimilar from medical concepts. In the clinical MCN dataset, the opposites are true.
For all experiments, we use BioBERT-base , which further pre-trains BERT on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC). We use huggingface's pytorch implementation of BERT 5 . We select the best hyper-parameters based on the performance on dev set. See Appendix A.2 for hyperparameter settings.

Comparisons with related methods
We compare our proposed architecture with the following state-of-the-art systems.  Given a mention as input, the exact-match module first looks for mentions in the training data that exactly match the input, and then looks for concepts from the ontology whose synonyms exactly match the input. If no concepts are found, the mention is fed into MetaMap. They run this sieve-based normalization model twice. In the first round, the model lower-cases the mentions and includes acronym/abbreviation tokens during dictionary lookup. In the second round, the model lower-cases the mentions spans and also removes special tokens such as "&apos;s", "&quot;", etc.
Since our focus is individual systems, not ensembles, we compare only to other non-ensembles 6 .

Models
We separate out the different contributions from the following components of our architecture.
BERT The BERT-based multi-class classifier. When used alone, we select the most probable concept as the prediction. 6 An ensemble of three systems (including CharL-STM+WordLSTM and LR+MeanEmbedding) achieved 88.7% accuracy on the SMM4H-17 dataset (Sarker et al., 2018).

Approach
Dev Test  Lucene The Lucene-based dictionary look-up.
When used alone, we take the top-ranked candidate concept as the prediction.
+BERT-rank The BERT-based list-wise classifier, always used in combination with either BERT or Lucene as a canddiate generator +ST-reg The semantic type regularizer, always used in combination with BERT-ranker.
We also consider the case (+gold) where we artificially inject the correct concept into the candidate generator's list if it was not already there. Table 2 shows that our complete model, BERT + BERT-rank + ST-reg, achieves a new state-of-theart on two of the social media test sets, and Table 3 shows that Lucene + BERT-rank + ST-reg achieves a new state-of-the-art on the clinical MCN test set. The TwADR-L dataset is the most difficult, with our complete model achieving 47.02% accuracy.

Results
In the other datasets, performance of our complete model is much higher: 87.46% for AskAPatient, 88.24% for SMM4H-17 7 . On the TwADR-L, SMM4H-17, and MCN test sets, adding the BERT-based ranker improves performance over the candidate generator alone, and adding the semantic type regularization further improves performance. For example, Lucene alone achieves 79.25% accuracy on the MCN data, adding the BERT ranker increases this to 82.75%, and adding the semantic type regularizer increases this to 83.56%. On AskAPatient, performance of the full model is similar to just the BERT multiclass classifier, perhaps because in this case BERT alone already successfully improves the state-ofthe-art from 85.71% to 87.52%. The +gold setting allows us to answer how well our ranker would perform if our candidate generator made no mistakes. First, we can see that if the correct concept is always in the candidate list, our list-based ranker (+BERT-rank) outperforms the multi-class classifier (BERT) on all test sets. We also see in this setting that the benefits of the semantic type regularizer are amplified, with test sets of TwADR-L and MCN showing more than 1.00% gain in accuracy from using the regularizer. These findings suggest that improving the quality of the candidate generator should be a fruitful future direction.
Overall, we see the biggest performance gains from our proposed generate-and-rank architecture in the MCN dataset. This is the most realistic setting, where the number of candidate concepts is large and many test concepts were never seen during training. In such cases, we cannot use a multiclass classifier as a candidate generator since it would never generate unseen concepts. Thus, our ranker shines in its ability to sort through the long list of possible concepts.
6 Qualitative analysis Table 4 shows an example that is impossible for the multi-class classifier approach to concept normalization. The concept mention "an abdominal wall hernia" in the clinical MCN dataset needs to be mapped to the concept with the preferred name "Hernia of abdominal wall", but that concept never appeared in the training data. The Lucene-based candidate generator finds this concept, but only 7 Miftahutdinov and Tutubalina (2019) use the same architecture as our BERT-based multi-class classifier (row 7), but they achieve 89.28% of accuracy on SMM4H-17. We were unable to replicate this result as their code and parameter settings were unavailable.   Lucene ranks the correct concept 4th in its list. The BERT ranker is able to compare "an abdominal wall hernia" to "Hernia of abdominal wall" and recognize that as a better match than the other options, re-assigning it to rank 1. Table 5 shows an example that illustrates why the semantic type regularizer helps. The mention "felt like I was coming down with flu" in the social media AskAPatient dataset needs to be mapped to the concept with the preferred name "influenza-like symptoms", which has the semantic type of a sign or symptom. The BERT ranker ranks two disease or syndromes higher, placing the correct concept at rank 3. After the semantic type regularizer is added, the system recognizes that the mention should be mapped to a sign or symptom, and correctly ranks it above the disease or syndromes. Note that this happens even though the ranker does not get to see the semantic type of the input mention at prediction time.

Limitations and future research
The available concept normalization datasets are somewhat limited. Lee et al. (2017) notes that AskAPatient and TwADR-L have issues including duplicate instances, which can lead to bias in the system; many phrases have multiple valid mappings to concepts but the context necessary to disambiguate is not part of the dataset; and the 10-fold cross-validation makes training complex models unnecessarily expensive. These datasets are also unrealistic in that all concepts in the test data are seen during training. Future research should focus on more realistic datasets that follow the approach of MCN in annotating mentions of concepts from a large ontology and including the full context.
Our ability to explore the size of the candidate list was limited by our available computational resources. As the size of the candidate list increases, the true concept is more likely to be included, but the number of training instances also increases, making the computational cost larger, especially for the datasets using 10-fold cross-validation. We chose candidate list sizes as large as we could afford, but there are likely further gains possible with larger candidate lists.
Our semantic type regularizer is limited to exact matching: it checks only whether the semantic type of a candidate exactly matches the semantic type of the true concept. The UMLS ontology includes many other relations, such as is-a and part-of relations, and extending our regularizer to encode such rich semantic knowledge may yield further improvements in the BERT-based ranker.

Conclusion
We propose a concept normalization framework consisting of a candidate generator and a list-wise classifier based on BERT.
Because the candidate ranker makes predictions over pairs of concept mentions and candidate concepts, it is able to predict concepts never seen during training. Our proposed semantic type regularizer allows the ranker to incorporate semantic type information into its predictions without requiring semantic types at prediction time. This generate-and-rank framework achieves state-of-theart performance on multiple concept normalization datasets.

A Appendices
A.1 Lucene-based dictionary look-up system The lucene-based dictionary look-up system consists of the following components: (a) Lucene index over the training data finds all CUI-less mentions that exactly match mention m.
(b) Lucene index over the training data finds CUIs of all training mentions that exactly match mention m.
(c) Lucene index over UMLS finds CUIs whose preferred name exactly matches mention m.
(d) Lucene index over UMLS finds CUIs where at least one synonym of the CUI exactly matches mention m.
(e) Lucene index over UMLS finds CUIs where at least one synonym of the CUI has high character overlap with mention m. To check the character overlap, we run the following three rules sequentially: token-level matching, fuzzy string matching with a maximum edit distance of 2, and character 3-gram matching..
See Figure A1 for the flow of execution across the components. Whenever there are multiple CUIs generated from a component (a) to (e), they are fed, along with the concept mention, to the BERT-based reranker (f). During training, we used component (e) alone instead of the combination of components (b)-(e) to generate training instances for the BERT-based reranker (f) as it generated many more training examples and resulted in better performance on the dev set. During evaluation, we used the whole pipeline.  Figure A1: Architecture of the lucene-based dictionary look-up system. The edges out of a search process indicate the number of matches necessary to follow the edge. Outlined nodes are terminal states that represent the predictions of the system.

A.2 Hyper-parameters
To keep the size of the candidate list equal to k for every mention, we apply the following rules: if the list does not contain the gold concept and is already of length k, we inject the correct one and remove an incorrect candidate; if the list is not length of k, we inject the gold concept and the most frequent concepts in the training set to reach k.