MedNorm: A Corpus and Embeddings for Cross-terminology Medical Concept Normalisation

The medical concept normalisation task aims to map textual descriptions to standard terminologies such as SNOMED-CT or MedDRA. Existing publicly available datasets annotated using different terminologies cannot be simply merged and utilised, and therefore become less valuable when developing machine learning-based concept normalisation systems. To address that, we designed a data harmonisation pipeline and engineered a corpus of 27,979 textual descriptions simultaneously mapped to both MedDRA and SNOMED-CT, sourced from five publicly available datasets across biomedical and social media domains. The pipeline can be used in the future to integrate new datasets into the corpus and also could be applied in relevant data curation tasks. We also described a method to merge different terminologies into a single concept graph preserving their relations and demonstrated that representation learning approach based on random walks on a graph can efficiently encode both hierarchical and equivalent relations and capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. We believe that making a corpus and embeddings for cross-terminology medical concept normalisation available to the research community would contribute to a better understanding of the task.


Introduction
The medical concept normalisation task aims to assign a corresponding identifier from a standard terminology to text descriptions. Depending on the domain, descriptions may vary from formal medical jargon terms (e.g. "Dizziness") to more informal and colloquial expressions that rather explain how the patient feels (e.g. "everything that surrounds me is circling or rolling", "kept bumping into things"). There are multiple terminologies of medical concepts that are commonly used for mapping, such as SNOMED-CT (Systematized Nomenclature of Medicine -Clinical Terms) (Stearns et al., 2001) and MedDRA (Medical Dictionary for Regulatory Activities) (Brown et al., 1999). The Unified Medical Language System (UMLS) (Schuyler et al., 1993) integrates concepts from various biomedical vocabularies and lexicons, including SNOMED-CT and Med-DRA. Each concept is represented by its Concept Unique Identifier (CUI). Clinicians choose the most suitable terminology based on their particular case or application. Hence, when creating corpora with annotated medical concepts, there is no general agreement on which terminology to use or which annotation guidelines to follow. Also, variety of available concepts in terminologies (e.g. over 70,000 lowest level terms in Med-DRA and over 350,000 concepts in SNOMED-CT) makes it harder to achieve high agreement between annotators. For instance, annotators could pick a different level of hierarchy (e.g. Fatigue or more specific term Tiredness) or inconsistently pick from similarly described concepts when a description is vague (e.g. Insomnia and Poor quality sleep). As a result, such variable annotations cannot be simply merged and utilised, and therefore, such data become less valuable when developing machine learning-based concept normalisation systems. To combine and harmonise datasets, we need to tackle various problems associated with providing cross-terminology mappings between concepts and resolving inconsistent annotations from different datasets. Due to heterogeneous structures of medical terminologies, simple one-to-one mappings may be insufficient to match and compare concepts. Therefore, it is also necessary to harmonise and align terminologies and find a way to represent medical concepts con-sidering relations between them regardless of the terminology. Representation learning techniques have shown promising results in encoding structural information about nodes in graphs and heterogeneous networks (Perozzi et al., 2014;Grover and Leskovec, 2016;Dong et al., 2017;Hamilton et al., 2017), however this requires integrating various medical terminologies into a single graph or network, which remains challenging. Recently, it has been also demonstrated that terminological embeddings can capture semantic similarities and are especially well-suited for biomedical ontology alignment (Kolyvakis et al., 2018). In this paper, we present a MedNorm corpus consisting of 27,979 textual descriptions (phrases) simultaneously mapped to both MedDRA and SNOMED-CT, that have been sourced from five publicly available datasets across biomedical and social media domains. To combine them, we designed a data harmonisation pipeline that can be re-used in the future to integrate new datasets into the corpus or applied in relevant annotation and data processing tasks. Also, we have described a method to merge multiple medical terminologies into a single network preserving both terminology-specific and cross-terminology relations. We demonstrated that representation learning approach based on random walks on a graph can efficiently encode equivalent and hierarchical relations and capture semantic similarities not only between concepts inside a given terminology, but also between concepts from different terminologies. Finally, we have provided an analysis of the corpus, investigated textual and conceptual similarities between utilised datasets and also analysed cross-terminology medical concept embeddings. The corpus and concept embeddings 1 as well as the harmonisation pipeline 2 are publicly available. Making such resources available to the research community aimed to contribute to a better understanding of the task.

Target medical ontologies
Relationships between medical concepts are encoded differently in medical ontologies. In this section we describe the two ontologies that have been used for mappings in the corpus.
SNOMED-CT (SCT) is a structured clinical terminology that enables consistent documentation and annotation of clinical data. There are both hierarchical and semantic (e.g. finding site, associated morphology) relations between terms. Each term can have multiple hierarchical paths with different lengths, so their specific level in the hierarchy is undefined.
MedDRA is a hierarchical terminology with five levels (from very specific to very general) designed for encoding adverse drug events for regulatory affairs. The most specific level is Lowest Level Terms (LLT) and refers how a concept might be reported in practice (e.g. "Feeling queasy"). Each LLT is linked to exactly one Preferred Term (PT), a distinct descriptor for a symptom, sign, disease diagnosis, indication, procedure or medical history characteristic (e.g. "Nausea"). Related PTs are grouped into High Level Terms (HLTs, e.g. "Nausea and vomiting symptoms"), then into High Level Group Terms (HLGTs, e.g. "Gastrointestinal signs and symptoms"), and finally into "System Organ Classes" (SOC, e.g. "Gastrointestinal disorders"). Note that single HLT can be linked to more than one HLGTs, and as a result, PT will have more than one hierarchical path to SOC.

Source corpora
The data for the MedNorm corpus was collected across two different domains: biomedical documents (drug labels and PubMed abstracts) and social media (online health forums and drug-related discussions in Twitter). The list of source datasets and their descriptions are provided below. Table 1 represents the overview of utilised terminologies.  CADEC: The CSIRO Adverse Drug Event Corpus (CADEC) (Karimi et al., 2015) is an annotated corpus of patient-reported adverse drug events (ADEs) sourced from the medical forum called AskAPatient 3 , which collects ratings and reviews of medications from their consumers. It contains 1,250 forum posts annotated for mentions of Drug, ADR, Disease, Symptom and Finding. Every mention other than Drug has been mapped to the corresponding SNOMED-CT concept identifier, whereas ADR mentions have been also mapped to the corresponding MedDRA term.
TwADR-L: The TwADR-L dataset has been constructed by the University of Cambridge (Limsopatham and Collier, 2016) from a collection of three months of Twitter posts, which has been sampled and annotated by undergrad-level linguists who mapped each phrase to one of the concepts in the UMLS Metathesaurus.
TwiMed: A corpus consists of 1,000 tweets and 1,000 PubMed sentences selected using the same strategy and annotated by two pharmacists for a set of drugs, diseases and symptoms (Alvaro et al., 2017). The TwiMed-Twitter set contains 827 phrases and the TwiMed-PubMed contains 1,142 phrases, both mapped to the UMLS Metathesaurus.
SMM4H-2017: This is a dataset of concept mentions and their corresponding human-assigned MedDRA PTs has been provided as a part of the 2 nd Social Media Mining for Health Applications Shared Task at AMIA 2017 (Subtask 3) (Sarker et al., 2018). It consist of two sets: the SMM4H2017-train set (6,650 phrases) and the SMM4H2017-test set (2,500 phrases).
TAC 2017 (ADR Track): The Text Analysis Conference (TAC) 2017 Shared Task had a track on Adverse Drug Reaction Extraction from Drug Labels (Demner-Fushman et al., 2018), the final task of which was focused on mapping extracted ADRs in a Structured Product Labels (SPL) to MedDRA PTs. The training set (TAC2017 ADR) of 101 annotated drug labels has been released, which contain 7,045 ADR mentions mapped to MedDRA.

Corpus creation
The overview of the data harmonisation pipeline used to create a corpus is illustrated in Figure 1. Initially, we have combined all seven datasets from five data sources mentioned above into a single set of instances where each phrase is associated with corresponding original identifiers in different terminologies. We have represented the corpus as a graph to preserve relations between datasets and their annotations (Section 3.1). Then, we extracted hierarchical relations and linked all con-cepts to their closely matched (equivalent) concepts across terminologies (Section 3.2). We have encoded both hierarchical and equivalent relations between concepts in different terminologies in a low-dimensional vector space that enables to measure the similarity between them (Section 3.3). In addition, we attempted to identify and resolve potential inconsistencies in human annotations (Section 3.4). In order to achieve consistent hierarchy levels across annotations, all instances have been simultaneously mapped to either the Preferred Term (PT) or higher level (e.g. when original annotation was less specific) in MedDRA and its equivalent level in SNOMED-CT. After such process, each phrase could have more than one equivalent mapping candidate (multi-label). Therefore, to provide one-to-one mapping between phrases and concepts, multiple candidates have been reduced to a single concept (single-label). As a result, we constructed our corpus of 27,979 textual descriptions (phrases) simultaneously mapped to both MedDRA (version 21.1) and SNOMED-CT (version 2018-07-31).  of instances, each INSTANCE can be originally annotated with one or multiple CONCEPTs (e.g. LLT:10041017, SCT:248255005) and described with textual PHRASE (e.g. "unable to sleep"), which in its turn contains a set of TOKENs (e.g. {sleep, unable, to}). Each of the CONCEPT has a corresponding NAME in the terminology (e.g. Sleeplessness, Cannot sleep at all), which is encoded using NAMED AS link and also contains a set of tokens (similar to phrase). To represent hierarchical relations between concepts extracted from medical terminologies, each CONCEPT can be linked to its parent node (i.e. concept from the higher level in the hierarchy) with IS A link (e.g. Sleeplessness → Insomnia → Disturbances in initiating and maintaining sleep → Sleep disorders and disturbances → Psychiatric disorders) and mapped to the equivalent concept node using MAPPED TO relation (e.g. Sleeplessness LLT:10041017 → Insomnia SCT:193462001). The representation of the corpus as a graph makes the further processing and analysis easier. For example, testing whether a particular phrase has been inconsistently annotated in the same dataset (i.e. has more than one associated concept) could be done by counting the number of unique CONCEPT nodes reachable from the target phrase. Moreover, all links between concepts in different terminologies (despite their various structures) are stored inside the single graph.

Cross-terminology mapping
The automatic mapping between UMLS, Med-DRA and SNOMED-CT has been done using community-based mappings from BioPortal (Noy et al., 2008) through the REST API 4 . The two concepts from different ontologies are considered as equivalent or closely matched if they share the same UMLS Concept Unique Identifier (CUI). After a careful review of results, we observed that some of the frequently mentioned concepts have not been mapped automatically. Therefore, with the help of medical experts, we defined an additional set of manually-curated mapping rules (provided in Appendix A, Table 6).

Learning cross-terminology representations of concepts
Cross-terminology mappings allowed to link concepts from multiple terminologies together, but their heterogeneous hierarchical structures (i.e. concepts are located deeper in the hierarchy or have more relations) makes graph distance alone insufficient to measure the similarity between concepts in different terminologies. However, medical concepts (or their corresponding nodes) can be embedded into a low-dimensional vector space. Initially, we have constructed a simplified hierarchical concept graph whose vertices are groups of equivalent concepts (i.e. nodes linked with MAPPED TO relation in the main corpus graph) and edges are hierarchical IS A relations. Then, we have used the DeepWalk (Perozzi et al., 2014), a deep learning method based on generalisation of language modelling applied on the streams of short random walks treating them as the equiva-lent of sentences. Performing 10 random walks per node (with a length of 40 nodes) and training a Skip-gram model (Mikolov et al., 2013) with the window size of 5, we have generated 64-dimensional concept vectors. The size of vectors has been chosen empirically. Later, we have split the groups under the assumption that all concepts in a group (i.e. equivalent concepts) should have the same vectors. Table 2 shows three selected MedDRA concepts and their most similar concepts (with cosine similarity) from all terminologies. It demonstrates that both equivalent and hierarchical relations between concepts has been successfully encoded and the semantic similarity can be captured by calculating the cosine similarity between two corresponding concept vectors.

Corpus consistency
In order to make all annotations in our final corpus consistent, we have performed the two operations described below.
Resolving inconsistent annotations: After performing a manual analysis of the combined corpus we have noticed inconsistencies in the original human annotations. For example, in the CADEC, where phrases can be mapped simultaneously to both SNOMED-CT and Med-DRA, 27 instances which were (correctly) annotated as Stomach cramps (SCT:51197009) also were co-annotated as Learning disorder (MEDDRA PT:10061265). To identify potential annotation errors in the original datasets, we have utilised the concept graph to calculate the distances between concept nodes (i.e. the shortest path length) and the cosine similarity of corresponding vectors in the latent vector space model (VSM). Also, we made an effort to locate inconsistent annotations across different datasets by identifying ambiguous tokens. In the usual case, a specific token is used to describe groups (clusters) of similar concepts (e.g. "walk" frequently describes concepts related to walking or mobility). However, an ambiguous token describes clusters of similar concepts frequently, but also sometimes describes concepts that are different from those clusters (i.e. the difference between the number of occurrences in the groups is high). Note that common tokens (e.g. "unable"), that are not specific for a particular group of concepts, will usually have a high number of groups, but relatively small difference between the numbers of occurrences.
We attempted to identify such outliers by calculating distances between concepts and their distance deviations from the clusters. For example, token "walk" was mentioned in 98 phrases and mapped to 23 concepts in total. The most popular annotation was Walking disability (e.g. "can barely walk"), however it also has been annotated as Myocardial infarction (e.g. "walk a little funny") that could be a potential annotation error. After such analysis and manual review, we have identified and re-mapped 110 annotations (provided with the source code).

Consistent hierarchical mapping:
The Preferred Term (PT) level in MedDRA describes single medical concept. Therefore it has been selected as a standard to provide a consistent hierarchical level among annotations in our corpus. However, not all phrases are specific enough to be mapped to the PT level or its equivalent. In such cases, we kept annotations equivalent to higher MedDRA levels (i.e. HLT, HLGT or SOC). All lower level annotations (i.e. LLT-equivalent) have been mapped to their PT-equivalent parents. Using the corpus graph, we were able to automate this process. Initially, all instances regardless of the terminology used in original annotations have been recursively mapped to their corresponding equivalent PT candidates (i.e. including mappings of mappings). Then, for each MedDRA candidate, we selected equivalent candidates from SNOMED-CT. To filter concepts that have emerged from such automatic mapping, all concepts that have not been observed in the original annotations were removed (except cases, where it was the only possible candidate). After such process, each phrase could have more than one candidate for each terminology (multilabel). Therefore, to provide one-to-one mapping between phrases and terminologies, in each multilabel group we have initially identified the most similar MedDRA concept to the original annotation (i.e. from the source dataset) but also the most popular across the whole corpus (i.e. to minimise the number of outliers). Then, we selected the SNOMED-CT concept (from the multi-label group) that is the most similar to the selected Med-DRA concept to achieve consistency in mapping between terminologies. Hereby, each phrase has been mapped to exactly one (single-label) Med-DRA and its corresponding SNOMED-CT concept simultaneously. As a result, the final corpus Prefixes for concept identifiers: SCT -SNOMED-CT; C -UMLS; LLT, PT, HLT, HLGT, SOC -MedDRA (based on the level). The equivalent concepts have similarity value of 1.0. Selected concepts (during multi-label reduction to single-label) are in bold-italic. Table 3: Examples of originally annotated phrases and their multi-label and single-label mappings has 27,957 PT-equivalent, two HLT-equivalent, 18 HLGT-equivalent and two SOC-equivalent annotations. In Table 3

Corpus analysis
The descriptive statistics of datasets constituting a corpus (grouped into biomedical and social domains) are presented in Table 4. The length of medical concept descriptions (phrases) are longer in social domain. The longest phrase has been found in the CADEC corpus: "when I went to sit down instead of siting normally I would almost fall down in the chair no control no strength, upon getting up I had to hold on to something to get up" (36 tokens) that describes Muscle weakness. We have also investigated the degree of class imbalance in the corpus and illustrated the most reported MedDRA concepts in Figure 3. The most reported

Asymmetric transferability between datasets
To investigate how the knowledge acquired from one dataset is potentially transferable to another dataset, we introduced the asymmetric transferability index that takes into account both conceptual (i.e. concepts from various terminologies used in the dataset) and textual (i.e. language used to describe those concepts) similarities. Asymmetry allows to see how much information can be understood from another dataset having all information about the first dataset. It utilises two similarity measures: cosine similarity CS(X, Y ) = X·Y X Y and the special case of Tversky Index (Tversky, 1977) with α = 1 and β = 0, that can be rewritten as T I(X, Y ) = |X∩Y | |X∩Y |+|X−Y | . We can calculate the similarity between two sequences of labels l 1 and l 2 with the cosine similarity between the corresponding label count vectors c(l 1 ) and c(l 2 ). However that measure will be symmetric, and therefore we multiply it by asymmetric setbased similarity: s(l 1 , l 2 ) = T I(l 1 , l 2 ) × CS(c(l 1 ), c(l 2 )) (1) Having two datasets A and B, sets of phrases P A , P B and sets of words W A , W B we obtain the textual transferability index (from A to B) as the arithmetic mean of phrasal and verbal asymmetric similarities: For each terminology t, we extract sequences of labels ℓ(A, t) in dataset A and ℓ(B, t) in dataset B. The conceptual transferability index is the average asymmetric similarity between terminologyspecific label sets: Finally, we obtain the overall transferability index: We have presented textual, conceptual and overall transferability matrices in Figure 4. The higher transferability index shows the better chance to understand information (i.e. match vocabulary or concepts). The most transferable dataset was TwADR-L, whereas the least transferable was TwiMed-PubMed. It directly corresponds to the number of unique concepts, phrases and words reported previously in Table 4. Also, the datasets collected from Twitter are highly transferable between each other. The CADEC dataset collected from AskAPatient reports is still more similar to Twitter (i.e. social domain).

Cross-terminology concept representations
In order to analyse cross-terminology concept representations, we used T-distributed Stochastic Neighbour Embedding (t-SNE) (Maaten and Hinton, 2008) to perform dimensionality reduction from 64D to 2D ( Figure 5). It can be observed that semantically similar concepts have been clustered together, providing additional evidence about the ability of concept representations to encode hierarchical and equivalent relations and capture semantic similarities. In Table 5 we have presented the most similar MedDRA and SNOMED-CT annotations (i.e. the final labels in the corpus) for the three most frequently reported concepts: Insomnia, Pain and Fatigue. Although such representations encoded conceptual similarity well, they are insufficient to identify opposite concepts correctly (e.g. Fatigue and Energy increased). This is because we only utilised hierarchical relations in terminologies (information about opposite concepts is not provided in these terminologies explicitly).

Conclusion
We have presented a corpus for cross-terminology medical concept normalisation that has been sourced from five publicly available datasets across the biomedical and social domains. The

Concept
MedDRA SNOMED-CT  data harmonisation pipeline described in the paper combines instances from various datasets and provides consistent simultaneous mappings to both MedDRA and SNOMED-CT terminologies. Such pipeline can be used in the future to integrate new datasets into the corpus or could be also applied in relevant data annotation and processing tasks. Also, we have described a method to merge multiple medical terminologies and demonstrated that equivalent and hierarchical relations can be encoded into cross-terminology concept representations that are able to capture semantic similarities not only between concepts inside a given terminology but also between concepts from different terminologies. The generated cross-terminology medical concept representations can be used to improve and analyse the performance of concept normalisation systems. Making such resources available to the research community as well as providing an analysis of the final corpus aimed to contribute to a better understanding of the task and associated challenges.