Biomedical Concept Relatedness – A large EHR-based benchmark

A promising application of AI to healthcare is the retrieval of information from electronic health records (EHRs), e.g. to aid clinicians in finding relevant information for a consultation or to recruit suitable patients for a study. This requires search capabilities far beyond simple string matching, including the retrieval of concepts (diagnoses, symptoms, medications, etc.) related to the one in question. The suitability of AI methods for such applications is tested by predicting the relatedness of concepts with known relatedness scores. However, all existing biomedical concept relatedness datasets are notoriously small and consist of hand-picked concept pairs. We open-source a novel concept relatedness benchmark overcoming these issues: it is six times larger than existing datasets and concept pairs are chosen based on co-occurrence in EHRs, ensuring their relevance for the application of interest. We present an in-depth analysis of our new dataset and compare it to existing ones, highlighting that it is not only larger but also complements existing datasets in terms of the types of concepts included. Initial experiments with state-of-the-art embedding methods show that our dataset is a challenging new benchmark for testing concept relatedness models.


Introduction
The adoption of electronic health records (EHRs) facilitates interoperability, meaning that more and more information from different sources is being stored about a patient. This makes it increasingly challenging for doctors to efficiently filter a patient's record for relevant information during a consultation without missing anything. This is particularly problematic since consultations are time-constrained. In the UK for example, general practitioner (GP) doctors usually have less than 10 minutes to consult a patient (Flaxman, 2015;Salisbury, 2019).
EHRs consist not only of free-text records but are furthermore tagged by doctors with medical concept codes. This coding is aimed at standardising health records to enable, e.g., the seamless transfer of patient information between practices and the analysis of health data from different practices (Sinha et al., 2012;Morrison et al., 2014). In addition, coded EHRs allow for the search of concept codes in an EHR. However, retrieving not only the exact concept in question but also related ones, as done by doctors when reading a patient's record, is less straight-forward.
For a patient with potential liver failure, related information of interest to a doctor in the patient's history are for example 'alcohol abuse', as it is a risk factor of liver failure, and 'jaundice', a symptom of liver failure. Other risk factors, symptoms, treatments, conditions, or tests associated with liver failure would also be considered relevant.
Concept representation models, such as embeddings or ontology-based methods (McInnes et al., 2009;Pivovarov and Elhadad, 2012;Henry et al., 2018;Smalheiser et al., 2019;Park et al., 2019), have been developed to tackle the task of identifying and retrieving related concepts. These methods have the potential to aid doctors in finding related information in a patient's EHR, which can increase the quality of medical outcomes by improving efficiency, as consequently alleviating time pressure, and ensuring that doctors do not miss important information. It is however unclear how well these methods would perform in real-world EHR concept retrieval settings as they have so far only been tested on very small datasets, as pointed out by Schulz and Juric (2020).
We address this issue by constructing a novel open-source 1 biomedical concept relatedness dataset consisting of 3630 concept pairs -six times more than the largest existing dataset. Instead of manually selecting and pairing concepts as done in previous work, our dataset is sampled from EHRs to ensure concepts are relevant for the EHR concept retrieval task. The relatedness scores assigned to concept pairs in our dataset are of high quality, as shown by good inter-annotator agreement and reliability metrics. A detailed analysis of the concepts in our novel dataset reveals a far larger coverage compared to existing datasets. We furthermore report the results of initial experiments with state-of-the-art embeddings, illustrating that our dataset constitutes a challenging new benchmark.

Related Work
Relatedness and similarity are not to be confused, even though embedding models are often tested on both types of relations (Chiu et al., 2018;Henry et al., 2019;Schulz and Juric, 2020). Semantic similarity is a specific type of semantic relatedness (Pakhomov et al., 2010;Pakhomov et al., 2011), meaning that similar concepts are generally related but not vice versa. As an example, 'liver failure' and 'alcohol abuse' are medically related but not semantically similar. We are here concerned with relatedness. Pedersen et al. (2007) hand-picked 120 pairs of medical concepts from UMLS (Bodenreider, 2004) that were expected to have a balanced distribution across four categories: closely related, somewhat related, somewhat unrelated and completely unrelated. The pairs were then rated by 13 medical coders on a 1-10 relatedness scale. Coders were not given a definition of the scale and were instructed to use their intuition. Since the coders' agreement was low, the 29 concept pairs with highest agreement were chosen and annotated again by nine medical coders and three physicians as synonyms (4), related (3), marginally related (2), or unrelated (1), resulting in the MiniMayoSRS dataset. Pakhomov et al. (2011) selected a subset of 101 pairs from the original 120, excluding duplicates and ambiguous pairs, and analysed the coders' ratings in more detail. This subset is available as the MayoSRS dataset. Based on their observations, they also proposed a framework for the future creation of concept relatedness datasets, which we closely follow. MiniMayoSRS and MayoSRS both lack size and coverage. They are thus not suitable for testing concept relatedness models with the purpose of selecting the best one for real-world applications. Pakhomov et al. (2010) introduced the UMNSRS-Sim and UMNSRS-Rel datasets, consisting of manually chosen pairs of UMLS concepts rated on a continuous scale of 0-1600 regarding their similarity and relatedness, respectively. The rating was performed on a touch screen, where the continuous scale corresponds to pixels on the screen. Raters had only 4 seconds to rate each pair and were not given any definition of the similarity/relatedness scale. Out of 724 given concept pairs, four medical coders rated 566 of them regarding similarity and another four coders 587 of them regarding relatedness. Hliaoutakis (2005) presented 36 pairs of MeSH terms with a similarity score of 0-1. Chiu et al. (2018) created the Bio-SimLex and Bio-SimVerb datasets of, respectively, 988 pairs of nouns and 1000 pairs of verbs that frequently occur in PubMed. Since both works involve similarity rather than relatedness and terms are not linked to concepts in any biomedical ontology, they are omitted from our comparison of existing datasets with our novel benchmark.

A New Concept Relatedness Benchmark
To enable the development of reliable models for searching EHRs and biomedical literature, appropriate benchmark datasets for testing are essential. Existing datasets (see Section 2) have various shortcomings, which we address in the construction of our novel benchmark: 1) Size: An appropriate test set needs to be of sufficient size to allow for the generalisability of performance results. Our novel benchmark is 6 times larger than the biggest existing dataset. Score Definition 0 Unrelated: completely unrelated concepts -the concepts have nothing in common and no relationship links them 1 Marginally related: there is a correlation between the concepts, but an established link might not exist 2 Related: the concepts are strongly related medically, e.g. one leads to the other (nausea leads to vomiting), or the concepts have an established link (obesity and ischemic heart disease)

3
Extremely related: the concepts always occur together medically, or one cannot happen without the other (alcoholic liver disease and liver cirrhosis) In the following, we describe the selection of concept pairs for our dataset and their annotation.

Constructing medical concept pairs from IMRD
In contrast to existing datasets, where concepts are either manually selected from UMLS/MeSH or sampled from PubMed and then paired, we directly sample concept pairs from EHR data. In particular, we use IQVIA Medical Research Data (IMRD) incorporating data from The Health Improvement Network (THIN, a Cegedim database), which consists of anonymised primary care EHRs, covering 5% of the UK population.
A patient's consultation in IMRD may include concepts from the following categories: symptom, diagnosis, presenting complaint, examination, intervention, management, and administration. We here only consider concepts from the the first three categories as they are the most relevant to the purpose of EHR search to aid consultations. For each patient in IMRD, we pair all distinct concepts (from the three categories) occurring in the patient's EHR, resulting in a total of 1,345,193 unique pairs made from 34,794 unique concepts.
The concepts in IMRD are given as Read Version 2 codes, a coding system that is almost exclusively used in the UK (Robinson et al., 1997). To ensure international compatibility, we map all concepts in the extracted concept pairs to SNOMED-CT IDs (Donnelly, 2006) using the mappings provided by NHS Digital 2 .
Despite belonging to the symptom, diagnosis, or presenting complaint categories, some of the extracted concepts describe administrative or navigational rather than medical information, e.g. "did not attend appointment" or "situation with explicit context". Such concepts are manually flagged and filtered out along with all their descendants specified in SNOMED-CT. The mapping and filtering results in 1,066,541 unique concept pairs made of 30,276 unique concepts represented by SNOMED-CT IDs.

Annotation scale and setup
Our five annotators are experienced doctors, registered and licensed with the General Medical Council (GMC). To ensure that all annotators have the same understanding of relatedness, we define a relatedness scale of zero to three, as shown in Table 1, based on detailed discussions with doctors. We also perform a small pre-annotation study with all annotators to train them in applying the relatedness scale and to discuss potential misunderstandings and difficulties.
EHR-RelA: We first randomly select 120 pairs from the list of concept pairs for annotation by all five annotators. Our analysis of these annotations (details are discussed in Section 4) shows that the distribution of the relatedness scores is highly skewed towards non-related concepts.  Table 2: Pairwise Krippendorff's α and average α, Cohen's κ, and Spearman's ρ for each Ann(otator).

EHR-RelB:
To create a more balanced dataset, we sample concept pairs based on the assumption that concept pairs occurring frequently in EHRs are more likely to be related. The 1,066,541 unique concept pairs are therefore sorted by their number of occurrences in descending order. The pairs are then filtered so that only the top six occurrences of each concept are included to ensure a higher coverage of unique concepts in our dataset. We then choose the top 4,000 concept pairs, which include 2479 unique concepts. Since our analysis of the preliminary EHR-RelA annotations show good annotator agreement (see Section 4), each of the 4,000 concept pairs is annotated by only three annotators to save resources (different concept pairs are annotated by a different subset of three out of the five annotators).

Dataset Analysis
Some concept pairs in EHR-RelA and EHR-RelB were not rated as the meaning of some concepts was unclear. Excluding these pairs, EHR-RelA consists of 111 concept pairs and EHR-RelB of 3630.

Annotation quality and reliability
To assess the quality of our annotated datasets as well as the difficulty of the task, we analyse the annotators' agreement. We closely follow the methodology set out by Pakhomov et al. (2011), considering 1) inter-annotator agreement, measured as pairwise coefficients between each of the annotators, and 2) multi-rater reliability, measured as summary statistics of all annotators together.

Inter-annotator agreement
Pairwise agreement measures are useful in identifying single annotators with low performance as well as disagreements between pairs of annotators. Following Pakhomov et al. (2011), we use three measures: Spearman's ρ (correlation), Cohen's κ and Krippendorff's α.
The agreement between all annotators in terms of Krippendorff's α is 0.64 for EHR-RelA annotation and 0.59 for EHR-RelB. The higher agreement on the 111 EHR-RelA concept pairs as compared to the EHR-RelB annotation of 3630 concept pairs can be attributed to the fact that the EHR-RelA annotation was highly skewed towards unrelated concept pairs, as will be shown in Section 4.2.
The pairwise Krippendorff's α agreement is presented in Table 2. We omit the pairwise κ and ρ scores for space reasons and as they follow the trends of the pairwise α measure and only report the averaged κ and ρ scores for each annotator. The table shows that overall there is a satisfactory agreement between all annotators. The pairwise agreement scores reveal that annotator E agrees least with the other annotators. However, the agreement is still moderate, so we include annotator E's annotations in our dataset. Since we publish not only the average relatedness score but also each annotator's individual annotations, future studies are free to exclude concept pairs with high disagreement.

Rater reliability
To assess the reliability of annotations, we follow Pakhomov et al. (2011) in using Kendall's coefficient of concordance (Kendall's W) and the Intra-class Correlation Coefficients ICC(C,1) and ICC(C,k). Mc-Graw and Wong (1996) define 10 types of Intra-class Correlation Coefficient (ICC) which depend on the use. As Pakhomov et al. (2011), we select ICC(C,1) and ICC(C,k), because: 1) they consider annotators as representative of a larger population of similar annotators, in our case doctors, and 2) they measure  Table 3: EHR-RelB compared to existing datasets. *Unclear which ICC(C, ·) the authors used, so we assume ICC(C,1). † Unclear which correlation the authors used, so we assume Spearman's ρ.
consistency instead of the absolute agreement, i.e. systematic errors of an annotator are cancelled out. ICC(C,1) measures the reliability of a single rater selected from the larger rater population, whereas ICC(C,k) measures the reliability of an average of multiple raters from the larger rater population. As shown in Table 3, the annotation reliability is good to excellent (Cicchetti, 1994). As can be expected from the inter-annotator agreement analysis, the reliability on EHR-RelA is higher. We also observe that ICC(C,1) is lower than ICC(C,k), indicating that the average annotation score is more reliable than a single annotator's scores.
In comparison to existing datasets, Table 3 shows that the inter-annotator agreement on our datasets in terms of average Spearman's ρ is higher than for the MayoSRS dataset. Note that the agreement for MiniMayoSRS is very high since only high-agreement concept pairs were chosen (see Section 2). Furthermore, the reliability of each of our individual annotators, as indicated by ICC(C,1), is higher than for MayoSRS. The higher average reliability (given by ICC(C,k)) for MayoSRS can be attributed to the much higher number of annotators. For UMNSRS, only one reliability metric is given, which is lower than for our datasets. The comparison shows that the quality of annotations in our datasets is at least as high, if not higher, than for existing datasets.
From here onward, we consider the average of all annotations for a concept pair as its relatedness score. Figure 1 illustrates the distribution of relatedness scores in the two datasets. EHR-RelA is highly skewed towards 'unrelated' pairs of concepts. This can be attributed to the random selection of co-occurring concept pairs. In contrast, EHR-RelB was constructed by choosing the most frequently co-occurring pairs of concepts, leading to a balanced distribution of relatedness scores. This is particularly interesting as Pakhomov et al. (2011) found that hand-picking concept pairs to create balanced dataset is highly challenging. As we show, concepts frequently co-occurring in EHRs naturally result in a balanced dataset.

Distribution of relatedness scores
Due to the small size and skewness of EHR-RelA, we consider and recommend only EHR-RelB as a new benchmark dataset. The analyses and experiments in the following sections therefore investigate EHR-RelB only.

Concept Coverage
Clearly our new EHR-RelB dataset is larger than existing ones in terms of number of concept pairs and unique concepts. In this section, we further investigate the types of concepts in EHR-RelB compared to existing datasets. Since UMNSRS-Rel and UMNSRS-Sim consist of nearly the same concept pairs, their concept coverage is very similar. We thus only present results for the relatedness dataset UMNSRS-Rel.

Mapping SNOMED IDs to UMLS CUIs (Concept Unique Identifiers)
The (Mini)MayoSRS as well as the UMNSRS-Rel datasets were constructed in terms of UMLS concepts (Bodenreider, 2004). In contrast, our new dataset is made of SNOMED concepts. To compare existing datasets with EHR-RelB, we thus map all SNOMED IDs in our new benchmark to UMLS CUIs, as detailed in Algorithm 1. To get all CUIs associated with a SNOMED code (line 2) and to find preferred SNOMED terms for a CUI (line 6), the UMLS REST API 3 is used.
Since the SNOMED IDs in EHR-RelB are obtained from Read codes, some of them are not contained in the SNOMED-CT International version, as they are from the SNOMED-CT United Kingdom release, which is not included in UMLS. Therefore, some SNOMED IDs in EHR-RelB cannot be mapped to a UMLS CUI. The mapping results in 3225 pairs of UMLS concepts (out of the 3630 SNOMED pairs).
Note that expressing our new benchmark dataset in terms of UMLS CUIs is not only useful for the comparison with existing dataset, but also allows for the application of CUI embedding models (Henry et al., 2019;Park et al., 2019;Henry et al., 2018) for predicting concept relatedness. Pakhomov et al. (2010) constructed their concept pairs in the UMNSRS datasets by choosing concepts with semantic type 'drug', 'disorder', and 'symptom' and combining them so as to obtain a balanced amount of semantic type combinations. We chose concepts tagged as 'presenting complaint', 'diagnosis', or 'symptom' in the EHRs, but did not use these tags to inform the creation of concept pairs. We thus analyse the semantic types of all UMLS CUIs in EHR-RelB as well as in existing datasets.

Semantic types
Many CUIs have more than one semantic type. Since UMLS semantic types are organised in a hierarchical semantic network, we determine the most specific common ancestor of a CUI's semantic types and choose this to be its unique semantic type. Again, we make use of the UMLS REST API. Figure 2 shows the distribution of the most common semantic type combinations (those making up more than 3% of a dataset) in EHR-RelB compared to the existing MayoSRS and UMNSRS-Rel datasets. Note that in UMLS the semantic type 'symptom' is a specific type of 'finding', so combinations involving these two semantic types are similar (e.g. finding-symptom is similar to symptom-symptom).

Distribution of semantic type combinations
The most frequently occurring semantic types in EHR-RelB are 'symptom' and 'disease', matching the three concept tags used to filter the IMRD EHRs. 21% of concept pairs in EHR-RelB are of types findingfinding (or the similar finding-symptom), 13% are disease-disease, and 11% combine the two semantic types. The rest of the concept pairs belong to one of 172 less frequent semantic type combinations.
As expected, UMNSRS-Rel exhibits an even distribution of semantic type combinations, in particular of 'disease', 'symptom', and 'chemical'. Interestingly, none of the concepts in UMNSRS-Rel are of semantic type 'clinical drug' (and 'chemical' and 'clinical drug' are only vaguely related as descendants of 'physical object'). In contrast to EHR-RelB and UMNSRS-Rel, less than 4% of concept pairs in MayoSRS are of type symptom-symptom (or similar combinations with 'finding') and are thus not represented in Figure 2. MayoSRS also has 10% of concept pairs belonging to types disease-pathological function and 4% to disease-neoplastic process, which occur much less frequently in the other datasets. Note however that 10% of MayoSRS constitutes only 10 concept pairs, whereas 10% of EHR-RelB is 363 concept pairs. MiniMayoSRS consists of only 29 concept pairs, but includes 19 different semantic type combinations, so nearly all combinations occur only once. It is thus not included in Figure 2.
There are 28 semantic type combinations in MayoSRS and UMNSRS-Rel that do not occur in EHR-RelB, 13 of these contain a 'chemical' semantic type, which is not represented in EHR-RelB. Our new benchmark EHR-RelB comprises 148 combinations of semantic types not present in the existing datasets.
Our analysis shows 1) that the distribution of most frequently co-occurring semantic types in EHRs, as given in EHR-RelB, is similar to that of the manually constructed existing datasets, and 2) that our new EHR-RelB benchmark complements semantic type combinations in existing datasets.

Semantic type combinations versus relatedness scores
Having analysed the distribution of semantic type combinations, we investigate whether any of the datasets has a bias of relatedness scores for the different semantic type combinations. In other words, is the semantic type combination a good predictor of concept relatedness? To answer this question, we compute the median relatedness score for each semantic type combination in a dataset. This is used as a baseline, predicting for each concept pair the median score of its semantic type combination. The performance of this baseline is evaluated in terms of Spearman's correlation on concept pairs with a semantic type combination occurring more than once. Table 4 shows that the semantic type combination of a concept pair is not a good predictor of relatedness in our new benchmark dataset or UMNSRS-Rel. This indicates that the relatedness scores in the

MayoSRS UMNSRS-Rel EHR-RelB
Spearman's Correlation 0.46 0.23 0.33 Table 4: Performance of the semantic type baselines for each dataset. datasets are not biased by semantic types. Note that this is the case despite the fact that the baselines are "trained" on the same data used for testing. The higher correlation for the MayoSRS dataset can be attributed to its small size: concept pairs with a semantic type combination that occurs only two or three times, which is the case for many concept pairs in the MayoSRS dataset, are likely to have a more accurate median prediction than concept pairs belonging to a high frequency combination. We omit the MiniMayoSRS dataset as its small size does not allow for a meaningful comparison.

Concept specificity
The previous section showed that our new EHR-RelB benchmark complements existing datasets in terms of semantic types of concepts. Another interesting aspect of concepts is their specificity, i.e. whether they are very general or specific concepts. Given a hierarchical organisation of concepts, specificity can be defined in a straight-forward way in terms of a concept's shortest path from the root. Since UMLS has no hierarchy of its own, we choose the SNOMED-CT hierarchy to measure specificity.

UMLS CUIs to SNOMED IDs
To compare specificity in EHR-RelB with existing datasets, we map the UMLS CUIs in the existing datasets to SNOMED IDs. As in Algorithm 1 line 6, all SNOMED IDs whose preferred term is associated with the CUI in question are obtained. If there are no such SNOMED IDs, the CUI's name as given in the dataset is used to search for SNOMED IDs. We thus obtain a list of SNOMED IDs for each CUI. The specificity of a concept is then computed as the shortest path of any SNOMED ID in the list. Note that for some CUIs in the UMNSRS-Rel dataset, it is not possible to find a matching SNOMED ID. We thus had to exclude 19 concept pairs from the analysis. Figure 3 illustrates the distribution of concept specificity combinations in EHR-RelB compared to UMNSRS-Rel and MayoSRS. We observe that concepts in existing datasets do not go beyond a specificity of 11, whereas our new benchmark EHR-RelB contains concepts with a maximum specificity of 14. Furthermore, EHR-RelB also covers more general concepts, where the combination of most general concepts have specificity two and three. The most frequent combination in EHR-RelB is of concepts with specificity six and six, which is the same as in MayoSRS. In UMNSRS-Rel, the most frequent combination is of slightly more general concepts with a specificity of five and five.
Similar to the semantic type analysis, this evaluation shows that our new EHR-RelB benchmark goes beyond existing datasets in terms of concept coverage as it adds more specific concepts while also containing very general ones. It also shows that existing datasets do not cover the breadth of concepts frequently occurring in EHRs. It is thus questionable how well the performance of concept relatedness models tested on existing datasets would generalise to real-world EHR concept retrieval.

Experiments with SOTA Embeddings
As an initial experimental evaluation on our dataset, we evaluate the 13 state-of-the-art open-source biomedical word embeddings tested by Schulz and Juric (2020) on existing datasets: PMC, PM, PP, and PPW by Pyysalo et al. (2013), ASQ by Kosmopoulos et al. (2016), LTL2 and LTL30 by Chiu et al.   . We do not consider the sentence embeddings tested by Schulz and Juric (2020) as they showed poor performance on existing datasets. Table 5 shows the performance of each embedding on EHR-RelB in terms of Spearman's correlation. Note that the performance is computed on a subset of 3350 out of the total 3630 concept pairs which could be embedded by all embeddings. For each embedding we use both fuzzy Jaccard similarity (Zhelezniak et al., 2019) and the standard average cosine as a similarity measure between vectors. We observe that using fuzzy Jaccard similarity yields consistently higher performance for all embeddings. LTL30 has the highest performance with a correlation of 0.49. Furthermore, it significantly outperforms 4 all but two embeddings (it does not outperform ASQ or extr). Schulz and Juric (2020) showed that most existing datasets are too small to observe significant difference between embeddings. Our results demonstrate that EHR-RelB is a promising new benchmark large enough to observe significant performance differences.
Compared to existing datasets, the performance of embeddings on EHR-RelB is lower. The best of the 13 embeddings on MayoSRS yields a Spearman's correlation of 0.57 and on UMNSRS-Rel 0.59 (Schulz and Juric, 2020). To investigate possible reasons for the lower performance, we measure the performance of embeddings on a further subset of 2978 concept pairs, excluding concept pairs with high disagreement between annotators (3 different scores assigned). However, this only marginally improves performance, indicating that low agreement concept pairs are not a source of the lower model performance. A possible explanation for the lower performance is that EHR-RelB consists of 89% multi-word concepts, whereas MayoSRS has only 47% and UMNSRS-Rel 0%. Representing multi-word concepts with word embeddings is likely to induce noise, whereas representing single-word concepts does not.
The Human Upper Bound (HUB), i.e. the maximum Spearman's correlation achieved by any annotator with the mean rating, is 0.88. Note that this is a slightly biased metric as the mean rating will include the annotator's rating. If the HUB is computed by comparing an annotator's rating with the mean rating of the other annotators, it is slightly lower at 0.70. The HUB shows that there is large scope for improvement of relatedness models, but that it should not be expected to achieve performance scores of 0.9 or higher.
Our initial experiments show that EHR-RelB is a challenging new benchmark for the performance analysis of concept relatedness models.

Conclusions
We presented a novel biomedical concept relatedness dataset sampled from EHR data, thus ensuring its relevance to EHR retrieval tasks. It is six times bigger than existing datasets, has high quality annotations, and complements existing datasets in terms of concept coverage. Initial experiments showed that it is a challenging new benchmark for state-of-the-art biomedical word embedding models.
Despite our benchmark being much larger than existing datasets, we hope that this work inspires others to use our methodology for building even larger ones. As explained, our dataset covers around 3000 unique concepts, whereas SNOMED-CT consists of close to 350,000 concepts. We here focused on the most frequently co-occurring concept pairs. It would be interesting to expand this to less frequent pairs in future work. This could also involve focusing on specific areas of medicine.
Since our new benchmark consists of concept pairs expressed as 1) biomedical terms, 2) SNOMED IDs, and 3) UMLS CUIs, it can be used as a test bed for a large variety of concept representation models. In our initial experiments, we only considered word embedding models based on terms. In future work, it will be interesting to evaluate UMLS concept embeddings (Yu et al., 2017;Beam et al., 2018;Henry et al., 2019;Park et al., 2019) as well as graph-embeddings (Crichton et al., 2018;Agarwal et al., 2019).
Our work was motivated by the retrieval of information in EHRs related to a patient's presenting complaint. However, the usage of this benchmark goes far beyond this motivation. Coding in EHRs is not always perfect. For example, doctors do not always code both symptoms and diagnosis. Enabling the search for related information is thus crucial to overcome the challenges associated with missing data.