Comparing the Intrinsic Performance of Clinical Concept Embeddings by Their Field of Medicine

Pre-trained word embeddings are becoming increasingly popular for natural language processing tasks. This includes medical applications, where embeddings are trained for clinical concepts using specific medical data. Recent work continues to improve on these embeddings. However, no one has yet sought to determine whether these embeddings work as well for one field of medicine as they do in others. In this work, we use intrinsic methods to evaluate embeddings from the various fields of medicine as defined by their ICD-9 systems. We find significant differences between fields, and motivate future work to investigate whether extrinsic tasks will follow a similar pattern.


Introduction
The application of natural language processing (NLP) and machine learning to medicine presents an exciting opportunity for tasks requiring prediction and classification. Examples so far include predicting the risk of suicide or accidental death after a patient is discharged from general hospitals (McCoy et al., 2016) or classifying which patients have peripheral vascular disease (Afzal et al., 2017). A common resource across NLP for such tasks is to use high-dimensional vector word representations. These word embedding include the popular word2vec system (Mikolov et al., 2013) which was initially trained on general English text, using a skip-gram model on a Google News corpus.
Due to considerable differences between the language of medical text and general English writing, prior work has trained medical embeddings using specific medical sources. Generally, these approaches have trained embeddings to rep-resent medical concepts according to their 'clinical unique identifiers' (CUIs) in the Unified Library Management System (ULMS) (Bodenreider, 2004). Words in a text can then be mapped to these CUIs (Yu and Cai, 2013). Various sources have been used, such as medical journal articles, clinical patient records, and insurance claims (De Vine et al., 2014), (Minarro-Giménez et al., 2014), (Choi et al., 2016).
Prior authors have sought to improve the quality of these embeddings, such as using different training techniques or more training data (Beam et al., 2018). In order to judge the quality of these embeddings, they have primarily used evaluation methods quantifying intrinsic qualities, such as their ability to predict drug-disease relations noted in the National Drug File -Reference Terminology (NDF-RT) ontology (Minarro-Giménez et al., 2014), or whether similar types of clinical concepts had cosine similiar vectors (Choi et al., 2016).
To date these embeddings have been both trained and evaluated on general medical data. That is, no fields of medicine were specified or excluded; data could be from an obstetrician delivering a baby, a cardiologist placing a stent, or a dermatologist suggesting acne treatment. It is unclear how well such embeddings perform for a specific field of medicine. For example, we can consider psychiatry, the field of medicine concerned with mental illnesses such as depression or schizophrenia. Prior work has shown that psychiatric symptoms are often described in a long, varied, and subjective manner (Forbush et al., 2013) which may present a particular challenge for training these embeddings and NLP tasks generally.
As these pre-trained embeddings may increasingly be used for down-stream NLP tasks in spe-cific fields of medicine, we seek to determine whether embeddings from one field perform relatively well or poorly relative to others. Specifically, we aim to follow prior work using intrinsic evaluation methods, comparing the geometric properties of embedding vectors against others given known relationships. This will offer a foundation for future work that may compare the performance on extrinsic NLP tasks in different medical fields. Finding relative differences may support that certain medical fields would benefit from embeddings trained on data specific to their field, or using domain adaptation techniques as sometimes used in the past (Yu et al., 2017).

Sets of Embeddings
We sought to compare a variety of clinical concept embeddings trained on medical data. Table 1 contains details of the sets compared in this project, all of which are based on word2vec. We obtained DeVine200 (De Vine et al., 2014), ChoiClaims300, and ChoiClinical300 (Choi et al., 2016) all from the latter's Github. We downloaded BeamCui2Vec500 (Beam et al., 2018) from this site. Unfortunately, we were unable to obtain other sets of embeddings mentioned in the literature (Minarro-Giménez et al., 2014), (Zhang et al., 2018) (Xiang et al., 2019).

Determining a Field of Medicine's Clinical Concepts
A clinical concept's corresponding field of medicine is not necessarily obvious. In order to have an objective and unambiguous classification, we utilized the ninth revision of the International Statistical Classification of Diseases and Related Health Problems (ICD-9) (Slee, 1978). This is a widely used system of classifying medical diseases and disorders, dividing them into seventeen chapters representing medical systems/categories such as mental disorders, or disease of the respiratory system. While the 10th version is available, we chose this version based on prior work using it, and the pending release of the 11th version. We will use these ICD9 systems to define the different medical fields.
We determined a CUI's field of medicine according to a CUI-to-ICD9 dictionary available from the UMLS (Bodenreider, 2004). We consider pharmacological substance related to a field of medicine system if it treats or prevents a disease with an ICD9 code within a particular ICD9 system. We determine this by using the NDF-RT dictionary, which maps CUIs of substances to the CUIs of conditions they treat or prevent, and then convert these CUIs to the ICD9 systems as before. As such, A CUI representing a drug may have multiple ICD9 systems and therefore medical fields.

Evaluation Methods
We sought to compare multiple methods for evaluating the quality of a medical field's embeddings based on prior work. We were unable to use Yu et al's (2017) method, based on comparing the correlation of vector cosine similarity against human judgements from the UMNSRS-Similarity dataset (Pakhomov, 2018) due to there being too few examples across many medical fields. The code for all implemented methods will be publicly available upon publication of this work from the first author's GitHub. (2016) is based on quantifying whether concepts with known relations are neighbours of each other. They use known relationships between drugs and the diseases they treat or prevent, and also the relations between diseases that are grouped together in the Clinical Classifications Software (CCS) hierarchical groupings, a classification from the Agency for Healthcare Research and Quality (Cli). The scoring utilizes Discounted Cumulative Gain, which attributes a diminishing score the further away a known relationship is found if within k neighbours.

Medical Relatedness Measure (MRM) This method from Choi et al
In our implementation, we calculate the Medical Relatedness Measure (MRM) based on the 'coarse' groupings from the CCS hierarchies. Scores are calculated for CUIs that represent diseases with a known ICD9 code. The mean MRM is then calculated for all CUIs within a given ICD9 system. The implementation was adapted from Python 2.7 code available from the original author's Github. We calculate MRM as: Where V are medical conditions, F a field of medicine, V (F ) the medical conditions within an ICD-9 system/field of medicine, G the CCS group that medical condition v ∈ V (F ) is part of, and V (G) the subset of medical conditions found in  this group. 1 G is 0 or 1 depending on whether v(i), the ith closest neighbour to a condition v, is in the same group. k neighbours are considered.
To illustrate this, consider calculating the MRM for F "Diseases of the Musculoskeletal System". It involves summing the scores for its conditions, such as rheumatoid arthritis (v ∈ V (F )). This condition is part of the CCS-coarse grouping (G), "Rheumatoid arthritis and related disease". This group contains twelve conditions, such as Felty's syndrome and Rheumatoid lung. With Choi et al's choice of k = 40, the score for rheumatoid arthritis would depend on how many of the eleven other conditions in this group are within the 40 nearest neighbours (v(i)) to rheumatoid arthritis, and would give a higher score the nearer they are, the highest being if they are the eleven nearest neighbours.
Medical Conceptual Similarity Measure (MCSM) The other method used by Choi et al's work evaluates whether embeddings known to be of a particular set are clustered together. They use conceptual sets from the UMLS such as 'pharmacologic substance' or 'disease or syndrome'. Discounted Cumulative Gain is again used, based on whether a CUI has other CUIs of its set within k neighbours.
We reimplement this method, but instead of using the UMLS conceptual sets, we create sets from the ICD9 systems, again giving a score to neighbours that are diseases or drugs from the same field of medicine/ICD9 system. Again, this was adapted from code from Choi et al's Github. The Medical Conceptual Similarity Measure (MCSM) can be represented as: Similar to MRM, F is a medical field/ICD9 system, V (F ) the medical conditions within a system, and 1 F 0 or 1 depending on whether neighbour v(i) is also in this medical field.
For illustration, consider an example calculating the MCSM for the medical field/system (F ) "Infectious and Parasitic Diseases". This involves calculating the score for the medical condition (v) primary tuberculous infection. If rifampin, an an-tibiotic, was found to be nearby, it would contribute to the MCSM, as it treats conditions in "Infectious and Parasitic Diseases" and so would be classified as being part of this system. On the other hand, if the respiratory illness asthma was one of the k nearest neighbours, it would add nothing to the MCSM score, as it is a disease in a different system, "Diseases of the Respiratory System".
Significance against Bootstrap Distribution (Bootstrap) Beam et al (2018) also evaluate how well known relationships between concepts are represented by embedding vector similarity. For a given known relation, they generate a bootstrap distribution by randomly calculating cosine similarities between embedding vectors of the same class (eg. a random drug and disease when evaluating drug-disease relations). For a given known relation, they consider that the embeddings produced an accurate prediction if their cosine similarity is within the top 5%, the equivalent of p < 0.05 for a one-sided t-test.
Our implementation considers the may-treat or may-prevent known relationships from the NDF-RT dataset. We calculate the percentage of known relations for drug-disease pair within each medical field. Beam et al have not yet made their code publicly available, so we reimplemented this technique in Python.
System Vector Accuracy (SysVec) We implement a new, simple method to evaluate a medical field's embeddings. A representative vector is calculated for each medical field/ICD9 system by taking the mean of the normalized embedding vectors of a field's diseases. We then consider all of the drugs known to treat or prevent a disease of a given medical field. A field's System Vector Accuracy is then the percentage of these drugs whose most similar (by cosine similarity) system vector is this field's. A higher score indicates better performance. We implemented this method in Python.
For example, a system vector for "Mental Disorders" would be calculated from the embeddings for diseases such as schizophrenia and major depressive disorder. "Mental Disorders'" System Vector Accuracy is the percentage of its medications (e.g. fluoxetine, risperidone, paroxetine) whose embedding vectors are more similar to the "Mental Disorders" system vector than all others. Fluoxetine is an anti-depressant medication solely used to treat "Mental Disorders", so we would expect its vector to be more similar to this system vector than, say, the system vector representing "Diseases of the Skin and Subcutaneous Tissue". Some drugs treat or prevent diseases in n multiple medical field. For a field, such a drug is classified as being accurately predicted if the field's system vector is amongst the n most similar system vectors. For instance, valproic acid is an anticonvulsant used to treat both mental disorders and those of the nervous system. "Mental Disorders'" System Vector Accuracy would take into account whether its system vector was one of the n=2 most similar system vectors. For further illustration, Table 2 shows an example SysVec calculation.

Comparing Scores
Comparing Sets of Embeddings We calculated the mean scores for an embedding set, only including embeddings with corresponding ICD9 values and present in all of the compared sets. For the MCSM and MRM scores, we conducted two-way paired t-tests between the scores from each embedding set, adjusted with the Bonferroni correction. For the binary Bootstrap and SysVec scores, we judged statistical significance by calculating z-scores and their corresponding Bonferroni corrected p-values.
A negative control set of embeddings was constructed by taking the embeddings from Beam et al (2018) and randomly arranging which clinical concepts an embedding corresponds to.
Comparing Fields of Medicine As the embeddings from Beam et al (2018) are most recent, trained on the most data, and have significantly higher scores than the other embeddings compared, we used these embeddings to compare scores from the different fields of medicine. This set also contained the most embeddings, allowing more embeddings from each field to be compared.
We sought to determine whether a field of medicine's embeddings were significantly worse or better than the average. As such, for each field of medicine we calculated the mean score from each evaluation method. We then used statistical tests to compare a field's scores from a given evaluation method with the same scores from all other fields. For MCSM and MRM scores we used two-tailed t-tests, and for Bootstrap and SysVec, z-scores, all corrected with the Bonferroni correction.
To aggregate a medical field's results, we calculated a 'Net Significance' metric by taking how many of the four method's scores were significantly above the mean, minus how many were significantly below. We found this more interpretable than other methods such as aggregating normalized scores.

Differences Between Sets of Embeddings
Comparing the sets of embeddings (Table 3) shows consistent differences. BeamCui2Vec500's scores are the highest across all methods, and this difference is very significant, with p-value 10 − 5 after Bonferonni correction. The ChoiClaims300 embeddings seem next best, and the remaining sets still have much higher scores than those of the negative control.

Differences Between Medical Systems
Differences are also observed between embeddings from the various fields of medicine as represented by the ICD-9 systems (Table 4). For instance, embeddings related to the medical field Mental Disorders have scores significantly above the mean score across all systems for two evaluation methods, while those of the field Symptoms, Signs, and Ill-defined Conditions are significantly below for three. Due to a smaller number of documented drug-disease relationships across two medical fields, scores were not calculated with those methods using these relationships.

Discussion and Future Direction
To our knowledge, this is the first investigation into whether clinical concept embeddings from a given field of medicine perform relatively well or poorly compared to others. We conducted this investigation comparing available sets of such embeddings, using a variety of previously described intrinsic evaluation methods in addition to a new one. Given that one set of embeddings performed better than others, we used this set to compare the different fields of medicine, and found significant results between various fields.
The superior performance of one set of embeddings -those from Beam et al (2018) -are consistent with the depth and breadth of data used to train these embeddings. Training used three different types of data, including that from health insurance claims, clinical narratives, and full texts from medical journals. The size of the dataset was also much larger than that of the others. Our work validates their findings that their embeddings offer the best performance. However, it would be interesting to also consider the recent clinical concept embeddings developed by (Xiang et al., 2019). They use a similar amount of data (50 million) as Beam et al, using a large dataset from electronic health records, and apply a novel method to incorporate time-sensitive information. At the time of submission, we were unable to obtain their embeddings, and so leave this comparison to future work.
Examining the differences between fields of medicine, we note that the poor performance of embeddings from the system "Symptoms, Signs, and Ill-defined Conditions" may support validity of the results. This collection of miscellaneous medical conditions would not be expected to have the intrinsic vector similarity and cohesion evaluated by our evaluation methods.
Further work may explore why the other systems have varied performance. We wonder if the observed results correlate with possible distinctiveness of the various medical fields. For example, one of the best performing systems was "Neoplasms". The conditions in this field are often unambiguous -a cancer like non-small cell lung cancer has little other meaning -and the drugs used for these diseases tend to be similarly specific. On the other hand, poorly performing systems such as "Diseases of the Skin and Subcutaneous Tissue" and "Diseases of the Musculoskeletal Systems and Connective Tissue" often utilize immunosuppressant medications that are used across many fields of medicine. Future work could investigate this conjecture by comparing scores when restricting what clinical concepts are compared, such as only common or distinct medications. This work evaluated embeddings using intrinsic measures of embedding quality. This presents some advantages, but also the most obvious limitation and direction for future work. These intrinsic methods allowed a consistent evaluation to be carried out between medical fields, and allowed a wide variety of embedding sets to be compared. The methods all evaluate qualities that well-trained embeddings should have, though still represent artificial use-cases. Evaluating these embeddings on extrinsic, down-stream tasks may provide more practically relevant comparisons. However, these tasks will need to be comparable  Injury and Poisoning 0.59 9.09 0.75 0 0 Table 4: Comparison of mean scores using different evaluation methods for the fields of medicine as represented by their ICD-9 system. The row All Systems shows the mean score for each method across embeddings from all systems. A bold score indicates that a system's score was significantly above the All Systems score, while an italic score indicates it was below. Significance is judged by having a p-value <0.05 after Bonferroni correction. Net Significance is the number of these significant differences above the All Systems score minus the number below. A system's score is not calculated if there are fewer than ten examples for a method. See Methods section for evaluation method abbreviations. All scores in this table are calculated using the embeddings from Beam et al. and available for multiple medical fields. For instance, the recent work by Xiang et al (2019) compared embeddings trained by different methodologies on a task predicting the onset of heart failure (Rasmy et al., 2018). This would be an appropri-ate task to judge embeddings from "Diseases of the Circulatory System"; others would be needed for other systems. We also plan to investigate the validity of these intrinsic evaluation methods by comparing them to extrinsic results.
Another future direction could be to investigate what could be done to improve performance in the fields with lower scores. For instance, Zhang et al (2018) used domain adaptation techniques for psychiatric embeddings, and this could also be carried out for those systems we identified as doing poorly. Alternatively, one could train embeddings solely on data from one field of medicine and investigate how this affects performance.