WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking

We present our work on aligning the Unified Medical Language System (UMLS) to Wikipedia, to facilitate manual alignment of the two resources. We propose a cross-lingual neural reranking model to match a UMLS concept with a Wikipedia page, which achieves a recall@1of 72%, a substantial improvement of 20% over word- and char-level BM25, enabling manual alignment with minimal effort. We release our resources, including ranked Wikipedia pages for 700k UMLSconcepts, and WikiUMLS, a dataset for training and evaluation of alignment models between UMLS and Wikipedia collected from Wikidata. This will provide easier access to Wikipedia for health professionals, patients, and NLP systems, including in multilingual settings.


Introduction
The Unified Medical Language System (UMLS) 2 is a controlled vocabulary resource, enabling standardisation of biomedical terminology and interoperability of electronic health systems across the world (Bodenreider, 2004). UMLS has good coverage in only a handful languages such as English, impeding its uptake in health systems in different language settings (Markó et al., 2006). In addition, concept definitions in UMLS are either missing or very short, limiting its use as a medical encyclopedia. Wikipedia, on the other hand, is a crowd-sourced encyclopedia that is a primary source of online medical knowledge for practitioners, students, and the general public (Heilman et al., 2011;Heilman and West, 2015;Shafee et al., 2017;Murray, 2019). Wikipedia hosts an increasing number of health-related articles (Matheson-Monnet and Matheson, 2017). Wikipedia articles exist in 100s of languages, far exceeding the multilingual support in UMLS. Together, these characteristics point to the value of Wikipedia for patient-facing health information.
Our goal in this work is to align UMLS concepts to their corresponding Wikidata and through that to their available Wikipedia concepts, to expand the language support for UMLS terminology with little effort. This will have a direct impact on patients worldwide by facilitating adoption of UMLS (including clinical terminology SNOMED-CT (Donnelly, 2006) or Medical Subject Headings, MeSH (Aronson and Lang, 2010)) in international healthcare systems, and also supporting medical information seeking needs of patients with varying linguistic backgrounds. For example, a patient whose native language is not English might receive a discharge summary in English with mentions of symptoms, diagnoses, or medications in formal clinical language. This language may not match Wikipedia titles exactly, making search hard (and noisy). Aligning UMLS and Wikipedia facilitates such information seeking, makes medical information accessible in a patient's native language, and supports the need for patient access to information in consumer-friendlier terms (Zeng et al., 2001;Smith and Stavri, 2005). Most common medical terms are covered in Wikipedia; for instance, it is estimated that 80% of SNOMED-CT terms have a corresponding Wikipedia page (Ngo et al., 2019).
Our contributions are as follows: (1) we use the multilingual resources in UMLS and Wikidata in a neural ranking model to retrieve Wikipedia articles given a UMLS concept, and achieve a recall@1 of 72%, representing a 20% increase over a BM25 model; (2) we release our new WikiUMLS dataset (collected from Wikidata) for training and evaluation of UMLS-to-Wikipedia alignment models; and (3) we release the output of our cross-lingual ranking model as a large-scale silver-standard alignment of UMLS and Wikipedia, facilitating new applications that can make use of the alignment.

Background
UMLS combines multiple vocabularies into a unified coding system by mapping terms referring to the same concept to a single concept ID (CUI). A single CUI can have several aliases in different languages from various vocabularies. A large proportion of UMLS concepts (roughly 100k) have descriptions that are very short (mostly one sentence), and so are not adequate for dissemination of knowledge or information access. Medical entity pages in Wikipedia have longer descriptions, are connected to other entities through hyperlinks, and are enriched by the Wikidata (Vrandecic and Krötzsch, 2014) knowledge graph. Wikipedia is rich in content and multilinguality, as anyone can contribute to or revise an article. For example, the diabetes Wikipedia page exists in more than 134 languages, compared to only a dozen languages in UMLS.
Knowledge-base alignment: There are two main approaches for aligning knowledge bases such as Wikipedia and UMLS: (1) embedding-based alignment, and (2) string and semantic matching. In embedding-based methods, entity embeddings are learnt from text co-occurrence statistics or knowledgegraph (KG) relations separately, and are then aligned using a seed alignment dictionary (Mikolov et al., 2013;Chen et al., 2017), or adversarial learning (Qu et al., 2019). However, the text-based methods suffer from non-comparable corpora (Ormazabal et al., 2019), and the KG methods have not been tried on Wikidata because of its large scale and entity variety.
String and semantic matching methods are based on similarity between the entity names or descriptions in the two knowledge-bases (KBs). Each entity in KB 1 is used as a query against all the entities in KB 2 using a scoring method such as cosine similarity or BM25. Entities are represented by bag of words or bag of character n-grams. The result of these cheap methods can be further improved by supervised (neural) ranking models (Guo et al., 2016;Rao et al., 2019;Wang et al., 2018). Particularly when query and candidate documents are represented by an encoder such as BERT , pre-trained on massive amounts of text data, neural rerankers perform substantially better than IR-based methods (Nogueira and Cho, 2019;Akkalyoncu Yilmaz et al., 2019). Our methods fit within this group.

Method
Given a UMLS concept c i represented by query q i = {t 1 i , . . . , t N i }, where t n i is an alias term for c i in UMLS, we use the English Wikipedia augmented with multilingual Wikidata as a document collection D = {d 1 , . . . , d |D| }, to retrieve page d j matching concept c i . Each page is represented by its title, text (only for candidate generation), and multilingual aliases from Wikidata. We follow a two-stage retrieval procedure: (1) candidate generation, where an IR method (e.g. BM25) is used to retrieve related documents; and (2) reranking of the top k candidates via a learn-to-rank method (Liu, 2009).

Candidate Generation
We index Wikipedia collection D using Lucene, and build query q i from UMLS to retrieve the top k=64 relevant pages. We use a Boolean disjunction between all alias terms in UMLS, and search in the title, text, and multilingual aliases fields in D. BM25 relies on exact term matches, and small variations can result in a mismatch. As a result, we also experimented with a character n-gram method (TFIDF char ) successfully used in Murty et al. (2018) for candidate generation in medical entity linking. We build a bag of character n-grams (n ∈ [1, 5])) weighted by TF-IDF within term boundaries, and use cosine similarity between q i and each d ∈ D (excluding page text) to generate the top k=64 candidates.

Reranking
We formulate the reranking task as passage pair binary classification (Nogueira and Cho, 2019), where the first passage is q i for concept c i from UMLS, and the second passage is the set of Wikipedia alias names for each of top k=64 documents ranked by candidate generation (Wikipedia text is excluded for reranking). The goal is to predict if a pair is a match or not, by minimising the following objective: where (q i , d + ) is the matching UMLS-Wiki pair, and cand − i is the set of remaining negative candidates generated by BM25. Function f is the passage pair encoder, for which we use BERT's <CLS> token encoding ) and a linear projection. We also experiment with BioBERT (Lee et al., 2019) because it is pre-trained on medical literature, which is a better domain fit for our task.
The major shortcoming of BERT and BioBERT is that they don't encode multilingual aliases effectively, particularly if the scripts are different. Aliases of every concept in UMLS and Wikipedia are available on average in 1.4 and 11.2 different languages, respectively. This multilingual data can be utilised for triangulation of the matching pair. For example, GERD is a term used in UMLS for Gastroesophageal reflux disease, and is also a given name in Germanic languages. If it is used for retrieval, many Wikipedia pages related to people with that name will rank high. However, if alias names of GERD in other languages (e.g. Japanese) are used in the query and documents, the disambiguation becomes easier. To encode multilingual alias concept names we require a model that embeds tokens of different languages into the same embedding space. To this end, we also experiment with multilingual BERT (BERT multi ).

Data
WikiUMLS is a UMLS to Wikipedia aligned dataset we create in this work. It consists of about 17.8k UMLS concepts that are manually linked to their matching Wikipedia page by Wikipedia content contributors. All the UMLS concepts in WikiUMLS are from Medical Subject Headings vocabulary (MeSH) (Lipscomb, 2000). A WikiUMLS record is a tuple of (UMLS CUI, UMLS concept alias set, Wikipedia page title, Wikidata concept alias set). We use a UMLS concept's alias set as query, and compare it with every Wikidata entity's alias set to retrieve the matching Wikipedia page. The Wikipedia aliases and their links to UMLS are taken from their corresponding record in Wikidata, a collaborative knowledge-base that is tightly connected to Wikipedia. An entity in Wikidata has multilingual aliases, 3 and is linked to both a Wikipedia page, and possibly a UMLS concept. 4 We split the matching pairs into roughly 10k, 2k, and 5.8k for the training, validation, and test data sets, respectively. There are about 3 million remaining UMLS concepts from various vocabularies that are not aligned to Wikipedia, which we hope to align in this work.

Evaluation Methodology
We use the aliases of a concept in UMLS as a query, and the aliases for entities in Wikipedia as the document collection. A document is relevant to a query if the pair are manually aligned in WikiUMLS (a matching UMLS-Wikipedia pair). We evaluate the models by measuring recall at different positions of their ranking (recall@k). In this paper, recall and accuracy are equivalent because we assume each UMLS concept has a unique matching Wikipedia page. Because our proposed retrieval method has two stages (candidate generation followed by neural reranking), we use normalised recall (N. recall@k) to evaluate the neural reranking stage independent of candidate generation stage by ignoring queries for which the candidate generator didn't retrieve a relevant document.

Results
The performance of the candidate generation methods for word-(BM25) and character-level (TFIDF char ) retrieval, and also the reranking methods (BERT, BioBERT, and BERT multi ) are shown in Figure 1. In all top k positions, BM25 outperforms TFIDF char , which is surprising given that TFIDF char was reported to achieve strong performance for entity linking in Murty et al. (2018). 5 As shown in Table 1, BM25 is able to retrieve the correct Wikipedia page at k=64 with 85% accuracy (compared to 75% for TFIDF char ), an upper-bound for the performance of the reranking models. For the reranking models, BioBERT performs only slightly better than BERT (68% vs. 67% recall@1), although it has been pretrained on large amounts of biomedical literature. BERT multi performs better than both BERT and BioBERT, achieving a recall@1 of 72%, given the cross-lingual nature of the task. Compared to classic vector space approaches (e.g. BM25), BERT multi shows an improvement of 22%, and 10% for recall@1 and recall@10, respectively. We also report normalised recall at k=1 and k=4, by excluding the test instances for which BM25 doesn't retrieve the gold candidate. Here, BERT multi achieves normalised recall at k=1 and k=4 of 84% and 95%, respectively. This indicates that BERT multi is highly successful at ranking in the case that the correct document is retrieved.

Conclusions
We proposed passage pair ranking models based on pretrained contextual encodings for aligning UMLS and Wikipedia, to help bridge between health information systems, and empower consumers with understanding of their health condition. We developed a dataset, WikiUMLS, for training and testing alignment models between the two knowledge-bases, and proposed neural reranking models that substantially outperform BM25. We showed that the use of multilingual aliases in BERT multi substantially improves recall@1 compared to BioBERT (72 vs. 68).
The use of subword information such as BPE (Sennrich et al., 2016) as used in XLM (Conneau and Lample, 2019) might improve performance, which we leave for future work. Utilising the relationships between concepts in UMLS and Wikipedia (through Wikidata) to align the two knowledge graphs is also an interesting future direction. We also intend to release a large Wikipedia-based Entity Linking (EL) dataset by using the top-ranked Wikipedia pages for UMLS queries, to be used in state-of-the-art EL models such as zeshel (Logeswaran et al., 2019).