MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.


Introduction
Recent work in mining medical texts focus on building deep learning models for different medical tasks, such as mortality prediction (Grnarova et al., 2016) and diagnosis prediction (Li et al., 2020). However, because of the private nature of medical records, there are few large-scale, publicly available medical text datasets that are suitable for pretraining models, and real-world, private datasets are often small-scale and imbalanced. As a result, one of the biggest challenge in building deep learningbased NLP systems for biomedical corpora is the availability of public datasets (Wang et al., 2018).
To tackle this problem, we present Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) 1 , a large dataset of medical texts curated for the task of medical abbreviation disambiguation, which can be used for pre-training natural language understanding models. Figure 1 shows an example of sample in the dataset, where the true meaning of the abbreviation 'DHF' is inferred from its context, and Figure 2 shows the pretraining framework. Although this dataset can be used for building abbreviationexpansion systems, its main purpose is to enable 1 https://github.com/BruceWen120/medal  effective pre-training and improve performance on downstream tasks during fine-tuning. The motivation behind using abbreviation disambiguation as the pre-training task is two-fold. First, abbreviations are widely used in medical records by healthcare professionals and can often be ambiguous (Xu et al., 2007;Islamaj Dogan et al., 2009). 2 The ubiquitousness of abbreviations poses a restriction on building deep learning models for medical tasks, such as mortality prediction (Grnarova et al., 2016) and diagnosis prediction (Li et al., 2020).
Second, we believe that understanding natural language in a knowledge-rich domain such as medicine requires understanding of domain knowledge at some level, similar to how humans can understand medical text only after receiving medical training. The abbreviation disambiguation task enables models to use domain knowledge to understand the global and local context, as well as the possible meanings of the abbreviation in the medical domain.
Medical abbreviation disambiguation has long been studied (Skreta et al., 2019;Li et al., 2019;Finley et al., 2016;Joopudi et al., 2018;Jin et al., 2019) and our work builds upon many of them. In particular, our data generation process is inspired by the reverse substitution tech-  nique (Skreta et al., 2019;Finley et al., 2016). Our work differs from them in mainly two aspects. First, instead of trying to improve performance on abbreviation disambiguation itself, we propose to use it as a pre-training task for transfer learning on other clinical tasks. Second, existing datasets for medical abbreviation disambiguation, for instance CASI (Moon et al., 2014), are small compared to datasets used for general language model pre-training, and as noted by Li et al. (2019) some are erroneous. Thus, we chose to construct a new dataset large enough for effective pre-training.
Our main contributions are: a) we present a large dataset for pre-training on the task of medical abbreviation disambiguation. b) we provide empirical evidence of the benefit of abbreviation pre-training for a wide range of deep learning architectures.

Dataset Summary
The MeDAL dataset consists of 14,393,619 articles and on average 3 abbreviations per article. The statistics of MeDAL are summarized in Table 1.
The distribution of number of words and the distribution of number of abbreviations are shown in Figure 3a and Figure 3b, respectively.

Dataset Creation
The MeDAL dataset is created from PubMed abstracts which are released in the 2019 annual baseline. 3 PubMed is a search engine that indexes scientific publications in biomedical domain. The PubMed corpus contains 18,374,626 valid abstracts with 80 words in each abstract on average.
We use reverse substitution (Skreta et al., 2019) to generate samples without human labeling. We identify full terms in text that have known abbreviations and replace them with their abbreviations. For reverse substitution, mappings of abbreviations to expansions established by Zhou et al. (2006) are used. Mappings where the abbreviation maps to only one expansion or the expansion maps to multiple abbreviations are discarded, resulting in 24,005 valid pairs of mappings. Among the valid mappings are 5,886 abbreviations, which means each abbreviation maps to about 4 expansions on average.
To avoid completely removing all expansions and making them unseen to models, the expansions are substituted with a pre-defined probability. For our study, expansions are substituted with a probability of 0.3, although our processing scripts allow for other values for future use.

Pretraining
The task of abbreviation disambiguation is treated as a classification problem, where the classes are all possible expansions.
Considering the huge size of the dataset and the associated computational cost, a subset of 5 million data points are sampled from the complete corpus, which are split into 3 million training samples, 1 million validation samples and 1 million test samples. This subset is used throughout this study.
When creating this subset, because the distribution of true expansions is highly imbalanced, a sampling strategy is adopted which essentially removes classes in increasing order of frequency in an iterative manner. The sampling strategy works in the following way: from each class label, N C = min(F C , T ) samples that have this label are randomly selected, where F C is the frequency of that class in the unsampled dataset, and T is a threshold that is computed using Algorithm 1 such that each class can have at most T samples, and C N C is equal to the total number of samples N . The strategy iteratively removes classes, and at every iteration decreases N (which corresponds to the number of remaining samples) and L (which corresponds to the number of labels remaining). Then, the rate r is calculated based on how many classes L can fit in the remaining N if each remaining L has exactly r samples. In this way, it is ensured that the moment the current class frequency f C being iterated is greater than the desired rate r, the sampling stops.

Algorithm 1 Compute threshold T
Require: array of class frequency f , N > 0 Sort f in increasing order

Evaluation Tasks
Mortality Prediction As a downstream task to evaluate models' performance in clinical settings, mortality prediction aims at predicting the mortality of a patient at the end of a hospital admission, using ICU patient notes. The mortality prediction dataset is generated from MIMIC-III (Johnson  , 2016). Medical notes in this MIMIC-III comprise of free-form text documents written by nurses, doctors, and many types of specialists, and are written throughout the patient's stay. Only notes written by physicians and nurses at least twenty-four hours before the end of the discharge time are used, for the goal is to accurately predict whether a patient is at risk of dying by the end of the admission. In order to balance positive and negative samples (roughly 10% of patients expire at the end of an admission) while keeping as much text diversity as possible, we sample at most four notes from each surviving patient.
The dataset generated has a total of 137,607 negative samples and 138,864 positively-labelled notes. Then, using stratified random splitting, we selected 75%/10%/15% of the patients to be included in the training/validation/test splits. As an example of the ubiquitousness of abbreviations, 'MR' appears 1,612 times in 1,366 samples in the test set alone.
Diagnosis Prediction Similar to mortality prediction, diagnosis prediction aims to predict the diagnoses associated with a hospital admission from medical notes written during the admission. The same MIMIC-III medical notes and the same splits from mortality prediction are used, with seven training samples that have no diagnosis recorded removed. In MIMIC-III, diagnoses are recorded with International Classification of Diseases (ICD) codes, which are standardized codes designed for billing purposes. We discard minor distinctions of ICD codes under the same category by taking the first three digits (for codes that start with 'E' or 'V' the first four digits) of ICD codes. 4 After grouping, there are 1,204 unique diagnosis codes.
Top-k recall is used for evaluation of models based on the similarities to real-life medical decision making (Choi et al., 2015), which is defined as the number of diagnosis codes in that admission that are present in the top k predictions of the

Models
The models are first pre-trained on the MeDAL dataset, then pre-trained weights are used to initialize models for training on the downstream tasks. We compared this training strategy with training respective models from scratch to validate the benefit of pre-training.
LSTM BiLSTM is used as a baseline model. Specifically, the BiLSTM consists of three layers with hidden size of 512. Pre-trained Fasttext model is used for word embeddings (Bojanowski et al., 2017). LSTM + Self Attention To allow for leveraging information extracted by LSTM in a flexible manner, soft attention layers are added on top of LSTM. The attention layer is largely based on the soft attention by Bahdanau et al. (2014). Its detailed formulation is included in Appendix A.
Transformers We used the pre-trained ELECTRA-small discriminator (Clark et al., 2020) as an example of Transformer-based (Vaswani et al., 2017) model and, since it was not pre-trained on medical text, we compared its performance with or without pre-training on abbreviation disambiguation. Task-specific Output Layer Depending on the task, the output layer can take various forms. For abbreviation disambiguation, the output layer is a fully-connected layer, whose input is the hidden vector at the location of the abbreviation from the previous layers and output space is all possible expansions. For mortality or diagnosis prediction which are not associated with any specific token, hidden vectors from the previous layers need to be first aggregated into one vector. This can be achieved by either a pooling layer or an additional attention layer with a learnable query vector. Then the output layer is a fully connected layer that takes the aggregated vector as input. The attention output layer is illustrated in Figure 4. In preliminary experiments we found attention output layer generally improves models' performance compared to max-pooling output layer, and therefore it is used throughout the rest of the study unless otherwise noted.

Results
Models' performance on the pre-training task, abbreviation disambiguation, is shown in Figure 5. As the goal is not to optimize performance on this  task, Figure 5 serves to confirm the models are properly pre-trained. After pre-training, models are fine-tuned on the two downstream tasks to evaluate the benefit of pretraining. On the mortality prediction task, all three models that are pre-trained perform better than their from-scratch counterparts, shown in Table 2.
The benefit of pre-training is more significant on diagnosis prediction, shown in Figure 6. Both LSTM and LSTM + self attention perform considerably better if they pre-trained. In fact, the two models' performance increase by more than 70% relatively. While for ELECTRA the gain is not as significant, pre-training leads to faster convergence during fine-tuning.
On the two downstream tasks, experiment results show that pre-training improves ELECTRA's performance even when the model is already fully pre-trained on non-medical texts and is among the state-of-the-art, and bring the other models' performance close to ELECTRA's. This shows that pre-training on the MeDAL dataset can generally improves models capabilities of understanding language in medical domain. The complete results can be found in Appendix C.

Conclusion and Discussion
In this work, we present MeDAL, a large dataset on abbreviation disambiguation, designed for pretraining natural language understanding models in the medical domain. We pre-trained a variety of models using common architectures and empirically showed that such pre-training leads to improvement in performance as well as faster convergence when fine-tuning on two downstream clinical tasks.