Deep Contextualized Biomedical Abbreviation Expansion

Automatic identification and expansion of ambiguous abbreviations are essential for biomedical natural language processing applications, such as information retrieval and question answering systems. In this paper, we present DEep Contextualized Biomedical Abbreviation Expansion (DECBAE) model. DECBAE automatically collects substantial and relatively clean annotated contexts for 950 ambiguous abbreviations from PubMed abstracts using a simple heuristic. Then it utilizes BioELMo to extract the contextualized features of words, and feed those features to abbreviation-specific bidirectional LSTMs, where the hidden states of the ambiguous abbreviations are used to assign the exact definitions. Our DECBAE model outperforms other baselines by large margins, achieving average accuracy of 0.961 and macro-F1 of 0.917 on the dataset. It also surpasses human performance for expanding a sample abbreviation, and remains robust in imbalanced, low-resources and clinical settings.


Introduction
Abbreviations are shortened forms of text-strings. They are prevalent in biomedical literature such as scientific articles, clinical notes and user queries in information retrieval systems. Abbreviations can be ambiguous (e.g.: ER can refer to estrogen receptor, endoplasmic reticulum, emergency room etc.), especially when they appear in short or professional texts where the definitions are not given. For instance, about 15% of PubMed queries include abbreviations (Islamaj Dogan et al., 2009), and about 14.8% of all tokens in a clinical note dataset are abbreviations (Xu et al., 2007). In both cases, the definitions of the abbreviations are rarely provided. Thus, automatic expansion of ambiguous abbreviations to their full forms is vital in biomedical natural language processing (NLP) systems.
In this paper, we focus on the cases where definitions of ambiguous abbreviations are not directly available in the contexts, so reasoning over the contexts is required for disambiguation. Under the conditions where definitions are provided in the contexts, one can easily extract them using rulebased methods.
We present DEep Contextualized Biomedical Abbreviation Expansion (DECBAE) model. DECBAE uses a simple heuristic to automatically construct large supervised disambiguation datasets for 950 abbreviations from PubMed abstracts: In scientific writing, authors define abbreviations the first time they are used, and the same abbreviations in the following sentences have the same definitions as those of the first ones. We extract all the sentences containing the same abbreviations in each PubMed abstract, and use the definition given in the first sentence as the full form label of abbreviations in the following sentences. We group the definitions for each abbreviation and formulate abbreviation expansion as a classification task, where input is an ambiguous abbreviation with its context, and the output is one of its possible definitions.
Recent breakthroughs of language models (LM) pre-trained on large corpora like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) clearly show that unsupervised LM pre-training can vastly improve performance of downstream models. To fully utilize the knowledge encoded in PubMed abstracts, DECBAE uses BioELMo (Jin et al., 2019), a domain adapation verison of ELMo, to embed the words. After the embedding layer, DECBAE applies abbreviation-specific bidirectional LSTM (biLSTM) classifiers to do the abbreviation expansion, where the biLSTM parameters are trained separately for each abbrevi-ation. We train DECBAE from the automatically collected dataset of 950 ambiguous abbreviations.
At inference time, DECBAE feeds the BioELMo embeddings of the whole sentence and uses the corresponding abbreviation-specific biLSTM classifiers to perform disambiguation of abbreviations in the sentence. We show that DECBAE outperforms other baselines by large margins and even performs better than single human expert. Although training instances of DECBAE are collected from PubMed, it covers 85% of clinically related abbreviations mentioned in a previous work (Xu et al., 2012). Moreover, DECBAE remains robust in low-resource and imbalanced settings.

Related Work
Contextualized word embeddings: Recently, contextualized word representations pre-trained by large corpora like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) significantly improve the performance of various NLP tasks. ELMo is a pre-trained biLSTM language model. ELMo word embeddings are calculated by a weighted sum of the hidden states of each biLSTM layer. The weights are task-specific learnable parameters while biLSTM layers are fixed. Indomain trained contextual embeddings further improve the performance on domain-specific tasks. In this paper, we use BioELMo, which is a biomedical version of ELMo trained on 10M PubMed abstracts (Jin et al., 2019). BioELMo outperforms general ELMo by large margins on several biomedical NLP tasks.
We don't use BERT for contextualized embeddings due to its fine-tuning nature: users just need to download 1 BioELMo and N abbreviationspecifc biLSTM weights to run DECBAE locally, which takes significantly less disk size than N fine-tuned BERTs for each abbreviation. N is the number of abbreviations.
Word sense disambiguation (WSD): The goal of WSD is to determine the correct sense of words in different contexts. Abbreviation expansion is a specific case of WSD where the ambiguous words are abbreviations. In this paper, we use abbreviation expansion and abbreviation disambiguation interchangeably. Several human-annotated datasets are available for supervised WSD (Navigli et al., 2013;Camacho-Collados et al., 2016;Raganato et al., 2017b). However, human anno-tations could be expensive, especially in domain specific settings. To address this problem, some automatic dataset collection methods have been proposed (Yu et al., 2007;Ciosici et al., 2019), where abbreviations are automatically labeled if they are defined previously in the same documents. We use a similar approach in this work. Peters et al. (2018) report that just matching the ELMo embedding of the target words with the nearest sense representations, calculated by averaging their ELMo embeddings, leads to comparable WSD performance with state-of-the-art models using hand crafted features (Iacobacci et al., 2016) or task-specific biLSTM trained with multiple tasks (Raganato et al., 2017a). Instead of searching the nearest contextualized embeddings neighbors of the abbreviation and definitions, we model abbreviation expansion as classification.
Biomedical abbreviation expansion: Various methods have been introduced for automatically expanding biomedical abbreviations. Yu et al. (2007) train naive Bayes and SVM classifiers with bag-of-word features on an automatically collected dataset from PubMed. Some works disambiguate abbreviations to their senses in controlled vocabularies like Medical Subject Headings 1 (MeSH) and Unified Medical Language System 2 (UMLS). Xu et al. (2015) use pooled neighbor word embeddings of the abbreviations as features to train SVM classifiers for clinical abbreviaiton disambiguation. Jimeno-Yepes et al. (2011) introduced MSH WSD dataset to test the performance of supervised biomedical WSD systems and several supervised models have been proposed on it (Antunes and Matos;Yepes, 2017). Recently Pesaranghader et al. (2019) presented deep-BioWSD which sets new state-of-the-art performance on it. DeepBioWSD uses a single biLSTM encoder for disambiguation of all abbreviations by calculating the pairwise similarity between context representations and sense representations.
To the best of our knowledge, DECBAE is the first model that uses deep contextualized word embeddings for biomedical abbreviation expansion. Figure 1 shows the architecture of DECBAE. During training, we first construct abbreviation ex-  Figure 1: Architecture of DECBAE. Training and inference phases are illustrated in the left and right boxes, respectively. The PubMed corpus is used in training BioELMo (Jin et al., 2019) and collecting the disambiguation dataset. We train a separate biLSTM classifier for each abbreviation, and the specific pre-trained classifier is retrieved in inference phase.

Methods
pansion datasets from PubMed ( §3.1). We use BioELMo ( §3.2) to get the contextualized representations of words, and train a specific biL-STM classifier ( §3.3) for each abbreviation. During inference ( §3.5), we first detect whether there are ambiguous abbreviations in input sentences by the expert-curated ambiguous abbreviation vocabulary. If so, we use BioELMo and the corresponding abbreviation-specific biLSTM classifiers to do the disambiguation.

Dataset Collection
Figure 2 shows our approach of automatically collecting disambiguation dataset. For each abstract, we first detect and extract the pattern of "Definition (Abbreviation)", e.g.: "endoplasmic reticulum (ER)". Then we collect all the following sentences that contain the abbreviation, and label them with the definition. This would generate a noisy label set due to the variations of writing the same definition (e.g.: emergency department and emergency departments). To group the same definitions together, we use MetaMap-derived MeSH terms (Demner-Fushman et al., 2017) as features of definitions and define the MeSH similarity between definition a and definition b as: where M a and M b are the MeSH term sets of definition a and b, respectively. We group those definitions with high MeSH similarity and close edit distance by heuristic thresholds. We collected 1970 abbreviations. However, due to the unsupervised nature of the collection process, some abbreviations are invalid or not ambiguous. For this, one biomedical expert 3 filtered the abbreviations we found, based on 1) Validity: abbreviations should be biomedically meaningful; 2) Ambiguity: abbreviations should have multiple possible definitions, and prevalence of the dominant one should be < 99%. After the filtering, there are 950 valid ambiguous abbreviations. Their statistics are shown in Table 1. We split the instances of each abbreviation into training, development and test sets: If there is more than 10k instances, we randomly select 1k for both development and test sets. Otherwise, we randomly select 10% of all instances for both development and test sets.

BioELMo
BioELMo is a biomedical version of ELMo pretrained on 10 millions of PubMed abstracts (Jin et al., 2019). It serves as a contextualized feature extractor in DECBAE: given an input sentence of Figure 2: An example of automatically generated training instances for disambiguation from the abstract of Schwarz and Blower (2016). In this case, we extract "endoplasmic reticulum" as the definition for all ER mentions in the abstract, and store those instances to the dataset. where e ∈ R D is the token embedding and D is the embedding dimension 4 .

Abbreviation-specific biLSTM Classifiers
For each abbreviation, we train a specific biLSTM classifier, denoted as biLSTM i for abbreviation i. We feed the BioELMo representations of sentences containing abbreviation i to biLSTM i : where h ∈ R 2H is the concatenation of forward and backward hidden states of the biLSTM. We take as input the concatenated hidden states of the abbreviation i (i.e. the ambiguous token) h a and use several feed-forward neural network (FFN) 4 Note that it's after scaling and averaging the 3 BioELMo layers using task-specific weights. layers with softmax output unit to predict its definition: where w k is the learnt weight vector corresponding to definition k, and def k is the k-th definition of abbreviation i in our dataset. Similarly, we train FFN separately for different abbreviations.

Training
The weights of BioELMo are pre-trained and fixed, while the averaging weights and scaling factor of BioELMo embeddings are trained separately for each abbreviation along with the abbreviation-specific biLSTM classifiers. We use Adam (Kingma and Ba, 2014) to optimize the cross-entropy loss of the predicted label and ground-truth label.

Inference
At inference time, we denote the tokenized input sentence as [t 1 ; t 2 ; ...; t L ] and our ambiguous abbreviation set as A. If ∃t j ∈ A, we run DECBAE to expand the t j : First, we use BioELMo to compute the representations of all the input tokens to E = [e 1 ; e 2 ; ...; e L ]. The trained biLSTM for abbreviation t j , denoted as biLSTM t j , is retrieved and used to calculate the hidden states given the BioELMo embeddings of the input sentence: Then h t j , which is the concatenated hidden states of the ambiguous abbreviation t j , is used for disambiguation through the trained abbreviationspecific FFN: 4 Experiments

Baseline Settings
A trivial baseline is to predict the majority of definition for all cases, which could still lead to high accuracy in severely imbalanced datasets. We denote this method as Majority. We also test other baseline settings of different feature learning schemes. They are all followed by several FFN layers and a softmax output unit. Bag-of-words: Following most of the previous works, we use bag-of-words features to represent the context by c ∈ R |V| , where |V| is the vocabulary size.
BioELMo: We take the BioELMo embeddings of the ambiguous abbreviations as input features.
biLSTM: We use biomedical w2v (Moen and Ananiadou) as word embeddings and train taskspecific biLSTMs and use the hidden states of the ambiguous abbreviations as input features.
We also measure the human performance: due to limitation of resources, we just study singleexpert performance on one sampled abbreviation. For this, the expert is shown with the test sentences, and asked to classify the ambiguous abbreviation to its possible definitions. An ensemble of experts will obviously generate better results, so our single-human results just represent the lower bound of human performance.

Subset Settings
We report the model performance on different subsets of our dataset. Statistics of those datasets are shown in Table 1.
Random samples: It's computationally expensive 5 and unnecessary to test the models on all 950 abbreviations. Instead, we use randomly sampled 100 abbreviations to represent the whole set.
Imbalanced samples: We define abbreviations whose dominant definitions have over 95% frequency as imbalanced samples. Multi-label classification with imbalanced classes is considered as a hard machine learning task.
Low-resources samples: We define abbreviations that have less than 1k training instances as low-resources samples. It's motivated by the fact that most biomedical datasets are typically limited by scale, so models that can still perform well under low-resources settings have the potential to be applied in real world settings.
Clinical samples: Though our abbreviations are collected from PubMed abstracts, we have included 11 out of 13 of clinical ambiguous abbreviations mentioned in a previous work of clinical abbreviation disambiguation (Xu et al., 2012). We also test our models on the subset of these 11 clinically related abbreviations.
Testing sample for human expert: We test human performance on one abbreviation (DAT), due to limited resources. The statistics of DAT abbreviation expansion dataset are close to the averages of the whole dataset, as shown in Table 1. Possible definitions of DAT include: 1) Dopamine transporter (63.9%); 2) Direct antiglobulin test (5.8%); 3) Direct agglutination test (5.8%); 4) Dementia of the Alzheimer type (24.5%).

Evaluation Metrics
We model abbreviation expansion as a multi-label classification task, and use the following metrics to measure the performance of different models: Accuracy: Accuracy is defined as the proportion of right predictions in all predictions. Most of the definition labels are imbalanced, so accuracy could be misleadingly high for a trivial majority solution in these cases, thus may not reflect the real capability of models.
Macro-F1: In multi-label classification, macro-F1 is calculated as an unweighted average of F1 score for each class. Class-wise F1 score is defined as follows: where precision and recall are calculated for each class.
Kappa Statistic: Cohen's kappa was originally introduced as a metric to measure inter-rater    agreement (Cohen, 1960). It can also be used to evaluate predictions of multi-label classification: where p o is the observed agreement and in the case of classification p o = accuracy, p e is the expected agreement which can be achieved by pure chance: p c andp c refer to the proportion of class c in ground truth labels and predictions, respectively. Empirical results in Table 2 show that Kappa statistics are often lower than accuracy and macro-F1, and thus serving as a more distinctive metric for our task.

Results
In Table 2, we report means and standard deviations of each model's performance on different subsets evaluated by the three metrics. In all subsets, DECBAE performs significantly better than most other models by large margins. A general trend of DECBAE > BioELMo > biLSTM > BoW-FFN > Majority conserves across subsets. In the Random subset which represents the whole dataset, all metrics of DECBAE exceed 0.90, setting very promising state-of-the-art performance despite the potential noise of the dataset.
In the Imbalanced subset where the most frequent definitions consist of over 95% of all the labels, a trivial Majority solution gets over 95% accuracy. However, for macro-F1 and kappa statistic, performance of the baselines drop dramatically while DECBAE can still generate decent results.
DECBAE and BioELMo alone remain robust in Low-resources setting. This is due to the transfer learning nature of BioELMo, which utilizes the knowledge encoded in the PubMed abstracts.
Our abbreviation expansion dataset covers roughly 85% of clinical abbreviations mentioned in Xu et al. (2012). On this Clinical subset, DECBAE gets pretty good results and vastly outperform other baselines despite its variety in possible definitions (8.5 possible definitions per abbreviation, as shown in Table 1).
On the testset for human performance (i.e.: abbreviation expansion for DAT), DECBAE and even some neural baselines outperform single human expert.

Analysis
In Fig. 3, we use confusion matrices to visualize the differences between DECBAE or the human expert and the ground truth labels, for disambiguation of abbreviation "DAT". The high agreement level between human expert predictions and the automatically assigned labels indicates that our pipeline of collecting the abbreviation disambiguation dataset is valid.
In general, both DECBAE and the human expert perform well in the task, with only few misclassifications. Specifically, DECBAE, and even other neural baselines like biLSTM and BioELMo, outperform the human expert in all metrics. Compared to DECBAE, the human expert is more likely to misclassify direct agglutination test with direct antiglobulin test (9 v.s. 1), and misclassify dementia of the Alzheimer type with dopamine transporter (7 v.s. 0). We show several instances of human and DECBAE's errors in Table 3.
One limitation of this work is that we just test DECBAE on our automatically collected dataset. Since the proposed model can also be used on other biomedical abbreviation expansion datasets as well, evaluating on other datasets like MSH WSD is a clear future work to do.
Another potential direction for improvement is to accelerate the inference speed. Currently DECBAE uses BioELMo for embedding and abbreviation-specific biLSTM for classification, resulting in two recurrent models in total. Our results show that just BioELMo with several FFN layers also generates decent results, so in some cases we might use only BioELMo as a compromise for faster inference.

Conclusion
We present DECBAE, a state-of-the-art biomedical abbreviation expansion model on the automatically collected dataset from PubMed. The results show that, with only minimum expert involvement, we can still perform well in such a domainspecific task by automatically collecting training data from a large corpus and utilize embeddings from pre-trained biomedical language models.

Acknowledgement
We are grateful for the annonymous reviewers of BioNLP 2019 who gave us very insightful comments and suggestions. J.L. is supported by NLM training grant 5T15LM007059-32.