Evaluation of Transfer Learning for Adverse Drug Event (ADE) and Medication Entity Extraction

We evaluate several biomedical contextual embeddings (based on BERT, ELMo, and Flair) for the detection of medication entities such as Drugs and Adverse Drug Events (ADE) from Electronic Health Records (EHR) using the 2018 ADE and Medication Extraction (Track 2) n2c2 data-set. We identify best practices for transfer learning, such as language-model fine-tuning and scalar mix. Our transfer learning models achieve strong performance in the overall task (F1=92.91%) as well as in ADE identification (F1=53.08%). Flair-based embeddings out-perform in the identification of context-dependent entities such as ADE. BERT-based embeddings out-perform in recognizing clinical terminology such as Drug and Form entities. ELMo-based embeddings deliver competitive performance in all entities. We develop a sentence-augmentation method for enhanced ADE identification benefiting BERT-based and ELMo-based models by up to 3.13% in F1 gains. Finally, we show that a simple ensemble of these models out-paces most current methods in ADE extraction (F1=55.77%).


Introduction
Adverse Drug Events (ADE) arising from the medical intervention of drugs account for 1.3 million visits to the emergency department in the United States alone (CDC, 2017). Randomized controlled trials (RCTs), the primary mechanism for monitoring and identifying ADEs, are hampered by insufficient sample sizes of clinical trials (Sultana et al., 2013). Pharmacovigilance databases such as the Food and Drug Administration's Adverse Event Reporting System (FAERS) strive to be authoritative sources for Physicians; however, they require regular manual data entry (Hoffman et al., 2014;Chedid et al., 2018).
Electronic Health Records (EHRs) contain valuable information about patient medication history: drugs prescribed, reasons for administration, dosages/strengths, and ADEs. Automated extraction of these medication entities by Natural Language Processing (NLP) techniques can facilitate wide-scale pharmacovigilance (Moore and Furberg, 2015;Liu et al., 2019a).
Incorporating such a predictive system within the clinical note-taking interface may help the Physician by alleviating the need to access external clinical decision support applications (Chen et al., 2016). For instance, if a physician notes down 'started on Dilantin for seizure prophylaxis for a few days', the text could be quickly parsed -highlighting 'Dilantin' as a drug, 'seizure prophylaxis' as the reason for administration, 'few days' as the duration, and warnings of 'eye discharge', 'oral sores', etc. as potential ADEs. In the example given, 'seizure prophylaxis' and 'few days' may occur any where in the clinical text, but only in the context of 'Dilantin' they indicate reason / duration for administration. Besides, such 'dynamic' interfaces can aid medical students to learn from their collective experiences.
Among medication entities, ADE and Reason are challenging to disambiguate (Henry et al., 2020). Frequently, the specific reason for drug administration may appear in a subsequent sentence (Dandala et al., 2020). Besides, ADE data-sets include goldannotations for these entities, only if they are associated with a drug. Doing so leads to a significant reduction in the number of gold annotations (Wei et al., 2020).
As part of our work in uniting clinical decision support functions and note-taking interfaces, we needed to develop a high-performing medication extraction model using open-source NLP frameworks. Following (Miller et al., 2019), we modeled this as a named-entity recognition task (Uzuner S , 2011;Si et al., 2019) and experimented with transfer learning using openly available biomedical contextual embeddings. It is in this context, 1. We evaluate transfer learning models incorporating: BioBERT (Lee et al., 2020), Clinical-BERT (Alsentzer et al., 2019), ELMo (Peters et al., 2018) and Flair (Akbik et al., 2018) contextual embeddings pre-trained on PubMed abstracts (Fiorini et al., 2018).
2. We evaluate embedding-specific methods to maximize performance: language-model finetuning, scalar mix, sub-word token aggregation.
3. Based on the performance of the transfer learning models, we develop procedures for enhanced ADE and Reason identification. Sentence-augmentation at predictiontime benefits ADE extraction by up to +3.13% in F1 gains. It also facilitates a deeper understanding of the behavior of the embeddings. Ensembling strategies help improve performance of all three challenging enities: ADE, Duration, and Reason with up to +2.63% in F1 gains for ADE.
Our main intention was to get a transfer learning pipeline working with these embeddings and therefore we did not perform any detailed hyperparameter optimization. Despite this, we were able to achieve strong performance with all the embeddings. Standalone models achieved F1-scores of 53.08% in ADE extraction and 92.91% in the overall task with default features. A basic ensemble constructed from these standalone models achieved F1scores of 55.77% in ADE extraction and 92.82% in the overall task confirming the viability of the overall strategy.
The 2018 n2c2 Adverse Drug Events and Medication Extraction in EHR data-set (Buchan et al.) and  tion. Most participants leveraged the BiLSTM-CRF neural model in their work (Chalapathy et al., 2016). We have listed the top performing methods from the 2018 n2c2 ADE challenge in Table 1. Dandala et al. (2020) custom-trained biomedical ELMo embeddings using the MIMIC-III data-set (Johnson et al., 2016); they also used a rich set of sentence tokenization rules. Ju et al. (2020) leveraged a tree-architecture to detect overlapping spans in addition to lexical and knowledge features (e.g., word shapes, Human Disease Ontology / MedDRA side-effect database information).
Relationship association for medication entities is complementary to our work and can be implemented either jointly or in a pipeline. Such a joint architecture utilizes the signals from the relations task to filter out unwanted medication entities. Wei et al. (2020) adopted such a joint-approach with a three-classifier ensemble achieving 52.95% in ADE extraction. Chen et al. (2020) also used a joint-architecture supplemented by UMLS (Bodenreider, 2004) concept lookups and unique modeling of temporal entities. Dai et al. (2020) cascaded classifiers sequentially to widen the contextual information available for ADE identification. This model also facilitates improved identification when spans overlap. They evaluated ten pre-trained embedding models: half of them were based on MIMIC-III while the rest were general-purpose. Kim and Meystre (2020) uniquely leveraged SEARN (Daumé et al., 2009), a search-based prediction algorithm for its preference of precision over recall.
Our work is most similar to Miller et al. (2019); they demonstrate that strong medication extraction models can be constructed with minimal engineering using contextual embeddings. The main differences from above mentioned studies are the evaluation of a broader array of contemporary biomedical embeddings, detailed study of fine-tuning strategies, and augmentation methods for ADE extraction.

Data and Pre-Processing
We use the 2018 n2c2 Adverse Drug Events and Medication Extraction (Track 2) data-set for our experiments. The data-set has a total of 505 clinical notes with nine medication-entities, as shown in Table 2. We convert these files into CoNLL 2000 BIO (Begin, Inside, Outside) format after pre-processing: split sentences into words, normalize numeric values, treat a subset of punctuation characters as word-boundary markers.

Transfer Learning Model
We formulate the medication extraction task as a standard NER task incorporating a single biomedical embedding from the list below: 1. BioBERT (BB) is a pre-trained version of BERT using PubMed abstracts. We used the Base version.
4. Flair-PubMed (FP) is a Flair contextual embedding pre-trained on PubMed abstracts.
We also incorporated the Glove (Pennington et al., 2014) classical word embedding as part of our model after a brief evaluation (Section 4.2). Our architectural formulation allows for experimenting with newer embeddings or combined embeddings with incremental effort.

Experimental Setup
We implement our models using the Flair opensource framework (Akbik et al., 2019). Flair, based on PyTorch, provides off-the-shelf BiLSTM+CRF model, a pluggable architecture for adding embeddings and data-sets. We have retained default hyperparameters and training procedures (details in Appendix A). During parameter selection, we train for 50 epochs. Final models are trained for 150 epochs or until convergence. We used the evaluation script provided as part of the data-set to appraise our models using the test-set. We report the 'Relaxed F1' score per prevailing practice.

Model Selection Procedures
In Transfer Learning, the linguistic-information encoded by contextual embedding acts as a primary input to the downstream task layer (BiLSTM). Fine-tuning is generally accepted to be beneficial. However, it requires familiarity with the scripts / associated frameworks specific to the embedding and data-set adaptation.

BERT Embeddings
BERT models have close to a dozen layers (heads). Understanding the linguistic information encoded by these layers and their relative contribution to downstream tasks is an active research area (Liu et al., 2019b;Kovaleva et al., 2019). Flair uses the last four layers of the BERT models to generate embeddings by default.
1. Choice of Layers (4L vs All): The default setting of the end four transformer layers leads to sub-optimal performance (under-fitting) on the training set (Table 3, Row 1). Rather than choosing specific layers, we tried using all layers. This option generates a vast number of features (11 x 768), for the downstream task (Bi-LSTM), and causes training to run out-of-memory.

Scalar Mix (SM):
As an alternate, we adopted Scalar Mix (Peters et al., 2018), a pooling mechanism on the layer-generated representations. Scalar Mix results in a reasonable number of features (768) and performs optimally (Row 2).
3. Mean-Pooling of sub-tokens (MP): BERT models uniquely use word-piece tokenization for out-of-vocabulary (OOV) words. Embeddings can be generated using first sub-token, or first and last sub-tokens, or using an aggregate (mean-pooling) of all sub-tokens. The latter provides best performance (Row 3).
These settings deliver optimal performance for the BERT-models. Akbik et al. (2018) show that paired use of classic word embeddings (such as Glove) and contextual

Flair Embedding Fine-Tuning
Language-model fine-tuning aims to improve the performance of Flair-PubMed contextual embeddings on speciality corpora. We performed finetuning for 10 epochs using the 4391 clinical notes from the i2b2/n2c2 data-sets. While all entities exhibited gains, the prominent gainers are shown in Table 5. We used this fine-tuned model for the rest of our experiments.     3. Form and Route: Unusual Routes ('take one tab under your tongue') were naturally ignored by all models. Commonly, the method of drug administration is used to describe the drug form also. In 'Heparin 5,000 unit/mL Solution Sig: One (1) Injection TID (3 times a day)', 'Injection' refers to the former and hence a Route while in 'EGD with epinephrine injection and BICAP cautery', it refers to the drug Form. Likewise, 'infusion' generates disagreement. BERT-models generally do well.
4. Dosages and Strength: Dosages were mislabeled most commonly for Strengths ('iron 0.5 ml per day') by all models followed by Frequency. In 'levophed @ 12 mcg/min', the FP model identifies 'mcg/min' as 'Strength' (correctly) while other models identify 'mcg/min' as 'Frequency'.  Mislabeling between ADE and Reason: CB model generates the highest number of mislabels (low recall) while EP does the best as shown in Table 8.

Mislabeling of ADE/Reason with Drug:
In 'Heme/onc was consulted regarding hemolysis and anticoagulation. ... Given her multiple indications for anticoagulation, decision was made to begin coumadin ...', the first reference to 'anticoagulation' is a Drug gold annotation ('blood thinners') while the latter is a Reason ('medical indication'). This example demonstrates the need for good contextual disambiguation. BB/FP models identify correctly. The EP model, ignores the former, and incorrectly identifies the latter as Drug. The CB model fails to identify both entities.  it may occur in a subsequent sentence creating a challenge for the model. To verify this hypothesis, we evaluated model behavior by combining a sentence with one or more of its subsequent sentences. This is discussed in the next section.

Prediction-time Sentence Augmentation
We evaluated model behavior by combining a sentence with one or more of its subsequent sentences. For example, the 'Look-ahead-1 strategy', pairs a sentence with the one immediately following it. We progressively increased the pairing length up to a paragraph. Table 10 shows the ADE performance resulting from this augmentation strategy. Table  11 lists several examples (Drug entities are marked bold when they occur in the subsequent sentence).
1. Reason: 'Hypothyroid' is detected by augmentation due to the co-occurrence of 'Syn-  3. In Ex. 5, altered mental status is identified at sentence-level but is un-annotated (despite 'somnolent' indicating the state of 'feeling drowsy'). 'AMS' is recognized by augmentation but is un-annotated probably because of its diagnostic nature.
The 'Look-ahead-1' strategy is the most effective: ADE F1 scores increase by +3.11%, +2.21%, +1.67% for the BB, CB, EP models despite a reduction in Precision. Recall gains for the FP model are offset by a higher reduction in Precision. For Reason entity, all models benefit by augmentation, with the gains ranging between 0.51% to 1.23%. This exercise basically shows that inter-sentence word context impacts ADE and Reason identification and is beneficial when the underlying model is unable to contextualize effectively.

Model Ensembles
We briefly evaluated model ensembling strategies for enhanced ADE performance. We generate predictions on the underlying models. We combine non-conflicting entities. In the case of a conflict, we prioritize ADE predictions; otherwise, we choose the entity using the confidence score. Table 12 shows three ensemble models based on their 'Overall F1' scores. Table 13 shows the entity-wise performance for the FP+EP ensemble model (selected based on the highest ADE F1 score). The ensemble model delivers the best performance in all three challenging entities: ADE, Duration, and Reason validating the feasibility of the strategy.  There are a few limitations in this study that we plan to address in future works: 1. We did not fine-tune BERT and ELMo-based embedding models. Doing so may alter the performance profile of these models. Hence, an apples-to-apples comparison between the models is not recommended.
2. Adoption of better tokenization methods (e.g., clinical text processing tools), and handling special-cases (such as abbreviations) may further enhance model robustness.
3. We also did not do an exhaustive survey of the available embeddings. There may be other more effective embeddings.

Conclusion
In this study, we presented strong performing transfer learning models for the extraction of medication entities using several biomedical contextual embeddings. Our experiments shed light on the strengths of the various embeddings: Flair-PubMed embedding out-performs in ADE extraction. BioBERT and ClinicalBERT embeddings outperform in recognition of Drug and Form medication entities. ELMo-PubMed embedding delivers competitive performance in all medication entities. We showed that sentence-augmentation and ensembling are viable strategies to enhance ADE performance. Our approach is free of hand-generated features and built using off-the-shelf neural models, default hyper-parameters, and training procedures. These factors decrease the development effort. A detailed analysis of embedding-specific factors contributing to mis-classification and inclusion of finetuning procedures are part of our ongoing work.

Acknowledgements
We thank the anonymous reviewers for their valuable suggestions and feedback. This work was supported by the biomedical AI groups of Amrita Technologies, Amritapuri, India and Amrita Institute of Medical Sciences, Kochi, India.
9 Availability of Data and Materials 1. The 2018 n2c2 ADE and Medication Extraction (Track 2) data-set is protected by Data Usage Agreement. It can be obtained from Harvard DBMI Portal.
2. The code and setup instructions used for the experiments in this paper is available from Git.