Extracting Adherence Information from Electronic Health Records

Patient adherence is a critical factor in health outcomes. We present a framework to extract adherence information from electronic health records, including both sentence-level information indicating general adherence information (full, partial, none, etc.) and span-level information providing additional information such as adherence type (medication or nonmedication), reasons and outcomes. We annotate and make publicly available a new corpus of 3,000 de-identified sentences, and discuss the language physicians use to document adherence information. We also explore models based on state-of-the-art transformers to automate both tasks.


Introduction
Patient adherence, also known as compliance, is the degree to which a patient follows medical advice. Adherence includes not only taking medications as prescribed, but also using medical devices as instructed, following diet and exercise recommendations, etc. As we shall see, adherence is a critical factor in health outcomes, particularly in patients with chronic illnesses. Patients play the central role following medical advice and communicating adherence to health providers: they largely self-administer, self-regulate and self-report. While insurance claims and filling prescriptions on time may be available as structured data, these sources of information only provide a partial view of patient adherence. Indeed, not going to a specialist or not filling a prescription are sufficient to detect some forms of non-adherence, but doing so is insufficient to guarantee adherence. Additionally, many other forms of non-adherence are common (e.g., skipping medications, taking the wrong dosage, following a diet only occasionally).
From a computational perspective-including both annotation and model building efforts-patient adherence has been primarily modeled as a binary decision, with an emphasis on pinpointing nonadherence. Adherence, however, ranges from absolute non-adherence to full adherence. Partial adherence and non-adherence can be due to many factors. Adherence models by medical experts include intentional and unintentional non-adherence (Ng et al., 2014), social support (Simoni et al., 2006) and other patient attributes such as age and time since diagnosis (Weaver et al., 2005).
In this paper, we extract medication adherence information from electronic health records. Natural language processing is a requirement to solve this problem, as physicians record self-reported patient adherence (or non-adherence) in unstructured free text. We identify not only whether a patient has adhered fully or not at all to medical advice, but also partial adherence as well as doctors reviewing or instructing patients about treatments and the importance of future adherence. In addition, we target for the first time information to better understand adherence. This information includes adherence type (medication or non-medication), the source of adherence information (the patient, a care giver, etc.), reason of non-adherence (cost, intolerance, forgetfulness, etc.), speculation (doesn't remember, is not sure, etc.) and negation (forgot, stopped, etc.) and others. The main contributions of this paper are: 1 • a corpus of 3,000 de-identified sentences from electronic health records and annotations indicating which adherence information is discussed (full, partial, none, review) and additional information including 10 attributes (medication, non-medication, source, target, reason, outcomes, etc.); • analysis discussing the language doctors use to record adherence information; and • experimental results showing that transformer models can partially automate this task, and error analyses providing insights into when the task is most challenging.

Background and Previous Work
Electronic health records have emerged in the last decade as a standard in healthcare (Jha et al., 2009;Henry et al., 2018). Studies have shown that they are beneficial both in small practices and large organizations (Buntin et al., 2011). While electronic health records are notoriously difficult to annotate and mine for information (Chapman et al., 2011;Friedman et al., 2013), modern machine learning and in particular deep learning has seen many successes (Shickel et al., 2018). Some of the remaining challenges include the lack of a universal standard and the inherent difficulties of working in the medical domain (abbreviations, jargon, incomplete sentences, etc.) The Consequences of Non-Adherence. Among individuals with chronic illnesses, medication nonadherence is an important contributor to (a) poor patient outcomes related to increased morbidity and mortality, and (b) increased healthcare costs related to avoidable hospitalizations and emergency department visits (Ho et al., 2006;Yang et al., 2009;Asche et al., 2011). Non-adherence is associated with $100-$300 billion of avoidable health care costs (3-10% of the total) in the US annually (Viswanathan et al., 2012;Iuga and McGuire, 2014). Despite these facts, studies have shown that up to 50% of patients in the US with chronic illnesses stop taking their medications within one year of being prescribed (Lee et al., 2006), and over 31% of prescriptions are not filled within 9 months (Tamblyn et al., 2014).
Adherence and Electronic Health Records. Previous work has measured medication adherence objectively using medication event monitoring systems (Wu et al., 2008). To our knowledge, the work by Turchin et al. (2008) is the first to extract adherence from physician notes. Their system, however, is based on 87 heuristics using pattern matching, and their evaluation considers only 82 notes.
Physician-patient transcripts have been studied as a source of information to predict non-adherence. For example, Howes et al. (2012) conclude that unigrams in psychiatrist-patient transcripts are good predictors of future adherence to treatment for schizophrenia (as determined by the physician), and Howes et al. (2013) investigate the role of topics automatically identified. More recently and still using traditional machine learning, Wallace et al. (2014) automatically identify whether utterances discuss adherence barriers. Similar to our sentence-level annotations, all these previous works consider adherence as a text classification task. That is, they assign one label to a piece of text (e.g., a full dialogue transcript, sentence, utterance). Unlike them, we also consider span-level adherence information including adherence type (medication or non-medication), the reasons and outcomes for adhering (or not adhering), etc.
Social media has also been studied as a source of adherence information, Onishi et al. (2018) find that out of 400 tweets mentioning drugs, 9 tweets indicate non-adherence and 6 include the reason for not adhering (e.g., adverse effects). They do not, however, attempt to automatically extract any adherence information. Moseley et al. (2020) present a corpus (1,102 discharge summaries and 1,000 nursing progress notes) annotated with 13 patient phenotypes, one of which is non-adherence. They also present results with a CNN, but do not report results identifying non-adherence. Their definition of non-adherence is equivalent to our partial and none adherence (Section 3), and like the other works described above, they reduce the problem to text classification and disregard span-level information. Deep learning approaches have been proven useful to extract information from electronic health records beyond adherence. For example, Jagannatha and Yu (2016) show that BiLSTM networks are useful for medical event detection, and Rosenthal et al. (2019) experiment with transfer learning, GRUs and BERT to identify sections (e.g., allergies, chief complaint, examination, family history, procedures, etc.) in electronic health records.

Annotating Adherence
Large corpora of electronic health records are not publicly available because of privacy concerns thus we work with new sentences. We summarize below the final annotation guidelines. The guidelines were refined after several iterations with pilot annotations and discussions with the annotators.
Source sentences We work with sentences retrieved from real electronic health records of patients suffering from diabetes or mental health disorders. The criterion to select sentences consisted on checking for a set of keywords likely to discuss adherence: adhere, adhered, adherence, adhering, compliance, complied, taking medications and taking meds. A manual process ensured that all protected health information (e.g., patients' names, hospitals, locations, dates) was replaced with dummy tokens before the annotation process started. We selected and de-identified 3,000 sentences following these steps.

Sentence-Level Annotations
The fist annotation task is to determine whether the patient adheres to the treatment prescribed by the doctor generating the electronic health record. We use five labels: • FULL: patient is completely adherent to the treatment as prescribed by the doctor; • PARTIAL: patient is following some of the treatment, but not exactly as prescribed by the doctor; • NONE: patient is not following the treatment at all; • REVIEW: either (a) patient received guidance about how to adhere to treatment or (b) the doctor reviews the prescribed treatment, but the patient's adherence (or lack thereof) is not discussed; or • UNKNOWN: the sentence is ambiguous, there is not enough information to choose another label. FULL adherence requires not only medication adherence, but also adherence to other treatment aspects such as diet, not lifting weights, or resting. PARTIAL adherence includes patient adherence ranging from just above NONE adherence to almost FULL adherence regardless of the reason: forgetting, intolerance, etc. We use REVIEW to indicate sentences describing instructions to patients that do not report on the patient's adherence. Tables 1 and 5 exemplify the five sentence-level adherence labels. Annotators could also discard sentences not discussing medical adherence, e.g., Umbilical cord is still adhered. Annotators discarded only 223 sentences (6.9%) because they did not discuss medical adherence.
Span-Level Annotations Regardless of the sentence-level annotations, we further annotate sentences with spans indicating ten attributes related to adherence. These more detailed annotations could be described as slots for an adherence event or following FrameNet (Baker et al., 1998), the frame elements of an instance of the adherence frame. We use ten span-level labels: • MEDICATION: medications that are part of a treatment; • NON-MEDICATION: any form of treatment other than medications (e.g., diet, exercise); • NEGATION: phrases indicating that the patient is non-adherent to the treatment; • SPECULATION: phrases indicating uncertainty by the doctor; • REASON: explanation or justification for non-adherence (e.g., forget, financially unable, unable to travel to pharmacy, adverse effects) or (rarely) adherence (e.g., encouragement by family);  Table 2: Label frequencies (percentage of sentences with each sentence-level and span-level label) and agreements. We detail agreements in the annotation and adjudication phases.
• OUTCOME: results of non-adherence (e.g., lack of recovery) or adherence (e.g., full recovery, adverse effect patient had to deal with in order to be adherent); • SOURCE: person who is reporting adherence (usually the patient, but also relatives and care givers); • TARGET: individual responsible for adherence (e.g., the patient, a care giver); • TIME: phrases expressing time and related to adherence; • PROBLEM: phrases explaining the need for medication of treatment (e.g., illness, symptoms). Tables 1 and 6 provide examples, and we briefly summarize important criteria from the annotation guidelines. First, multiple smaller spans are preferred to one larger span covering several individual elements ( Annotation Process The annotation process consisted of two phases and involved four individuals with complementary expertise. In the first phase, two natural language processing practitioners completed both annotation tasks independently. In the second phase, a medical scribe and a doctor adjudicated the annotations resulting from the first phase. In the annotation phase, both annotators annotated each sentence independently. In the adjudication phase, both adjudicators adjudicated 20% of sentences, and the remaining 80% were adjudicated by one adjudicator. We discuss agreements in Section 4.

Sentence-Level Annotations
The top block of of Table 2 presents the distribution of sentence-level labels and agreements. Annotators could make a decision with the vast majority of sentences (UNKNOWN: 11%). Surprisingly, we found that doctors often record in electronic health records when they REVIEW treatments and emphasize the importance of adherence (30%). This may possibly signal lack of adherence or that a doctor suspects future non-adherence and wishes to address the (potential) issue. Complete non-adherence is rare (7%), and patients partially adhering is the most common case (31%), followed by FULL adherence (21%).
We calculate agreements with Cohen's κ (Cohen, 1960) in order to discount the probability of agreeing by chance. Inter-annotator and especially inter-adjudicator agreements are high (All, κ: 0.73 and 0.79). Cohen's κ inter-adjudicator coefficients range between 0.56 and 0.92, and are above 0.75 with all labels except UNKNOWN, which is the least frequent label. κ in the 0.6-0.8 range indicate substantial agreements, and over 0.8 (nearly) perfect (Artstein and Poesio, 2008).

Span-Level Annotations
The bottom block of Table 2 presents the frequency of spans per sentence and agreements. Most sentences discussing adherence mention a MEDICATION (68%), and surprisingly, a substantial amount also mention NON-MEDICATION (41%). Many sentences also include sentences indicating non-adherence (NEGATION, 40%), and around a quarter discuss TIME (22%), the SOURCE reporting adherence (23%) or the PROBLEM requiring treatment (and adherence, 28%). These counts show that our span-level adherence annotations provide additional information to sentence-level annotations.
Calculating inter-annotator agreements of span-level annotations is more involved than sentence-level annotations (a span covering a sequence of tokens vs. one label per sentence). We follow previous work to consider partial matches in the agreement calculations (Chinchor and Sundheim, 1993;Collins et al., 2016). More specifically, we calculate Precision, Recall and F1 of the token level annotations after marking each token as correct, incorrect, partial, spurious or missing. Unlike the aforecited works, we consider a token as a partial match as long as it overlaps with the gold annotations, regardless of whether the last token of the span is the same. We observe again high inter-annotator and inter-adjudicator agreements (F1: 0.6-1.0, all above 0.7 except REASON, which is present in only 6% of sentences).   Figure 1 presents a heatmap of the correlation of observing two span-level annotations in the same sentence. Specifically, we generate the heatmap taking into account the presence of span-level annotations and as opposed to the count. We observe that doctors tend to not discuss MED-ICATION and NON-MEDICATION adherence (-0.31), or SOURCE and TARGET information in the same sentence (-0.26). On the other hand, they tend to use NEGATION to discuss REASON (0.22), TIME (0.18) and SOURCE (0.16), and MEDICATION correlates with TIME (.16) and PROBLEM (0.14).

Span Correlations
Spans and Sentence Length Intuitively, longer sentences have more span-level annotations and certain span-level annotations include more tokens than others. The plots in Figure 2 confirm this intuition. For example, sentences that include PROBLEM, REASON and SPECULATION tend to be longer than sentences that do not contain these spans, as describing this information requires more tokens than, for example, introducing negation. Sentence length is roughly uniform regardless of whether other span-level annotations are present. The supplementary materials provide similar plots for all span-level annotations. Figure 3 plots the most common bigrams (i.e., sequences of two words) for two span-level annotations: SPECULATION and SOURCE. We observe that doctors document SPEC-ULATION in first person (I do, I am), with verbs bringing up uncertainty (seems, think), and speculative adverbs (entirely confident, likely, possibly). We also note that negation is quite common to indicate SPECULATION: not entirely, (do)n't think, (do)n't know, (I) am not. Regarding SOURCE, we observe that the patient is rarely named (PHI PERSON, de-identified). Rather, they refer to the SOURCE with patient or the shorthand pt, personal pronouns (She, he), and a long list of sources that are not the patient: per record, family, mother, mom, etc. Additionally, common communication verbs are used to indicate the  Table 3: Results obtained classifying sentences into their sentence-level adherence information. The last system is the only one using pretraining specific to the clinical domain.

Spans and Common Words
SOURCE: says, states, reports, indicates, admits, attests, claims, etc. The supplementary materials include sunbursts for the remaining span-level annotations. We note that most NEGATION spans do not include common negation cues such as not, non or never. Instead, doctors commonly document NEGATION with verbs and adjectives that indicate negation in a nuanced manner: stopped taking, unable to (take, follow, etc.), prefers to, been skipping, minimal compliance, etc.

Experiments and Results
We create stratified train, development, and test splits (75/5/20) and experiment with transformers to automate both annotation tasks. More specifically, we work with four versions of BERT (Devlin et al., 2019). Three versions are pre-trained with general English: the base and large BERT models (110M and 340M parameters) and a smaller version (65M parameters) shown to be as effective, DistillBERT . The fourth transformer, ClinicalBERT (Si et al., 2019), is pretrained in the clinical domain. We use the transformers package by , keras (Chollet and others, 2015), and ktrain (Maiya, 2020) to tune the models with the train and development splits.
Sentence-Level Annotations The models for sentence-level classification are simple: a transformer to obtain a distributed representation, 0.2 dropout, and a fully connected layer of size 5 with softmax activation to make the classification. Table 3 shows the results obtained with each transformer on the test split after training and tuning with the train and development splits. It is difficult for humans to draw non-binary conclusions about adherence given a single sentence. Therefore, as intuition would suggest, PARTIAL and UNKNOWN are the most challenging labels for the model to identify correctly. Across all four transformer models, these labels have F1-measures significantly below that of the less broad FULL, NONE, and REVIEW labels, even with NONE representing only 7% of the dataset. ClinicalBERT yields the best results by a small margin (F1: 0.82 vs. 0.78-0.80), particularly in sentences annotated with FULL label (F1: 0.83 vs. 0.72-0.76). As adherence is related to improved overall health outcomes, it is likely that ClinicalBERT's domain knowledge is able to interpret medical jargon related to outcomes better. This advantage is noticeably lacking in the NONE adherence label, but negation already provides significant indication in this case, and all transformers obtain similar results (F1: 0.88-0.91).

Span-Level Annotations
The models for predicting span-level annotations are slightly more complex. We use a transformer to obtain contextualized word embeddings, and then use a BiLSTM (Graves and Schmidhuber, 2005) of size 100, 0.5 dropout and an additional fully connected layer of size 40 with softmax activation to train a sequence-to-sequence model that outputs span-level annotations. We use the BILOU standard (Ratinov and Roth, 2009) to represent spans. Table 6 shows the results obtained with each transformer. All models achieve the same precision and recall (0.87), though there are differences amongst span performance. We note, however, that BERT-large is the only one to detect SPECULA-TION. As SPECULATION occurs infrequently within the dataset (1% of sentences), the larger model size-featuring more than three times as many parameters as the other transformers-is the only rich enough language model to learn any correlation given so few instances. More importantly, we note that BERT-base BERT-large DistillBERT ClinicalBERT

Error Analysis
We conduct a manual error analysis of a sample of 100 sentences from the test set in order to identify the most commons error types made by the best models in each task. We identified well-known linguistic phenomena that are challenging in any domain as well as specific issues present in electronic health records. Our future plans include improving the models based on these insights.
Sentence-Level Adherence Information. We exemplify the most common error types predicting sentence-level annotations in Table 5. We observe uncertainty in 25% of errors; in the example, it is unclear if taking tylenol was prescribed by a doctor or it is simply describing the patient's background and medical history. Other uncertainty errors include lack of information (e.g., The patient is taking medications, but did not bring list and does not remember name of the medications.). Solving 20% of errors would require making implications, i.e., inferring adherence information that can be drawn from the sentence even though it is not explicitly stated. In the example, not always consistent implicates sometimes consistent and thus PARTIAL adherence. Other examples include helps him adhere to his medications' schedule better (so he is already partially adhering), would like to see increased compliance (so there is already some compliance), and work together to improve adherence to CPAP (so there is already some adherence). In 16% of errors, the sentence contains a contrast: the doctor first states FULL adherence (e.g., excellent adherence) and then qualifies this statement to indicate PARTIAL adherence (e.g., reports missing 1-2 doses a month). Other contrast errors include She reports compliance with medicine but has  We only show gold and predicted labels for the error being exemplified to improve readability.
been missing simvastatin occasionally. A less frequent error (11%) occurs when adherence information is buried in a long list of medical terms. The example in Table 5, additionally, indicates adherence with a numeric score. Finally, our analysis identifies short sentences as the cause of 9% of errors. In the example, the model is mislead by the communication verb question(s), which appears to indicate REVIEW although there is not enough information to make this judgment.
Span-Level Adherence Information. Most sentences discussing adherence have more than one span, and we conduct an error analysis to analyze the most common errors depending on the gold label (Table 6). Most errors mislabeling OUTCOMEs (62%) occur when the gold outcome is a long phrase or clause and the model misses part of it. In the example, the adverb significantly is the only token identified as outcome. The most common error regarding NON-MEDICATIONs is to miss them altogether (68%). The model is quite effective at identifying non-medication treatments related to diet and exercising regardless of the specific wording (e.g., avoid carbs, eliminate fat), but it often misses infrequent therapies (e.g., nasal pillows). Mentions to generic medications, unlike specific drugs such as methyldopa, hydrocodone and benzopril, are the most common source of errors mislabeling MEDICATIONs. TIME spans are most challenging (39% of errors) when they refer to unspecific temporal information (e.g., occasionally, current). Another common error with TIME is identifying temporal expressions related to when medical treatment was prescribed, not when adherence (or non-adherence) took place.

Conclusions
Patient adherence is critical to positive health outcomes, especially for patients with chronic illnesses. Non-adherence results in $100-$300 billion of avoidable healthcare costs annually in the US alone (Viswanathan et al., 2012;Iuga and McGuire, 2014). Doctor-patient communication is critical for adherence, but ultimately patients have the responsibility to follow medical advice: they largely selfadminister, self-regulate and self-report. Crucially, patient adherence is documented by physicians in electronic health records using unstructured free text. Thus, extracting adherence information requires natural language processing. Previous work on adherence worked primarily with physician-patient transcripts. Unlike them, we work with electronic health records as generated by physicians, which are commonplace in small practices and large organizations. Beyond determining whether patients are fully adherent or not (a binary decision), like previous work does, we differentiate between FULL and PARTIAL adherence, and no adherence at all (NONE). We also identify when doctors document themselves reviewing medical treatments with patients in order to improve adherence. More importantly, in addition to approaching adherence as a text classification task at the sentence level, we also consider span-level information including adherence type (MEDICATION or NON-MEDICATION), REASONs and OUTCOMEs. Our annotation effort (3,000 sentences) shows that sentences often contain more than one span.
The work presented here opens the door to applications that would benefit from extracting adherence information from electronic health records-the largest source of physician-generated patient records. For example, it enables studies to identify the most common reasons and outcomes of non-adherence, and tools to anticipate potential patient non-adherence so that medical staff can apply an intervention.   Table 1 in the main paper.