Toward Cross-Domain Engagement Analysis in Medical Notes

We present a novel annotation task evaluating a patient’s engagement with their health care regimen. The concept of engagement supplements the traditional concept of adherence with a focus on the patient’s affect, lifestyle choices, and health goal status. We describe an engagement annotation task across two patient note domains: traditional clinical notes and a novel domain, care manager notes, where we find engagement to be more common. The annotation task resulted in a kappa of .53, suggesting strong annotator intuitions regarding engagement-bearing language. In addition, we report the results of a series of preliminary engagement classification experiments using domain adaptation.


Introduction
The recent trend in medicine toward health promotion, rather than disease management, has forefronted the role of patient behavior and lifestyle choices in positive health outcomes. Socialcognitive theories of health-promotion (Maes and Karoly, 2005;Bandura, 2005) stress patient selfmonitoring of life-style choices, goal adoption, and the enlistment of self-efficacy beliefs as health promotive. We call this cluster of behavioral characteristics patient engagement. Traditional strategies of patient follow-up have also been affected by this trend: healthcare providers increasingly employ "care managers" (CMs) to monitor patient well-being and adherence to physicianrecommended changes in health behavior-i.e., engagement. In this paper, we present an annotation schema for (lack of) engagement in CM notes (CMNs) and generalize the schema to the related * work completed at IBM domain of electronic health records (EHRs). Our high-level research questions are: (1) Is the concept of engagement sufficiently welldefined that annotators can recognize the concept across text domains with an acceptable level of agreement?
(2) Can the annotations produced in (1) be used to classify engagement-bearing language across text domains?
In section 3, we report the results of our exploration of (1), describing an annotation task involving ∼ 6500 CMN and EHR sentences that resulted in an average κ of .53. In sections 4 and 5 we address (2) and report the results of several classification experiments that ablate classes of features and use domain adaptation to adapt these features to the CM and EHR target domains.

Related Work
The notion of patient engagement explored here is inspired by the self-regulation paradigm of (Bandura, 2005;Leventhal et al., 2012;Mann et al., 2013), where a patient's successful completion of health-related goals is predicated on their ability to "self-regulate", i.e., to plan and execute actions that promote attaining those goals, and their ability to maintain a positive attitude toward self-care. We are also aligned with the more recent work of Higgins et al. (2017) whose definition of engagement includes a "desire and capability to actively choose to participate in care".
NLP approaches assessing doctor compliance include Hazelhurst et al. (2005) who evaluate notes for doctor compliance to tobacco cessation guidelines and Mishra et al. (2012) who assess ABCs protocol compliance in discharge summaries.

Label
Description Examples Engagement with care The patient is engaged in their well-being by describing/exhibiting healthy behavior, positive outlook, and social ties.
"Patient disappointed by lack of weight loss but is just beginning exercise regimen"; "Patient joined book club." Engagement with CM Adherence to a doctor or CM instruction or understanding of CM advice.
"Patient verbalized understanding"; "Patient confided that she has gaps in nitroglycerin use." Lack of engagement with care Lack of engagement by using language suggestive of non-adherence to guidelines, health-adverse behavior, lack of social ties, or negative impression of patient self-care.

Lack of engagement with CM
Non-adherence to a prescribed instruction or a negative response to interaction. "Patient rude during call"; "Patient angrily refused further outreach." CM Advice CM advice or suggestion "I suggested he watch his diet and increase exercise" Other Default label to be chosen when no other label fits. "Patient has a history of atrial fibrillation on corticosteroids"; "Chest is clear with no crackles." Table 1: Annotation labels with descriptions and anecdotal examples. We use the term CM to describe both the para-professionals interacting with patients in CM notes and the physicians in EHRs.
While there exists work dealing with sentiment in clinical notes, such as positive or negative affect (Ghassemi et al., 2015) and speculative language (Cruz Díaz et al., 2012), (lack of) engagement cannot be reduced to sentiment. Lack-ofengagement-bearing language, for example, can also contain positive sentiment, e.g., patient is feeling better so she has stopped taking her medication. We include sentiment in our feature set, as described in Section 4.
The most closely related work is Topaz et al. (2017) who developed a document-level discharge note classification model that identifies the adherence of a patient in the discharge note. Their annotation task differs from ours, however, as they focus only on lack of adherence, specifically, towards medication, diet, exercise, and medical appointments. We also distinguish the targets of both engagement and lack of engagement by allowing annotators to identify either the CM or the care itself as the target.

Annotation Task and Data
The majority of our data consists of CMNs generated by a care manager service located in Florida, USA. CMs typically contact patients via phone to inquire into the patient's status with respect to health goals and enter the resulting information into the structured sections of a reporting tool. In addition, CMs note their impressions of the patient in a note as unstructured text, which we use here. To expand the domain scope of the task, we included EHR notes from the i2b2 Heart Disease Risk Factors Challenge Data Set , which includes notes dealing with diabetic patients at risk for Coronary Artery Disease (CAD). All notes were annotated in the same manner regardless of source. Table 1 includes descriptions of the annotation labels along with anecdotal examples of each label type (original sentences are excluded due to privacy constraints 1 ). Annotators were allowed to choose more than one label for each sentence, or no label at all (considered other). Our schema captures three different label classes: engagement, lack of engagement, and cm advice. We included cm advice because it can provide an indication that the next sentence should be classified as (lack of) engagement. We initially explored "barrier" language (e.g. patient could not get to his appt because he didn't have a car) as this can be indicative of lack of engagement, however, we found it to be too rare to include in the annotation tasks.

Annotation Challenges
Our first challenge was encoding a distinction between engagement and the more familiar concept of patient "adherence" (Vermeire et al., 2001;Topaz et al., 2017) in the annotation guidelines. While engagement-bearing language can include adherence-bearing language (e.g., is monitoring blood sugar, made follow-up appointment), the reverse is often not the case: Engagement-bearing language can include mentions of social ties (e.g., discusses struggles to lose weight with sister) and positive or negative evaluations of health-related goals (e.g., patient was irritable when asked about efforts to reduce smoking), neither of which involve adherence per se. By annotating such examples as engagement-bearing, we capture "self-efficacy beliefs," which theories of patient selfregulation (Bandura, 1998(Bandura, , 2005 have suggested are predictive of health goal attainment. An additional distinction that emerged during the annotation process involved the target of the engagement-bearing language: Is the patient (not) engaged with the CM or with the care itself? This distinction is evident in sentences that display a lack of engagement with care but a level of engagement with the CM. For example, in the sentence He appeared cheerful in our interactions and admitted that he has not been exercising daily, the patient is confiding in their CM (engagement) that they are not pursuing their health goals (lack of engagement). By allowing annotators to annotate such sentences as both engaged with the CM but unengaged with care we were able to exclude sentences that contained internally inconsistent engagement-bearing language from our data.
Another challenge involved the frequent use of "canned language" in the CM data, or language that does not report the CM's interactions with the patient but is used to meet some reporting criterion recommended by the health-care provider. For example, Patient is scheduled for follow up appointment in two weeks, is a frequently occurring canned language. Thus, we excluded common canned language sentences from the data.

Data Statistics
After several initial pilot rounds inter-annotator agreement for our six annotators on a final pilot round of 200 sentences (100 from each source) ranged from .46 to .66 among the annotators with an overall average of .53 (using Cohen's κ), indicating moderate to substantial agreement (McHugh, 2012). 4011 CMN sentences were annotated, extracted from ∼ 10, 000 unique CMNs. In order to broaden the range of language in our data, 2561 EHR sentences were annotated, with an equal number of sentences drawn from the three patient cohorts included in the i2b2 data. For each EHR, we restricted our annotation effort to sections that were more likely to include engagement-bearing language, specifically, the social history, family history, personal medical history, and history of the present illness sections. Table 2 shows the label distribution of the annotated data relative to note source. Although we allowed the annotators to differentiate between engagement/lack of engage-

Method
Given the small size of our data we elected to use a feature-engineering-based approach along with a discriminative classification algorithm in our experiments. Our features can be divided into five categories: lexico-syntactic, lexical-count, sentiment, medical, and embeddings. Lexico-syntactic. Standard NLP features for text-classification such as n-grams and part-ofspeech (POS) tags, along with dependency tuples (De Marneffe and Manning, 2008) with either the governor or dependent generalized to its POS.
Lexical-count. Frequency-based features such as sentence length, min and max word length, and number of out of vocabulary words.
Sentiment. We ran two sentiment classifiers over the data (Socher et al., 2013;Hutto and Gilbert, 2014) and included the resulting tags as features. In addition, we developed "comply word" features by inducing a lexicon based on WordNet- (Fellbaum, 1998) and Unified Medical Language System (UMLS)-based 2 synonym expansion of seed words such as "take" and "decline." Medical. Using the MetaMap 3 tool, we generated Concept-Unique Identifiers (CUIs) for any medical concepts in the sentence. We also included both the "preferred names" and semantic types returned by UMLS for each concept Embeddings. We extracted term-term, CUI-CUI and term-CUI co-occurrences pairs from a large medical corpus and used wordtovecf 4 (Levy and Goldberg, 2014) to learn embeddings from this co-occurrence dataset. We generated the mean of the embeddings for all content-words and CUIs in the sentence as a feature.

Experiments
All experiments were performed using an SVM classifier with a linear kernel basis function and  one-vs-rest multiclass classification strategy as implemented in scikit-learn. 5 To deal with the skew in class distribution we experimented with both over-and under-sampling but got our best performance by simply adjusting class weights to be inversely proportional to class frequencies. Given the relatively small size of our data we used 5-fold cross-validation throughout. We also conflated cm advice with other to boost performance. We show F-score results for all three classes, but our analysis will focus on (lack of) engagement since other is trivially high-performing due to the massive data skew.
In our first set of experiments we examined the impact of training and testing on EHRs and CMNs individually, as well as together, while ablating the feature classes described in section 4. As shown in Table 3, all feature classes seem to help the model, but sentiment helps more for predicting lack of engagement in the CMNs while medical features help more for predicting lack of engagement in the EHRs. These experiments show that a CMNtrained model can perform well on EHRs. The best result for lack of engagement occurs when training on CM notes, with an F-score of 13.0.
The results in Table 3   the smaller dataset (EHRs) while also taking advantage of the larger dataset (CMNs) by considering EHRs to be "in-domain" and CMNs to be "out-of-domain". As this is still preliminary work, we started with a simple, yet effective DA strategy: the feature representation transformation procedure described in Daumé III (2007). Table 4 shows the results using DA. In the EHRs, where there is less data, on average DA provided an improvement, particularly for lack of engagement.

Conclusion
In this paper we presented an annotation schema that captures engagement in CMNs and EHRs. We described the challenges of developing an annotation schema for a subjective task and show that annotators achieved moderate to high agreement in our final task. We annotated 6,572 sentences for (lack of) engagement and show preliminary results of a classification experiment on our dataset using feature ablation and domain adaptation. Our results are promising, showing that both features and domain adaptation are useful. However, they remain preliminary due to the rarity of (lack of) engagement labels. In future work, we plan to explore transfer learning to increase the size of our data, which in turn will allow use to explore deep learning approaches to this task.