Classifying the reported ability in clinical mobility descriptions

Assessing how individuals perform different activities is key information for modeling health states of individuals and populations. Descriptions of activity performance in clinical free text are complex, including syntactic negation and similarities to textual entailment tasks. We explore a variety of methods for the novel task of classifying four types of assertions about activity performance: Able, Unable, Unclear, and None (no information). We find that ensembling an SVM trained with lexical features and a CNN achieves 77.9% macro F1 score on our task, and yields nearly 80% recall on the rare Unclear and Unable samples. Finally, we highlight several challenges in classifying performance assertions, including capturing information about sources of assistance, incorporating syntactic structure and negation scope, and handling new modalities at test time. Our findings establish a strong baseline for this novel task, and identify intriguing areas for further research.


Introduction
Information on how individuals perform activities and participate in social roles informs conceptualizations of quality of life, disability, and social well-being.Importantly, activity performance and role participation are highly dependent on the environment in which they occur; for example, one individual may be able to walk around an office without issue, but experience severe difficulty walking along mountain paths.Thus, determining what level of performance an individual can achieve for activities in different environments is critical for identifying ability to meet work requirements, and designing public policy to support the participation of all people.
However, the interaction between individuals and environments makes modeling performance * These authors contributed equally to this work.information a complex task.Assessments of activity performance within clinical healthcare settings are typically recorded in free text (Bogardus et al., 2004;Nicosia et al., 2019), and exhibit high flexibility in structure.Syntactic negation can be present, but is not necessarily indicative of inability to perform an action; for example, Patient can walk with rolling walker and Patient cannot walk without rolling walker are both likely to be used to assert the ability of the patient to walk with the use of an assistive device.Information about performance may also be given without a clear assertion, as in the cane makes it difficult to walk.Thus, extraction of performance information must not only distinguish between positive and negative assertions, but also those which cannot be clearly evaluated.
To the best of our knowledge, this is the first work to explore assertions of activity performance in health data.We explore a variety of methods for classifying assertion types, including rulebased approaches, statistical methods using common text features, and convolutional neural networks.We find that machine learning approaches set a strong baseline for discriminating between four assertion types, including rare negative assertions.While this work focuses on a relatively constrained and homogeneous corpus, error analysis suggests several broader directions for future research on classifying performance assertions.

Related Work
Though this is the first work focusing on the polarity of activity performance, three areas of prior work are particularly relevant to this research.
The first is concerned with applying NLP techniques and linguistic annotation to information about whole-person function, particularly activity performance.Harris et al. (2003) experimented with term extraction for the purpose of terminology discovery to support information retrieval relating to functioning, disability and health, using linguistic, n-gram and hybrid techniques.Bales et al. (2005) and Kukafka et al. (2006) modified and applied the MedLEE NLP Extraction tool to code Rehabilitation Discharge Summaries using ICF (World Health Organization, 2001) encodings.Kuang et al. (2015) studied UMLS term coverage of functional status terms found in VA clinical notes and in social media sources, reporting that there is a need to extend existing terminologies to cover this area.Finally, Thieu et al. (2017) reported on an effort to build an annotated corpus of Physical Therapy (PT) notes from the Clinical Center of the National Institutes of Health (NIH) with functional status information.This corpus was also used for an investigation into using named entity recognition (NER) techniques to extract information about patient mobility (Newman-Griffis and Zirikly, 2018).
The second area is research on negation.Negation detection is a well-researched area (Morante and Sporleder, 2012), and both negation and uncertainty have historically been studied in the clinical NLP context (Mowery et al., 2012;Peng et al., 2018).Previous work studied the use of incorporating dependency parsers to help in identifying the scope (Sohn et al., 2012;Mehrabi et al., 2015).Recent work in this area involves the use of neural network models, where Long Short-Term Memory (LSTM), or variations of it, yielded competitive results on negation (cues and scope) detection (Taylor and Harabagiu, 2018).
One highly-related work to ours is Wu et al. (2014), which investigates detection of binary semantic negation status (i.e., the presence or absence of a finding, as opposed to syntactic negation) for clinical findings in EHR text.However, as Action Polarity is defined in terms of the interaction between an individual and a specific environment, it adds a layer of complexity to noninteractive physiological observations.Gkotsis et al. (2016) investigate using parsing-based scoping limitations for negation detection in complex clinical statements, though their focus is specifically on mentions of suicide.
Finally, classifying the assertion status of activity performance descriptions bears similarities to the problem of recognizing textual entailment (RTE) (Dagan et al., 2006;Marelli et al., 2014).RTE asks whether a given premise entails a specific hypothesis, and has historically been pursued in the general domain, though, recent efforts have developed datasets in biomedical literature (Ben Abacha et al., 2015;Ben Abacha and Demner-Fushman, 2016) and in clinical text (Romanov and Shivade, 2018).Our task, by asking whether a given description entails ability to perform an action in the an environment, is more constrained than RTE, but poses a related research challenge.

Data
We use an extended version of the dataset initially described by Thieu et al. (2017), consisting of 400 English-language Physical Therapy initial assessment and reassessment notes from the Rehabilitation Medicine Department of the NIH Clinical Center.These text documents have been annotated to identify descriptions and assessments of mobility status, typically including one or more specific Actions; for example, Pt walked 300' with rolling walker (Action underlined).
Each Action annotation was assigned one of four Polarity values, indicating what (if any) information the containing mobility description provides about the subject's ability to perform the given Action in the context of any described environmental factors. 1 The Polarity labels are defined in the following paragraphs.
Able The subject is able to complete the activity in the environment described.For example, She states she can walk 20 minutes before tiring; in the case of now requires assistance of one person with transfers, it is unknown whether the patient can perform the action independently, but they are able to do so with the assistance described.
Unable The subject is not able to complete the activity in the environment described; for example, He is unable to walk.More specific information may also be included, as in Pt is now unable to walk more than 50 feet.
Unclear Some information is provided about the subject's ability to perform the action, but not enough to make a definitive positive or negative judgment.For example, in The cane makes it difficult to walk, it is undetermined whether the subject can or cannot walk.This label also includes some cases of negated environmental factors; for example, unable to propel wheelchair independently.
None No direct information about ability to perform the action is provided.Common examples of this label refer to a scale that is either unavailable or distant in the document, as in Ambulation: 1.Other cases refer to a specific aspect of performing an action, without evaluation, as in tendency during gait to quickly extend the leg from swing to stance.
We randomly split the 400 documents into 320 training records and 80 testing records, stratified by distribution of Polarity labels.Table 1 provides frequencies of each label in these splits.

Methods
We investigate a variety of methods to classify the Polarity values of Action annotations.Rule-based methods have been used to great effect in clinical information extraction (Kang et al., 2013;Chapman et al., 2007), and form an important baseline for our task.We also make use of several common machine learning methods, such as support vector machines and k-nearest neighbors, along with more recent neural models such as convolutional neural networks (CNN).Finally, we experiment with ensembled combinations of our bestperforming models.These approaches are described in the following subsections.

Rule-based
A UIMA (Ferrucci and Lally, 2004) based pipeline was constructed to identify action polarity from components of v3NLP-Framework (Divita et al., 2016).Leveraging the relationship of our task to detecting contextual attributes such as negation, the conTEXT (Chapman et al., 2007) algorithm embedded in the v3NLP-Framework was augmented with a few additional entries including "able" and "independent" as asserted evidence and "unable" as negative evidence.
The conTEXT algorithm relies on a lexicon of evidence and accompanying clues to indicate when evidence found to the right or left of a relevant entity within a bounded window should be applied.We used the sentence containing an Action mention as the bounds of its context window.An Action Polarity UIMA annotator was built to assign Polarity, given an Action annotation.This annotator is downstream from the conTEXT annotator that assigned negation, assertion, conditional, hypothetical, historical, and subject attributes to named entities.Within conTEXT-processed entities, we assigned Unable polarities to actions that had previously been attributed with negative and assigned Able polarities that had previously been assigned only asserted attributes.Actions that were tagged as conditional or hypothetical were not assigned a Polarity.
The v3NLP-Framework pipeline includes document decomposition annotators to identify sections, section names, sentences, slots and values, questions and their answers, and to a lesser extent checkboxes (Divita et al., 2014).Action mentions in clinical text occur within the boundaries of each of these elements.ConTEXT addresses action mentions within prose, but is not relevant for action mentions found in the semi-structured constructs.The Action Polarity annotator was thus augmented with additional rules to aid in polarity assignment based on where the mention was found.The most relevant rules are as follows: • Action mentions that are in the slot part of a slot:value construct get their polarity assignment from positive or negative evidence in the value part of the construct.Table 2 provides guidelines to assigning polarity from slot:value and question and answer constructs.
• Action mentions that are within Goals or Education sections do not get a polarity.The section name is known for each named entity.For the time being, section names with "plan," "goals," "education," "intervention" and "recommendations" qualify.These are  • Action mentions within only the value part of the slot:value construct were handled the same way as Action mentions within prose.

Machine learning models
We evaluated the following common machine learning-based classification methods for our Polarity labeling task:2 • Random forest (RF), using 100 estimators; • Naïve Bayes (NB), using Gaussian estimators; • k-nearest neighbors (kNN), using k=5 with Euclidean distance; • Support vector machine (SVM), with linear kernel; • Deep neural network (DNN), using a 100dimensional hidden layer followed by a 10dimensional hidden layer. 3or a given Action mention a contained in a Mobility description m, we explored using both bag of binary unigram features4 and word embedding features as model input.For both kinds of features, we experimented with using the context words in m − a (i.e., all words in m except for the Action mention itself) only, and including the text of the Action mention a. Word embedding features were calculated by averaging the embeddings of all words used (either context alone or averaging context words and Action mention we also analyzed per-class performance of each model.Interestingly, as Table 4 illustrates, we found that all models except Naïve Bayes were surprisingly robust to this imbalance, with both SVM and DNN achieving over 76% F1 on the smallest class (Unable).Across all four classes, the SVM and the 2-layer DNN yield statistically equivalent performance (p ≥ 0.001); we therefore use absolute macro F1 to choose SVM as the best baseline model for comparing across approaches.

CNN model
We adopt the Convolutional Neural Network (CNN) architecture introduced in Kim (2014).In our architecture, shown in Figure 1, we combine word embeddings with character embeddings, to reduce the impact of out-of-vocabulary rate as opposed to using words alone.Additionally, character-level CNNs have been shown to improve the results of text classification (Zhang et al., 2015), but the improvement is more evident with larger data sizes.Although our task is close to negation detection, it differs in that we do not need to detect the span of the Action: we take as inputs the Action mention and its parent mobility mention (a selfcontained text span that can be considered a sen- tence).Unlike sequence tagging problems, where Long-Short Term Memory (LSTM) architectures would be a good fit (Fancellu et al., 2016), we treat the problem as a text classification task.
We experiment with character and word embeddings of the following inputs: • previous context (prev): the set of words preceding and including the action mention.
• next context: the set of words following and including the action mention.
• full context: the union of prev and next.
We also compare the impact of using character (full char) or word (full word) embeddings only as opposed to combining both (* all), as shown in Table 5.We note that relying on part of the context significantly drops the Unable performance.However, as expected, prev outperforms next, given that the words preceding the Action mention carry most of the ability-related information.For the rare Unable class, character embeddings outperform word embeddings, with F1 72.9% on the test set; the highest across all systems.
Hyperparameters were optimized on a dev set (we used a 90/10 train/dev split), yielding a learning rate of 0.0001, dropout of 0.5, embeddings size 100, and Adam optimization (Kingma and Ba, 2014) with L2 regularization.

Ensemble models
Ensembling methods have been shown to improve performance in a variety of classification tasks (Buda et al., 2018), including in class-imbalanced tasks (Ju et al., 2018).In order to combine the strengths of each modeling approach, we therefore experimented with ensembling all three systems, using two ensembling strategies: Majority voting Predictions from the single best configurations of the SVM and CNN models8 were combined to make a single decision.When the systems agreed, that label was chosen as output; in the case of disagreement, we chose the predicted label that was less frequent in training data, in order to prefer the strengths of individual models on rare classes.DNN chooser Predictions from all three systems (rule-based and the best pretrained SVM and CNN models) 9 were passed as inputs to a DNN with a single 10-unit hidden layer. 10In order to compensate for the class imbalance in our dataset, which would lead to preferring the CNN due to its higher precision, we identified all training samples that the three models disagreed on and grouped them by label, and identified the smallest of these disagreement sets.We then sampled no more than twice this number of points from each disagreement set, yielding a training sample of 182 points.
Using this downsampled training set, we trained the DNN to predict which, if any, of the systems chose the correct answer.As multiple systems may have made the correct prediction, this is a multi-label classification task.At test time, the system with highest probability output from the DNN was chosen as the reference decision for the final classification.
We also experimented with three approaches to predict the final class directly: using a DNN with the predictions of each system as input, using an SVM with predictions as input, and adding rulebased and CNN predictions as additional features to the SVM with lexical features.All variants underperformed the chooser in cross validation experiments on training data, thus we omit them from our results.

Results
The test results of the systems we compared are given in Table 6.The ensembled systems achieve 9 For the chooser, adding rule-based predictions consistently improved results over just SVM and CNN.
10 Experiments with a 64-unit hidden layer, to cover all possible label combinations, yielded the same results in cross validation.
the best overall performance, with 77.4% macro F1 with the DNN chooser and 77.9% with majority voting.Due in large part to the class imbalance in the dataset, the SVM, CNN, and ensemble methods do not yield statistically significantly different results in most cases (p > 0.001), although the voting ensemble does produce significantly higher precision on None samples than other methods (p 0.001).While performance is considerably better on the more frequent Able and None classes, the learned systems achieve good results on Unclear and the very rare Unable.Figure 2 shows the confusion matrices for all systems.The most common confusions are with Able and None, with only a small number of false positives for Unable and Unclear and no confusion between the two in the machine learning approaches.
Comparing between individual systems, the CNN is best at making the important distinction between Able and Unable.It consistently achieves high precision across all classes, but suffers large drops in recall for the rare labels.The SVM model reverses this tradeoff, yielding high recall for Unable and Unclear, but much lower precision.The ensembled methods are able to strike a good middle ground, keeping the high recall of the SVM without sacrificing too much of the CNN's precision.

Discussion
As is evident from the results, correctly classifying the minority classes Unable and Unclear is not trivial.This is not only caused by the lack of data for training those classes, but in the case of Unclear, also by its semantic ambiguity -even for humans.
An important area of confusion is when actions are hypothetical, as is the case for plans, recommendations or feelings towards an action (e.g.eager to walk), which should all be tagged as None.Semantic problems can also arise around the use of an assistive device.In the following synthetic example, the annotated polarity is Able: she is unable to ambulate more than a few feet without support.
Without the mention of assistance, it would have been Unable.In future work, assistance mentions will be modeled explicitly to better capture this.
Overall, we obtain models that perform well across the board, where each approach has different strengths as illustrated in Figure 2. Out of the 955 test instances, the rule-based approach classifies 37 correctly that no other system got right.Likewise, SVM and CNN have 27 and 25 unique true positives, respectively.46 instances get misclassified by all classifiers.The ensemble is able to pick up on 31 of the unique true positives from the machine learning systems, but consistently ignores valid suggestions from the rule-based approach.This suggests that different ensembling parameters should be considered to take better advantage of the rule-based system's strengths.Below, we discuss system-specific observations in more detail.

Rule-based
The following failures were observed in the training and testing output: Scoping negation The scope for assigning negation attribution was set to be within sentential boundaries.Ideally, the scope should be tighter at the major phrase level.However, v3NLP-Framework does not currently employ a dependency graph parser.Breaking on phrasal boundaries was not successful, primarily due to the inability to distinguish between list markers such as commas, coordinating conjunctions (and/or), and true scope limiting phrasal boundaries.Several false negatives were due to the incorrect Unable assignment because of negation scoping.
Identifying variants of slots and values accurately Negation and assertion assignment are dependent upon whether the action is within prose, a slot or a value.A number of errors were due to multiple slot:value constructs within the same line making it difficult identifying the values, and/or nested constructs (i.e., the value of a slot:value construct was also a slot:value construct).
Nested sections A number of missed None errors were the result of mis-identifying what section the annotation was within, and picking up an inner section name.Several other issues arose from the use of spaces as delimiters between slots and values, as well as slots and values embedded within bulleted lists.
Pertinent negatives (Divita et al., 2014) A statement where the action mention had clear negative evidence really meant the patient could perform an action.For example, no trouble walking.An easy amelioration would be to gather constructs like "no trouble" and add them to the assertion evidence lexicon.

Machine learning
The machine learning systems are prone to failures in sentences that have multiple Action mentions, if their Polarity differs.This is because the systems do not take into account sentence structure.Similarly, sentence length seems to have a negative effect on performance, as it dilutes the information salient to the focus mention.In future work, we would limit the context information to exclude other mentions' contexts, add parse tree information relevant to the focus mention, or improve the neural network architecture to better model the sequential nature of the data.
The models would also benefit from better capturing semantic similarity.An example would be Pt. is fearful to start walking again (class: None), where the modality expressed by fearful might not have been learned from the training data.Additionally, lemmatization, stemming and character embeddings can blunt the impact of such unseen tokens, but using embeddings from large corpora would be more robust.
Finally, one potential limitation in our machine learning results is our use of pretrained embeddings from web text.As Newman-Griffis and Zirikly (2018) show, when only a small amount of text from the target domain is available, outof-domain embeddings can roughly match performance with in-domain embedding features; however, developing or tuning more targeted word em-beddings for use in this dataset is a useful area of future work.

Generalizability
It is important to note that the dataset used in this study was derived from one specialty -Physical Therapy -within a single institution -the NIH Clinical Center.Thus, the texts analyzed are likely to be more homogeneous than would be a broader dataset.Evaluating generalization of our findings to free text from other healthcare subdomains and other institutions, and describing ways in which performance assertions vary between these sources, is a valuable area of future work.

Conclusion
We have presented an evaluation of several approaches for the task of classifying whether a given description of an individual performing an activity indicates that they are able to perform it, unable, unclear, or insufficient information to determine.We found that machine learning approaches with lexical features perform surprisingly well on the task, including detecting the rarer labels of Unable and Unclear, and that an ensembled approach sets a strong baseline of 77.9% macro F1 for our dataset.In-depth analysis of system errors suggested several intriguing problems for future work.For instance, we intend to investigate hybrid models and test how information related to report formatting, section structure, slot info and assistive devices could improve the performance.To clarify the confusion of a patient's ability, we need models that can differentiate between factual and hypothetical statements (e.g.Pt can run vs.Pt dislikes running).Additionally, we would like to incorporate contextual representations such as ELMo (?) and BERT (?) into our models.
To our knowledge, this is the first work expanding on the problem of clinical negation detection to complex interactions between individuals and their environments.This work joins a growing body of research on application of NLP techniques to information about activity performance and role participation, and identifies several research challenges in adapting NLP methods to this new domain.

Figure 2 :
Figure 2: Confusion matrices for results on the test set.

Table 1 :
Number of samples with each Polarity label in train and test data.

Table 2 :
Table of slot:value rules for Action Polarity considered to be hypothetical constructs.The exception to this is if a goal is noted to have been met, it gets an Able Polarity.

Table 3 :
Macro F1 over Polarity classes in 5-fold cross validation feature selection experiments.All experiments start with binary unigram features using context words alone, and add Action words, embedding features from context words, or both (i.e., unigrams and embedding features from context and Action words combined).The best performing model configurations are marked in bold.

Table 5 :
CNN performance using different inputs.

Table 6 :
Precision (Pr), Recall (Rec), and F1 for each model evaluated on the test set.Top rows are individual models, bottom rows are ensembled results.The best result in each column is marked in bold.