Unsupervised Domain Adaptation for Clinical Negation Detection

Detecting negated concepts in clinical texts is an important part of NLP information extraction systems. However, generalizability of negation systems is lacking, as cross-domain experiments suffer dramatic performance losses. We examine the performance of multiple unsupervised domain adaptation algorithms on clinical negation detection, finding only modest gains that fall well short of in-domain performance.


Introduction
Natural language processing applied to healthrelated texts, including clinical reports, can be valuable for extracting information that does not exist in any other form. One important NLP task for clinical texts is concept extraction and normalization, where text spans representing medical concepts are found (e.g., colon cancer) and mapped to controlled vocabularies such as the Unified Medical Language System (UMLS) (Bodenreider and Mc-Cray, 2003). However, clinical texts often refer to concepts that are explicitly not present in the patient, for example, to document the process of ruling out a diagnosis. These negated concepts, if not correctly recognized and extracted, can cause problems in downstream use cases. For example, in phenotyping, a concept for a disease (e.g., asthma) is a strong feature for a classifier finding patients with asthma. But if the text ruled out asthma occurs and the negation is not detected, this text will give the exact opposite signal that its inclusion intended.
There exist many systems for negation detection in the clinical domain (Chapman et al., 2001(Chapman et al., , 2007Harkema et al., 2009;Sohn et al., 2012;Wu et al., 2014;Mehrabi et al., 2015), and there are also a variety of datasets available (Uzuner et al., 2011;Albright et al., 2013). However generalizability of negation systems is still lacking, as cross-domain experiments suffer dramatic performance losses, even while obtaining F1 scores over 90% in the domain of the training data (Wu et al., 2014).
Prior work has shown that there is a problem of generalizability in negation detection, but has done little to address it. In this work, we describe preliminary experiments to assess the difficulty of the problem, and evaluate the efficacy of existing domain adaptation algorithms on the problem. We implement three unsupervised domain adaptation methods from the machine learning literature, and find that multiple methods obtain similarly modest performance gains, falling well short of in-domain performance. Our research has broader implications, as the general problem of generalizabiliy applies to all clinical NLP problems. Research in unsupervised domain adaptation can have a huge impact on the adoption of machine learning-based NLP methods for clinical applications.

Background
Domain adaptation is the task of using labeled data from one domain (the source domain) to train a classifier that will be applied to a new domain (the target domain). When there is some labeled data available in the target domain, this is referred to as supervised domain adaptation, and when there is no labeled data in the target domain, the task is called unsupervised domain adaptation (UDA). As the unsupervised version of the problem more closely aligns to real-world clinical use cases, we focus on that setting.
One common UDA method in natural language processing is structural correspondence learning (SCL; Blitzer et al. (2006)). SCL hypothesizes that some features act consistently across domains (socalled pivot features) while others are still informative but are domain-dependent. The SCL method combines source and target extracted feature sets, and trains classifiers to predict the value of pivot features, uses singular value decomposition to reduce the dimensionality of the pivot feature space, and uses this reduced dimensionality space as an additional set of features. This method has been successful for part of speech tagging (Blitzer et al., 2006), sentiment analysis (Blitzer et al., 2007), and authorship attribution (Sapkota et al., 2015), among others, but to our knowledge has not been applied to negation detection (or any other biomedical NLP tasks). One difficulty of SCL is in selecting the pivot features, for which most existing approaches use heuristics about what features are likely to be domain independent.
Another approach to UDA, known as bootstrapping or self-training, uses a classifier trained in the source domain to label target instances, and adds confidently predicted target instances to the training data with the predicted label. This method has been successfully applied to POS tagging, spam email classification, named entity classification, and syntactic parsing (Jiang and Zhai, 2007;Mc-Closky et al., 2006).
Clinical negation detection has a long history because of its importance to clinical information extraction. Rule-based systems such as Negex (Chapman et al., 2001) and its successor, ConText (Harkema et al., 2009) contain manually curated lists of negation cue words and apply rules about their scopes based on word distance and intervening cues. While these methods do not learn, the word distance parameter can be tuned by experts to apply to their own datasets. The DepNeg system (Sohn et al., 2012) used manually curated dependency path features in a rule-based system to abstract away from surface features. The Deepen algorithm (Mehrabi et al., 2015) algorithm also uses dependency parses in a rule-based system, but applies the rules as a post-process to Negex, and only to the concepts marked as negated.
Machine learning approaches typically use supervised classifiers such as logistic regression or support vector machines to label individual concepts based on features extracted from surrounding context. These features may include manually curated lists, such as those from Negex and Con-Text, as well as features intended to emulate the rules of those systems, as well as more exhaustive contextual features common to NLP classification problems. The 2010 i2b2/VA Challenge (Uzuner et al., 2011) had an "assertion classification" task, where concepts had mutually exclusive present, absent (negated), possible, conditional, hypothetical, and non-patient attributes, and this task had a variety of approaches submitted that used some kind of machine learning. The top-performing system (de Bruijn et al., 2011) used a multi-level ensemble classifier, classifying assertion status of each word with three different machine learning systems, then feeding those outputs into a conceptlevel multi-class support vector machine classifier for the final prediction. In addition to standard bag of words features for representing context, this system used Brown clusters to abstract away from surface feature representations. The MITRE system (Clark et al., 2011) used conditional random fields to tag cues and their scopes, then incorporated cue information, section features, semantic and syntactic class features, and lexical surface features into a maximum entropy classifier. Finally, Wu et al. (2014) incorporated many of the dependency features from rule-based DepNeg system (Sohn et al., 2012) and the best features from the i2b2 Challenge into a machine learning system.

Methods
In this work, we apply unsupervised domain adaptation algorithms to machine learning systems for clinical negation detection, evaluating the extent to which performance can be improved when systems are trained on one domain and applied to a new domain. We make use of the (Wu et al., 2014) system in these experiments, as it is freely available as part of the Apache cTAKES (Savova et al., 2010) 1 clinical NLP software, and can be easily retrained.
Unsupervised domain adaptation (UDA) takes place in the setting where there is a source dataset D s = {X, y}, and a target dataset D t = {X}, where feature representations X ∈ R N ×D for N instances and D feature dimensions and labels y ∈ R N . Our goal is to build classifiers that will perform well on instances from D s as well as D t , despite having no gold labels from D t to use at training time. Here we describe a variety of approaches that we have implemented.
The baseline cTAKES system that we use is a support vector machine-based system with L1 and L2 regularization. Regularization is a penalty term added to the classifier's cost function during training that penalizes "more complex" hypotheses, and is intended to reduce overfitting to the training data. L2 regularization adds the L2 norm to the classifier cost function as a penalty and tends to favor smaller feature weights. L1 regularization adds the L1 norm as a penalty and favors sparse feature weights (i.e., setting many weights to zero).
Before attempting any explicit UDA methods, we evaluate the simple method of increasing regularization. While regularization is already intended to reduce overfitting, it may still overfit on a target domain since its hyper-parameter is tuned on the source domain. In a real unsupervised domain adaptation scenario it is not possible to tune this parameter on the target domain, so for this work we use heuristic methods to set the adapted regularization parameter. We first find the optimal regularization hyperparameter C using cross-validation on the source data, then increase it by an order of magnitude and retrain before testing on target data. For example, if we find that the best F1 score occurs when C = 1 for a 5-fold cross-validation experiment on the source data, we retrain the classifier at C = 0.1 before applying to target test data. 2 Changing this parameter by one order of magnitude is purely a heuristic approach, chosen because that is how we (the authors) typically would vary this parameter during tuning. Future work may explore whether this parameter on target data without supervision, perhaps by using some information about the data distribution in the target domain.
The first UDA algorithm we implement is structural correspondence learning (SCL) (Blitzer et al., 2006). Following Blitzer et al. we select as pivot features those features that occur more than 50 times in both the source and target data. Then, for each data instance i in X c = {X s ∪ X t }, and each pivot feature p, we extract the non-pivot features of i (non-pivot features are simply all features not selected as pivot features), x i = X c [i, non-pivots], and a classification target, y i [p] = X c [i, p] > 0.5 . 3 For each pivot feature p, we train a linear classifier on the ( x i , y i [p]) classification instances, take the resulting feature weights, w p , and concatenate them into a matrix W . We decompose W using singular value decomposition: W = U ΣV T , and construct θ as the first d dimensions of U . This matrix θ represents a projection from non-pivot features to a reduced dimensionality version of the 2 Note that since C is the cost of misclassifying training instances, increasing regularization means lowering C. 3 We use expr to denote the indicator function, which returns 1 if expr is true and 0 otherwise. The next UDA algorithm we implement is bootstrapping. Jiang and Zhai (2007) introduced a variety of methods for UDA, under the broad heading of instance weighting, but the method they call bootstrapping was the only one which does not rely on any target domain labeled data. This method creates pseudo-labels for a portion of the target data by running a classifier trained only on source data on the target data, and adding confidently classified target instances to the training data, labeled with whatever the classifier decided. Jiang and Zhai experiment with the weights of these instances, either giving higher weights to target instances or weighting them the same as source instances. We implemented a simpler version of bootstrapping that does not modify instance weights, and adds instances based on the initial classifier score (rather than iteratively re-training and adding additional instances). We allow up to 1% of the target instances to be added.
In addition to adding the highest-scoring instances, we also experiment with adding only highscoring instances from the minority class. In many NLP tasks, including negation detection, the label of interest has low prevalence, and there is a danger that the classifier will be most confident on the majority class and only add target instances with that  label. We therefore experiment with only adding minority class instances, enriching the training data to have a more even class distribution. The final UDA algorithm we experiment with uses instance similarity features (ISF) (Yu and Jiang, 2015). This method extends the feature space in the source domain with a set of similarity features computed by comparison to features extracted from target domain instances. Formally, the method selects a random subset of K exemplar instances from D t and normalizes them asˆ e = e || e|| . Similarity feature k for instance i in the source data set is computed as the dot product X t [i] ·ˆ e[k]. Following Yu and Jiang, we set K = 50 and concatenate the similarity features to the full set of extracted features for each source instance at training. These exemplar instances must be kept around past training time, so that at test time similarity features can be similarly created for test instances.

Evaluation
Our evaluation makes use of four corpora of clinical notes with negation annotations -i2b2 (Uzuner et al., 2011), Mipacq (Albright et al., 2013, SHARP (Seed), and SHARP (Stratified). We first perform cross-domain experiments in the no adaptation setting to replicate Wu et al.'s experiments. 4 One difference to Wu et al. is that we evaluate on 4 See that paper for an discussion of corpus differences. the training split of the target domain -we made this choice because the development and test sets for some of the corpora are quite small and the training data gives us a more stable estimate of performance. We tune two hyperparameters, L1 vs. L2 regularization and the values of regularization parameter C, with five-fold cross validation on the source corpus. We record results for training on all four corpora, testing on all three target domains, as well as a cross-validation experiment to measure in-domain performance. Table 1 shows these results, which replicate Wu et al. in finding dramatic performance declines across corpora.
In our domain adaptation experiments, we also use all four corpora as source domains, and for each source domain we perform experiments where the other three corpora are target domains. This result is reported in Table 2.

Discussion and Conclusion
These results show that unsupervised domain adaptation can provide, at best, a small improvement to clinical negation detection systems.
Strong regularization, while not obtaining the highest average performance, provides nominal improvements over no adaptation in all settings except when the source corpus is Mipacq, in which case performance suffers severely. Mipacq has two unique aspects that might be relevant; first, it is the largest training set, and second, it pulls docu-ments from a very diverse set of sources (clinical notes, clinical questions, and medical encyclopedias), while the other corpora only contain clinical notes. Perhaps because the within-corpus variation is already quite high, the regularization parameter that performs best during tuning is already sufficient to prevent overfitting on any target corpus with less variation, and increasing it leads to underfitting and thus poor target domain performance. Future work may explore this hypothesis, which must include some attempt to relate the within-and between-corpus variation.
Four different systems all obtain the highest average performance, with BS-All (standard bootstrapping), BS-Minority (bootstrapping with minority class enrichment), structural correspondence learning (SCL A+N), and instance similarity features (ISF) all showing 3% gain in performance (71% to 74%). While the presence of some improvement is encouraging, the improvements within any given technique are not consistent, so that without labeled data from the target domain it would not be possible to know which UDA technique to use. We set aside the question of "statistical significance," as that is probably too low of a bar -whether or not these results reach that threshold, they are still disappointingly low and likely to cause issues if applied to new data.
In summary, selecting a method is difficult, and many of these methods have hyper-parameters (e.g., pivot selection for SCL, number of bootstrapping instances, number of similarity features) that could potentially be tuned, yet in the unsupervised setting there are no clear metrics to use for tuning performance. Future work will explore the use of unsupervised performance metrics that can serve as proxies to test set performance for optimizing hyperparameters and selecting UDA techniques for a given problem.