Assertion Detection in Clinical Notes: Medical Language Models to the Rescue?

In order to provide high-quality care, health professionals must efficiently identify the presence, possibility, or absence of symptoms, treatments and other relevant entities in free-text clinical notes. Such is the task of assertion detection - to identify the assertion class (present, possible, absent) of an entity based on textual cues in unstructured text. We evaluate state-of-the-art medical language models on the task and show that they outperform the baselines in all three classes. As transferability is especially important in the medical domain we further study how the best performing model behaves on unseen data from two other medical datasets. For this purpose we introduce a newly annotated set of 5,000 assertions for the publicly available MIMIC-III dataset. We conclude with an error analysis that reveals situations in which the models still go wrong and points towards future research directions.


Introduction
The clinical information buried in narrative reports is difficult for humans to access for clinical, teaching, or research purposes (Perera et al., 2013). To provide high-quality patient care, health professionals need to have better and faster access to crucial information in a summarized and interpretable format. In this paper, we focus on English discharge summaries and the task of assertion detection, which is the classification of clinical information as demonstrated in Figure 1.
Given a piece of text, we need to identify two pieces of information -a medical entity and textual cues indicating the presence or absence of that entity. Medical entity extraction has been studied extensively (Lewis et al., 2020), we thus focus our work on the task of predicting the present / possible / absent class over a medical entity, addressing an important information need of health professionals. This setting is reflected in the dataset released by the 2010 i2b2 Challenge Assertions Task (de Bruijn et al., 2011a), on which we base our main evaluation.
Clinical assertion detection is known to be a difficult task (Chen, 2019) due to the free-text format of considered clinical notes. Detecting possible assertions is particularly challenging, because they are often vaguely expressed, and they occur far less frequently than present and absent assertions. Language models pre-trained on medical data have shown to create useful representations for a multitude of tasks in the domain (Peng et al., 2019). We apply them to our setup of assertion detection to evaluate whether they can increase performance (especially on the minority class) and where they still need improvement.
We argue that clinical assertion detection models must be transferable to data that differs from the training data, e.g. due to different writing styles of health professionals from other clinics or from other medical fields. As existing datasets do not represent such diversity, we manually annotate 5,000 assertions in clinical notes from several fields in the publicly available MIMIC-III dataset. We then use these annotated notes as an additional evaluation set to test the transferability of the best performing model.  Challenge Assertion Task  discharge summaries  21,064  1,418  6,144  BioScope  scientific publications  -3,474  2,161   MIMIC-III Clinical Database (New)   discharge summaries  2,610  250  980  physician letters  204  34  66  nurse letters  293  14  59  radiology reports  249  40  130   Table 1: Distribution of text types and classes in the three employed datasets. Note that possible is a minority class across datasets as well as text types. In the i2b2 dataset, for instance, only 5% of all labels are possible.
Our contributions are summarized as follows: 1) We evaluate medical language models on assertion detection in clinical notes and show that they clearly outperform previous baselines. We further study the transferability of such models to clinical text from other medical areas.
2) We manually annotate 5,000 assertions for the MIMIC-III Clinical Database (Johnson et al., 2016). We release the annotations to the research community 1 to tackle the problem of label sparsity and the lack of diversity in existing assertion data.
3) We conduct an error analysis to understand the capabilities of the best performing model on the task and to reveal directions for improvement. We make our system publicly available as a web application to allow further analyses 2 .

Related Work
One of the earliest approaches to assertion detection is NegEx (Chapman et al., 2001), where hand-crafted word patterns are used to extract the absent category of assertions in discharge summaries. In 2010, the i2b2 Challenge Assertions Task (de Bruijn et al., 2011a) was introduced, and an accompanying corpus was released.
There is a variety of prior work focused on scope resolution for assertions, which differs from our setting in that it does not consider medical concepts but scopes of a certain assertion cue. Representative current approaches for this task setup include a CNN-based (Convolutional Neural Network) one by Qian et al. (2016), reaching an F1 of 0.858 on the more challenging possible category. Sergeeva et al. (2019) propose a LSTM-based (Long Short-Term Memory) approach to detect only absent scopes. When "gold negation cues" are made available to the model and synthetic features are applied, an F1 of 0.926 is reached. NegBert (Khandelwal and Sawant, 2020) is another approach to detect absent scopes. As its name suggests, it is BERT-based and reaches an F1 of 0.957 on BioScope abstracts.
In contrast to these approaches we focus our work on entity-specific assertion detection, the results of which are of more practical help for supporting health professionals. Bhatia (2019) apply a bidirectional LSTM model with attention to the task and evaluate it on the i2b2 corpus. While these models reach F1-scores above 0.9 on the majority classes, the challenging possible class does not surpass 0.65. We show that medical language models outperform these scores especially regarding the minority class.
Furthermore, Wu et al. (2014) compared then state-of-the-art approaches for negation detection and found a lack of generalisation to arbitrary clinical text. We thus want to examine the transfer capabilities of recent language models to understand whether they can mitigate the phenomenon.

Methodology
We want to understand the abilities of medical language models on the task of assertion detection. We hence fine-tune various (medical) language models on the i2b2 corpus described below. We further apply the best performing model to the BioScope dataset and our newly introduced MIMIC-III assertion dataset without further fine-tuning to test their performance on unseen medical data.

Datasets
The 2010 i2b2 Assertion Task (de Bruijn et al., 2011a) provides a corpus of assertions in clinical discharge summaries. The task is split into six classes, namely present, possible, absent, hypothetical, conditional and associated with someone else. However, the distribution is highly skewed, such that only 6% of the assertions belong to the latter three classes. Hence we only use the present, possible, and absent assertions for our evaluation as they present the most important information for doctors.
BioScope (Vincze et al., 2008) is a corpus of assertions in biomedical publications. It was specifically curated for the study of negation and speculation (or absent and possible in this paper) scope and does not contain present annotations. As mentioned before, the BioScope dataset does not completely match the information need of health professionals and the i2b2 corpus lacks varied medical text types. We thus introduce a new set of labelled assertions to complement existing data.
The MIMIC-III Clinical Database (Johnson et al., 2016) provides texts from discharge summaries as well as other clinical notes (physician letters, nurse letters, and radiology reports) representing a promising source of varied medical text. Therefore, two annotators followed the annotation guidelines from the i2b2 challenge, and labelled 5,000 assertions, i.e. word spans of entities and their corresponding present / possible / absent class. The inner-annotator agreement as Cohen's kappa coefficient is 0.847, which indicates a strong level of agreement. The annotations were further veri-fied by a medical doctor, who provided feedback to correct a small number of labels, and confirmed that the end results were satisfactory.
It is important to note that even though the newly annotated data from MIMIC-III adds variation to the existing corpora, the dataset has its own limitations. The clinical notes are collected from a single institution (with a mostly White patient population) and from Intensive Care Unit patients only. We therefore argue that progress in assertion detection requires further initiatives for releasing more diverse sets of clinical notes. Table 1 summarizes the assertion distribution in the introduced datasets and shows the unbalanced nature of the data.

Data Preprocessing
We make predictions about assertions on a perentity level. However, we want our models to consider the context of an entity. We therefore pass the whole sentence to the models and surround the entity tokens with special indicator tokens [entity] whose embeddings are randomly initialised. A sample input sequence thus looks as follows: [CLS]  We apply the same pre-processing to all three datasets.

Fine-tuning Medical Language Models
There are various pre-trained (bio-)medical and clinical language models available to evaluate on the assertion detection task. We select the most prevalent ones and describe them in short below:  Both datasets were not seen during training. Note that the number of evaluation samples is very low for some text types (i.e. possible class in nurse letters), which impairs the expressiveness of these results.
BERT (Devlin et al., 2019) was pre-trained on non-medical data and serves as a baseline for Transformer-base pre-trained language models.
BioBERT ( After an initial grid search we fix our hyperparameters to a learning rate of 1e-5, batch size of 32, and 2 epochs of training.

Evaluation and Discussion
We start by evaluating the mentioned models on the i2b2 corpus. We use training and test data as defined by in the i2b2 challenge and compare our results to previous state-of-the-art approaches in Table 2. Next, we apply the best performing Bio+Discharge Summary BERT to the BioScope and MIMIC-III corpora without additional finetuning (Table 3). This way we can see the model's performance on medical text from unseen sources.

Results
Language models outperform baselines. Table 2 shows that all evaluated medical language models are able to increase F1-scores on all three classes. On the most challenging possible class the improvement is the clearest with up to ∼15pp, which shows that the models are better in handling sparse occurrences coupled with vague expressions.
Medical pre-training is important. The vanilla BERT baseline is the weakest of our evaluated models, which shows that models specialized on the medical domain are not only effective for more complex medical tasks but also for assertion detection, which is in line with the claim by Gururangan et al. (2020) that domain-specific pre-training is almost always of use. Bio+Discharge Summary BERT is the best model -probably because it was trained on text very similar to the i2b2 corpus.
Text style matters. Table 3 shows the ability of the Bio+Discharge Summary BERT language model to transfer to other text styles. The assertions in the BioScope corpus are difficult to identify by the model as they clearly differ from the ones used by doctors in clinical notes. The text style in MIMIC-III data is more similar to the originally learned data which is reflected in the results. 3 However, physician letters appear to contain more specialized expressions and therefore evoke more errors. This points towards a lack of generalization possibly caused by the limited variety of assertion cues in the training data.

Error Analysis
We analyse all errors made by the best model to identify main sources of errors and to point towards future research directions.
Inconsistent data in pre-existing datasets account for roughly 45% of errors. This includes obvious labelling mistakes, but also disagreements among annotators. For example, phrases such as "appeared to be," "concerning for" and "consistent with" are labeled differently, as present or as possible.
Long range dependencies account for roughly 20% of all errors, in which entities and their cues have dependencies longer than a few tokens apart. While the model's attention mechanism could easily detect distant tokens, the model might have learned to only consider close assertion cues. The following is an example of a distant cue indicating the absent class which was missed by the model: His rash on the right hand was examined further and is now resolved.
Lists of assertions are found in 8% of error samples. Here the assertion is not directly coupled to an entity but must be inferred by the way it is listed. Such somewhat ambiguous cases are usually easily understood by humans, but difficult for our models.
Misspellings account for 5% of all observed errors, but they reveal a critical yet surprising limitation. For instance, the cues "appeas" and "probalbe" that indicate possible instances, are missed. While Transformer-based models are generally capable of dealing with misspellings due to subword tokenization, the missing variety of expressions in the data appears to let the models focus on a specific set of textual cues without generalizing to new phrases or even misspellings.

Conclusion and Future Work
In this work, we present an evaluation on medical language models to detect assertions in clinical texts and experimental results which show that they outperform baseline approaches. We further provided a new corpus of assertion annotations on the MIMIC-III dataset that will augment existing data collections and shows the model's capability to be transferred to other sources -if the text styles do not strongly differ. We suggest future work to investigate generalization to unseen data and expressions. We further encourage work on multi-task learning of entity extraction and assertions to support health professionals with systems that learn jointly in an end-to-end fashion.