Detecting mentions of pain and acute confusion in Finnish clinical text

We study and compare two different approaches to the task of automatic assignment of predefined classes to clinical free-text narratives. In the first approach this is treated as a traditional mention-level named-entity recognition task, while the second approach treats it as a sentence-level multi-label classification task. Performance comparison across these two approaches is conducted in the form of sentence-level evaluation and state-of-the-art methods for both approaches are evaluated. The experiments are done on two data sets consisting of Finnish clinical text, manually annotated with respect to the topics pain and acute confusion. Our results suggest that the mention-level named-entity recognition approach outperforms sentence-level classification overall, but the latter approach still manages to achieve the best prediction scores on several annotation classes.


Introduction
In relation to patient care in hospitals, clinicians document the administrated care on a regular basis. The documented information is stored as clinical notes in electronic health record (EHR) systems. In many countries and hospital districts, a substantial portion of the information that clinicians document concerning patient status, performed interventions, thoughts, uncertainties and plans are written in a narrative manner using (natural) free text. This means that much of the patient information is only found in free-text form, as opposed to structured or coded information (c.f. * These authors contributed equally. standardized terminology, medications and diagnosis codes).
When it comes to information retrieval, management and secondary use, having the computer automatically identify and extract information from health records related to a given query or topic is desirable. This could, for example, be information about pain treatment given to a patient, or a patient group. Although free text is easy to produce by humans and allows for great flexibility and expressibility, it is challenging to have computers automatically classify and extract information from such text. The use of computers to automatically extract, label and structure information in free text is referred to as information extraction (Meystre et al., 2008), with named-entity recognition as a sub-task (Patawar Maithilee, 2015;Quimbaya et al., 2016). Due to the complexity of free text, this task is commonly approached using manually annotated text as training data for machine learning algorithms (see e.g. Velupillai and Kvist (2012)).
We present an ongoing work towards automated annotation of text, i.e. labelling with pre-defined classes/entity types, by first having the computer learn from a set of manually annotated clinical notes. The annotations concern two topics relevant to clinical care: Pain and Acute Confusion. To get a better insight into these topics and how this is being documented, two separate data sets have been manually annotated, one for each topic. For each of the two topics, a set of classes has been initially identified that reflect the information which the domain experts are interested in. An example sentence demonstrating the annotations is presented in Figure 1. The ultimate aim of this annotation work is to achieve improved documentation, assessment, handling and treatment of pain and acute confusion in hospitals (Heikkilä et al., 2016;Voyer et al., 2008). Now we want to inves-tigate how to best train the computer to automatically detect and annotate mentions of these topics in new, unseen text by exploring various machine learning methods.
We address this by testing and comparing two different overall approaches: • Named-entity recognition (NER), where we have the computer attempt to detect the mention-level annotation boundaries.
• Sentence classification (SC), where we have the computer attempt to label sentences based on the contained annotations.
The motivation for comparing these two approaches is that: (a) the experts are satisfied with having the computer identify and extract information on sentence level; and (b) we hypothesize that several classes, in particular those reflecting the more complex concepts, are easier for the computer to identify when approached as a sentence classification task. Further, we are not aware of any other work where a similar comparison has been reported. The methods and algorithms that we explore are based on state-of-the-art machine learning methods for NER and SC.

Data
Pain is something that most patients experience to various degrees during or related to a hospital stay. Pain experience is subjective and hence it can be challenging for clinicians to properly assess if, how and to what extent patients are experiencing pain. Acute confusion is a mental state that patients may enter as a result of serious illness, infections, intense pain, anesthesia, surgery and/or drug use. When clearly evident, this is commonly diagnosed as acute confusion or delirium (Fearing and Inouye, 2009), which is identified as a mental disorder that affects perception, cognitivity, memory, personality, mood, psychomotricity and the sleepwake rhythm. However, it can be challenging to clearly identify acute confusion or delirium at the point of care, in particular the milder cases. Still, signs and symptoms can often be found in the free text that clinicians document (Voyer et al., 2008), and the same goes for pain (Gunningberg and Idvall, 2007). Our annotated data consists of a random sample of 280 care episodes that were gathered from patients who had an open heart surgery and who were admitted to one university hospital in Finland during the years 2005-2009. This sample includes 1327 days of nursing narratives and 2156 notes written by physicians. The same sample was used as data sets for both topics (i.e. pain and acute confusion). An ethical approval and an organizational permission from the hospital district was obtained before the data collection.
Separate annotation schemes, reflecting the classes and guidelines for the annotation work, were iteratively developed based on the literature for both topics. For pain the annotation scheme has 15 classes while the acute confusion scheme has 37 classes (see supplementary materials for more details). The annotation schemes were initially tested and refined by having the annotators annotate a separate data set of another 100 care episodes (not included in this study). The annotation task was conducted by four persons working in pairs of two, so that all the text was annotated by (at least) two annotators. This team of annotators consisted of two domain experts and two non domain experts with an informatics background. At the end, the annotators analyzed the made annotations with respect to common consensus before producing the final annotated data sets used in this study. The annotations were conducted using the brat annotation tool (Stenetorp et al., 2012).
The two data sets were individually divided into training (60%), development (20%) and test (20%) sets. As preprocessing of the data we tokenize and enrich the text with linguistic information in the form of lemmas and part-of-speech (POS) tags for each token. For this we use the Finnish dependency parser (Haverinen et al., 2014).
For training of word embeddings (word-level semantic vectors), we used a large corpus consisting of both physician and nursing narratives, extracted from the same university hospital (in Finland). In total, this corpus consist of approximately 0.5M nursing narratives and 0.4M physician notes, which amounts to 136M tokens.

Experiment and Methods
Below (Section 3.1 and 3.2) we describe the methods, algorithm implementations and hyper parameters used in the two approaches, i.e., namedentity recognition (NER) and sentence classification (SC). In the Results section, Section 4, we compare the scores achieved by these two ap- proaches for each of the two topics (i.e. pain and acute confusion).

Named-entity recognition (NER)
In this approach we focus on methods for predicting word-level annotation spans. More precisely we explore two such methods that have shown state-of-the-art performance in NER.
NERsuite Conditional random fields (CRFs) are a class of sequence modeling methods that have shown state-of-the-art performance in learning to identify biomedical named entities in text (Campos et al., 2013). We use a named-entity recognition toolkit called NERsuite (Cho et al., 2010), which is built on top of CRFsuite (Okazaki, 2007). For each of the two topics, one NERsuite model is trained using the corresponding training sets and the mentions are labeled using the common IOB tagging scheme. As training features, we use the original tokens, lemmas and POS tags. Although NERsuite allows the user to adjust regularization and label weight parameters, for this initial study we have used the default hyperparameters. It is worth noting that adjusting the regularization parameter is not as crucial for CRFs as it is for instance for support vector machines and strong results can be achieved even with the default values.
Several of the annotated entities have overlapping spans, e.g. the Finnish compound word rintakipu (chest pain) includes both pain and location mentions, but the standard CRF implementations are not able to do multi-label classification. Thus we form combination classes from the full spans of overlapping entities. This slightly distorts the annotated spans as the original mentions may have had only partial overlaps. Another option would have been to train separate models for each class, but as the number of classes is relatively high for both topics, this would have been very impractical.
CNN-BiLSTM-CRF The second method that we explore is an end-to-end neural model following the approach by Ma and Hovy (2016), which has produced state-of-the-art results for general domain English NER tasks. This model uses a CRF layer for the final predictions, but instead of relying on handcrafted features it utilizes a bidirectional recurrent neural network layer, with a long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997;Gers et al., 2000) chain, over input word embeddings. In addition to the input word embeddings, a convolutional layer is used over character embedding sequences to form another encoding for each token. Thus, this model is often called CNN-BiLSTM-CRF network. For training the model we use the example implementation provided by the authors 1 .
Training the CNN-BiLSTM-CRF is computationally much more demanding then a standard CRF classifier and we have thus not ran an exhaustive hyperparameter search. Instead, we use the default values from the original paper except for setting the LSTM state dimensionality to 100 and learning rate to 0.05 as these produced slightly better results than the default values. The word embeddings are initialized with a word2vec (Mikolov et al., 2013) model trained on the large clinical Finnish text corpus.

Sentence classification (SC)
In this approach, we regard the task as a multilabel text classification task in which a sentence can be associated with multiple labels. For this task, we rely on artificial neural networks (ANN) since they have been shown to achieve state-ofthe-art performance in text classification tasks (see e.g. Zhang et al. (2015); Tang et al. (2015)).
Neural network architecture We tried several neural network architectures, but report only the architecture that performed best. For both of the two topics, we apply a deep learning-based neural network architecture that use three separate LSTM chains: for the sequence of words, lemmas and POS tags.
The network has three separate channels for the words, lemmas and POS tags in the sentence. Each channel receives a sequence (words, lemmas or POS tags) as input. The items in the sequence are then mapped into their corresponding vector representations using a dedicated embedding look-up layer. The sequence of vectors is then input to an LSTM chain and the last step-wise output of the chain is regarded as the representation of the sentence based on its words (or lemmas or POS tags).
Next, the outputs of the three channels are concatenated and the resulting vector is forwarded into the classification (decision) layer, which has a dimensionality equal to the number of annotation classes. The sigmoid activation function is applied on the output of the decision layer.
Training and optimization For implementation we use the Keras deep learning library (Chollet, 2015), with Theano tensor manipulation library (Bastien et al., 2012) as the back-end engine. We use binary cross-entropy as the objective function and the Adam optimization algorithm (Kingma and Ba, 2014) for training the network. We initialize the embeddings for words and lemmas with pre-trained vectors, trained using word2vec on the Finnish clinical corpus. For hyper-parameter optimization, we do a grid search and evaluate each model on the development set. To detect the best number of epochs needed for training, we use the early stopping method. Optimization is done against the micro-averaged F-score.
To avoid overfitting, we apply dropout (Srivastava et al., 2014) regularization with a rate of 20% on the input gates and with a rate of 1% on the recurrent connections of all LSTM units. In addition, we have set the dimensionality of the word, lemma and POS tag embeddings to 300 and the dimensionality of the LSTMs' output are also set to 300.

Results
We first evaluate the two NER methods on mention level using a strict offset matching criteria. The micro-averaged results are presented in Table 1. The NERsuite model achieves F-scores of 73.10 and 48.11 on the test sets of pain and acute confusion data set, respectively. Surprisingly the CNN-BiLSTM-CRF model is not able to reach the performance of the vanilla NERsuite on the pain dataset even though it is able to utilize pre-trained word embeddings. This might be due to the data sets being limited to open heart surgery patients and thus to a rather narrow vocabulary. Consequently we do not train CNN-BiLSTM-CRF on the confusion data. To analyse the performance of the NER approach in relation to the SC approach, we also convert the detected entity mentions to sentence-level predictions. For this the predictions from the best performing method, i.e. NERsuite, is used. Table 2 shows the sentence-level scores for both the NER and SC approach. The best performing neural network used in the SC approach achieves slightly inferior results compared to the NER approach (when evaluated on sentence level). This seems to somewhat falsify our hypothesis about sentence-level classification methods potentially performing better than mention-level NER methods when the task is approached as a sentence classification task. Still, in Table 3 we see that the SC approach achieves best overall prediction scores for several of the annotation classes (see also supplementary materials). Based on our analysis so far, it is difficult to say whether these classes (i.e. the concepts they represents) are more "complex" than the others, or if there are some other factors affecting the results. In an attempt to achieve better insight into this, we calculated the average annotation spans and vocabulary size associated with the different classes. However, these numbers did not show any clear trend.  Table 2: Micro-averaged F-scores for the different approaches on the test sets of the pain and acute confusion data sets. NERsuite was used to produce the NER scores.
The actual pain mentions which are divided into explicit, implicit and potential pain subcategories all achieve relatively high performance, implicit pain being the hardest to predict (see supplementary materials for more details). The other classes, which describe additional information about the pain mentions, are generally speaking harder to detect than the actual pain mentions. The acute confusion related entities seems to be much harder  to predict due to the vague and sparse nature of these concepts.

Discussion and Future Work
In this study we have gathered the initial results for detecting mentions of pain and acute confusion in Finnish clinical text. We also use a relaxed evaluation based on sentence level predictions and experiment with approaches designed specifically for this definition. Surprisingly the NERsuite based mention-level approach outperforms all other tested methods, showing strong performance and being the best suited alternative for real-world applications. However, it might be that these two approaches are complementary.
As the used datasets are limited to open heart surgery patients, a critical future work direction will be assessing the generalizability of the trained models on larger sets of patient health records, and from other hospital units. This study also reveals that multiple classes in the annotation schemes, in particular for acute confusion, need more manual annotation data, i.e. more training examples, in order to be reliably detected in an automatic manner.
As many of the classes can be considered as descriptive attributes of the pain and acute confusion mentions, but the relations have not been annotated explicitly, another future work direction is to investigate how often these relations are ambiguous and whether the relation extraction could be solved in an unsupervised fashion.

A Supplemental Material
The supplementary material includes specific information about the annotation classes for the pain and acute confusion data sets, as well as the detailed evaluation of the studied methods.