Classification of Syncope Cases in Norwegian Medical Records

Loss of consciousness, so-called syncope, is a commonly occurring symptom associated with worse prognosis for a number of heart-related diseases. We present a comparison of methods for a diagnosis classification task in Norwegian clinical notes, targeting syncope, i.e. fainting cases. We find that an often neglected baseline with keyword matching constitutes a rather strong basis, but more advanced methods do offer some improvement in classification performance, especially a convolutional neural network model. The developed pipeline is planned to be used for quantifying unregistered syncope cases in Norway.


Introduction
Neural methods have revolutionized the field of NLP, including the clinical domain in recent years. The amount of performance gain, however, may not always be proportional to the increased complexity and decreased transparency that their use might entail, especially in data-sparse domains and target languages. The limited availability of data and its linguistic characteristics, i.e. a high density of terminology, repetitions, abbreviations and misspellings (Allvin et al., 2011), are aspects that influence greatly the efficiency of the NLP methods applied. These have been compared to some extent in previous work (Baumel et al., 2018;Mascio et al., 2020;Karimi et al., 2017), however, they are often evaluated on the same (and often limited) openly available datasets (Pestian et al., 2007;Johnson et al., 2016). The real-world utility of various approaches in clinical text processing, especially for languages other than English, however, remains still to be investigated (Ching et al., 2018). Moreover, comparison to a simple rule-based baseline is often missing, leaving some uncertainty around the advantage of more advanced methods.
Starting from a close collaboration with Akershus University Hospital, we re-examine the question of the optimal methodological choice in the context of diagnosis coding in Norwegian clinical notes. Diagnosis codes are standard alpha-numeric codes representing a disease, a widely adopted scheme being ICD-10 (World Health Organization et al., 2004). ICD-10 codes are used for a variety of purposes, including hospital billing and reimbursement, population health statistics, and clinical research. Additionally, the re-use of structured health data in clinical decision support and risk assessment has also been suggested. ICD-10 coding is used as a most relevant classification of the reason for contact, underlying conditions or procedures related to the stay. A host of signs, events and observations are not coded. Syncope, or similar signs, may be regarded as secondary or irrelevant for a certain patient, and thus only mentioned but not coded. Clearly, accurate coding is important, but as a human process prone to error and biases, the quality of ICD-10 codes has been questioned. This is the case for syncope -a transient loss of conciousness typically due to insufficient blood flow to the brain -which was chosen as the use case for our study. A large study of Danish medical records (Ruwald et al., 2012) found that around a third of actual syncope records did not have the appropriate ICD-10 code. Since syncope can be an important sign of heart disease and a marker of elevated risk of death in certain conditions such as hypertrophic cardiomyopathy (Elliott et al., 2015), being able to retrieve information about patient's syncope events even when an ICD-10 code is not present, is crucial for better risk assessment. Also, this work constitutes a first step in the direction of an automatic diagnosis coding system for Norwegian, which is currently not available. The research questions we investigate in this context are: (i) How do linear and neural models compare to a simple keyword matching baseline for binary automatic diagnosis code classification?; and (ii) How useful are pretrained embeddings for this task? In what follows, we first describe our health record data and our pre-processing steps. We then compare three types of methods for syncope classification: a rule-based one relying on keyword matching, linear machine learning models and neural models. Besides estimating the amount of unregistered syncope cases in Norway, our processing and classification pipeline can also easily be re-used to train more generic diagnosis code classifiers.

Background
Since medical language is rather terminologyheavy, rule-based methods can often go a long way in clinical NLP tasks and are, therefore, still rather wide-spread (Koleck et al., 2019). Statistical approaches handle better linguistic phenomena such as synonyms, code-switching and negation, however, they are computationally more expensive, require resources and, in particular neural ones, are often less interpretable (Linzen et al., 2019). Moreover, neural methods substantially alleviate the burden of feature-engineering, but are considerably more challenging in terms of hyper-parameter tuning. Incorporating such models into clinical data processing pipelines is thus an advantage only if they can demonstrate a clear advantage over their simpler counterparts. Dipaola et al. (2019) developed linear classifiers with manually and automatically selected n-grams as features for classifying syncope in Italian medical records. A frequent target of investigations has been the 2007 Computational Medicine Challenge (CMC) dataset, focusing on automatic ICD coding in radiology reports. Both rule-based (Farkas and Szarvas, 2008) and statistical methods (Crammer et al., 2007) including neural ones (Karimi et al., 2017), have been tested and sometimes compared on this data. Karimi et al. (2017) reported that the performance of a Support Vector Machine (SVM) with term frequency-inverse document frequency (TF-IDF) bag-of-words (BOW) features remained considerably below the results of a Convolutional Neural Network (CNN) with dynamic in-domain pre-trained Word2Vec embeddings with F1 scores of .65 and .81 respectively. A direct comparison across these works, however, is difficult given differences in the evaluation and data subset used.
More recently, using another dataset, MIMIC-III (Johnson et al., 2016), experiments presented by Baumel et al. (2018) indicated that neural methods outperform linear models for the same type of multi-class classification of ICD codes, although not always by a large margin. Mascio et al. (2020) also described a comparison between linear and neural models, but for different clinical binary classification tasks (e.g. status and negation prediction) and showed that recurrent neural networks tuned for their task performed on par with the more recent, transformer models (Devlin et al., 2019). Rule-based baselines were often not included in these recent studies (Karimi et al., 2017;Baumel et al., 2018;Mascio et al., 2020), the practical advantage of different approaches therefore remains somewhat unclear compared to methods based on heuristics.

Dataset
Our data consisted of de-identified discharge summaries from Akershus University Hospital Hospital. Half of the notes were diagnosed syncope cases (SYN), the other half were notes with a variety of diagnosis codes for patients with no recorded and coded history of syncope (NONS). The documents were authored between 2005-2016. 1 While patients in SYN were from a variety of departments, all NONS patients were from the Cardiology Department. Moreover, only patients who were ≥ 18 years old at the time of discharge were included. The notes contained free text where some structuring is present in the form of titled sections with information about e.g. diagnosis, family history and current status. There were, however, inconsistencies in the section titles as well as in the presence and order of these sections. A previous study (Røst et al., 2020) using EHRs from Akershus University Hospital in a text classification task has also identified a need for improving interoperability when exporting such unstructured data.

Experimental Setup
The first pre-processing step consisted of tokenization with UDPipe (Straka et al., 2016). Diagnosis information reflecting the labels used for classification (SYN vs. NONS) was then removed from the documents using: (i) lexical matching for section title identification; and (ii) UDPipe paragraph information for determining section boundaries. We divided our data into three stratified splits: 70% of it reserved for training, 15% used as validation data for hyper-parameter tuning and the remaining 15% was set aside for testing. We compared a keyword matching baseline to two linear classifiers, a Logistic Regression (LR) classifier and an SVM, and to neural models, namely CNNs. These learning algorithms have been commonly and successfully used in previous NLP studies, including the clinical domain (Dipaola et al., 2019;Karimi et al., 2017).
Baseline with lexical matching (LEXM) We computed a baseline consisting of a simple lexical matching applied to the pre-processed documents using the term synkope 'syncope', which would find both its baseform and other derived forms without additional lemmatization. Whenever a document contained this term at least once, it was classified as belonging to the SYN class, and otherwise as NONS.
Linear models For training the linear models, we use scikit-learn (Pedregosa et al., 2011), and we employ Keras with Tensorflow (Abadi et al., 2016) as backend for the neural models. For both SVM and LR, we use BOW features extracted with a TF-IDF vectorizer. We perform a grid search for finding the optimal hyper-parameters on the validation data.
Neural models For the CNN, Word2Vec (Mikolov et al., 2013) embeddings were used as input representation to capture contextual similarity between words. We adopted a common CNN architecture (Kim, 2014) consisting of an input layer of 100 dimensions, a convolutional layer concatenating 100 filters of sizes 3 to 5, with rectified linear units, max pooling and a dropout of 0.5, followed by a fully connected softmax layer. We used binary cross-entropy loss, the Adam Optimizer, a learning rate of 0.001 and a batch size of 32. We trained for 10 epochs with early stopping based on validation accuracy and a patience of 2 epochs.
We experimented with different embedding initializations, inspired by Kim (2014): a randomly initialized one (W2V-R) and two where weights were based on pre-trained embeddings. In one case, weights were not trainable during the learning process (static) and in the other, we continued training these weights (dynamic). This type of transfer learning consisting of fine-tuning pre-trained embeddings for a specific task is often beneficial when the size of the available training data is small (Kim, 2014).
Pre-trained embeddings In the absence of pretrained clinical embeddings for Norwegian, we compared two other types of pre-trained embeddings, both 100 dimensional Word2Vec skip-gram models trained with Gensim (Řehůřek and Sojka, 2010): (i) general language embeddings W2V-G trained on OCR-ed books, news and web corpora, namely model nr. 100 from the NLPL repository 2 (Fares et al., 2017); and (ii) domain-related embeddings W2V-M, which we trained on data from the Norsk legemiddelhåndbok 3 'Norwegian drug manual'. The medical vocabulary of these disease and drug descriptions was closely connected to the clinical domain. We used default parameters for training W2V-M, but lowered minimum word count to 1 given the small data size.

Model Comparison and Error Analysis
In Table 2, we present the classification results for the approaches tested, where R-SENS represents sensitivity, i.e. recall for the positive class, SYN, and R-AVG is average recall for both classes. For the CNN models, an average of three runs (and standard deviation) is reported.
Lexical matching provided a rather high baseline, namely .80 accuracy, which suggests that similar terminology matching methods are worth testing and comparing to in terminology-rich domains such as the clinical one. Although we started from a strong baseline, we found that, with increasing computational complexity, performance improved somewhat. LR proved to be the best linear model (.86 accuracy) with L1 penalty, C = 10 with a
For neural models, initializing embeddings randomly worked best. The number of in-embedding words was rather low in fact for both W2V-G and W2V-M, namely 51% and 27.5% respectively. In addition, W2V-G results might be influenced by a difference in domains. Models with W2V-M produced not only lower scores, but also more instability as standard deviation shows, likely due to the small vocabulary size (50K) and few in-embedding words. Dynamic embeddings showed improvements over static ones, especially for W2V-M, in line with previous findings (Kim, 2014).
We compared our methods also with McNemar's test (McNemar, 1962) 4 and found statistically significant difference in the misclassifications at α = 0.05 only between the baseline and CNN-W2V-R (p = 0.003), but not between the other two model pairs, namely LR vs. baseline (p = 0.163) and LR vs. CNN-W2V-R (p = 0.077). Figure 1 shows the receiver-operating characteristic (ROC) curve for LR and CNN-W2V-R on the test set, which also shows a rather similar performance.
To gain a better understanding into what the best performing linear and neural models, LR and CNN-W2V-R respectively have learned, we inspected the 30 words which received the highest weights after training. These included for both models nearsynonyms such as svimmel 'dizziness' and bevissthetstap 'unconsciousness' and even the English 4 With binomial distribution given the small sample size. translation of the term (syncope). Yet another group of informative features described typical circumstances of syncope (e.g. gulvet 'floor'). LR also captured inflectional variants like synkopert, 'syncopated'. Both models' decisions relied thus on factors relevant to the target medical phenomenon.
Our error analysis revealed that around half of the NONS instances missclassified by both LR and CNN-W2V-R as SYN did contain mentions of 'syncope', but sometimes either as part of a patient's previous history of illnesses or with negation (aldri synkopert 'never syncoped'). Slightly more (60%) of misclassifications occurred for NONS texts, however, 23% (LR) and 38% (CNN-W2V-R) of these appeared to be unregistered syncope cases. Manually re-diagnosed data might therefore improve performance.

Conclusions
We described a set of experiments using keywordmatching as well as machine learning methods for the classification of syncope cases in Norwe-gian clinical notes. Our results indicate that neural methods provide some advantage over a keyword baseline, but the latter performs surprisingly well, which indicates that terminological cues can be easily leveraged for such binary clinical text classification tasks in the absence of access to training data. This type of baseline constitutes thus a valuable starting and reference point for comparison to more advanced methods.
Future work includes hyper-parameter tuning of the neural models and comparing the generalizability of our models to new data, including different note types. We plan to use the developed models for quantifying the amount of unregistered syncope cases in Norway and to extend them to classify a variety of diagnostic codes. Embeddings trained on large Norwegian clinical data would be valuable to boost performance for both this and other tasks.
This work showcases a fruitful collaboration between an NLP research environment and a hospital. Aligning clinical data processing interests and needs is particularly important for smaller languages without publicly available data for both moving the clinical NLP research front forward and to bring findings closer to the clinical practice.

Ethics
According to Norwegian law, the project has been approved by the hospital's internal Privacy Ombudsman, ref. 2019 15.