Distinguishing between Dementia with Lewy bodies (DLB) and Alzheimer’s Disease (AD) using Mental Health Records: a Classification Approach

While Dementia with Lewy Bodies (DLB) is the second most common type of neurodegenerative dementia following Alzheimer’s Disease (AD), it is difficult to distinguish from AD. We propose a method for DLB detection by using mental health record (MHR) documents from a (3-month) period before a patient has been diagnosed with DLB or AD. Our objective is to develop a model that could be clinically useful to differentiate between DLB and AD across datasets from different healthcare institutions. We cast this as a classification task using Convolutional Neural Network (CNN), an efficient neural model for text classification. We experiment with different representation models, and explore the features that contribute to model performances. In addition, we apply temperature scaling, a simple but efficient model calibration method, to produce more reliable predictions. We believe the proposed method has important potential for clinical applications using routine healthcare records, and for generalising to other relevant clinical record datasets. To the best of our knowledge, this is the first attempt to distinguish DLB from AD using mental health records, and to improve the reliability of DLB predictions.


Introduction
Alzheimer's disease (AD) is the most prevalent type of dementia, characterised by progressive cognitive impairment such as memory loss. Dementia with Lewy bodies (DLB), also known as Lewy body dementia, is the second most common type of neurodegenerative dementia following Alzheimer's disease (AD), with the defining features of fluctuating cognition, recurrent visual hallucinations, rapid eye movement (REM) sleep behaviour disorder, and Parkinsonian motor symptoms in addition to dementia (Walker et al., 2015). Particularly in the early stages, prior to diagnosis, DLB and AD are difficult to distinguish, hence the detection rates of DLB are sub-optimal, with a large proportion of cases missed or misdiagnosed as AD (Kane et al., 2018). Detection of DLB is, however, crucial as compared to AD and other forms of dementia (e.g. Parkinson's disease dementia (PDD) 1 ). DLB has a worse prognosis across key outcomes such as mortality, hospitalisation, move into residential care, quality of life, and healthcare costs (Mueller et al., 2017). Moreover, not only is early diagnosis paramount, different types of treatments can have different impacts on these patient groups, e.g. antipsychotics, which adds to the importance of accurate and timely diagnoses.
Due to the challenges in recognising DLB clinically, it has been difficult to recruit large research cohorts of representative patients with DLB, and the increasing use of routinely collected healthcare data has been suggested as a potential solution to this shortage. Applying classical methods of symptom ascertainment using natural language processing (NLP) in routinely collected data is however difficult in patients with DLB, as clinicians tend to record the defining features only if they have also made the correct DLB diagnosis (Mueller et al., 2018). Therefore, we applied novel neural models of NLP to test whether these can be clinically useful to distinguish DLB and AD, and to provide assistance to mitigate expensive outcomes from misdiagnoses of DLB.
This task is challenging because DLB and AD share certain clinical and biological similarities that make them particularly difficult to differentiate. Motivated by the emergence of neural models and NLP methods applied to the biomedical domain, we cast this as a binary text classification task, where we use convolutional neural networks (CNNs) (LeCun et al., 1998;Krizhevsky et al., 2012;Kim, 2014) to address it. Additionally, the generalisation of well-trained models is notably more difficult, since different formats and grammatical patterns emerge in MHRs across different healthcare institutions. In order to test the efficiency of our proposed methodology, we use three datasets from two different MHR (clinical documentation) systems and healthcare institutions, with the aim of comparing the model's performances on similar datasets containing relevant data, but with different contextual structures.
To assist the analysis of our experimental results, and to bridge the gap between model accuracy and confidence, we also study an approach where the model confidence estimates are calibrated. Confidence calibration is important for classification models. Classification networks must not only be accurate, but should also indicate when they are likely to be incorrect; a well-calibrated network matches its confidence to its accuracy so that it is confident when it is accurate, and uncertain when it is not. We use the calibration method named temperature scaling, where expected calibration error (ECE), the expectation of the differences between confidence and accuracy, is used as the primary empirical metric to measure calibration (Guo et al., 2017).
In this paper, we present our preliminary work towards automatically distinguishing individuals diagnosed with DLB or AD using neural network models and MHR texts. This methodology can provide an efficient technique for detecting and intervening DLB. Our contributions are threefold: 1) we introduce a CNN approach for the classification on DLB and AD using MHRs; 2) we investigate the performance of the proposed model on two MHR datasets from two different healthcare institutions with different formats and patterns; 3) we also apply a neural model calibration method to help in understanding when the model predictions tend to be brittle, so that the model can output confidence scores with higher reliability.

Related Work
With the success of neural models for many NLP tasks, deep learning methods, as well as word embeddings, have started to be applied to the biomedical and/or clinical domains (Cohen and Demner-Fushman, 2014;Wang et al., 2018;Kormilitzin et al., 2020) including mental health, such as automatic detection and classification of cognitive impairment.
For example, three neural models (CNNs-, LSTM-RNNs-, and CNN-LSTM-based) were applied to distinguish AD and Control patients from DementiaBank (Karlekar et al., 2018;Becker et al., 1994). CNN-LSTM model achieves state-of-the-art performance on the AD classification task. Since neural models are usually black-boxes and it is hard to interpret the reasoning for final classification decisions, various visualisation techniques have been proposed for neural networks (Mahendran and Vedaldi, 2015;Samek et al., 2016;Li et al., 2016;Kádár et al., 2017). Karlekar et al. (2018) illustrated two visualisation methods for interpretation, based on activation clustering and first-derivative saliency methods, to assist the analysis and consolidation of distinctive grammatical patterns of contextual information from AD patients.
Early detection plays a crucial part in the study of dementia. Pan et al. (2019) proposes a hierarchical model that encompasses both the hierarchical and sequential structures of picture description with attention mechanism, and detecting signs of cognitive decline at both the word and sentence levels, by using the DementiaBank and an in-house database of Cookie Theft picture descriptions (Mirheidari et al., 2017). Pan et al. (2019) shows both the proposed hierarchical structure and the attention mechanism contribute to the improvement in AD detection.
Most NLP studies addressing dementia use language transcripts from clinical cohorts, such as the DementiaBank (Becker et al., 1994). To our knowledge, very few studies have used MHR documents and NLP for modelling detection of dementia types, and we are not aware of any studies using NLP and MHRs for detection of DLB. McCoy Jr. et al. (2020) presents a study using electronic health record (EHR) data for stratifying risk for dementia onset, using a bespoke NLP approach for scoring symptoms in the clinical texts. This NLP approach, however, relies on pre-defined terms, and addresses a slightly different clinical problem.
When applying neural networks to real-world decision-making systems, classification networks must not only be accurate, but also should indicate when they are likely to be incorrect. A network should provide a calibrated confidence measure in addition to its prediction. Calibrated confidence estimates are also important for model interpretability. Guo et al. (2017) identify methods, which can alleviate miscalibrated problems in neural networks, and offer insight and intuition into network training and architectural trends that may cause miscalibration. Good confidence estimates can provide valuable extra information to establish trustworthiness in early detection of cognitive impairment.

Methodology
Our proposed approach uses a CNN model to distinguish DLB and AD patients. We compare the performance of using an embedding layer (Emb-layer) and pre-trained embeddings (BioWord2Vec) on our classification task, and finally apply a post-processing method (temperature scaling) for model calibration.

Input representation: word embeddings
We compare two approaches for the input, using high-dimensional word vectors (Mikolov et al., 2013): 1) a randomly initialised embedding layer and trained with the neural network, and 2) pretrained biomedical word embeddings.
For the pre-trained embeddings, we use BioWord2Vec, distributed word representations proposed in Zhang et al. (2019). 2 The biomedical word embeddings are learnt based on medical subject heading (MeSH) terms and text sequences, employing the fastText (Bojanowski et al., 2017) subword embedding model. 2 These non-contextualised embedding have performed the best in our setting. We have also conducted the experiments using the contextualised BioBERT (Lee et al., 2019) embeddings available at that time but it has a comparative worse performance due to its specifics of the subword tokenisation and larger clinical document lengths, as compared to the standard configurations in the BioBERT pre-training framework. During the preparation of this paper, more work on advanced pre-trained word embeddings emerged and we applied BioWord2Vec, one that was most relevant to our datasets.
BioWord2Vec outperforms the current state-ofthe-art non-contextualised word embeddings in most BioNLP and/or ClinicalNLP tasks, suggesting that the sub-word information and domain knowledge are indeed able to improve the quality of biomedical word representations and better capture their semantics.

Convolutional Neural Network
We apply the convolutional neural network (CNN) model (Kim, 2014) on our DLB and AD classification task. The input to the model are all documents of each patient concatenated and represented as a matrix using each of the embedding configurations. We use filters that slide over full rows of the matrix. The height of the filters may vary, but sliding windows over 3-5 words at a time are typical. Next, we max-pool (a sample-based discretisation process) the result of the convolutional layer into a long feature vector, add dropout regularisation, and the result is then passed to a softmax layer that outputs probabilities over two classes.
We use a logistic regression (LR) model as a baseline. Documents are pre-processed by tokenising and lowercasing. We compare two different text representations: bag-of-words (BoW) and term frequency-inverse document frequency (TF-IDF) counts. For TF-IDF counts, we selected a minimum document frequency of 5 and a maximum of 5,000 features.

Temperature Scaling
Temperature scaling is a post-processing technique which can almost perfectly restore network calibration (Guo et al., 2017), and can be easily added to any models. For classification problems, the neural network model outputs a vector known as the logits. The logits vector is passed through a softmax function to get class probabilities. Temperature scaling simply divides the logits vector by a learnt scalar parameter, i.e.
whereŷ is the prediction, z is the logit, and T is the learnt parameter. T is learnt on the validation set, where T is chosen to minimise negative loglikelihood (NLL). Intuitively, temperature scaling simply softens the neural network outputs. This makes the network slightly less confident, which in turn makes the confidence scores reflect true probabilities.
This post-processing calibration method is applied on our DLB and AD classification task, to narrow the gap between model confidence and accuracy. The calibrated confidence provides further assistance when deciding whether the individual prediction might be reliable or incorrect.
A scalar summary statistic for calibration can be useful to compare two distributions: accuracy and confidence. The difference between accuracy and confidence is defined as: whereŶ is a class prediction, andP is its associated confidence, i.e. the probability of correctness.
In practice, the model predictions are grouped into M interval bins (each of size 1 M ). Expected calibration error (ECE) is computed as the weighted average of the bins' accuracy/confidence differences: where B m is a set of indices where the prediction confidence of samples falls into the interval m−1 M , m M , and n is the total number of samples across all bins. Perfect calibration is achieved when ECE = 0, that is acc(B m ) = conf(B m ) ∀ bins m.

Materials and Experimental Setup
By applying two types of word embeddings (Emb-layer and BioWord2Vec) for word representations, convolutional neural network (CNN) for model training, and temperature scaling for model calibration, we investigated and evaluated the efficiency of our proposed methodology on three datasets from two healthcare institutions.

Datasets
We use de-identified mental health records (MHRs) from (1)   (CPFT). From each MHR database, we extract documents for patients diagnosed either with DLB or AD. Acquisition of ground truth differed for the two datasets. For CRIS, the MHRs are identified using an information extraction technique that matched any text strings associated with a diagnosis statement of Lewy body dementia or disease. The performance of this automatic extraction was verified by DLB experts as described in Mueller et al. (2018). For CRATE, two experienced clinicians with knowledge of DLB diagnostic criteria and symptom presentation have determined ground truth DLB cases in a set of records pre-selected by an information extraction procedure. Cases were identified as ground truth DLB if a diagnosis had been given by a clinician within the healthcare institution and was the most recent recorded diagnosis within the MHR (see Price et al. (2017) for more details on data collection for CRATE). To have a more comparable dataset to CRATE, we also created CRIS † , in which we randomly selected AD cases from CRIS to obtain a more balanced distribution, while the DLB cases remain identical to CRIS.
Within each dataset, we have information about the Patient ID and the Diagnosis Date of DLB and AD patients respectively. For each patient with any of these diagnoses, we use only the text written upon the first consultation until the date 3 months before the diagnosis (concatenated into one document). The intuition is that we would like to remove MHRs closer to the date of diagnosis that could be more informative of the two diseases, and hence making the differentiation using NLP trivial. There is a total of 90 DLB patients and 750 AD patients in CRIS 5 , and 98 DLB patients and 80 AD patients in CRATE 6 (see Table 1). In CRIS † , the 5 The distribution of DLB and AD patients from CRIS is close to the real distribution because diagnosed DLB is currently about 5% of all dementias and there is evidence that DLB should be around 10%, AD is around 70% (Mueller et al., 2017). 6 The more balanced distribution of CRATE is an outcome   AD cases were extracted randomly from CRIS with the aim of making the results more comparable by equalising the number of DLB and AD patients (closer to the distribution in CRATE). The length of each document varies in the datasets, ranging from tens of words to hundreds of thousands (see Table 2). On average, documents are longer in CRIS and CRIS † . Since the standard CNN model used for text classification takes the maximum length of samples as the uniform length, we considered normalising the length to its median for optimised usage of computational resources 7 (as shown in Table 2) to pad/cut documents to the same length, and use the latest diagnosis records as the training samples if the document exceeds the median.

Experimental Setup
In our binary classification task we consider DLB cases as positive and AD cases as negative. We preprocess the datasets by lowercasing and tokenising using regular expression operations. We use 5fold cross-validation (CV) to segment the training datasets and ensure that particular subgroups have no deterministic effect on final model performance. All our models use an Adam optimizer (Kingma and Ba, 2014), with a learning rate of 0.001. We of the manual extraction.
7 For the CNN model, we use the sequence length 4,406 for CRIS and 2,710 for CRATE; for CRIS † , we applied the same median length (2,710) as CRATE, in order to make the results more comparable. used a 2-D CNN. Filter sizes of [3, 4, 5] were used with 128 filters per filter size. Batch size was set to 32. To avoid overfitting, we apply dropout to the output of all the functional layers (Srivastava et al., 2014), with the dropout rate set to 0.5. The final criteria are calculated by averaging the 5-fold cross-validation results.
In the ablation study, we remove important words from the training data and to trace changes in model performance. These important words are either the most informative of DLB and AD (e.g. Model B where a list of terms, expressions, and abbreviations related to the diagnoses of DLB and AD; and was composed manually), or obtained from our baseline model which contribute the most to the LR predictions (Model C). We believe these words are also indicative to neural models. Four models are designed and compared: • Model A: The training data are the raw text for all the datasets. • Model B: "lewy", "body", "bodies", "dlb", "ad", "lbd", "dementia" are removed from original text. • Model C: "parkinson", "hallucinations", "visual", "symptoms" are removed from original text.

• Model D: Words mentioned in Model B and
Model C are all removed from original text.
We use the temperature scaling calibration method, which does not affect the model's accuracy. We would want the confidence estimates (output probabilities) to be calibrated. For example, given 100 predictions, each with confidence of 0.8, we expect that 80 should be correctly classified. A perfect calibration should be an identity function between accuracy and confidence. We decide to measure calibration by using expected calibration error (ECE).

Evaluation
In order to test the efficiency of our model, we report the performances based on precision, recall, and F1-score. All the reported results are the average of 5-fold cross-validation (CV). We also report F1-scores for each fold. In addition, to better understand the underlying data, we extract the top-20 words contributing the most to the DLB classification in the LR model with both the BoW and TF-IDF counts representations.   Table 5: Top-20 words contributing the most to the DLB detection using logistic regression (LR) with BoW and TF-IDF counts representation (a minimum document frequency of 5 and a maximum of 5,000 features).

Results
Overall classification results are reported in Table 4. Two kinds of word representations are used with the LR model: BoW and TF-IDF. Using BoW features resulted in higher F1-score (0.66) as compared to TF-IDF features (0.49) for CRIS; while the opposite is observed for CRATE (0.69 for BoW and 0.77 with TF-IDF features). In general, CNN achieves better results compared to the baseline LR (0.87 for CNN with Emb-layer on CRIS), and lower deviation for each fold in 5-fold cross-validation. On CRATE, the LR model with TF-IDF features performs best (0.77).
Comparing the performances of random initialised word embeddings (Emb-layer) and pre-trained BioWord2Vec, the result using Emb-layer achieves higher F1-score (0.87) than BioWord2Vec (0.63) for CRIS. Results on CRATE using Emb-layer and BioWord2Vec are, on the other hand, quite close considering F1scores and their stabilities for 5-fold CV.
However, for CRIS † , using pre-trained embeddings BioWord2Vec (0.78) performs better than Emb-layer (0.73), with more comparable data sizes of DLB and AD. Our proposed model CNN with BioWord2Vec achieves the highest F1-score (0.78) among four models (LR with BoW and TF-IDF, CNN with Emb-layer and BioWord2Vec). With the same settings, the F1-score is also higher than that of CRIS (0.63) with lower deviation, which might be the outcome of a more balanced dataset. In comparison to CRATE (0.71), although the F1-score on CRIS †  (0.77) is slightly higher, the CRATE has a comparatively lower deviation. This result might be inherited from the fact that there is a significant increase in the overlap between CRIS † and CRATE datasets (see Table 3).
We also report the top-20 most important features contributing to the prediction in the LR model using BoW and TF-IDF counts representations (see Table 5). It is obvious that "hallucinations", "parkinson", "visual", "symptoms" are ranked highly in both CRIS and CRATE.
Inspired by the important features from LR, our baseline method, we removed the top-ranked important words from the pilot training data. We observed that after removing the core dementiarelated words we still obtain similar F1-scores for CRATE using both types of embeddings (see Table  6: models B-D compared to A). These words, however, seem to contribute more to the predictions of CRIS patients and as informative as DLB symptoms in this case. Results for CRIS † indicate the efficiency of a more balanced dataset and higher vocabulary overlap with BioWord2Vec, where we obtained less performance decrease when removing informative words. This would imply the remainder sets of words could also contribute to the model predictions.
Using Model A as the base model for model calibration, where raw text serves input to our CNN model with the BioWord2Vec word representation, we obtain well-calibrated model for all CRIS, CRIS † , and CRATE (see Table 7).
It is worth noting that models trained on three datasets experience some degrees of miscalibration.
(1) The confidences of two models (before and after calibration) decrease from over-confident to a reliable level after temperature scaling. The difference between two confidence scores indicates the performance of calibration and the model's stability. If the confidence level drops significantly (for instance, CRATE), this means there is more uncertainty in the calibrated model estimates, but less gap between accuracy and confidence. (2) According to Guo et al. (2017), the ECE is typically between 4% to 10% on benchmark datasets. In our experiment, we expected the scores of ECE to be higher, as MHRs are much more free-formed and noisy. Through the comparison of ECE before and after calibration, we can observe that temperature scaling does calibrate on the datasets, which is also supported by the reduction in confidence and NLL.
(3) The NLL is often used to define how well a neural network classifies data. A high NLL means the classification is inaccurate. A low NLL otherwise indicates the prediction matches that of the expected value. The NLL decrease in our models on the datasets means that the calibration produces more reliable prediction outputs.

Discussion
To our knowledge, this is the first study on automatically distinguishing dementia with Lewy bodies (DLB) from Alzheimer's disease (AD) using MHRs. We investigated the performance of CNN models using different embedding representations on MHRs from two different healthcare institutions, and incorporating the method of model calibration into DLB classification to obtain reliable predictions.
To be able to apply NLP models to real-world biomedical tasks, we need first to embrace the challenges of the datasets. In our case, we face a range of such challenges: small data size, hence reduced reliability of predictions; class imbalance; noisy data; and contextual differences between datasets. These might be the reasoning behind higher deviation and instability of F1-scores observed in some predictions (see Table 4).
We attempted to mitigate these challenges by using a set of fairly standard techniques. We use 5-fold CV to ensure that every example appears during both training and testing. Using 5-fold CV, important information is more likely to be learnt, and consequently obtaining better approximations and enhancing robustness, whereas with larger datasets there is more chance to have a proper distribution of information for both training and testing.
Since most MHRs are written with different formats and grammatical patterns, we considered using pre-trained biomedical word embeddings (BioWord2Vec) to get a unified word representation across different datasets. Those embeddings helped our models to rely less on explicit indica-  tors of diagnoses (e.g. direct mentions of a diagnosis) while producing predictions and stabilised performance over cross-validation splits. However, using those embeddings might be hindered by excessive noise (concatenation of words and punctuation, misspellings) in data and hence poor vocabulary overlap. Better performance in this case can naturally be achieved if more in-domain data is available and embeddings are trained from scratch. Finally, to improve the reliability of model predictions, temperature scaling, a simple but efficient calibration method, is used to narrow the gap between accuracy and confidence. The ECE scores from both before and after calibrations are used as the primary measures of model calibration.
The well-calibrated model decreases in confidence. This can reflect the true probability of model predictions, and can provide a good assistance and reference when evaluating the model outputs.
Our proposed model and calibration method could prove useful clinically. Currently in clinical care there is a high level of under-diagnosis as well as lack of confidence in making a DLB diagnosis. Moreover, appropriate treatment is crucial, e.g. it is important to avoid antipsychotic prescribing for this patient group. Although the F1-scores and calibration results are not always perfect, they indicate that using routine healthcare data could be valuable for predictive model development even in cases where it is hard to obtain large datasets.

Conclusion
In this paper, we propose to use a CNN approach for the task of detecting DLB patients by distinguishing them from AD patients. Our wellcalibrated models are relatively robust after using temperature scaling, where calibrated probabilities are more informative of good probability estimates and true predictions. The proposed model is investigated on two MHR datasets from two different healthcare institutions, and achieves competitive results using two types of embeddings (Emb-layer and BioWord2Vec). The pre-trained biomedical word embeddings (BioWord2Vec) are effi-cient for all three datasets whilst CRIS relies much more on in-domain word distributions. In particular, BioWord2Vec can achieve lower deviations on model performance in ablation study.
Future work will be focused on the effectiveness of contextualised embeddings for a more general methodology where the detection of DLB can be realised across healthcare institutions. We would also like to investigate more effective pre-processing techniques to purify and clean the raw texts before feeding into advanced models, and to mitigate the noise commonly existed in health records. The code is available at https://github.com/ zixuwang1996/dlb_ad_classification.