Towards Early Dementia Detection: Fusing Linguistic and Non-Linguistic Clinical Data

Dementia is an increasing problem for an aging population, with a lack of available treat-ment options, as well as expensive patient care. Early detection is critical to eventually postpone symptoms and to prepare health care providers and families for managing a patient’s needs. Identiﬁcation of diagnostic markers may be possible with patients’ clinical records. Text portions of clinical records are integrated into predictive models of dementia development in order to gain insights towards automated identiﬁcation of patients who may beneﬁt from providers’ early assessment. Results support the potential power of linguistic records for predicting dementia status, both in the absence of, and in complement to, corresponding structured non-linguistic data.


Introduction
Dementia is a problem for the aging population, and it is the 6th leading cause of death in the US (Alzheimer's Association, 2014). Around 35 million people worldwide suffer from some form of dementia, and this number is expected to double by 2030 (Prince et al., 2013). The most common form of dementia is Alzheimer's Disease, which has no known cure and limited treatment options. The clinical care for dementia focuses on prolonged symptom management, resulting in high personal and financial costs for patients and their families, straining the healthcare system in the process. Early detection is critical for potential postponement of symptoms, and for allowing families to adjust and adequately plan for the future. Despite this importance, current detection methods are costly, invasive, or unreliable, with most patients not being diagnosed until their symptoms have already progressed. Dementia diagnosis is a life-changing event not only for the patient but for the caretakers that have to adjust to the ensuing life changes. Improved understanding and recognition of early warning signs of dementia would greatly benefit the management of the disease, and enable long-term planning and logistics for healthcare providers, health systems, and caregivers.
With the advent of electronic clinical records comes the potential for large-scale analysis of patients' clinical data to understand or discover warning signs of dementia progression. The ability to follow the evolution of the disease based on patients' records would be key to develop intelligent support systems to assist medical decision-making and the provision of care. Current research using records mainly focuses on structured data, i.e. numerical or categorical data, such as test results or patient demographics (Himes et al., 2009). However, unstructured data, such as text notes taken during interactions between patients and doctors, presents a potentially rich source of information that may be both more straightforwardly interpretable for humans, as well as helpful for early dementia detection. Structured data from innovative diagnostic tests are often absent due to their cost and accessibility, text notes are generated for nearly every visit of a patient. Moreover, text notes in medical records are a source of natural language, and potentially more flexibly encode the diagnostic expertise and reasoning of the clinical professionals who write them.
Processing and computationally analyzing natural language remains a formidable task, but insights gleaned from it may translate particularly well into actual clinical practice, given its interpretable and accessible nature. Therefore, the ability to predict dementia development based on both structured and unstructured data would be useful for intelligent support systems which could automatically flag individuals who will benefit for further evaluation, reducing the impact of late diagnosis.

Related Work
Structured clinical data has been useful for identifying known disease markers (Himes et al., 2009). Procedural and diagnostic codes (e.g., ICD-9) can provide high specificity for identifying a disease, but may not provide sufficient sensitivity (Birman-Deych et al., 2005;Kern et al., 2006). A patient's history, however, is typically summarized by a clinician in text form, and can provide informative expressiveness and granularity not adequately captured by ICD-9 codes (Li et al., 2008). Interestingly, Prior work has shown that natural language data can help synthesize details and discover trends in medical records. Natural language processing and text mining have been applied to the identification of various known medical conditions. One method maps specific conditions to relevant terms from ontologies (curated knowledge bases). For example, SNOMED-CT predicted post-operative patient complications (Murff et al., 2011), and MedLEE (Friedman et al., 1995) identified colorectal cancer cases (Xu et al., 2011), suspicious mammogram findings (Jain and Friedman, 1997), and adverse events related to central venous catheters (Penz et al., 2007). Similarly, the language analysis-based resource SymText (Haug et al., 1995) has been used for detecting bacterial pneumonia cases from descriptions of chest X-ray (Fiszman et al., 2000).
While such studies with medical knowledge bases are useful for disease identification, they mostly involve conditions with well known markers and known relationships between words and clinical concepts typically available once the patient is symptomatic. However, many cognitive conditions, such as dementia, as well as other illnesses of interest, are not well understood and their onsets gradually evolve over long periods of time. Further-more, diagnosing such conditions is often primarily a function of experts' analysis, transcribed into notes. Thus, discovering lexical associations with the progression of these conditions could be tremendously beneficial, and could also help to validate and enhance the use of resources such as the Alzheimer's Disease Ontology (Malhotra et al., 2013).
Topic models have produced interesting results across domains (Chan et al., 2013;Resnik et al., 2013;McCallum et al., 2007;Paul and Dredze, 2011). Latent Semantic Indexing (LSI) has been used in medicine to discover statistical relationships between lexical items in a corpus. LSI has been used to supplement the development of a clinical vocabulary associated with post-traumatic stress disorder (Luther et al., 2011), and for forecasting ambulatory falls in elderly patients (McCart et al., 2013). However, LSI often requires around 300-500 concepts or dimensions to produce stable results (Bradford, 2008). This limitation can be overcome by using LDA, whose identified groups of related terms are also more intuitive for human interpretation than LSI results. Additionally, representing documents by their LDA topic distribution reduces the dimensionality of the feature space. Furthermore, a study with microtext data demonstrated that document length influences topic models, and that aggregating short documents by author can be beneficial (Hong and Davison, 2010). This finding is relevant for this study due to the short nature of clinical texts.
This study is concerned with the fusion of linguistic data with structured non-linguistic data, as well as the integration of distinct models suitable for each. Approaches to the former case, have been studied (Ruta and Gabrys, 2000). For the latter case, integration of classifiers typically involves multiple models of the same data, e.g. ensemble methods such as random forests, and often utilizes voting algorithms to produce the final combined output. However, here we focus on the combination of two distinct models: one based on linguistic data and one on structured non-linguistic data. This setup complicates the use of typical voting methods, and thus we explore a less frequently studied solution that leverages Bayesian probability to produce posterior distributions (Bailer-Jones and Smith, 2011).

Our Contributions
(1) We compare performance of predictive modeling with linguistic vs. non-linguistic features, studying if linguistic features used alone as predictors yield performance comparable to that of non-linguistic record data -especially when the latter exclude cognitive assessment scores from expert-administered tests. Our results show the utility of linguistic data for dementia prediction, e.g., when relevant structured data are unavailable in the records, as is often the case. (2) We explore the use of Latent Dirichlet Allocation (LDA) (Blei et al., 2003) as textually interpretable dimensionality reduction of the lexical feature space into a topic space. We examine if LDA can transform the sparse term space into a reduced topic space that meaningfully characterizes the texts, and we discuss its practical value for classification. (3) We study the challenge of fusing linguistic and non-linguistic data from records in additional classification experiments. If fusion improves performance, this would strengthen the utility of records-based linguistic features for disease prediction. We explore two integration methods: combining feature vectors computed independently from structured and text data, or leveraging probabilistic outputs of their respective trained classifiers. This paper is organized as follows. Section 2 describes the data for the dementia detection problem. Section 3 presents our framework and integration. Section 4 outlines experiments and results. We conclude with future directions in Section 5.

Dementia Detection Problem: Data
This study makes a secondary use of a data set from the Alzheimer's Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu). The ADNI study contains mostly structured data, such as measurements from brain imaging scans, blood, and cerebrospinal fluid biomarkers. The dataset also contains optional text fields in which examiners include notes or descriptions at their discretion.
Each ADNI subject 1 is labeled upon entering the study. ADNI's original labeling scheme was modified in later phases of the study, resulting in some subjects having updated labels, while others remain unchanged. Therefore, only subjects who joined the 1 The ADNI study refers to its participants as subjects. study under the most recent phase, ADNI-2, are included in this work. Subjects with a label of SMC (Significant Memory Complaint; reflecting a selfreported memory issue) are excluded as it is not a real diagnostic category outside of ADNI. A subject's record must have both unstructured text and structured data to be included, resulting in 679 usable subjects; from here on we refer to their data.
The ADNI-2 phase of the ADNI collection uses several labels to indicate the progression to Alzheimer's Disease: NL (Normal), EMCI (Early Mild Cognitive Impairment), LMCI (Late MCI), and AD (Alzheimer's Disease). The label (class) distribution of the remaining 679 subjects is relatively balanced (see Figure 1). Moderately-sized data sets are common in clinical NLP contexts, where data is understandably more challenging to collate and access. For the text data, we considered text source files with considerable quantities of information. 2 All 679 subjects possess text notes in at least one of these four files. Entries from these files are aggregated by subject and concatenated to yield one text document per subject.
There are 22 structured data fields in this ADNI subset. The problem of missing values in the structured data was handled through multiple imputation (using the Amelia II package in R). This process uses log-likelihoods to generate probable complete datasets. Most structured data comes from either cerebrospinal fluid samples or brain imaging scans, while three fields correspond to scores on cognitive exam evaluations: the Clinical Dementia Rating (CDR), the Mini Mental State Examination (MMSE), and the Alzheimer's Disease Assessment Scale (ADAS13). Importantly, a meaningful distinction can be made between structured data from cognitive assessments versus those from biophys-ical tests/markers. A cognitive assessment is administered by a clinical professional, and thus is a reflection of that person's opinion and expertise. Essentially, cognitive assessment scores are outputs of professional interpretation, whereas other structured data are inputs for future interpretation. Cognitive assessments are also usually administered when providers already suspect dementia, and thus can be regarded as post-symptomatic. Patients, providers, and families will benefit from early detection, and such automated detection can also help prioritize the scheduling of expert-based cognitive assessments in resource-strained healthcare environments.

Modeling of Linguistic Data
There are three main feature representations for the linguistic data: bag-of-words (BOW), termfrequency inverse-document-frequency (tf-idf ) on top of BOW, and topics from LDA.
Preprocessing and text normalization were performed in Python and NLTK, involving lowercasing, punctuation removal, stop-listing, and number removal (with exception of age mentions). Besides regular stop-listing, words or phrases revealing a subject's diagnostic state (for example MCI) were removed. Words in a document were lemmatized to merge inflections (removing distinctions between for instance cataracts and cataract). Abbreviation expansion used lexical lists. The 200 most frequent lexical content bigrams and trigrams were extracted and concatenated (breast cancer → breast cancer). Lastly, while dates were removed, age expressions were kept after conversion and binning (AGE >=70 <80), as they may be important for this problem. Ages below 40 were represented as AGE <40 and ages at or above 90 as AGE >=90.
BOW and tf-idf were implemented using gensim (Řehůřek and Sojka, 2010). The standard BOW representation is very sparse, since any document only contains a small subset of the vocabulary. An extension weights the terms based on their distribution in the corpus using tf-idf. Thus higher weights are assigned to terms which appear more times in fewer documents, and lower weights to terms which appear fewer times and/or in more documents. The feature space of tf-idf corresponds to standard BOW, but the values are the weights.
LDA is a generative model for identifying latent topics of related terms in a text corpus, D, which consists of M documents and is assumed to contain K topics. Each topic k is essentially follows a multinomial distribution over the corpus vocabulary, parameterized by φ k , which is drawn from a Dirichlet distribution, i.e., φ k ∼ Dir(β). Similarly, each document follows a multinomial distribution over the set of topics in the corpus, also assumed to have a Dirichlet probability, denoted θ i ∼ Dir(α). Working backwards, the probability of each term in a document is determined by the term distribution of its topic, which is in turn determined by the topic distribution of the document (Blei et al., 2003).
Under LDA, a document is modeled as a probabilistic distribution over topics, learned from the occurrence of terms through Collapsed Variational Bayesian (CVB) inference methods using the Stanford Topic Modeling Toolbox (Teh et al., 2007). 3 Since topics are determined based on statistical relationships of terms, the effectiveness of the model can be hampered by extremely frequent or infrequent terms. For these reasons, we filter out the vocabulary (Boyd-Graber et al., 2014, p. 9) for terms appearing less than 3 times and the 30 most common terms. 4

Integration with Structured Data Models
Integration is performed on the results of each unstructured modeling experiment (BOW, tf-idf, and LDA) and those of each structured ones-with vs. without cognitive assessment features. For LDA, only the parameters with the highest performance are used in integration. The most intuitive form of integration is concatenation of the feature vectors for structured and unstructured data. Hence, concatenation refers to joining two vectors of length n and m into a single new vector of length n + m. This concatenated feature vector is used in classification.
The second approach of integration leverages posterior probabilities from the individual (linguistic vs. non-linguistic) classification models. For each input, a classifier produces a posterior probability of each class label and selects the most probable as its output. One classifier is trained on structured data features X s , and another on unstructured data features X u , resulting in two posterior distributions. The probability of a class C k is then denoted as p(C k | X s , X u ). If these distributions are assumed to be conditionally independent with respect to their class labels, then by Bayes' theorem: From here, the class label with the highest probability is selected as the output; for details see Bailer-Jones and Smith (Bailer-Jones and Smith, 2011).
For integration purposes, we use logistic regression for all classification experiments, implemented in scikit-learn (Pedregosa et al., 2011) to compute the posterior probabilities of all classes. We adopt a regularized logistic regression model to further improve the predictive accuracy. By incorporating a regularization term into the basic logistic regression model, regularized logistic regression is able to reach a good bias-variance trade-off and hence achieve a better generalization capability. The regularization term is comprised of two parameters, which are C, the inverse of regularization strength, 5 and the penalty function (either the L 1 or L 2 vector norm). A smaller C corresponds to harsher penalties for large coefficients. The values of these parameters are selected through a grid search of possible values, evaluated by accuracy in cross validation. The process is repeated for each labeling scheme.

Experimental Study
Each subject is annotated with a dementia status class label. Each subject's linguistic and structured non-linguistic data are used separately or integrated, as instances for classification.Two different classification problems are reported on. One involves all four classes (NL, EMCI, LMCI, AD). This 4-class problem is henceforth referred to as Standard. As discussed, early detection of dementia is critical. Accordingly, EMCI subjects are of particular interest, as they represent the beginning of the disease's progression. In the second experiment, we use 367 subjects having one of these two class labels (187 NL, 180 EMCI). While this does not perfectly match the reality of diagnosis, as it excludes the later dementia stages, it could be argued that those later stages are in less need of automatic analysis since they are more readily observable.The resulting binary problem is referred to here as Early Risk.
The results and discussions presented later in this paper include a comparison to a majority class baseline, however, this is included merely as a standard comparison, while the actual comparison of interest is between integration of non-linguistic (with vs. without cognitive assessment scores) and linguistic features compared to those groups in isolation.

Held-out Data
The data set is randomly split into 80% (n = 544 subjects) for model development (dev set), and 20% (n = 135 subjects) for final evaluation (held-out set). Models are only exposed to the held-out set after satisfactory performance is achieved using the dev set. Class distributions are preserved in the dev and held-out sets.
LOO Cross-Validation Although the dev and held-out sets have similar class distributions, overfitting is still a potential issue. For this reason, after the held-out evaluation is complete, a leave-one-out cross-validation (LOO or LOOCV) procedure is run on the entire merged dataset to serve as an additional evaluation, to either confirm or call into question the trends from held-out testing, which may be evident through differences in performance of the same features and models. LOOCV is a case of k-fold crossvalidation where k is equal to the number of training instances, resulting in one fold for every data point in which all other data points are used for training.

Topic Exploration and Evaluation
Tuning of the topic number parameter is essential to finding an appropriate LDA model. This process is performed by iteratively measuring classification accuracy at values of K ranging from 5 to 100, in multiples of 5, using the training data from the heldout evaluation split. LDA is being used here with two goals in mind: to improve classification performance as a form of dimensionality reduction, as well as to provide human-interpretable topics. The former is more convenient and appropriate in the context of this work, but does not necessarily imply good results for the latter. A clinical expert view-ing the output of such a model would likely prefer fewer topics, each with higher interpretability. Accordingly, LDA models in classification are examined with various per-topic metrics known to correlate well with human evaluation. Thus, the bestperforming reduced topic-feature space is selected for classification results and then additionally analyzed using the topic coherence metric (Mimno et al., 2011), which measures how often the most probable words of a topic appear together in documents, and has been shown to match well with human evaluation of topic quality (Boyd-Graber et al., 2014).

Classification of Standard Labels
The upper part of Table 1 shows the results of structured vs. linguistic features in isolation for the Standard problem, while the rest of the table shows results of integration techniques. Overall, performance improved in LOOCV, with a few exceptions (e.g. tf-idf ), which is likely due to the greater number of available training instances in this evaluation.
The performance of structured data alone is substantially higher than the majority class baseline, and more so when cognitive assessment features were included (+cognitive), as expected. Importantly, the BOW representation for text data achieved similar performance compared to the structured data without cognitive assessment scores, showing that simple text modeling can be useful in the common event that structured data are missing.
The benefit of tf-idf appears inconsistent between held-out and LOOCV evaluations, possibly attributable to differences in document frequency of important terms in the different training data (dev vs. dev+held-out, respectively).
For LDA, performance was dependent on the number of topics, as seen in Figure 2, with two performance peaks (at K = 60 and K = 85) surpassing BOW. This supports that dimensionality reduction by LDA can improve performance, but data size may influence results. This is a limitation of using an unsupervised algorithm for a supervised task. Performance differences between held-out and LOOCV indicate overfitting to the dev set in particular.  referencing people in their 60's. Topic 45 pertains to regular medical visits (PCP is primary care physician), with some common concerns of elderly patients (back, heart). Topic 25 captures heart disease (cardiac, stent, chest pain) and related visits (hospitalization, admitted, discharged). Linguistic and non-linguistic models are integrated to improve classification performance. Table 1 shows results for 16 integrated models (2 nonlinguistic models × 4 linguistic models × 2 integration methods). Similar trends were observed for BOW and tf-idf in most cases. Interestingly, integrating with BOW is better than including cognitive assessment scores for held-out. The LDA-reduced features are again less consistent than other text features, but still comparatively improved performance in many cases. LDA integration experiments appear more robust between held-out and LOOCV than when LDA features were used alone, likely due to structured features taking the brunt of the decision.
It was predicted that the posterior probability composition method would yield better results than vector concatenation. Interestingly, this is not apparent, with many cases revealing the opposite. Yet overall, the best performing cases include results where integration is done by this method. One potential limitation of the posterior probability composition is that a stronger decision is made when each of the underlying classifiers produces an asymmetric posterior class distribution. A limitation of Held-out Evaluation Leave-one-out Cross-validation

Features
Acc. P / R P / R P / R P / R Acc. P / R P / R P / R P / R Baseline (majority class) 32.6% 33 / 100 − / 0 − / 0 − / 0 27.5% 28 / 100 − / 0 − / 0 − / 0    this method is its dependence on strong or accurate decisions from the underlying models. Vector concatenation is not subject to this limitation, but has the drawback of potentially overwhelming a smaller feature set with a larger sparse one. As for class-specific differences, the NL (normal) and AD (Alzheimer's disease) subjects were classified with higher precision and recall scores than were the MCI classes in nearly all integration experiments, pointing to the challenge of subtler disease stages.

Classification of Early Risk
In addition to the experiments above, the more specific problem of distinguishing normal (NL) subjects from those with early mild cognitive impairment (EMCI) was also explored. Only LOOCV is performed because the subsampling of NL and EMCI subjects slightly distorts the class distributions in the original held-out set. Results are given in Table 3. As in the Standard problem, all non-linguistic and linguistic feature types perform well above the majority class baseline. One major difference here is that all linguistic data types outperform the structured features when cognitive assessments are excluded. This may suggest a potential linguistic difference in clinical notes at the onset of MCI.
The number of LDA topics is selected as before (but using the whole Early Risk subsample, as opposed to the Standard dev set). Two peaks found at K = 65 and K = 100 achieve the same classification accuracy, but do not outperform BOW and tf-idf. The difficulties LDA faced in the Standard problem are also faced here, and thus similar performance shortcomings are observed. The ability to approximately match tf-idf performance is still noteworthy since the LDA features are a smaller and denser representation than tf-idf, which may be more easily interpretable by clinical professionals. Table 2b shows 5 of the top 10 topics from the 100 topic model trained on the Early Risk subset, based on the topic coherence metric. A consequence of a smaller sample of subjects is a smaller vocabulary and thus weaker statistical judgments, Topics 38, 25, 36, and 56 appear to be about routine visits/tests, cognitive evaluations, smoking habits, and cardiac issues, respectively. Topic 55 is an example of a chained topic (Boyd-Graber et al., 2014, p. 17  shared co-occurring words, in this case with left and right seeming to link eye and hand, along with their associated terms cataract and arthritis.
The performance trends for the integrated models are slightly more consistent for the Early Risk problem than they were for the Standard problem. When excluding cognitive assessment scores, all integration experiments result in a modest improvement, although there is little to no difference between the two integration methods employed. This may suggest that results can be achieved without extra sophistication provided by posterior probability composition, or that further sophistication is needed beyond either of these techniques. In general, our results further justify the integration of linguistic and non-linguistic features and/or models.

Conclusion and Future Work
We explored classification of dementia progression status of subjects from a study on Alzheimer's disease, and the integration of text data models with those of structured data, with vs. without cognitive assessment scores. Experiments support texts' viability as a useful source for dementia classification, as an important complement to structured data, or alone when structured data are missing. LDA was also studied as interpretable dimensionality reduction. With a larger sample size, the LDA model may converge to a more stable set of topics, but other appropriate public datasets (with both linguistic and non-linguistic data) are presently not available. An alternative is to apply supervised versions of LDA (Blei and McAuliffe, 2007;Ramage et al., 2009). Furthermore, with access to a pool of clinical specialists, it would be useful to integrate experts in evaluating the latent topics. Chang et al. (2009) proposed various such human evaluation techniques, such as the word intrusion task, in which human evaluators are presented with a list of n high probability terms of a randomly chosen topic, and one additional low probability term from that topic, and asked to identify the former. A drawback is that it would require access to a large enough pool of dementia specialists.
Other avenues of future work would include the incorporation of lexical similarity measures from sources like WordNet.