Using linguistic features longitudinally to predict clinical scores for Alzheimer’s disease and related dementias

We use a set of 477 lexicosyntactic, acoustic, and semantic features extracted from 393 speech samples in DementiaBank to predict clinical MMSE scores, an indicator of the severity of cognitive decline associated with dementia. We use a bivariate dynamic Bayes net to represent the longitudinal progression of observed linguistic features and MMSE scores over time, and obtain a mean absolute error (MAE) of 3.83 in predicting MMSE, comparable to within-subject interrater standard deviation of 3.9 to 4.8 [1]. When focusing on individuals with more longitudinal samples, we improve MAE to 2.91, which suggests at the importance of longitudinal data collection.


Introduction
Research into the early assessment, pathogenesis, and progression of dementia is becoming increasingly important, as the proportion of people it affects grows every year. Alzheimer's disease (AD), the most common type of dementia, affects more than half of the population above 80 years of age and its impact on society is expected to grow as the "baby boomer" generation ages [2,3,4].
There is no single laboratory test that can identify dementia with absolute certainty. Typically, probable dementia is diagnosed using the Mini Mental State Examination (MMSE), which provides a score on a scale of 0 (greatest cognitive decline) to 30 (no cognitive decline), based on a series of questions in five areas: orientation, registration, attention, memory, and language [5]. While MMSE provides a unified scale for measuring the severity of the disease, it can be time-consuming and relatively costly, often requiring a trained neuropsychologist or physician to administer the test in a clinical setting.
Changes in cognitive ability due to neurodegeneration associated with AD lead to a progressive decline in memory and language quality. Patients experience deterioration in sensory, working, declarative, and non-declarative memory, which leads to a decrease in the grammatical complexity and lexical content of their speech [6]. Such changes differ from the pattern of decline expected in older adults [6], which suggests that temporal changes in linguistic features can aid in disambiguation of healthy older adults from those with dementia.
Some previous work used machine learning classifiers with linguistic features for two-class separation of patients with AD from controls (see section 1.1), but there appears to be no previous research that has used them to infer a clinical score for dementia -an indicator of the degree of cognitive decline. The present work uses a set of automatically-extracted lexicosyntactic, acoustic, and semantic (LSAS) features for estimating continuous MMSE scores on a scale of 0 to 30, using a dynamic Bayes network for representing relationships between observed linguistic measures and underlying clinical scores.
Since dynamic changes in linguistic ability in patients with AD differ from those in typical healthy older adults [6], we hypothesize that considering speech samples over time would aid in estimating underlying cognitive status. Previous studies analyzing dynamic progression of language features in patients with AD did not employ machine learning techniques, and are characterized by a small number of subjects (between 3 and 6) and a limited set of features that do not include acoustics. The present work improves on these analyses by extracting LSAS features from a relatively large collection of longitudinal speech, in order to estimate MMSE scores.

Related Work
Previous work has explored the use of lexicosyntactic features for identifying individuals with AD from controls. Orimaye et al. [7] used DementiaBank 1 , one of the largest existing datasets of pathological speech [8], to perform binary classification of 242 patients with dementia and 242 controls; a support vector machine classifier achieved their best F-measure of 0.74 [7]. Another experiment by Jarrold et al. collected spontaneous speech data from 9 controls, 9 patients with AD, and 30 patients with frontotemporal lobar degeneration (FTLD) [9]. A multi-layer perceptron model obtained classification accuracy of 88% on a two-class task (AD:controls, and FTLD:controls), and 80% on a three-class task (AD:FTLD:controls).
While these studies have obtained promising results in classifying patients with dementia based on linguistic features, there is limited work modelling the progression of such features over time. Le et al. [10] examined the longitudinal changes in a small set of hand-selected lexicosyntactic measures, such as vocabulary size, repetition, word class deficit, and syntactic complexity, in 57 novels of three British authors written over a period of several decades. They found statistically significant lexical deterioration in Agatha Christie's work evidenced by vocabulary impoverishment and a pronounced increase in word repetitions [10], but the measures for syntactic complexity did not yield conclusive results. A similar analysis performed by Sundermann examined the progression of a small set of lexicosyntactic features, such as length, frequency, and vocabulary measures in 6 patients with AD or mild cognitive impairment (MCI), with a minimum of 3 longitudinal samples in Dementia-Bank [11]. Analysis of the features over time did not reveal con-clusive patterns; Sundermann suggested that the limited sample size and feature set selection may be the cause. Neither study involved acoustics or machine learning techniques.

Data
We use data from DementiaBank, a large dataset of speech produced by people with dementia (including probable AD, possible AD, vascular dementia, and MCI) and healthy older adults, recorded longitudinally at the University of Pittsburgh's Alzheimer's Disease Research Center [8]. Annual visits with each subject consist of a recording of speech data, its textual transcription, and an MMSE score. Subjects have a variable number of longitudinal samples (min = 1, max = 5, M = 1.54, SD = 0.79). Each speech sample consists of a verbal description of the Boston Cookie Theft picture, which typical lasts about a minute. We partition subjects between controls (CT) and those with probable AD, possible AD or MCI (or, collectively,"AD" 2 ). Considering only subjects with associated MMSE scores, the working set consists of 393 speech samples from 255 subjects (165 AD, 90 CT).

Features
Three major types of features are extracted from the speech samples and their transcriptions: (1) lexicosyntactic measures, extracted from syntactic parse trees constructed with the Brown parser and POS-tagged transcriptions of the narratives [12,13,14,15,16]; (2) acoustic measures, including the standard Melfrequency cepstral coefficients (MFCCs), formant features, and measures of disruptions in vocal fold vibration regularity [17]; and (3) semantic measures, pertaining to the ability to describe concepts and objects in the Cookie Theft picture. The full list of features, along with their major type and subtype, is shown in Table 1.

Feature Analysis
Two feature selection methods are used to identify the most informative features for disambiguating AD from CT. Since the MMSE score is a measure of the progression of cognitive impairment and is used to distinguish AD from CT generally, we hypothesize that highly discriminating features of the two groups would also be good predictors of MMSE. This is quantified by Spearman's rank-order correlation between the most informative features and the MMSE score, ρMMSE, shown in Table 2.
The first feature ranking method is a two-sample t-test (α = 0.001, two-tailed) which quantifies the significance of the difference in each feature value between the two classes; the features are ordered by increasing p-value. Table 2 shows the type and p-value of the top 10 features, along with their correlation with MMSE. Control subjects use longer utterances, more gerund + prepositional phrase constructions (VP→ VBG PP, e.g., standing on the chair), more content words such as noun phrases (NP) and verbs, and are more likely to talk about what they see through the window (info_window), which is in the background of the scene (e.g., it seems to be summer out). On the other hand, subjects with AD use more words not found in the dictionary (NID), and more function words such as pronouns (PRP). Honoré's statistic measures lexical richness, ex-2 Ongoing work distinguishes between AD and MCI. tending type-token ratio, which is decreased in AD. These findings are consistent with expectations. Since the majority of the extracted acoustic features consist of MFCCs and measures related to aperiodicity of vocal fold vibration, the lack of significance of the acoustic features as discriminators between the two classes may be attributed to the fact that AD is not strongly associated with motor impairment of the articulators involved in speech production.
The second feature selection method is minimumredundancy-maximum-relevance (mRMR), which minimizes the average mutual information between features and maximizes the mutual information between each feature and the class [18]; the features were ranked from most relevant to least. The results of this technique generally corroborate the selection made by the t-test, with no acoustic features among the top 10 selected. Here, mRMR selects a greater proportion of semantic features (e.g., mentions of the window and sink, and the number of occurrences of curtain and stool), placing more weight on the content of what the speaker is saying as a way of discriminating the two classes.
All of the features displayed in Table 2 have moderate statistically significant correlation with MMSE (p < 0.001).
Since we are interested in the task of predicting clinical MMSE scores, the experiments described in Sec. 3 use correlation itself as a third feature selection method. The features are ranked by their correlation with MMSE, and the ones with the highest correlations are selected.

Predicting MMSE score using LSAS features
To model the longitudinal progression of MMSE scores and LSAS features, we constructed a dynamic Bayes network (DBN) with continuous nodes, i.e., a Kalman filter with 2 variables, shown in Figure 1. Each time slice (Qt, Yt) represents one annual visit for a subject. Each conditioning node Qt represents the underlying continuous MMSE score for that visit (R 1×1 ), while each node Yt represents the vector of observed continuous LSAS features (R 477×1 ). A Kolmogorov-Smirnov test for normality was performed on the MMSE scores of all AD subjects, with the null hypothesis that they come from a normal distribution. The test did not reject this null hypothesis at the 5% confidence level, demonstrating that the data come from a Production rule (121) Number of times a production rule is used, divided by the total number of productions. Phrase type (9) Phrase type proportion, rate and mean length. Syntactic complexity (4) Depth of the syntactic parse tree. Subordination/coordination (3) Proportion of subordinate and coordinate phrases to the total number of phrases, and ratio of subordinate to coordinate phrases. Word type (25) Word type proportion; type-to-token ratio, Honoré's statistic. Word quality (10) Imageability; age of acquisition (AoA); familiarity; transitivity. Length measures (5) Average length of utterance, T-unit and clause, and total words per transcript. Perseveration (5) Cosine distance between pairs of utterances within a transcript.

Sem. (85)
Mention of a concept ( The feature set described in Sec. 2.2 is preprocessed to (i) remove features with zero variance across all samples, and (ii) normalize feature values to zero-mean and unit-variance, as is standard practice. Since the number of features (477) is large compared to the number of samples (393), the three feature selection methods described in Sec. 2.3 (i.e., a paired two-tailed t-test, mRMR, and correlation with MMSE score) are used to avoid overfitting, by varying the number of features selected by each method in order to determine the optimal feature set size.
The parameters of the three probability distributions in our model are trained using maximum likelihood estimation (MLE) since all training data are fully observed. During testing, the observed features for each test case are provided and junction tree inference on the trained model computes the marginal distribution of the now hidden (MMSE) nodes. Performance is measured as the mean absolute error (MAE) between actual and predicted MMSE scores. Since not all subjects have the same number of longitudinal samples, MAE is evaluated at the first and last hidden node, and averaged. Experiments are performed with leave-one-out cross-validation, where data from each subject, in turn, are used for testing and all other data for training, over all 255 subjects.
The results, with varying feature set sizes and feature selection methods, are shown in Table 3. The lowest MAE of 3.83 (σ = 0.49) is achieved when correlation is used to select the top 40 features. A two-factor repeated measures ANOVA performed on the mean MAE shows that both main effects are statistically significant, i.e., feature set size (F7,24 = 8.67, p < 0.001) and the feature selection method (F2,24 = 4.07, p < 0.05). The interaction effect is not significant (F14,24 = 0.16, ns), as expected given that the factors are independent.
To illustrate the longitudinal changes in cognitive and linguistic ability, Fig. 2 shows the pattern of decline of MMSE and the top 5 most correlated features for the subset of subjects with AD. This demonstrates the MMSE score declining nonmonotonically over four annual visits (the maximum number of visits for AD subjects in DementiaBank), along with similar patterns across the indicated LSAS features.

Effect of longitudinal data on predicted MMSE score
To test the hypothesis that using longitudinal speech data aids in identifying underlying cognitive status (i.e., improving MMSE estimation), the Kalman filter experiment described in 3.1 is repeated for subsets of the dataset consisting of different amounts  . The number of subjects with at least four visits is too low to conduct statistical experiments. The number of features used in the model is fixed to the optimal feature set size found in 3.1, and the feature selection method is varied (t-test, mRMR, correlation). Leave-one-out cross-validation is performed on each of the four datasets. The results are presented in Table 4. The lowest MAE for each feature selection method occurs on the dataset consisting of the highest number of longitudinal visits (T ≥ 3). A two-factor repeated measures ANOVA performed on the mean MAE shows that the main effect of the data subset is statistically significant (F3,9 = 5.43, p < 0.05) while neither the second main effect (F2,9 = 0.94, ns) nor the interaction effect (F6,9 = 0.54, ns) is significant.

Discussion
Automatically extracted linguistic features can be used to effectively estimate underlying cognitive status, in terms of the most predominant clinical measure of dementia. The best result obtained with leave-one-out cross-validation on the entire dataset of 393 samples is an MAE of 3.83 (σ = 0.49), using correlation to select the top 40 features. This corresponds to a mean absolute relative error (MARE) of 21.0% (obtained as the absolute difference between predicted and actual MMSE score, divided by the actual MMSE score, and averaged over all runs). Molloy and Standish [19] reported that different rating styles among clinicians administering the MMSE and variance in testretest scoring can lead to a within-subject interrater standard deviation of 3.9 to 4.8 and within-subject intrarater standard deviation of 4.8, with higher variation in low-scoring subgroups of subjects [1,19]. The MAE obtained through statistical speech analysis in our present work is comparable to such variability. Further, the results obtained with the Kalman filter model significantly outperform an initial baseline multilinear regressor ran with leave-one-out cross-validation on the same dataset (t = 2.31, p < 0.05). This is being explored further.
The fact that correlation outperforms the other two feature selection methods is expected, as it computes the relationship between the features and the MMSE score directly whereas the others use the presumed diagnosis to dichotomize the data into classes. The majority of features selected on each iteration of cross-validation are typically lexicosyntactic and semantic, with acoustic features typically not being among the most relevant. While this may suggest that anatomical irregularities in speech production are less meaningful, we note that the lexicosyntactic features depend, to a large extent, on the free expression of language through speech. Specifically, the working memory impairment associated with AD affects preferred syntactic constructions in speech, leading to shorter utterances, fewer complex noun and verb phrases, a higher number of pronouns, and lexical impoverishment indicated by Honoré's statistic.
We also show that focussing on subsets of subjects with a higher number of longitudinal samples improves the accuracy of inference in the Kalman filter model, lowering MAE to 2.91 (σ = 0.31) or equivalently lowering MARE to 12.5%, using a ttest for selecting the top 40 features. Since DementiaBank contains a variable number of samples for each subject, the number of subjects and the proportion of subjects with AD in each subgroup explored in Sec. 3.2 is not balanced. We therefore suggest that future data collection of pathological speech should involve more longitudinal samples across participants.
While MMSE is one of the most widely used clinical tests for cognitive ability, it is somewhat coarse, lacking sensitivity to subtle changes in cognition in the early stages of dementia, as well as having a high false-negative rate in addition to interannotator disagreement and test-retest variability [20,1,19]. While automated prediction of the MMSE score may aid the screening process for AD by reducing the cost and time involved, and improving reliability, future work will explore more precise measures of cognitive decline. The Montreal Cognitive Assessent (MoCA) and the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) [21] are screening tests which have been shown to have higher sensitivity than MMSE to subtle changes in cognitive decline in populations with MCI and mild dementia [22]; future studies are needed to assess the validity of automatic scoring of such tests as a more fine-grained measure of the progression of cognitive decline.

Acknowledgements
This work is funded by an NSERC Discovery grant (RGPIN 435874) and by a Young Investigator award by the Alzheimer Society of Canada.