Detection of Alzheimer’s disease based on automatic analysis of common objects descriptions

Many studies have been made on the language alterations that take place over the course of Alzheimer’s disease (AD). As a consequence, it is now admitted that it is possible to discriminate between healthy and ailing patients solely based on the analysis of language production. Most of these studies, however, were made on very small samples—30 participants per study, on an average—, or involved a great deal of manual work in their analysis. In this paper, we present an automatic analysis of transcripts of elderly participants describing six common objects. We used part-of-speech and lexical richness as linguistic features to train an SVM classiﬁer to automatically discriminate between healthy and AD patients in the early and moderate stages. The participants, in the corpus used for this study, were 63 Spanish adults over 55 years old (29 controls and 34 AD patients). With an accuracy of 88%, our experimental results compare favorably to those relying on the manual extraction of attributes, providing evidence that the need for manual analysis can be overcome without sacriﬁcing in performance.


Introduction
As life expectancy increases, age-related disorders increase as well, bringing great social, health and economic challenges for governments and societies in general. Researchers across the world are trying to find methods for detecting and treating these disorders in effective, non-invasive and costefficient ways.
AD affects one in ten adults over 65 years old in the United States (Alzheimer's Association, 2015). Interventions may be more effective in the early stages of dementia. Nevertheless, it is highly common, especially in low and middle income countries, to diagnose AD several years after the disease begins, leading to a treatment gap for early dementia sufferers (Alzheimer's Disease International, 2011). This gap could reduce the effectiveness of treatments, prolonging the patients' state of reduced independence. Alzheimer's Disease International (2015) identifies early diagnosis and treatment as a means of attenuating care costs and reducing this gap. Furthermore, an early diagnosis would allow the sufferers and their families to get their affairs in order by foreseeing the future better and preparing accordingly.
Many researchers have studied the early detection of AD. These studies usually follow two main approaches: the analysis of biomarkers and the examination of patients' decreasing cognitive abilities. The first approach yields reliable results in the detection of AD in its moderate and advanced stages, albeit still performing insufficiently in the early stages of the disease (Alzheimer's Association, 2015). The second approach has gained more attention in recent years, due to the fact that, in clinical practice, it has shown promise in the early detection of AD (Taler and Phillips, 2008;Schröder et al., 2010). Furthermore, when compared to the first approach, the analysis of the decline of cognitive abilities represents an inexpensive and noninvasive alternative.
Language skills are among the first cognitive abilities to diminish during the course of AD, with alterations appearing even before any symptom is experienced. Clinicians have designed many standard tests to evaluate language in elderly patients (Taler and Phillips, 2008), such as asking them to retrieve words from certain categories, to think of words that start with the same letter, to name objects in pictures, etc. These tests, although sufficient to give a reasonably accurate diagnosis, present some problems in the clinical practice (Smith and Bondi, 2013). Such problems include production of nervousness and discomfort in elderly patients, as well as a "practice effect". Also, these tests do not necessarily describe patients' real performance in language production. This, apart from aiding in early detection of the disease, could help further our understanding of the disease, its progression and the parts of the brain affected in early stages (before the damage can be visible on MRI images).
In this article, we introduce our first experimental approach for automatic analysis of transcripts from elderly Spanish speakers. We aim to discriminate cognitively-healthy participants from (early and moderate) AD sufferers.

Related Work
Relatively few authors (Bucks et al., 2000;Jarrold et al., 2010;Guinn and Habash, 2012;Guinn et al., 2014;Jarrold et al., 2014;Alegria et al., 2013) have researched the automatic discrimination of AD patients using language analyses of transcripts, although there is a growing interest in recent years. In most studies, researchers examined the free discourse of elderly Englishspeakers. The most often-used features are partof-speech rates, lexical richness measures, pauses, and incomplete words. Overall accuracy ranges from 73% to 95%, but between authors there is a disagreement on the features used. Some works, like Khodabakhsh et al. (2015) even minimize the usefulness of these types of features.
Most of these studies used very small samples (8-32 AD patients and 16-51 controls) taken in different settings (phone/face-to-face conversations, hospital/familiar environment, inconsistent thematics, etc.). These differences make it difficult comparing their findings. Given the small size of the samples, it would be helpful to use corpora with constrained settings, like restricted discourse and controlled environments, in order to discard differences attributable to factors unrelated to language. Moreover, further studies with non-English speakers would help us to enrich our understanding of language alterations due to AD.
In a different approach, Guerrero et al. (2016) trained a Bayesian Network using manually extracted conceptual components along with age, gender, and educational level as prior probabilities to detect AD. Their corpus consisted of transcripts of Spanish elderly participants orally describing six objects (Peraita and Grasso, 2010). The authors reported the following performanceaccuracy, precision, recall (sensitivity), F 1 -score, false positive rate, and false negative rate-: Acc Pre Rec F 1 FPR FNR 0.91 0.94 0.87 0.90 0.05 0.01 Table 1: Results by Guerrero et al. (2015).
For this work, we studied the restricteddiscourse corpus used in Guerrero et al. (2016), and trained an SVM using some of the linguistic features used by previous authors in the analysis of free conversations. We additionally incorporated two scarcely explored part-of-speechbased features-conjunction rate and secondary verb rate-. We compared our automatic analysis results to those obtained by Guerrero et al. using manually extracted conceptual components.

Corpus
Peraita and Grasso (2010) created a dataset 1 of oral descriptions in Spanish to study linguistic pathologies related to dementia, particularly AD. All recollections were obtained with the written informed consent of the participants (Grasso et al., 2011). The authors granted us permission to use their corpus for this study. We choose this corpus because its availability, restricted discourse and homogeneous recollections facilitate the comparison of our results with those of other researchers. Likewise, the size of this sample is comparable to the largest samples used in related works.
The cohort used by Guerrero et al. (2016) in their study includes a total of 69 participants (30 controls and 39 AD patients previously diagnosed by neurologists) aged between 55 and 95 years old. For each participant, Peraita and Grasso recorded free oral descriptions of six common objects (referred in their work as "semantic categories"): dog, apple, pine (living things), car, trousers and chair (non-living things). These descriptions were manually transcribed. Any interactions with or interventions by the interviewer were excluded from the transcription. In addition, the authors of the corpus noted if a participant went "off-topic", but did not include these utterances in the transcript. They annotated this corpus to show marks of interruptions, off-topic, and unintelligible words.
In their study, Peraita and Grasso (2010) manually analyzed and extracted attributes from the description of each object. These attributes were divided into eleven categories: taxonomic, types, parts, functional, evaluative, places/habitat, behavior, cause/generate, procedural, life cycle, and others. In Figure 1, we provide a translated version of a sample taken from the corpus.
From the sample, we removed all participants with no utterances in the description of one or more objects. Additionally, since our objective was to evaluate the performance of a classifier for early detection of AD, we proceeded like Guerrero et al. (2016) and only considered the controls and patients in the early and moderate stages of AD. Our final sample consisted of a total of 63 participants (29 controls and 34 AD patients).

Linguistic features
For this work, we used a combination of 5 features that most authors have found suitable: verb, noun and preposition rates, Brunet's W index, and Honoré's R Statistics. Additionally, in a previous non-automatic study regarding the preservation of syntax in AD, Kemper et al. (1993) found that sentences produced by cognitively healthy adults usually contain more secondary verbs and conjunctions. We incorporated these findings, resulting in a total of 7 features.

Part-Of-Speech features:
• Verb, noun, preposition and conjunction rates: the number of verbs, nouns, prepositions and conjunctions per 100 words, respectively.
• Secondary verb rate: number of secondary verbs divided by the total number of verbs.
Lexical richness features: • We used Brunet's W index (Brunet, 1978) to determine the richness of speakers' vocabularies: Where N is the total number of words used and V is the vocabulary size (number of different words used).
• Honoré's R Statistics (Honoré, 1979) measures lexical richness based on the number of a speakers once-mentioned words: Where N is the total number of words used, V is the vocabulary size, and V 1 is the number of words mentioned only once.

Implementation
To perform the binary classification, we used a Support Vector Machine (SVM) implementation of the Python library scikit-learn (Pedregosa et al., 2011). For the automatic tokenization, lemmatization, and part-of-speech extraction, we used FreeLing 3.0 (Padró and Stanilovsky, 2012), an open source language analysis tool suite. We selected this package for its good performance in Spanish (although it also supports other languages), and for the way it encapsulates multiple text analysis services in a single application.
In their experiments, Guerrero et al. (2016) trained their Bayesian Network without directly linking risk factor variables (such as age, gender, or education) to the rest of the model. Instead, they used these a priori probabilities as deterministic inputs. In our experiments, we did not consider these variables. We performed our classification based solely on linguistic features.
Using the above-mentioned risk factor variables and their correlation to AD could be useful in the improvement of the overall accuracy of these types of experiments. However, these correlations vary significantly depending on factors such as country, race, quality of life, diet, pollution, environment, etc. Moreover, in most countries, there are no reliable statistics about these correlations (Alzheimer's Disease International, 2015). Furthermore, most AD datasets have very few participants, and their distributions are not usually an accurate representation of the population. In practice, training an algorithm with the socio-demographic information presented in these datasets would lead to biased results.
In the core of their Bayesian Network, Guerrero et al. (2016) calculated the probability of a person having a lexical-semantic-conceptual deficit (LSCD)-which is considered by the authors as a major sign of cognitive impairment-in two main categories: "living things" and "non-living things". The authors obtained these probabilities based on the number of attributes present on each of the 11 categories; first individually for each object, and then jointly for the main category to which they belonged (living / non-living things). The reason behind this categorical division is that previous researchers have found an important difference in the number of attributes of living and non-living things in the descriptions given by AD patients in early stages and those given by healthy individuals. The authors used the k-means++ algorithm to discretize the presence of LSCD given the number of living and non-living things' attributes mentioned by a participant.
We designed two different experiments. In the first experiment, we followed the lead of Guerrero et al. (2016) and divided each human subject's descriptions into living and non-living things. From this, we extracted a total of 14 linguistic features (set1): 7 features (verb, noun, preposition, secondary verb and conjunction rates, Brunet's W index, and Honoré's R Statistics) from their descriptions of living things, and (the same) 7 features from their descriptions of non-living things. In the second experiment, we considered all the descriptions from each human subject as a unit and extracted the 7 linguistic features (set2).
Calibration: We tested two SVM kernels for both experiments, linear and Radial Basis Function (RBF). We used a 5-fold cross validation to calibrate the value of their respective hyperparameters. Cross validation used 80% of the data for training and 20% for testing. We shuffled the training and testing samples and selected them at random. For set1, the best model (accuracy=86%) used an RBF kernel (C=1.0, gamma=0.0001). The best model (accuracy=88%) for set2 used a linear kernel (C=0.1). The best models' accuracies reflect the performance of the classifiers when dealing with completely unseen data.

Results
We evaluated the two best models using 5-times-10-fold cross-validation over the dataset. In Table 2 we show the average of the most common performance metrics-accuracy, precision, recall (sensitivity), F 1 -score, FPR (false positive rate), and FNR (false negative rate)-used for medical applications. Additionally, we obtained the ROC curves and areas under the curves (AUC) for both experiments (see Figure 2).

Discussion and future directions
As shown in Table 2, the differences in accuracy and F 1 -score between the AD classifiers trained with set1 and set2 of features are not very per-ceptible. The classifier of the second experiment has a slightly higher sensitivity (2% more), which means that it has a lower tendency of letting AD participants go unrecognized. When comparing the AUC of both classifiers, the difference is more noticeable; set2 performed better than set1. From this, we concluded that for the linguistic features considered, there is no need to separate participants' descriptions into living and non-living categories. Guerrero et al. (2016) reported an accuracy of 91% (see Table 1) and an AUC of 0.9636. They used a Bayesian Network fed with manuallyextracted attributes and incorporated participants' socio-demographic information as a priori deterministic inputs. We obtained an accuracy of 88% and an AUC of 0.9685 by performing automatic language analysis, without taking into account any socio-demographic information. Although the manually extracted attributes' classifier performs slightly better, automatic language analysis reduces time and human effort and provides consistency and replicability.
There is another cohort of 143 speakers from Argentina in the corpus used in this work. The corpus is provided in a read-only application, and manually transforming the data into text format took a great amount of time. For this reason, we only analyzed the cohort of Spanish participants as Guerrero et al. (2016) did. To our knowledge, no experimental work has been done over it yet. Our next step will be to experiment with this cohort to explore intralanguage variations. We also intend to perform a study on less restrictive discourse contexts, like the work of Prud'hommeaux and Roark (2011) with story retellings.
For our first set of experiments, we selected some basic linguistic features commonly used in free spontaneous discourse analysis, but applied them to a particular restricted discourse context with very encouraging results for detecting AD in its early and moderate stages. In future experiments we will test more sophisticated linguistic features, and perform computational syntactic and semantic analysis. Furthermore, we will investigate performance of other classification algorithms. An in-depth analysis of features used and their relevance in this task is also planned.