Detecting Mild Cognitive Impairment by Exploiting Linguistic Information from Transcripts

Here we seek to automatically identify Hungarian patients suffering from mild cognitive impairment (MCI) based on linguistic features collected from their speech transcripts. Our system uses machine learning techniques and is based on several linguistic features like characteristics of spontaneous speech as well as features exploiting morphological and syntactic parsing. Our results suggest that it is primarily morphological and speech-based features that help distinguish MCI patients from healthy controls.


Background
Mild cognitive impairment (MCI) is a heterogeneous set of symptoms that are essential in the early detection of Alzheimer's Disease (AD) (Negash et al., 2007). Symptoms such as language dysfunctions may occur even nine years before the actual diagnosis (APA, 2000). Thus, the language use of the patient may often indicate MCI well before the clinical diagnosis of dementia.
MCI is known to influence the (spontaneous) speech of the patient via three main aspects. First, verbal fluency declines, which results in longer hesitations and a lower speech rate (Roark et al., 2011). Second, the lexical frequency of words and part-of-speech tags may also change significantly as the patient has problems with finding words (Croot et al., 2000). Third, the emotional responsiveness of the patient was also observed to change in many cases (Lopez-de Ipiña et al., 2015).
For many patients, MCI is never recognized as in the early stage of the disease it is not trivial even for experts to detect cognitive impairment: according to Boise et al. (2004), up to 50% of MCI patients are never diagnosed with MCI. Although there are well known tests such as the Mini Mental Test, they are usually not sensitive enough to reliably filter out MCI in its early stage. Tests on linguistic memory prove more efficient in detecting MCI, but they tend to yield a relatively high number of false positive diagnoses (Roark et al., 2011).
Although language abilities are impaired from an early stage of the disease, evaluating the language capacities of the patients has only received marginal attention when diagnosing AD (Bayles, 1982). However, if diagnosed early, a proper medical treatment may delay the occurrence of other (more severe) symptoms of dementia to the latest extent possible (Kálmán et al., 2013).
Here we seek to automatically identify Hungarian patients suffering from mild cognitive impairment based on their speech transcripts. Our system uses machine learning techniques and is based on several features like linguistic characteristics of spontaneous speech as well as features exploiting morphological and syntactic parsing.
Recently, several studies have reported results on identifying different types of dementia with NLP and speech recognition techniques. For instance, automatic speech recognition tools were employed in detecting aphasia (Fraser et al., 2013b;Fraser et al., 2014;Fraser et al., 2013a) and mild cognitive impairment (Lehr et al., 2012), and Alzheimer's Disease (Baldas et al., 2010;Satt et al., 2014). Jarrold et al. (2014) distinguished four types of dementia on the basis of spontaneous speech samples. Lexical analysis of spontaneous speech may also indicate different types of dementia (Bucks et al., 2000;Holmes and Singh, 1996) and may be exploited in the automatic detection of patients suffering from dementia (Thomas et al., 2005). As for analyzing written language, changes in the writing style of people may also refer to dementia (Garrard et al., 2005;Hirst and Wei Feng, 2012;Le et al., 2011).
Concerning the automatic detection of MCI in Hungarian subjects, Tóth et al. (2015) experimented with speech recognition techniques. However, to the best of our knowledge, this is the first attempt to identify MCI on the basis of written texts, i.e. speech transcripts for Hungarian.
In the long run, we would like to develop a system that can automatically detect linguistic symptoms of MCI in its early stage, so that the person can get medical treatment as early as possible. It should be noted, however, that our goal cannot be an official diagnosis as diagnosing patients requires medical experience. All we can do is implement a test supported by methods used in artificial intelligence, which indicates whether the patient is at risk and if so, s/he can turn to medical experts who will provide the clinical diagnosis.

Data
In our experiments 1 , two short animated films were presented to the patients at the memory ambulance of the University of Szeged. Patients were asked to talk about the first film then about their previous day, and lastly, about the second film. Their speech productions were recorded and transcribed by linguists, who explicitly marked speech phenomena like hesitations and pauses in the transcripts. These transcripts formed the basis of our experiments, i.e. we exploited only written information.
All of our 84 subjects were native speakers of Hungarian, a morphologically rich language. For each person, a clinical diagnosis was at our disposal, i.e. it was clinically proved whether the patient suffers from MCI or not. On the basis of these data, subjects were classified as either MCI patient or healthy control at the university memorial. Table 1 shows data on the subjects' gender and diagnosis while Table 2 shows the mean values for age and education (in terms of years attended at school).
Speech transcripts reflect several characteristics of spontaneous speech. On the one hand, they contain several forms of hesitations and silent pauses,  which are also marked in the transcripts, on the other hand, they abound in phenomena typical of spontaneous Hungarian speech such as phonological deletion (mer instead of the standard form mert "because" or ement instead of the standard form elment "(he) left") and lengthening (utánna instead of the standard form utána "then"). There are duplications (ez ezt "this this-ACC") and neologisms created by the speaker (feltkáva, which probably means főtt kávé "boiled coffee"). Fillers also deserve special attention when studying transcripts. Besides hesitations, we treated words and phrases referring to some kind of uncertainty together with indefinite pronouns as fillers such as ilyen "such", olyan "such", izé "thing, gadget",és aztán "and then", valamilyen "some kind of", valahogy "somehow", valamerre "somewhere" 2 . Thus, MCI patients often seem to substitute content words with fillers or indefinite pronouns, moreover, they also appear to use lots of paraphrases, which also indicate uncertainty just like egy ilyen bagolyszerűség a such owl-likeness "something similar to an owl" or az olyan délelőtt volt that such morning was "that happened some time in the morning".

Experiments
In order to determine the status of the subjects, we experimented with machine learning tools. The task was regarded as binary classification, i.e. subjects were classified as either an MCI patient or a healthy control, on the basis of a feature set derived from their transcripts.
At first, transcripts were morphologically and syntactically analysed with magyarlanc, a linguistic preprocessing toolkit developed for Hungarian (Zsibrita et al., 2013). For classification, we exploited morphological, syntactic and semantic features extracted from the output of magyarlanc.
Each person was asked to recall three different stories. As MCI is strongly related to memory deficit, we believe that the order of the tasks might also influence performance, hence we opted for processing each transcript separately. Thus, for each person, features to be discussed below were calculated separately for the three transcripts and all of them were exploited in the system.

Feature set
In our experiments, we employed features of spontaneous speech and morphological and semantic features derived from the transcripts and their automatic linguistic analyses. When defining our features, we took into account the fact that the speech of MCI patients may contain more pauses and hesitations than that of healthy controls (Tóth et al., 2015) and they are also supposed to have a restricted vocabulary due to cognitive deficit, which may affect the choice of words and the frequency of parts of speech (Croot et al., 2000) and might even yield neologisms. We also made use of demographic features that were at our disposal.
Our feature set contained the following features: Spontaneous speech based features: number of filled and silent pauses; number and rate of hesitations compared to the number of tokens; number of pauses that follow an article and precede content words as this might reflect that MCI patients may have difficulties with finding the appropriate content words; number of lengthened sounds (which we considered as a special form of hesitation).
Morphological features: number of tokens and words; number and rate of distinct lemmas; number of punctuation marks; number and rate of nouns, verbs, adjectives, pronouns and conjunctions; number of first person singular verbs as it might also be indicative how often the patient reflects to him/herself; number and rate of unanalyzed words, i.e. those with an "unknown" POS tag, which might indicate neologisms created by the speaker on the spot.
Semantic features: number and rate of fillers and uncertain words compared to the number of all tokens; number and rate of words/phrases related to memory activity (e.g. nem emlékszem not remember-1SG "I can't remember") as they directly signal problems with memory and recall; number of negation words; number and rate of content words and function words; number of thematic words related to the content of the films, based on manually constructed lists.
The mean values for each feature are reported in Table 3.

Statistical analysis of features
In order to reveal which features can most effectively distinguish healthy controls from MCI patients, we carried out a statistical analysis of the data (t-tests for each feature and transcript). For most of the features, significant differences were found between the two groups -p-values are listed in Table 3. The age of the patients also indicates significant differences: people who were at least 71 years old were more probable to suffer from MCI than those who were younger at the time of the experiment (p = 0.0124).
According to the data, each group of features has a significant effect in distinguishing controls and MCI patients. It is shown that it is mostly the second transcript (the one including the narratives about the subjects' previous days) where significant differences may be found among MCI patients and the control group. However, significant differences exist for the other two types of texts as well.

Machine learning experiments
To automatically identify MCI patients, we exploited machine learning techniques, i.e. support vector machines (SVM) (Cortes and Vapnik, 1995) with the default settings of Weka (Hall et al., 2009) and due to the small size of the dataset, we applied leave-one-out cross validation. As a baseline, majority labeling was used. For the evaluation, the accuracy, precision, recall and F-measure metrics were utilized.
In order to examine the effect of certain groups of features, we carried out an ablation study, i.e. we retrained the system without making use of one specific group of features. The results and differences are shown in Table 4 Table 4: Results and differences. MCI: mild cognitive impairment, P: precision, R: recall, F: F-measure, %: accuracy.

Results and Discussion
Using all the features, our system managed to achieve an accuracy score of 69.1%, that is, 58 out of the 84 patients were correctly diagnosed. 12 patients were falsely diagnosed as healthy and 14 controls were falsely labeled as MCI patients. Our results outperformed the baseline (57.14% in terms of accuracy). The system got a high recall value for MCI patients (75.0) but a lower one for controls (61.1), which is encouraging in the light of the fact that our main goal is to identify the widest possible range of potential MCI patients, who can turn to clinical experts to find out what their clinical diagnosis is.
We also experimented with using only features that displayed statistically significant differences among controls and MCI patients (see Table 3). Somewhat surprisingly, an accuracy of 75% could be achieved in this way, which indicates that some of our original features are superfluous and just confused the system, and this result needs further investigation.
An ablation study was also carried out to analyze the added value of each feature group. Speech-based, demographic and morphological features unequivocally contributed to performance. However, the effect of semantic features seems less obvious as they harm performance taken as a whole but some individual semantic features are useful for the system, as shown by the results achieved with just using significant features.
When investigating the errors made by our system, we found that MCI patients that spoke only a few short sentences were often classified as healthy controls. They had a lower number and rate of hesitations and pauses, moreover, their vocabulary contained fewer fillers and uncertain words, and these features resemble those typical of healthy controls. What is more, healthy subjects who talked more also hesitated more, which might be indicative of MCI. Furthermore, their use of pronouns and conjunctions was also more similar to those of MCI patients, hence the system falsely predicted a positive diagnosis for them.
Due to the specific characteristics of the data and the complexity of data collection -which requires clinical experiments -our dataset can be expanded only step by step. However, we found statistically significant differences among MCI patients and healthy controls concerning several linguistic and speech-based features even in our small dataset, which may be beneficial for our future experiments and might be also exploited by those who study spontaneous speech.

Conclusions
In this study, we introduced our system that automatically detects Hungarian patients suffering from mild cognitive impairment on the basis of their speech transcripts. The system is based on features derived from morphological and syntactic analysis as well as characteristics of spontaneous speech. Both statistical and machine learning results revealed that morphological and spontaneous speech-based features have an essential role in distinguishing MCI patients from healthy controls.
In the future, we would like to extend our dataset with new transcripts. Also, we intend to improve our machine learning system and investigate the role of semantic features. Lastly, we would like to integrate features from automatic speech recognition into our system so that tools from both speech technology and natural language processing can contribute to the automatic detection of mild cognitive impairment.