Impact of ASR on Alzheimer’s Disease Detection: All Errors are Equal, but Deletions are More Equal than Others

Automatic Speech Recognition (ASR) is a critical component of any fully-automated speech-based dementia detection model. However, despite years of speech recognition research, little is known about the impact of ASR accuracy on dementia detection. In this paper, we experiment with controlled amounts of artificially generated ASR errors and investigate their influence on dementia detection. We find that deletion errors affect detection performance the most, due to their impact on the features of syntactic complexity and discourse representation in speech. We show the trend to be generalisable across two different datasets for cognitive impairment detection. As a conclusion, we propose optimising the ASR to reflect a higher penalty for deletion errors in order to improve dementia detection performance.


Introduction
There is a rapid growth in the number of people living with Alzheimers disease (AD) [1]. Clinical research has shown that quantifiable signs of cognitive impairment associated with AD are detectable in spontaneous speech [2]. For example, a reduction in syntactic and lexical complexity in speakers with AD has often been reported in previous studies. In a study comparing 22 participants with AD and 24 controls [3], it was found that the group with dementia produced significantly fewer subordinate clauses. In a study with 20 individuals with AD and 30 healthy controls [4], it was found that the number of complex T-units 1 in speech was significantly lower in speakers with AD. Similarly, it was found that the AD group used fewer auxiliary verbs, fewer gerund verbs, fewer participles, and more sentence fragments [5].
Machine learning models have proved to be successful in detecting AD using speech and language patterns, such as acoustic markers of speech, syntactic and lexical complexity of language [5,6,7]. To use speech and language patterns for machine learning models, it is important, in addition to acoustic features, to have lexico-syntactic features extracted from transcripts of speech [5,8,9,10]. Since transcripts should be accurate enough to properly represent specific syntactic and linguistic characteristics, current approaches [11,12] frequently rely on 100% accurate human-created transcripts produced by trained transcriptionists. Relying on manual transcripts make AD detection systems time-consuming and not scalable, thus not efficient.
Automatic Speech Recognition (ASR) is a critical component to have a fully-automated and fast speech-based AD detection model. There is an extensive previous research showing 1 T-unit: Minimally terminable unit that the presence of cognitive impairments, such as AD, causes the accuracy of ASR to decrease [10,13]. It was found that for speakers with dementia WER varied between 93.54% and 38.24% depending on the dataset used for training the ASR system [10]. Similar results were reported for the ASR system tested on personal robots, recognising speech of AD patients [13]. Two main issues associated with lower ASR performance in people with AD are the following: a) accuracy is lower for older voices or for speakers with cognitive impairment due to increased breathiness and decreased intelligibility [14,10]; b) errors introduced due to ASR, such as word deletions or insertions, occlude important signals of impairment that can be extracted from speech [11]. Strategies to deal with the issue a) have been proposed in prior research [10] and include augmenting training data for the ASR system with a dataset of pathological speech. However, very little previous research was done to deal with issue b). To our best knowledge, no prior research was done to understand what patterns of impaired speech are influenced the most by ASR errors, and how this impacts performance of AD detection using machine learning models.
In this paper, we focus on the issue b) and study the effect of deletion, insertion and substitution errors on lexico-syntactic features extracted from transcripts of speech. We artificially generate noisy transcripts by applying a controlled amount of insertions, substitutions, and deletions to manually-transcribed data, to allow classification experiments across a wide range of noisy speech samples. The effect of these errors on binary ADcontrols classification performance is studied and suggestions are provided on how to improve ASR in order to maintain reasonable AD classification performance.
We identify that deletion errors affect the classification more than substitution and insertion on two datasets of spontaneous speech. The effect of these deletion errors are most profound on features related to syntactic complexity and discourse representations in speech, such as production rules, word-level structure and repetitions. These features are also identified as being the most important for the classification task using a feature gradient-based importance metric.

DementiaBank (DB)
The DementiaBank 2 dataset is a large publicly accessible dataset of pathological speech. It consists of narrative picture descriptions from participants aged between 45 to 90 [15]. Participants in the longitudinal study describe the 'Cookie Theft' image that they are shown. Out of 210 participants in the study, 117 were diagnosed with AD (180 samples of speech) and 93 were healthy (HC, 229 samples) with many participants repeating the task annually. Each participant has 1.7 corresponding samples on average. Voice recordings and manual transcriptions (with some transformations to CHAT protocol [16]) are available for all samples. This dataset is used for the experiments in Sec. 4,5,6.

Famous People (FP)
The Famous People dataset [17] consists of publicly available spontaneous speech samples from 17 famous individuals (e.g., Woody Allen, Clint Eastwood, Ronald Reagan) over the course of years starting from 1956 to 2017, spanning periods from early adulthood to older age, with an average of 25 samples per person. Approximately half of the subjects in the dataset were diagnosed with AD. The rest are considered to be healthy controls (HC, N = 231), given an absence of any reported diagnosis or subjective memory complaints. This HC group covers a variety of speaker ages, from 30 to 88 (µ = 60.9, σ = 15.4).
Similarly, the AD group covers ages from 31 to 97 (µ = 65.3, σ = 11.5). Voice recordings and manual transcriptions (following the same protocol as DB) are available for all samples. This dataset is used for testing generalisabity of our results (Sec. 7).

ASR Setup
The Automatic Speech Recognition (ASR) system we use for the purpose of this work is based on the open-source Kaldi toolkit [18], which in turn is based on Weighted Finite State Transducer (WSFT) approach. ASR acoustic model is trained on the Fisher speech corpus [19], using Hidden Markov Model (HMM), using calculated MFCC features. ASR uses ASPiRE chain model trained on multi-condition Fisher English corpus as a 3-gram language model. Rates of ASR errors for healthy and impaired speakers for DB and FP datasets are in Tab. 1. Majority of errors arise from deletions and substitutions for both datasets and for both groups.

Feature Extraction and Aggregation
Following previous studies [5,20], we extract 505 lexicosyntactic and acoustic features that can be aggregated into the following major groups: Syntactic Complexity: features to analyze the syntactic complexity of speech, such as number of occurrence of various production rules, mean length of clause (in words) etc. Discourse mapping: features that help identify cohesion in speech using a visual representation of message organization in speech. The major representation we use is a speech graph [21], where each word is a node and temporal links between words correspond to directed edges. Examples of features include the number of edges in the graph, number of self-loops, cosine-distance across unique utterances etc. Lexical Complexity and Richness : measures of lexical density and variation, such as average familiarity scores of all nouns, age of word acquisition, frequency of POS tags etc. Acoustic:Voice markers such as MFCC coefficients and Zero Crossing Rate (ZCR) related features.
Additionally, we extract features quantifying difficulty in finding the right words (e.g. filled pauses), measures related to description of content in the picture (e.g. number of content units), and coherence in speaking at local and global level.

Artificial ASR Errors
We follow a method similar to the one used by Fraser et al [11] to artificially add errors to manual transcripts at predefined 20%, 40% and 60% WER rates. All altered words w are selected at random. The following modifications are done: a) deletionword instance w is deleted, b) insertion -new word w1 is added after the word w, c) substitution -word w is replaced with a new word w1.
For deletion we simply delete random words from manual transcript at a specified rate. To substitute word w, we select a unigram from 2,000 most used unigrams from Fisher language model that has the smallest Levenshtein distance with word w based on the phonemic model from The Carnegie Mellon Pronouncing on Pronouncing Dictionary [22]. If word w is not found in the Fisher language model a random unigram from the top 2,000 is used for substitution. For insertion, we select a word from the bigram list from the language model that has the highest probability to follow after word w and insert it if it does not match the following word in transcript. In case of a match, the next most probable word is inserted. If word w is not found in bigram list a random unigram is used for insertion 3 .

Noise Addition
We perturb the features to mimic random sources of errors using Gaussian noise. We do this in order to compare and differentiate from consequences of ASR errors. This modification is implemented by adding a randomized number to the extracted features where the mean of the number added to a given feature is zero and the standard deviation varies depending on the amount of noise we wish to add.
The strategy followed to add k level of noise ( ) to an input Xi is as follows: where i is sample number, j is feature number, and standard deviation of noise added is k times the standard deviation of the original feature. Note that the standard deviation (σ) per feature is calculated over all samples.

Classification Model
All our experiments are based on predictions obtained from the Dual Feature Decomposition (DFD) model [17], a 3-layer neural network that consists of two latent states -one for predicting the probability that a given sample is healthy (HC) and another for the probability that a given sample is impaired (AD). We will henceforth refer to these states as HC component and AD component. Model optimization is performed to separate and classify the AD component as impaired for all AD samples and HC component as healthy for healthy samples. DFD was used for our experiments, since prior work [17] has shown that this model works well for classification tasks for both crosssectional and longitudinal data. We benchmark AD detection performance using this model on the two datasets, DB and FP. Implementation: All 3 hidden layers of our network have 10 units each. We initialize all the weights in the network with He initialization [23] using a uniform distribution. We use the Adam optimizer [24] with an initial learning rate of 0.01.We report evaluation metrics with the model trained up to a maximum of 25 epochs on DB, with early stopping to prevent overfitting.

Classification with Manual and ASR-generated Transcripts
We evaluate performance of classifying samples of speech to two classes -AD or healthy -using the DB dataset. 10-fold cross-validation is used, so that each subject's samples do not occur in both training and testing sets in each fold. Note that the input to the model consists of 505 lexico-syntactic and acoustic features (See Sec. 3.1).   1 shows that deletion errors affect classification performance significantly stronger than insertion and substitution errors do. 40% of deletions reduce F1 score by more than 8%, while 40% of insertions only result in 2.4%, and 40% of substitutions -in 0.1% of F1 score reduction. These differences become even more pronounced with adding a bigger amount of errors. Paired t-tests reveal that trajectory of F1 score with varying levels of noise is significantly different from that with varying deletion errors (F = −3.45, p < 0.05) but not that with insertions (p = 0.84) or substitutions (p = 0.29), showing that insertion and substitution errors influence classification performance in a way that is similar to a random noise. Deletion errors, however, have a significantly stronger effect on classification. Correlation between the level of errors in the range between 0 to 80% (0% denotes manual transcription) and F1 score is also insignificant for insertions (p=1) and substitutions (p=0.50) but is strong and significant for deletion errors (Spearman correlation test, ρ=-0.99, p<0.001).
It is also interesting to note that the model utilizing automatic transcripts from ASR retains a level of performance at 73.97%, which is comparable to the potential decrease in performance due to the rate of ASR deletion errors (48.14%). Graph self-loops with 3 edges Lexical Richness Category 'Particles' 7 2 POS 'Verb Past Tense' 16 POS 'Wh-pronoun ' 17 Different effects of errors on classification performance suggest that some features, extracted from the speech samples and used as an input for the classification algorithm, are affected far more substantially by deletions rather than any other type of errors. This leads us to inspect the correlation of feature values and the amount of deletions.

Distinctive Effects of Deletion Errors
We identify features maintaining higher correlation with the amount of deletions than that with the amount of insertions and substitutions. Tab.5 summarizes the aforementioned deletionaffected features, their relative ranking, where a higher rank denotes higher absolute Spearman correlation, as well as aggregated group ranking (all significant with p<0.001). Note that we do this comparison across all 505 features.
We observe 17 features in total that distinctively correlate with deletions. Out of these, the absolute majority of 14 features (82.35% of all selected) are associated with syntactic complexity (production rules of a constituency parser) and discourse phenomena (graph self-loop with 3 edges) and 3 (17.6%) -with lexical richness in speech. Other feature groups, such as acoustic features or those associated with word finding difficulty, do not meet the required conditions. Such results show that syntactic structure of language is much more vulnerable to deletions than to other ASR errors. This can be explained by the fact that insertions and substitutions use words from the language model (i.e. most probable words) for the modifications, which to some extent helps maintaining basic syntactic rules and structure.
Correlation between the number of deletions and features of syntactic structure shows the vulnerability of the feature group representing syntactic complexity and discourse phenomena to ASR deletion errors. However, it does not explain a decrease in classification performance when adding deletion errors. In Sec.6 we inspect if features of syntactic complexity are more influential in AD detection than other characteristics of speech.

Model-based Analysis of Feature Importance
In order to quantify importance of input features for classification, we obtain the gradient of HC and AD components (see Sec.3.3) with respect to the input feature values to understand which features and corresponding groups contribute most to each component. We define feature importance for given input Xi with respect to the two intermediate components in the DFD classification model as: where label refers to HC or AD, k is a given feature (1 to D ), and i is a number of samples (1 to N ) in the DB dataset. Additionally, to understand the influence of healthy samples and that of impaired samples separately on each of the components, we take the difference between the average importance of healthy samples and AD samples per feature as: where label = HC or AD, N hc is a number of healthy samples, and N ad is a number of impaired samples. Finally, importance values are scaled to the range (0,1) for easier comparison between components and better interpretability. Note that feature importance is analysed based on manual transcripts.   2 visualises importance of each feature group calculated based on gradient values and shows that features associated with the group of syntactic complexity and discourse phenomena make the largest group of features important for classification. This is equally true for both HC and AD components, meaning that characteristics of syntactic language structure and discourse phenomena contributes the most to extracting AD-related specifics, as well as describing healthy cognition. Moreover, results provided in Tab. 4 show that the average normalised importance of the features associated with syntactic complexity and discourse is higher than the average importance of lexical richness features, when top-10 most important features across all the groups are selected for comparison. To conclude, the feature group of syntactic complexity and discourse phenomena is affected significantly and distinctively the most by deletion errors as seen in Sec. 5. This is also an indication of why classification is affected significantly by deletion errors, tracking effects from the initial step of adding artificial errors of different amounts to obtaining the final predictions.

Generalisability Across Datasets and Tasks
In order to test how well our conclusions generalise to a different dataset and a different task, we repeat the same experiments on a dataset of spontaneous speech, Famous People (FP). Speech in FP is elicited from an interaction-based interview with no protocols, and hence it is completely unstructured, while in DB speech is associated with a very well-defined protocol and clear instructions such as 'describe everything you see in this image' and hence it is a dataset of structured speech. Additionally, in DB the topic of speech is constrained by the stimulus image shown to a participant for description, while the FP dataset has unconstrained topics and is free speech. We follow the same method, as described in Sec. 3, extracting the features, obtaining the latent representations and using leave one out cross-validation with a Mixed Effects Random Forest model [25] for FP.
Similarly to the results obtained using DB data, with FP deletion errors affect classification performance the most. Furthermore, deletion errors differentiate the same feature group of syntactic complexity and discourse phenomena (see Tab. 5): with FP dataset, 45 features correlate with deletions stronger than with insertions or substitutions, with 73.33% of features belonging to the aggregate group of syntactic complexity and discourse, and 26.67% -to the group of lexical richness. The rank of feature groups, based on the average absolute Spearman correlation of all the features included in the groups, correspond to the rank observed with DB dataset, with a stronger significant correlation corresponding to the group of syntactic complexity, rather than lexical richness. Since we use a classification model trained on DB data for obtaining latent states, feature gradient importances are the same as those obtained from the DB dataset ( Fig. 2 and Tab. 4).
These results suggest that the features of syntactic complexity and discourse are potentially the most vulnerable to deletion errors and at the same time, this group of features is the most influential for AD/HC classification. Both these trends potentially generalise to both structured and unstructured spontaneous speech, also across different tasks.

Conclusions
In summary, we observe that simulated deletion errors have a particular effect on classification performance, which can be tracked back to their effect on syntactic complexity and discourse representations. This behaviour was studied on two highly-varied datasets of speech for AD detection and proved to be generalisable across different tasks. With this observation in mind, the practical suggestion would be to change the ASR optimization functions to reflect a higher penalty for deletion errors in order to improve AD detection performance.
In the future work, we will focus on the optimisation of ASR performance and its effect on AD detection. We will also investigate how similar the simulated errors are to the true insertion, deletion and substitution errors produced by ASR.