Fantastic Features and Where to Find Them: Detecting Cognitive Impairment with a Subsequence Classification Guided Approach

Despite the widely reported success of embedding-based machine learning methods on natural language processing tasks, the use of more easily interpreted engineered features remains common in fields such as cognitive impairment (CI) detection. Manually engineering features from noisy text is time and resource consuming, and can potentially result in features that do not enhance model performance. To combat this, we describe a new approach to feature engineering that leverages sequential machine learning models and domain knowledge to predict which features help enhance performance. We provide a concrete example of this method on a standard data set of CI speech and demonstrate that CI classification accuracy improves by 2.3% over a strong baseline when using features produced by this method. This demonstration provides an example of how this method can be used to assist classification in fields where interpretability is important, such as health care.


Introduction
In recent years, word and sentence embeddingbased methods have had a significant impact on the field of NLP (Devlin et al., 2019;Mikolov et al., 2013;Pennington et al., 2014;Di Palo and Parde, 2019). These approaches stand as an alternative to classical feature engineering approaches, where carefully crafted features, such as word length or part of speech tag, are extracted from text and used as input. Despite the promise of embeddingbased methods, there are still several advantages to feature engineering. Most notably, using embeddings as input can lead to issues with interpretability (Heimerl and Gleicher, 2018;Hooker et al., 2019;Kindermans et al., 2017), which is especially important in a healthcare domain (Balagopalan et al., 2020). Meanwhile, feature engineering approaches directly lend themselves to easily inter-pretable models (Ribeiro et al., 2016). As such, feature engineering remains an important practice for fields such as health care, where interpretability is imperative. An extensive body of work has been produced where ML methods and engineered features have been applied to cognitive impairment (CI) detection (Balagopalan et al., 2018;Karlekar et al., 2018;Zhu et al., 2019).
In this work, we present a new feature engineering method that is guided by classifying subsets of a pause-centred speech sequence (subsequences), and inspired by literature suggesting that CI could be indicated by the words that subjects pause before (Calley et al., 2010;Mack et al., 2013;Seifart et al., 2018). This approach aims to extract pauserelated information while minimizing the noise added from unrelated factors. This method generates interpretable and effective features, potentially saving time and resources spent on excess feature engineering. We validate this method by presenting a 2.3% accuracy increase over a strong baseline on CI vs healthy (HC) classification, matching the state of the art (Hernández-Domínguez et al., 2018).
In summary, our major contributions are: • A method of classifying speech using only a token of interest and a small context around it, i.e. subsequence classification (Sec. 3.3).
• Validating this approach by showing that it aids in achieving classification results comparable to the state of the art (Sec. 5.2).

Related work
Several authors report increases in CI detection performance by extracting acoustic features such as filled and unfilled pause counts, as well as average pause duration (Tóth et al., 2015(Tóth et al., , 2018Pistono Data Dementiabank (Sec. 3.1) transcripts as CI or HC using an extended set of lexical features, which is to the best of our knowledge the state of the art (SOTA). We use their recorded performance as a benchmark when validating our approach.
Several authors have reported performance gains by using subsequences to aid with classification. These authors use subsequences only as a means to process full sequences (Phan et al., 2017), or they use the presence of common subsequences as a feature for longer text sequences (Iglesias et al., 2007;Kumar et al., 2005). To the best of our knowledge, no prior work describes using subsequence classification to guide feature engineering.

Experimental Method
In this section, we define the data sets and methodology used in our experimental framework.

Data Sets
Dementiabank (DB): Dementiabank 1 is a large public data set of pathological speech (Becker et al., 1994), containing audio files and transcripts of participants describing the 'Cookie Theft' image. Transcripts are created manually by trained transcriptionists following the CHAT protocol (MacWhinney, 2014). Out of the 286 participants, 193 are diagnosed with some form of CI (CI; N = 321 transcripts) and 93 are healthy controls (HC; N = 229 transcripts). Transcripts receive a CI or HC label corresponding to whether the participant who produced the transcribed speech was cognitively impaired or not. Subsequence-based Data Subsets: To conduct subsequence classification, we extract subsequences of varying length from each transcript. Figure 1: Visualization of the difference between contexts and distances in a pause-focussed subsequence.
For each transcript in DB, each utterance containing a pause was extracted and labelled as positive if the sample contains CI speech, or negative otherwise. Subsequences were extracted from these utterances by taking the first one, two, or three speech tokens before and after each pause. 2 We created three data subsets by including subsequences of at most one, two, or three tokens around the pause: Context 1 (DB-C1), Context 2 (DB-C2), and Context 3 (DB-C3), respectively. We also included one data subset including full utterances that include pauses, DB-Utt (Tab.1). Identical subsequences found in both classes were removed. Furthermore, subsequences extracted from HC transcripts are labeled as HC, and subsequences extracted from CI transcripts are labeled as CI. We refer to the tokens that are next to the pause as Distance 1 (D1), the tokens that are one token away from the pause as Distance 2 (D2), and the tokens that are two tokens away from the pause as Distance 3 (D3). The differences between context and distance are shown in Fig.1. For example, for the pause sequence "The boy is *uh* stealing a cookie", only the tokens "boy" and "a" would be considered the Distance 2 tokens for this sequence, while the tokens "boy", "is", "*uh*", "stealing", and "a" would be considered the Context 2 tokens for this sequence.

Feature Extraction
In this section, we describe how features are extracted on the transcript-level (for transcript classification) and on the token-level (for subsequence classification). Transcript-Level Features: We extract over 500 linguistic and acoustic features from each transcript, such as part of speech counts and average word length (App.A). These features, referred to as the Original feature set, are used to provide a baseline to benchmark transcript-level classification performance. We also use the Original feature set as a base that we extend with newly engineered transcript-level features (Sec. 4). To produce an additional baseline, we perform feature selection on the Original feature set, and found k = 85 features led to the greatest performance. Token-Level Features: In order to conduct subsequence classification, we extract features on the token-level for each of the subsequence-based data subsets. Of the Original feature set, we select a subset of features that have a clear token-level analogue (App.A). For instance, the transcript-level feature of average word length has the token-level analogue of individual word length. After feature extraction, each token is represented by a 23dimensional input vector. Consequently, each subsequence in the DB-C1, DB-C2, DB-C3, and DB-Utt data subsets is represented as a T by 23 matrix, where T is the length of the sequence in tokens.

Classification
In this section, we describe the methodology used for subsequence and transcript classification. Subsequence classification is used to guide the engineering of new features, while transcript classification validates the new features' effectiveness. Subsequence Classification: Our subsequence classification experiment involves performing 5fold cross validation with each of the DB-C1, DB-C2, DB-C3, and DB-Utt data subsets. Subsequences are classified as either HC or CI. We conduct classification using GRU-based (Cho et al., 2014) models with an attention mechanism designed for document classification (Yang et al., 2016), with model parameters tuned for each of the data subsets. We report accuracy for M-C1, the model that achieved the highest accuracy on DB-C1, M-C2, the model that achieved the highest accuracy on DB-C2, M-C3, the model that achieved the highest accuracy on DB-C3, and M-Utt, the model that achieved the highest accuracy on DB-Utt (additional training details provided in App.B). Transcript Classification: We evaluate the efficacy of our feature engineering approach (Sec.4) by performing transcript-level 10-fold cross validation with a variety of feature sets. Transcripts are classified as either HC or CI. We use the Original feature set, as well as the top 85 of the Original features, based on their ANOVA F-values, as baselines. Additionally, we extend the Original feature set with the k best features, based on ANOVA F-values, from each of the feature sets generated using our novel feature engineering approach (Sec. 4), separately. k is optimized for accuracy for each extending feature set separately. To classify DB transcripts, we use 5 ML models: an SVM, a gradient boosting ensemble, a 2-layer neural network (NN), a random forest, and an ensemble of the previous four models (Ens). We report the accuracy (Acc), precision (Prec), sensitivity (Sens), and specificity (Spec) for the model that achieved the greatest cross validated accuracy for each feature set, separately (training details provided in App.B).

Proposed Feature Engineering Approach
Our approach to engineering new transcript-level features involves three major steps: 1) Subsequences of varying length centred around a token of interest, in our case a pause, must be extracted from each of the input transcripts and grouped into subsets based on maximum length. Each of the tokens in these subsequences must have token-level features extracted. The tokenlevel features, as well as the central token, should be chosen based on in-domain knowledge.
2) A sequential ML model must be cross validated on each of the subsequence data subsets from the previous step in a subsequence classification experiment. Here, we are specifically attempting to exploit the ability for sequential machine learning models to uncover patterns in sequential data. The mean cross validated accuracy on each of these length-based data subsets should be used as an indicator of how much distinguishing information can be extracted from tokens within the specified range of the pause.
3) Based on the recorded cross validated accuracies from the previous step, transcript-level aggregations of the token-level features must be created at various distances from the pause. We propose two methods of aggregating token-level features (DB-specific details provided in App.A): • Continuous features can be aggregated simply by taking the average of a feature across each of the tokens. An example of this would be calculating the average word length for each of the tokens found at a specified distance from a pause.
• Categorical features can be aggregated using counts or ratios, such as the number of nouns occurring at a specified distance from a pause.
These transcript-level aggregates should only be extracted for the distances that produced the great-   est cross validated accuracy during subsequence classification, as the subsequence classification performance indicates that the features found in that range are the most distinguishing. For instance, if subsequences of up to two tokens around a pause produced the most accurate subsequence classifier, transcript-level aggregates should only be extracted for tokens at the D1 and D2 positions in reference to the pause, and not the D3 position.
To validate this method, we create five transcriptlevel feature sets: features aggregated from tokens at the D1 position in reference to a pause (F-D1), features aggregated from the D2 position (F-D2), features aggregated from the D3 position (F-D3), the combination of F-D1, F-D2, and F-D3 (F-C3), and the combination of F-D1 and F-D2 (F-C2).

Results
In this section, we report the results for the subsequence and transcript classification experiments.

Subsequence Classification
After averaging across four random seeds, M-C2 was able to achieve an accuracy of 60.7%, higher than M-C1, M-C3, or M-Utt (Tab.3). This leads us to the conclusion that using features from the two tokens preceding and succeeding a pause could enhance transcript-level classification performance.

Transcript Classification
We create the F-D1, F-D2, and F-C2 aggregate feature sets, as the highest subsequence classification accuracy was achieved by a model trained on DB-C2. Additionally, in order to validate our claims, we create F-D3 and F-C3. The highest accuracy of 77.09% on transcript classification is achieved by an ensemble model that used the Original + F-D2 feature set (Tab. 2).
Using one of the four random seeds used to produce the average performance metrics presented in Tab. 2, the model using F-D2 features was able to achieve an accuracy of 78.36%, the same as the single-seed SOTA accuracy of 78% (Hernández-Domínguez et al., 2018).

Discussion
As shown in Tab. 3 and Tab. 2, features from tokens within 2 tokens of a pause were the most effective in enhancing both subsequence and transcript classification. To determine how these two tasks are connected, we conduct a statistical analysis on the token-level and transcript-level features. Two sided t-tests between features extracted from tokens found at D1, D2, and D3 from different classes show similar patterns for features that are significantly different between classes for both the token and transcript-level. Larger concentrations of distinguishing features are found at D1 and D2 than at D3. This could explain the effectiveness of features from the D2 position in both tasks (Tab. 4).
However, this pattern congruity does not explain why F-D3 features on their own are more effective than F-D1 features on their own. The trend that both the F-D3 and F-C3 feature sets produced greater transcript-level accuracy than the F-D1 feature set, and lower transcript-level accuracy than the F-D2 and F-C2 feature sets, is the same as the trend for the subsequence classification results reported in Tab. 3. This indicates that subsequence classification may be able to provide better insight into potential transcript classification performance than traditional statistical testing.
It is important to consider the implications of producing a model with the F-D2 feature set that  achieved significantly higher accuracy than the most accurate model produced with the F-C3 feature set. As described in Section 3.3, we perform feature selection using ANOVA F-values for each of the aggregate feature sets. Since F-D2 is a subset of F-C3, this implies that this more traditional feature selection method did not select a group of features from F-C3 that was more effective than the features from F-D2, even though it was able to select any and all of the features in F-D2. This serves as a testament to our feature engineering method, as it demonstrates that even popular feature selection methods are not able to completely remove the negative effects of engineering an excessive amount of ineffective features.
Following several other works that used the DB data set (Hernández-Domínguez et al., 2018;Pou-Prom and Rudzicz, 2018;Sarawgi et al., 2020), all of our experiments are conducted with K-fold cross validation. While the small size of the DB data set helps to justify this as a validation procedure, optimizing a cross validated performance metric (accuracy, F1, etc.) may lead results using K-fold cross validation to be an overestimate of generalization performance.
DB-C2 produced a more accurate subsequence classifier than any other data subset of DB. This suggests that the class distinguishing signal from the pause is strongest within a two token radius around the pause. Beyond that radius, the signal may be obstructed by noise from other patterns in speech. However, in different data sets, a different subsequence length may present the strongest, least noisy signal. New aggregate features should be created for tokens within whichever range produces the best subsequence classification performance.
However, our results do indicate that there is a strong link between how well features from certain token positions contribute to both subsequence and transcript classification. This may relate to the effect of noise on those token positions, which we use subsequence classification to identify.

Conclusion and Future Work
In this work, we present two principle contributions. First, we describe a novel method for speech classification -subsequence classification -in which speech is modelled as a token of interest, such as a pause, along with surrounding tokens of context. Secondly, we demonstrate how subsequence classification can be used to engineer features that extract distinguishing information while minimizing added noise, and consequently match SOTA performance on a standard data set of CI speech.
Future work should be done to understand why certain context lengths are more conducive for subsequence classification than others, and when that performance can transfer to effective transcriptlevel classification. Finally, additional work should be done to develop techniques for finding tokens of interest, such as pauses, that can be exploited using our feature engineering technique.