Dissociating Semantic and Phonemic Search Strategies in the Phonemic Verbal Fluency Task in early Dementia

Effective management of dementia hinges on timely detection and precise diagnosis of the underlying cause of the syndrome at an early mild cognitive impairment (MCI) stage. Verbal fluency tasks are among the most often applied tests for early dementia detection due to their efficiency and ease of use. In these tasks, participants are asked to produce as many words as possible belonging to either a semantic category (SVF task) or a phonemic category (PVF task). Even though both SVF and PVF share neurocognitive function profiles, the PVF is typically believed to be less sensitive to measure MCI-related cognitive impairment and recent research on fine-grained automatic evaluation of VF tasks has mainly focused on the SVF. Contrary to this belief, we show that by applying state-of-the-art semantic and phonemic distance metrics in automatic analysis of PVF word productions, in-depth conclusions about production strategy of MCI patients are possible. Our results reveal a dissociation between semantically- and phonemically-guided search processes in the PVF. Specifically, we show that subjects with MCI rely less on semantic- and more on phonemic processes to guide their word production as compared to healthy controls (HC). We further show that semantic similarity-based features improve automatic MCI versus HC classification by 29% over previous approaches for the PVF. As such, these results point towards the yet underexplored utility of the PVF for in-depth assessment of cognition in MCI.


Introduction
Dementia is a syndrome primarily presenting with broad cognitive impairments. There are multiple underlying causes that result in dementia such as Alzheimer's Disease (AD) or fronto-temporal lobar degeneration or focal lesions (MacPherson et al., 2016). These sub-forms have different neurocognitive profiles. The most-common Alzheimer's Disease (AD)-related dementia is typically driven by an amnestic cognitive impairment (Kidd, 2008) whereas the fronto-temporal dementia is often associated with executive function impairment (Huey et al., 2009).
Early identification of dementia as well as precise differentiation between dementia sub-forms is crucial for effective management of the syndrome (Thyrian et al., 2016). Pairing high diagnostic sensitivity with ease of use, verbal fluency tests (VF) are amongst the most-applied tests in cognitive assessment of dementia (Troyer et al., 1997). In these tests, participants are asked to produce as many words from a specific category as they can in a fixed time. The two main variants of VF tests are the semantic verbal fluency (SVF) and the phonemic verbal fluency (PVF). In the SVF, the word category is defined by semantics (e.g. all animal words), whereas in the PVF participants need to produce words starting with a specific letter (e.g. "S"). Traditionally, test scores are computed by counting the number of correctly named words within the given time (Gomez and White, 2006). Although both VF variants are quite similar in the way they engage different neurocognitive functions, the cognitive strategies of the task can indicate different patterns of the underlying neuropathology. For instance, an SVF impairment is often only regarded as evidence for amnestic dementia (Vaughan et al., 2016;Teng et al., 2013) whereas a PVF impairment is almost exclusively regarded as evidence for fronto-temporal dementias (Dubois et al., 2000).
Recently, advanced Natural Language Processing (NLP) techniques have been applied to allow for in-depth analysis of the produced word sequence in VF tasks, particularly for the SVF (Linz et al., 2017a;Kim et al., 2019;Diaz-Orueta et al., 2020;Zemla et al., 2020). By extracting clusters from the produced word sequence and by modelling the semantic relationships between-and within these clusters, it is possible to disentangle the effects of memory impairment from effects of executive function impairment . Despite the success of these qualitative features in the SVF, their utility for automatic analysis of the PVF remains underexplored.
In this paper, we investigate both phonemic and semantic motivations for the underlying strategy of the phonemic verbal fluency task, and thereby reduce the gap between clinical theory and computational approaches to evaluating cognitive speech tasks. By contrasting semantic and phonemic distance measure in an analysis based on time bins, we show a dissociation between semantically-and phonemically-guided search processes: Subjects with mild cognitive impairment (MCI) exhibit significantly less semantic similarity in their productions as compared to healthy controls (HC). Finally, in experiments on automatic classification of MCI vs. HC from PVF word productions, we show that semantic features improve over previous approaches by 29%. Taken together, our results pave the way towards more fine-grained analysis of the PVF task that can help to improve clinical decision processes.
VF are used to assess semantic memory and executive functions as a good VF performance hinges on intact semantic memory stores as well as the ability to access these memory stores (Chertkow and Bub, 1990;Hodges et al., 1992;Mueller et al., 2015). Executive functioning, specifically, working memory is thought to allow a person to effectively search through phonological and semantic stores while regulating and adapting the search strategy to produce more words over the task (Faust, 2012;Rende et al., 2002;Troyer et al., 1997;Rosen, 1980).Both PVF and SVF are hypothesised to span multiple overlapping cognitive abilities; executive, verbal, and attention abilities (Mueller et al., 2015;Li et al., 2017;Shao et al., 2014;Schmidt et al., 2017). However, there is evidence that each task measures a set of distinct cognitive processes.
PVF burdens executive resources whereas the SVF demands linguistic-conceptual knowledge (Thompson-Schill et al., 1997;Vigneau et al., 2006;Shao et al., 2014;Mueller et al., 2015;Schmidt et al., 2017;Birn et al., 2010). SVF is theorized to engage the temporal lobe for lexical-semantic access and retrieval from semantic store (Newcombe, 1969;Mueller et al., 2015;Cerhan et al., 2002) where as the PVF is thought to rely on executive functioning and prefrontal lobe processes (Mueller et al., 2015) as well as phonological and orthographic cues for word retrieval (Li et al., 2017;Clark et al., 2013). Generally, it is hypothesised that SVF requires both semantic and retrieval processes whereas PVF relies only on retrieval processes (Fisher et al., 2004). However, there is conflicting research that PVF taps into the semantic network, although to a lesser extent than semantic fluency (Lezak et al., 2004;Mueller et al., 2015;Schmidt et al., 2017;Clark et al., 2013). Bizzozero et al. (2013) investigated the extent to which SVF and PVF were related to semantic and attention processes and found evidence of semantic processes in both SVF and PVF. Nutter-Upham et al. (2008) observed a larger effect size for the amnestic MCI (aMCI) group's deficit on semantic verbal fluency (Cohen's d=0.98) than for their deficit on phonemic verbal fluency (Cohen's d=0.66), due to greater variability in phonemic verbal fluency performance. Therefore, an alternative interpretation is that their findings actually do reflect a preferential deficit on semantic verbal fluency in aMCI. Supporting these findings, imaging studies combined with factor analysis have also suggested that the PVF task is relies on both semantic and phonemic processes (Schmidt et al., 2017;Clark et al., 2013).

VF for Diagnosis
Both the Phonemic and Semantic varieties of verbal fluency are commonly used to diagnosis and monitor cognitive decline such as mild cognitive impairment (MCI) and Alzheimer's Disease and Related Dementias (ADRD) (Marra et al., 2011;Clark et al., 2009;Gomez and White, 2006;Troyer et al., 1998).
SVF has been found to be more impaired than PVF in ADRD (Cerhan et al., 2002;Barr and Brandt, 1996;Zhao et al., 2013) and deficits in both semantic and phonemic memory have been reported. However there is conflicting research for PVF and SVF in the MCI group. For aMCI, only the SVF shows impairment (Hodges, 2006;Murphy et al., 2006;Teng et al., 2013). While other studies show decline on both the PVF and SVF task for MCI (Mueller et al., 2015;Vita et al., 2014;Nutter-Upham et al., 2008). Rinehardt et al. (2014) compared controls with aMCI, non-aMCI and AD and found that both MCI groups were less impaired on the SVF than the PVF, behaving more like controls than the AD group. Clark et al. (2013) considered computationallybased phonemic and semantic measures when analyzing the PVF and SVF tasks in relation to gray matter correlates for HC, MCI and AD. They concluded that both tasks showed greater semantic motivations than phonemic motivation, even in the PVF task.
PVF may be a sensitive test for investigating phonemic and semantic processes but a global word count does not provide the in-depth information needed to understand the underlying cognitive processes (Gomez and White, 2006;Becker and Salles, 2016). In this paper, we apply recently developed automatic analysis techniques from computational linguistics to the PVF to obtain a better insight into the degradation of semantic and phonemic processes.

Analyzing Semantic and Phonemic
Strategy for VF Several modes of analysis have been proposed with the goal of observing the role that different cognitive strategies play throughout VF tasks. Much work has been done on the semantic variety of verbal fluency, specifically for the animal category. Troyer et al. (1997) introduced a semantically-motivated hierarchical list of animals for determining semantic clusters. To overcome this time-intensive and subjective annotation process, previous research worked on automatically producing semantic clusters over SVF productions (Ryan, 2013;Pakhomov et al., 2015bPakhomov et al., , 2016Linz et al., 2017b;König et al., 2018;Kim et al., 2019). For example, Pakhomov et al. (2015a) compared traditional and novel computational methods of evaluating SVF using medical imaging techniques between healthy and cognitively impaired individuals. The semantic relatedness of words was determined using latent semantic analysis of word co-occurrences from a large online corpora. This study showed that computational methods of evaluating the SVF were beneficial in understanding the relationships between the different cognitive processes.
Building off of this, Linz et al. (2017a) used neural word embeddings as a data-driven way to model semantic clustering in the SVF task. König et al. (2018) showed high correlations (r = 0.9) between automatically extracted clustering and switching features and clinical methods. From these clusters, several features including cluster size or number of switches between clusters were calculated to reflect cognitive processes (Linz et al., 2017a;König et al., 2018).
In addition to the SVF, Troyer et al. (1997) proposed a rule-based method for finding phonemically-related clusters of words in PVF productions.  automated this rulebased method for determining phonemic clusters, and proposed three additional phonemic similarity metrics for evaluating the PVF task on healthy German students, namely the Levenshtein distance (LD), phonemically-weighted Levenshtein distance (PHON-LD), as well as position-weighted Levenshtein distance (POS-LD). Clark et al. (2013)  posed another phonemic distance measure using an English pronouncing dictionary and a formula for measuring string overlap to estimate phonemicrelatedness of adjacent words over the task.
Recently, ) considered a binning-based approach (Fernaeus et al., 2008) for the automatic analysis of the SVF. In this approach, features were calculated separately on nonoverlapping, 10-second time bins, which alloweda deeper investigation into the evolution of a participant's production strategy over time.  used temporal binning to analyse at what points in time during SVF word production HC differed from MCI and AD patients with respect to word count, transition length, and word frequency.
To conclude, while previous works introduced metrics for quantifying semantic as well as phonemic similarity in VF word productions, no comprehensive comparison of these metrics was performed on the PVF in a clinical setting. This leaves a gap between clinical theory of motivating cognitive strategies and computational methods as to how to automatically evaluate both phonemic and semantic strategy for the PVF task. To allow for a fine-grained analysis of production strategy over the course of the PVF task, we analyze semantic and phonemic distance metrics in the temporal binning framework.

PVF-based MCI Classification
Compared to the amount of work on HC versus MCI classification from the SVF (Linz et al., 2017a;König et al., 2018), considerably less studies have investigated this classification task using the PVF (Ryan, 2013;Lindsay et al., 2020). Ryan (2013) used logistic regression to classify between HC and MCI using only repetitions (AUC=0.53) and word count (AUC=0.5) from the PVF. Lindsay et al. (2020) reported a baseline PVF experiment between HC and MCI and reported an AUC of 0.75 using only word count on a very small dataset (8HC/19MCI). Additional temporal features low-ered the classification (AUC=0.55). To the best of our knowledge, no study at the present time has investigated HC versus MCI classification with the PVF using phonemic and semantic measures.

Data
The data used in this research was collected during the Dem@Care (Karakostas et al., 2017) and ELEMENT (Tröger et al., 2017) projects. Participants were recruited through the Memory Clinic located in Nice University Hospital at the Institute Claude Pompidou in Nice, France. The study was approved by the Nice Ethics Committee. All participants were native speakers of French and asked to give informed consent before participating in the study. The French data was collected in the form of speech recordings via an automated recording application installed on a tablet computer. The recordings were manually transcribed in PRAAT (Boersma and Weenink, 2009) according to the CHAT protocol (MacWhinney, 1991). Participants were asked to complete a battery of cognitive tests, including a 60 second phonemic verbal fluency task for the letter category F. Demographics for the data used are displayed in Table 1. A Mann-Whitney U test was conducted between the HC and MCI populations to check for significant differences between age (W = 1106, p-value = 0.40) and education (W = 1492, p-value = 0.08) but none were found.

Binning, Clustering & Global Resolutions of VF Analysis
We look at three resolutions of the verbal fluency task that have been applied to the SVF task and consider them for the PVF task; temporal binning, clustering and switching and global features. Each method provides a different resolution for looking word retrieval strategy. Temporal binning Fernaeus et al., 2008) gives the finest resolution of strategy. The clustering is motivated by clinical theory to investigate the different cognitive processes (Troyer et al., 1998). Global features are what are the current norm in clinical practice (Troyer et al., 1998;Gomez and White, 2006).

Binning Methods
To produce temporal bins for the PVF, we follow the methodology in  that was previously used for SVF. The complete 60-second PVF response is split into into six 10-seconds bins. This produces a new resolution of the task from which we can then compute features. As done in , we include the word count as well as the average temporal distance(TD) between consecutive words. In addition, we include the average semantic distance between consecutive words as well as the averages of the three phonemic distance measures LD, PHON-LD, and POS-LD. This allows for a separate investigation of the phonemic, semantic and temporal measures that guide search processes during the span of the word production in the PVF task.
Semantic Distance (SD) We follow Linz et al. (2017a) who computed semantic similarity between two words as the cosine distance between their embedding vectors. To construct word embeddings, FastText models (Bojanowski et al., 2016) are used. For this paper, the cosine distance is used, where Cosine distance = 1 − Cosine similarity .
(2019) used the Levenshtein distance as a measure of phonetic distance when evaluating the PVF task. They first phonetically transliterate the word using the python package epitran (Mortensen et al., 2018). They then proposed using the traditional levenshtein distance to measures the number of edits (insertions, substitutions and deletions) between consecutive words (Levenshtein, 1966). They also proposed two weighted measures of LD as described below.
Phonemic-weighted Levenshtein Distance (PHON-LD) In addition to LD, Lindsay et al.
(2019) proposed a phonemically weighted version of levenshtein distance. Using the epitran package, each phoneme has a corresponding 21-length phonological vector to represents the characteristics of the sound (e.g. voice/unvoiced, front/back). When computing the levenshtein distance, they weighted substitutions as the cosine between the to phonological vectors. Insertions and deletion are still valued at 1. Temporal Distance (TD) The temporal distance is defined as the time in seconds between the boundaries of consecutive words in the PVF production.

Clustering Methods
Clustering-based approaches for VF evaluation consist of two steps. First, the produced word sequence is partitioned into a set of clusters. Second, features (e.g. mean cluster size) are computed from the automatically produced clusters. In this study, we consider a rule-based phonemic clustering as well as an automated version of semantic clustering, and temporal clustering to investigate production. For each both phonemic and semantic clustering types, the mean cluster size and number of switches are computed.
Phonemic Clustering In the case of phonemic clustering features, we determine clusters in the word sequence following the phonemicallymotivated, clinical approach from Troyer et al.
(1997) that was automated by . This approach uses phonemic similarity rules to determine whether subsequent words belong to the same cluster or not.
Semantic Clustering Semantic Clusters are determined as in Linz et al. (2017a). Using the semantic distance method described previously, a semantic threshold is determined for each participant by averaging the semantic distance between all words in the production. If the semantic distance between consecutive words is lower than the threshold, the words are said to be in a cluster. If the semantic distance between consecutive words is greater than the threshold, this introduces a cluster boundary.
To obtain semantic word embeddings, the pretrained French fastText model is used. This model is trained on Common Crawl and Wikipedia corpora using the continuous bag of words (CBOW) algorithm with a negative sampling loss function. FastText models are trained at the character level using a character n-gram model. The 300-dimension

Global Features
In addition to the binning features and clustering features in (Section 4.2.2), we include the traditional way of evaluating verbal fluency tasks, which computes aggregate features for the whole 60 second long word production. For an overview of all features used, please see Appendix A. The most general and widely adopted measures of verbal fluency are the word count and repetition count (Spreen et al., 1991;Tombaugh et al., 1999). The word count is the count of all relevant words produced in (e.g. all words said start with the letter F ), excluding repeated words. The repetition count is the number of words produce more than once.

Experiments
Statistical Analysis was done in R Studio (R Core Team, 2017). All coding experiments are implemented using python 3.7. For significance testing, a non-parametric Mann-Whitney U test for significance is always reported.

Comparing Strategic Processes With Binning Methods
To visualize what the strategic process over the duration of the PVF task, we plot the group averages 1 https://fasttext.cc/docs/en/crawl-vectors.html of each feature across the bins. For overall performance, we plot the average word count and transition time by bin. To investigate semantic processes we plot the semantic distance between the words in each bin. To investigate the phonemic measures, we plot the LD, PHON-LD, and POS-LD. In addition, we compute the bin average and standard error (se) for each group over all distance measures. A non-parametric Mann-Whitney U test for significance is reported to see if the bin averages differ between groups.

Classification Experiments
The classification models are created using the scikit-learn library 2 (Pedregosa et al., 2011).
For the classification application of these features, we focused on an early diagnostic scenario; distinguishing between healthy controls and mild cognitive impairment. To observe how age and education bias our classifier, we trained individual models on each potential bias (Nogueira et al., 2016;Petti et al., 2020). For the clinical baseline, a model was produced by training on only word count (word count) (Lindsay et al., 2020). To compare to previous work, a model was trained on number of repetitions (Ryan, 2013).
In addition to the baseline comparison experiments, we investigated individual and combined models. Four individual models were built using the features for semantic clustering, semantic binning, phonemic clustering or phonemic binning.
To investigate the proposed analysis modes and cognitive strategies, we built four combined models; all binning features (binning), all clustering features (clustering), all semantic features (semantic), and all phonemic features (phonemic).
Finally, we investigate a model using all features (All) and compare the models performance to the proposed baselines.
Classification Specifications To compare these methods, the extremely randomized trees (also known as extra trees) algorithm is used to train a classifier for each experimental scenario. This algorithm was chosen due to its ability to reduce variance and lesser likelihood of overfitting on a relatively small dataset with high dimensionality. Due to the limited amount of data available (34HC/48MCI), training-testing data splits were created using leave one out cross validation to maximize the amount of training data available, while still testing on every available data point. Due to the extreme randomness of the algorithm chosen, performance metrics can fluctuate between runs. To nullify the potential of the bias effects of random initialization, the experiment is repeated 50 times. For each model, the Area Under the Receiver Operator Curve (AUC) is averaged of the 50 iterations and reported.

Results
Results from the experiments to investigate strategic process as described in Section 4.4.1 are visualized in Figure 1. Significance testing between the HC and MCI groups are given in Table 2

Strategic Processes
For all binning features, excluding word count, a lower average bin distance represents a higher similarity between adjacent words. Compared to the HC group, the MCI group has a lower average word count, is less semantically motivated and more phonemically related. They also have longer transition times. The MCI group also show significantly smaller phonemic cluster (p=0.03) and lower number of semantic switches (p=0.01).

Classification results
To reduce the complexity of Figure 2, baseline and combined classifications are visualized with ROC-AUC curves and additional classification experiments are reported in the text of this section.
Both the age (AUC=0.41) and education (AUC=0.24) models perform below chance. The most common clinical evaluation, word count, performs at chance (AUC=0.50). The model trained using all features (AUC=0.71) proposed in this study improves over all baselines including the previous Ryan (2013) model (AUC=0.42) by 29 points.
Not shown in Figure 2, we compare each of the semantic and phonemic process in combination with the binning and clustering methods. Semantic clustering methods (AUC=0.61) achieve similar performance when used for binning (AUC=0.64) where as phonemic features are best when combined with the binning methods (AUC=0.70) but perform poorly for clustering (AUC=0.45).
As shown in Figure 2, the combined binning methods (AUC=0.67) perform similarly to the combined clustering methods (AUC=0.64). The combined phonemic features (AUC=0.76) perform the best overall for the early diagnostic classification scenario.

Discussion
The phonemic verbal fluency task remains underexplored in its use for clinical assessment as well as research of MCI.
However, in this paper we show, that with stateof-the-art semantic as well as phonemic distance metrics, the PVF can reveal neurocognitive function involvement that is crucial to better assess MCI. Our data shows that with recent semantic and phonemic similarity metrics, we can capture MCI-related impairments, such as a general semantic impairment, that have also been reported in the SVF (Verma and Howard, 2012;Taler and Phillips, 2008) but not on the PVF. Our results show significantly lower semantic distance for HC responses when compared to the MCI group in the PVF task which is, by nature, phonemically motivated. In return, MCI patients show significantly lower phonemic distance. This could possibly be explained by the MCI group relying heavily on a phonemic strategy to guide their search rather than a utilizing a semantic strategy. The higher semantic distance for the MCI group could be interpreted as a structural deficit to access semantic memory efficiently as has been shown to be very prominent at all stages of AD-related dementia (Verma and Howard, 2012). This is especially striking as one would expect the phonemic distance to increase as more words are produced (with a larger number of words per bin, the mean distance of adjacent words should be higher). Such an increase is the case for the phonemic distance where MCIs produce fewer words overall and are more phonemically related in comparison to HC, who produce more words and have a larger average phonemic distance over the bins. However, the exact opposite is the case for the semantic distance where MCIs produce fewer words while generating a list of less semantically related words in comparison to the HC group. This strongly points towards the conclusion that MCI patients struggle to exploit the associative network of their semantic memory.
By making neurocognitive processes visible in the PVF that are traditionally reserved for the SVF in clinical practice, the PVF becomes significantly more relevant to real-world MCI and dementia assessment. In order to support the diagnostic usage of the PVF for MCI assessment, we simulate a diagnostic decision scenario through downstream machine learning classification using the semantic as well as phonemic features in the PVF. Our results show that by using semantic and phonemic features we can improve classification results over previous clinical and automatic baselines. The all features model (AUC=0.71) out performs both the word count (AUC=0.50) and previous work of Ryan (2013) (AUC=0.42).
Both clustering (AUC=0.64) and binning (AUC=0.67) methods of analysis perform comparatively. Both the semantic (AUC=0.65) and phonemic (AUC=0.76) measures outperform the clinical baselines (0.50). The classification results support that while the task is overall a phonemic task, semantic investigation of the PVF is relevant for future research and capable of discriminating between HC and MCI better than the clinical baseline.
As an additional finding, the machine learning task benefits from a combined binning and cluster- ing approach when modelling the phonemic processes (AUC=0.76), increasing over only phonemic clustering (AUC=0.45) or phonemic binning methods (AUC=0.70) for classification.

Conclusion
This paper set out to investigate the ability of computational linguistic techniques for understanding phonemic and semantic cognitive processes of the under-explored phonemic verbal fluency task. Utilizing three resolutions of analysis, temporal binning, clustering and global measures, combined with semantic and phonemic distance measures, we found semantic impairment in a phonemic task as has been hypothesized in previous clinical research. In addition to giving a finer-resolution for understanding the PVF task, the additional phonemic and semantic features improved classification over previous clinical and automatic baselines for early dementia detection with the PVF task. Future work should investigate these measures in additional languages and possibly combine the features presented in this paper with medical imaging techniques to see if the findings can be replicated. The following features are computed for each of the six, 10-second bins.

Word Count by Bin
The number of words per 10 second bin LD by Bin Levenshtein distance per 10 second bin POS-LD by Bin Position-weighted Levenshtein distance per 10 second bin PHON-LD by Bin Phonemic-weighted Levenshtein distance per 10 second bin Semantic Distance by Bin Semantic Distance between consecutive words per 10 second bin Mean Temporal Distance by Bin The average transition time in seconds between the end of one word and the onset of the next word by 10 second bin Table 3: The following features were extracted from the PVF task produced by the participants.