Speech-based Estimation of Bulbar Regression in Amyotrophic Lateral Sclerosis

Amyotrophic Lateral Sclerosis (ALS) is a progressive neurological disease that leads to degeneration of motor neurons and, as a result, inhibits the ability of the brain to control muscle movements. Monitoring the progression of ALS is of fundamental importance due to the wide variability in disease outlook that exists across patients. This progression is typically tracked using the ALS functional rating scale - revised (ALSFRS-R), which is the current clinical assessment of a patient’s level of functional impairment including speech and other motor tasks. In this paper, we investigated automatic estimation of the ALSFRS-R bulbar subscore from acoustic and articulatory movement samples. Experimental results demonstrated the AFSFRS-R bulbar subscore can be predicted from speech samples, which has clinical implication for automatic monitoring of the disease progression of ALS using speech information.


Introduction
Amyotrophic Lateral Sclerosis (ALS, also known as Lou Gehrig's disease) is a progressive neurological disease that destroys nerve cells and inhibits the normal voluntary motor function of the affected individual. The progression of this disease rapidly limits the patient's ability to perform normal daily tasks such as walking, speaking, and eventually even breathing. Although there is currently no cure for ALS, early detection and accurate tracking of disease progression is crucial to the planning of treatment strategies and therapeutic intervention (Kiernan et al., 2011). The currently used clinical measure for the disease progression is the patient self-reported ALSFRS-R score, which estimates the degree of functional impairment across motor tasks such as speaking and walking, as well as common daily tasks such as getting dressed and climbing the stairs (Cedarbaum et al., 1999).
ALSFRS-R has a collection of 12 questions, with a total score ranging from 0 to 48, which is composed of three factors: bulbar functions, fine and gross motor functions, and respiratory function (Franchignoni et al., 2013). Bulbar functions include speaking, salivating, and swallowing. The efficacy of the ALSFRS-R for measuring motor-function and levels of self-sufficiency of individuals with ALS has been thoroughly demonstrated. The ALSFRS-R has shown high inter-rater reliability, test-retest reliability, and internal consistency (Cedarbaum and Stambler, 1997;Brinkmann et al., 1997). Additionally, the ALSFRS-R is highly correlated with the clinical stage of ALS (Balendra et al., 2014) and has been shown to be a useful predictor of patient survival (Magnus et al., 2002). Despite the utility and reliability of the ALSFRS-R, it is only able to quantify specific degradations in motor function along a five point scale. As such, it lacks the resolution to capture more subtle changes in motor function that can be observed through instrumentationbased measures (Allison et al., 2017).
Recently, there has been a surge of research using speech analytics to detect and track a range of neurological diseases such as Parkinson's (Orozco-Arroyave et al., 2016a,b;Hsu et al., 2017;Benba et al., 2015) and ALS ill;Norel et al., 2018;Wang et al., 2016aWang et al., ,b, 2018. Efforts towards tracking disease progression in this area have typically focused on the estimation of speech specific measures such as speech intelligibility (Berisha et al., 2013;Kim et al., 2015), speaking rate (Jiao et al., 2016;Martens et al., 2015), or severity (Tu et al., 2017;Asgari and Shafran, 2010). While these efforts have shown success in the ability to objectively measure functional changes directly related to speech, whether speech can be used to measure functional impairment along other tasks in ALS remains largely unexplored.
In this paper we sought to address this question by examining how well speech and articulation data can predict the ALSFRS-R bulbar subscore (ranges from 0 to 12). The long-term goal of this research is to develop objective measures for broad level motor function. At this early stage, we focused on the bulbar score first. To our knowledge, this paper is the first to predict ALSFRS-R (bulbar) score directly from speech information. Two regression models, a simple linear ridge regression model and a machine learning algorithm (support vector machine), were used in the regression analysis.

Participants
Sixty-six speakers diagnosed with ALS at earlyonset participated in this study at up to four data collection sessions with an interval of four to six months. At each session, participants or caregivers completed the ALSFRS-R, which included the bulbar subscore. Speech intelligibility (percentage of understandable words, judged by listeners) and speaking rate (words produced per minute) were assessed by a speech-language pathologist using the Sentence Intelligibility Test (SIT) software (Dorsey et al., 2007). Intelligible speaking rate, called communication efficiency, was also calculated, which is the percentage of understandable words per minute (speech intelligibility × speaking rate) (Yorkston and Beukelman, 1981). The whole data set was used for the basic correlation analysis between ALSFRS-R and speech performance measures, while the data from 28 participants were used for regression analysis. This subset includes 15 male and 13 female participants, whose age averaged 57.3 years with a standard deviation of 10.7 years.

Stimuli and Procedure
The participants were asked to produce 20 sentences in a fixed order, such as I need some assistance and call me back when you can. A com-x y z Figure 1: Sensor locations for the Wave system plete list of the stimuli used for data collection is included in the Appendix. The sentences were selected because they are commonly used in augmentative and alternative communication (AAC) devices (Beukelman et al., 1984). All speech stimuli were presented on a TV screen in front of the participants. The stimuli were repeated for a total of four recordings at the participants habitual speaking rate among other speech tasks.
The NDI Wave System (Northern Digital Inc., Waterloo, Canada) was used to collect articulatory movement data with an accuracy of 0.5 mm (Berry, 2011). An optimal four sensor set-up (Wang et al., 2016c) was used to collect articulatory data from the tongue tip (TT, 5 mm from apex), tongue back (TB, 10 mm from TT), upper lip (UL, vermillion border) and lower lip (LL, vermillion border). The sensors were attached using nontoxic dental glue (PeriAcryl 90, GluStitch) or medical tape. A lightweight helmet with a 6 degree-of-freedom sensor served as a point of head reference. Prior to the start of each data collection session, the speakers had 3-5 minutes to adapt to the wired sensors prior to formal data collection. In this paper, we used x, y, and z to represent lateral, vertical, and anterior-posterior movements, respectively. A visual depiction of the sensor locations and coordinate system is displayed in Figure 1. To capture acoustic signals simultaneously, a Shure Microflex microphone with a sampling rate of 22kHz was positioned approximately 15 cm from each speaker's mouth.

Data Processing
Head rotation and translation movements were removed from articulatory data prior to analysis. A low pass filter of 15hz was applied to remove noise (Wang et al., 2016c). SMASH (Green et al., 2013a), a Matlab based software, was used seg-ment the time-matched articulatory and acoustic data into individual phrase samples.

Relationship between ALSFRS-R scores and speech performance measures
In this section, we evaluated the relationship between traditional speech metrics, such as speaking rate and speech intelligibility, and ALSFRS-R scores. Because speech represents only a small component of the broad motor function assessed by the ALSFRS-R, we not only compared the relationship between speech and the ALSFRS-R as a whole, but also at the Bulbar subscore, which reflects the portion of the ALSFRS-R related to speaking, salivating and swallowing.
There are several important factors to consider when evaluating the relationship between these measures and ALSFRS-R score. First, neither the speech metrics nor the ALSFRS-R scores being compared are perfect measures of the underlying decline in motor function that they attempt to quantify. Speaking rate is highly sensitive to natural deviation between speakers and compensatory strategies that can mask changes in motor function (Green et al., 2013b). Speech intelligibility suffers from ceiling and floor effects that prevent it from tracking disease progression outside of a fixed severity range (Yorkston and Beukelman, 1981). Second, because the ALSFRS-R measures each motor component along a 5-point scale it cannot capture subtle changes to motor control that occur between points on this scale. Despite this limitation, the ALSFRS-R has been proven reliable in test-retest analysis (Cedarbaum and Stambler, 1997) and correlates highly with the clinical stage of individuals with ALS (Balendra et al., 2014). Figure 2 displays the relationship between speech intelligibility, speaking rate, and intelligible speaking rate (ISR) and the ALSFRS-R bulbar subscore for participants in our data set. Although all three scatter plots show a correlation between the measures of speech and the ALSFRS-R bulbar subscore, there exists significant variability in the ALSFRS-R that cannot be explained by the measures of speech. This is particularly true of intelligibility, where participants could score as low as 4/12 of the ALSFRS-R bulbar subscore, while maintaining near-perfect intelligibility. Among the three speech measures, ISR had the highest correlation with ALSFRS Bulbar subscore.
To better understand the relationship between these speech measures and the different components of the ALSFRS-R, we performed a correlation analysis between three measures of speech intelligibility, speaking rate, and ISR, and three ALSFRS-R component scores (Table 1). The three component scores were (a) the total score, which provides a broad assessment of motor function, (b) the bulbar subscore, which includes functions of motor control most closely related to speech including assessment of speaking, swallowing and salivating, and (c) the non-bulbar component, which is the difference between the total ALSFRS-R score and the bulbar subscore. This analysis found a strong correlation between all three measures of speech and the Bulbar subscore with all correlations between 0.5 and 0.7 and all p-values less than 10 −6 . Although there exists a statistically significant relationship between each of the speech measures and the total ALSFRS-R score, this significance disappears if the bulbar component is removed. Therefore this relationship is simply more evidence of the speech measures ability to track the bulbar component.

Methods
As mentioned earlier, this analysis was based on a subset of twenty eight participants from the previously described data set whose speech data has been manually parsed for automatic processing. Fifteen of the patients only made a single visit, six made two visits, five made three visits, and only two made the full four visits. Though the number of samples collected for each patient is usually 80 per session, some of the participants were not able to complete all of the recording tasks. In these cases predictions were made based on the reduced set of samples that were available.

Acoustic Features
The acoustic features used in this paper were based on the frame-level Mel-frequency cepstral coefficients (MFCCs). Although MFCCs were originally popularized due to their effectiveness in automatic speech recognition systems, they have recently seen increasing usage in a range of other speech assessment tasks, including the detection of motor speech disorders like Parkinson's disease (Benba et al., 2015). As the mel cepstrum encodes spectral magnitude information related to the shape of the vocal tract, MFCC's can capture articulatory changes resulting from conditions like   (Fraile et al., 2008). For each frame we extracted 14 MFCCs, along with their first and second temporal derivatives ∆MFCC and ∆∆MFCC. From these 42 variables across time, we calculate 3 different summary statistics, mean, standard deviation and pairwise variability yielding a total of 126 features.

Articulatory Features
For a specific sensor, we have three positional arrays x = [x 1 , ..., x N ], y = [y 1 , ..., y N , and z = [z 1 , ..., z N corresponding to dimensions x, y, and z. For any index i ∈ [1, .., N − 1], we can calculate the corresponding distance traveled as and form the corresponding distance matrix This distance matrix forms the basis for the articulation features used in this paper. From D, we extracted eight summary statistics: mean, standard deviation, skewness, kurtosis, maximum, minimum, range, and pairwise variability. In addition to these baseline features, we considered three procedures for normalizing the feature based on measurements of the overall distance trajectory. The first normalization procedure was dividing the features by the overall distance traveled D, in order to control for the distance of the overall articulation motion which was heavily dependent on the specific phrase being produced. The second and third normalization procedures attempted to control for the size of each individual's articulation motion space, the first by normalizing between the maximum overall distance between any two points in (x, y, z). The second was forming a convex hull around the articulation path and normalizing based on the volume of the resulting hull. Combining the 8 different statistics and four normalization methods (including no normalization) with the four different sensors (Tongue-tip, tongue-back,lower lip and upper lip) yielded a total of 128 features. To illustrate how these articulatory features might help assess motor function for different individuals, we plotted the articulation data from two different patients on opposite ends of the severity spectrum in our dataset (Figure 3). The first sample was from participant DA001 on his/her first visit. This participant experienced minimal decline in their speaking abilities and scored a perfect 12/12 on the ALSFRS-R bulbar subscore. The second sample was drawn from patient DA016 on her second visit, where she had a severe speech decline already scoring only 3/12 points on the Bulbar subsection. Figure 3 displays the tongue-tip articulation tracks for the two participants, along with a box plot comparing the distribution of the distance values between points   Figure 3c shows a stark difference between the distribution of pairwise distances across the two participants. The distances for participant DA016 are significantly lower on average than DA016, indicating a pronounced reduction in the speed of articulation.

Regression Analysis
The regression analysis conducted in this experiment began with the 3959 × 254 dimension feature matrix extracted via the procedure outlined in the previous section. To ensure the regression model's ability to generalize to new speakers, it is evaluated by leave-one-speaker-out crossvalidation. Thus at every stage of cross-validation, the model is trained using 27 participants and evaluated based on the single left-out participant. When a participant with multiple recording sessions is moved to the validation set, all sessions are moved to the validation set as a group and unique predictions are made and evaluated for each session.
All the data samples were z-scored (subtracted the mean and divided by the standard deviation) to obtain the normalized feature data. This procedure helped prevent the scale of different attributes affecting how much they contribute to the model.
Two regression models were used in this analysis, a simple ridge regression model and a support vector machine (SVM). Ridge regression is similar to ordinary least-squares regression, but utilizes an L 2 regularization term in order to bet-ter model data that is subject to multicollinearities (Hoerl and Kennard, 1970). Unlike traditional regression models that minimize observed training error, support vector regression (SVR) minimizes a generalization bound in order to ensure the model performs well on out-of-sample data (Basak et al., 2007). This factor, combined with the ability of SVMs to use non-linear kernels to model complex non-linear patterns in data, has made them widely used for both classification and regression problems. The SVMs used in this paper employed a linear kernel and were trained using the sequential minimal optimization (SMO) algorithm.
In addition to the baseline model (using all previously described acoustic and articulation features), we also tested the performance of other five feature groups, acoustic only, acoustic + lips, tongue, lips, and tongue + lips. The initial predictions were made on individual samples (phrases), and were then averaged to form a final prediction for each patient-session pair.
Two measures were used for the regression performance, root mean squared error (RMSE) and the correlation of the resulting set of predictions with the true ALSFRS-R (bulbar) scores. Low RMSE indicates that the small difference between the predicted and true ALSFRS-R values. High correlation indicates that changes in the predicted ALSFRS-R values are likely corresponding to a proportional changes in the true values.

Results
The results for each of the six feature group and two regression models are displayed in terms of both RMSE and correlation in Figure 4. The high-  Figure 4: Bar graphs describing the performance of Ridge (blue) and SVM (orange) models across the six feature groupings described along the x-axis in terms of both root mean-squared error and correlation. est performance was achieved by the SVR model using acoustic data along with all articulatory motion data (RMSE = 1.78, r = 0.64). Figure 4 indicated a few interesting findings. First, we found that the models trained on both the articulation motion data and the acoustic data tended to outperform either grouping by itself. This is consistent with the literature on both ISR prediction (Wang et al., 2016b and ALS early detection (Wang et al., 2016a), which have shown the performance benefits of adding articulatory data to acoustic models.
In addition, the performance on data from tongue or lips separately shows that the tongue sensors were significantly more powerful than lips for predicting ALSFRS-R scores when viewed in isolation, which is not surprising, as the tongue is the primary articulator. Wang and colleagues also found tongue information outperformed lip information in predicting intelligible speaking rate for ALS .
Interestingly, when comparing the performances between the "Acoustic+Lips" group and the "All Features" group, we found that the addition of the tongue data (on top of acoustic and lip motion data) did not significantly improve the performance. Further studies, however, are required to verify this finding with a deeper analysis on the performance of additional tongue information on top of acoustic and lip information. Future research should investigate the degree to which non-invasive video-based measures of lip motion can be substituted for the more traditional motionsensors. This finding supports the idea that a mobile app for recording speech and lip motion (via a webcam) would be beneficial for future homebased data collection from patients.
Finally, when comparing the performance of the two regression models that were tested, SVM tended to perform slightly better than the ridge regression models. The lone exception to this was in the case of tongue-only articulation data, where the ridge regression model slightly outperformed the SVM. Future work will involve more complicated models such as convolutional neural networks (CNNs), which have recently shown potential in ALS early detection .

Conclusion
This paper explored automatic estimation of the ALSFRS-R bulbar score from speech information, where both acoustic and articulatory motion data collected during speech production were used. Two regression models, support vector regression and ridge regression, were applied on six different feature groups/sets. The highest performance was achieved by the SVR model using acoustic data along with all articulatory motion data. To our knowledge, for the first time, we demonstrated the feasibility of automatic prediction of ALSFRS-R bulbar score from speech samples. Future research on this topic will focus on the degree to which nonspeech information can be included to predict ALS motor function decline more broadly.