Oral-Motor and Lexical Diversity During Naturalistic Conversations in Adults with Autism Spectrum Disorder

Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by impaired social communication and the presence of restricted, repetitive patterns of behaviors and interests. Prior research suggests that restricted patterns of behavior in ASD may be cross-domain phenomena that are evident in a variety of modalities. Computational studies of language in ASD provide support for the existence of an underlying dimension of restriction that emerges during a conversation. Similar evidence exists for restricted patterns of facial movement. Using tools from computational linguistics, computer vision, and information theory, this study tests whether cognitive-motor restriction can be detected across multiple behavioral domains in adults with ASD during a naturalistic conversation. Our methods identify restricted behavioral patterns, as measured by entropy in word use and mouth movement. Results suggest that adults with ASD produce significantly less diverse mouth movements and words than neurotypical adults, with an increased reliance on repeated patterns in both domains. The diversity values of the two domains are not significantly correlated, suggesting that they provide complementary information.


Introduction
Autism spectrum disorder (ASD) is a behaviorally-defined neurodevelopmental condition that affects approximately 1.5% of children in the U.S. (Christensen et al., 2016). Individuals with ASD are characterized by social communication impairments and the presence of restricted and repetitive patterns of interests and activities (APA, 2013). One of the most striking features of ASD is extreme heterogeneity in its clinical presentation. For example, verbal abilities in ASD range from minimally verbal (a few words or sounds) to above average (Pickles et al., 2014). This heterogeneity makes it harder to diagnose ASD reliably, and indeed, expert clinicians may disagree about whether or not an individual meets criteria (Regier et al., 2013). Diagnostic challenges are compounded by shortcomings in current phenotyping approaches, which are either time-consuming and expensive, or provide limited information via questionnaires. Moreover, although ecologically valid stimuli have been shown to be superior for capturing ASD-related differences in behavior (Chevallier et al., 2015), most traditional ASD assessments continue to be conducted in highly controlled contexts. Taken together, these challenges highlight the need for a precision medicine approach to ASD (Beversdorf, 2016) that includes quantified and precise behavioral assessments in naturalistic settings.
Recent computational methodologies, including wearable technologies, computer vision, and natural language processing, have great potential to facilitate automated identification of novel phenotypic markers of behavior in ecologically valid settings, with exquisite precision, and in a highly scalable manner. Clinically, these technological advancements in "quantified behavior" could support diagnostic decision making, while providing critical information about intervention effectiveness.
In this study, we explore the applicability of computational behavioral assessments for identifying manifestations of the restricted/repetitive dimension in ASD. Building on existing knowledge about language production (Bone et al., 2013(Bone et al., , 2014Heeman et al., 2010;Tanaka et al., 2014;Goodkind et al., 2018;Parish-Morris et al., 2016b) and facial movements in ASD (Yirmiya et al., 1989;Borsos and Gyori, 2017;Guha et al., 2018;Owada et al., 2018), as well as the known interrelation between the two domains (Busso and Narayanan, 2007), this study investigates patterns of word production and mouth movements during natural conversation. Our goal is to test whether an underlying dimension of cognitive-motor restriction can be detected across multiple behavioral domains in ASD.
Prior research suggests that restricted patterns of behavior may be cross-domain phenomena in autism, and are therefore evident in a variety of modalities. For example, computational studies of language in ASD provide support for the existence of multifaceted restricted language patterns that emerge during conversation. Children with ASD produce significantly more semantically overlapping turns than typically developing children during clinical evaluations (Rouhizadeh et al., 2015). They also engage in more echolalia (repetition of words or phrases) than typical children during semi-structured interviews (van Santen et al., 2013), and utilize a restricted range of narrative tools (Capps et al., 2000) and words (Baixauli et al., 2016) during storytelling. Less is known about linguistic diversity in adults with ASD, particularly during naturalistic conversations.
While similar evidence for atypical patterns of facial movement in ASD exists, most prior work has investigated facial expressions in the context of emotion recognition and imitation. Individuals with ASD produce flattened facial expressions (Yirmiya et al., 1989) that are hard to read (Brewer et al., 2016), and overt facial expression mimicry is impaired (Yoshimura et al., 2015). Reduced complexity in facial behavior, particularly in the eye region, while participants produced various facial expressions has been reported (Guha et al., 2018). Limited research, however, has examined facial expressions and oral-motor movement in dynamic social contexts such as conversations.
This study adds to the existing literature by combining tools from computational linguistics, computer vision, and information theory to characterize lexical and oral-motor diversity in adults with ASD. We demonstrate the utility of our approach in a young adult data set consisting of 44 conversational partners, 17 with ASD, in naturalistic social scenarios. Results showed that participants with ASD used fewer words than the typically developing (TD) control group during 3-minute "get to know you" conversations, and paused more. They also produced significantly less diverse mouth movements and words, suggesting increased reliance on repeated patterns (i.e., restriction) in both domains. Notably, the correlation between the diversity values of the two domains was not significant, suggesting that they provide complementary information. The findings reported here suggest that reduced behavioral diversity, across domains, captures an underlying dimension of restriction and repetition in ASD that distinguishes individuals on the spectrum from typical controls. In the future, these methods could be utilized to identify and track highly quantifiable treatment targets, thus advancing the goal of precision medicine for autism.

Participants
Forty-four adults participated in the present study (ASD: N=17, TD: N=27, all native English speakers). Participant groups did not differ significantly on mean chronological age, full-scale IQ estimates (WASI-II) (Wechsler, 2011), verbal IQ estimates, or sex ratio (Table 1). There was a trend toward a difference in full-scale IQ, so this variable was considered in models comparing diagnostic groups. Participants were diagnosed using the Clinical Best Estimate process (Lord et al., 2012b), informed by the Autism Diagnostic Observation Schedule -2nd Edition, Module 4 (ADOS-2) (Lord et al., 2012a) and adhering to DSM-V criteria for ASD (APA, 2013). All aspects of this study were approved by Institutional Review Boards of the University of Pennsylvania and the Children's Hospital of Philadelphia.

Procedure
After providing written informed consent to participate in a novel social skills intervention (NIH R34MH104407, "Services to enhance social functioning in adults with autism spectrum disorders", PI: Brodkin) participants underwent a battery of tasks at three time points separated by approximately 6 months each. These tasks assessed social communication competence and included a slightly modified Contextual Assessment of Social Skills (CASS) (Ratto et al., 2011). The current analysis focuses on the third time point, after all participants with ASD received the social skills intervention. Typical participants did not receive intervention, and participated in the CASS once after providing informed consent. The CASS is a semi-structured assessment of conversational ability designed to mimic real-life first-time encounters. Participants engaged in two 3-minute face-to-face conversations with two different confederates (research staff, blind to participant diagnostic status and unaware of the dependent variables of interest). In the first conversation (interested condition), the confederate demonstrated social interest by engaging both verbally and non-verbally in the conversation. In the second conversation (bored condition), the confederate indicated boredom and disengagement both verbally (e.g., one-word answers, limited follow-up questions) and physically (e.g., neutral affect, limited eye-contact and gestures). The current analysis is based on the interested condition only. Prior to each conversation, study staff provided the following prompt to the participants and confederates before leaving the room: "Thank you both so much for coming in today. Right now, you will have 3 minutes to talk and get to know each other, and then I will come back into the room." CASS confederates included 10 undergraduate students or BA-level research assistants (3 males, 7 females, all native English speakers). Confederates were semi-randomly selected, based on availability and clinical judgment (4 confederates interacted with the ASD group, 8 with the TD group, 2 with both). In order to provide opportunities for participants to initiate and develop the conversation, confederates were trained to speak for no more than 50% of the time and to wait 10s to initiate the conversation. If conversational pauses occurred, confederates were trained to wait 5s before re-initiating the conversation. No specific prompts were provided to either speaker.
Audio and video of the CASS was recorded using a specialized "TreeCam", built in-house , that was placed between the participant and confederate on a floor stand. This device has two HD video cameras pointing in opposite directions to allow simultaneous recording of the participant and the confederate as they sit facing each other, with a central microphone to record audio. For the face analysis, the first 10 seconds of the video were cropped to remove RA instructions (which may have also removed a few seconds of the CASS), and recordings continued for 3 minutes. For the lexical analysis, the sample began when the first word of the CASS was uttered, after study staff left the room, and ended when study staff re-entered.

Processing of Language Data
Audio streams were extracted from video recordings, and saved in lossless .flac format. A team of reliable annotators produced time-aligned, verbatim, orthographic transcripts of the recordings in XTrans (Glenn et al., 2009). Each recording was processed by two junior annotators and one senior annotator, all of whom were undergraduate students and native English speakers. Before becoming junior annotators for this cohort, each team member received at least 10 hours of training in Quick Transcription (Kimball et al., 2004) modified for use with clinical interviews of participants with ASD (Parish- Morris et al., 2016bMorris et al., ,a, 2017. In addition, annotators were trained to reliability (defined as >90% in common with a Gold Standard transcript) on segmenting (marking speech start and stop times) and transcribing (writing down words and sounds produced, using the modified Quick Transcription specification). Training files included audio recordings of conversations between individuals with and without autism that were not used in this study. For the CASS, one reliable junior annotator segmented utterances into pause groups, while the second transcribed words produced by each speaker. A senior annotator then thoroughly reviewed and corrected each file (Figure 2). All senior annotators had at least 6 months of prior transcription experience. Final language data were exported from XTrans as tab-delimited files that were batch imported into R. Annotations marking non-speech sounds like laughter, indicators of language errors like stutters, and punctuation were removed, while other disfluencies (including filled pauses and whole-word repetitions) were left in. Total words, speech rate (total words/total length of speaking segments), sum of participant response latencies (Confederate-to-Participant inter-turn pauses or C2P; overlaps excluded), and number of conversational turns were calculated across each session.

Processing of Vision Data
CASS videos were processed by an image processing and feature extraction pipeline that included face detection, face registration, and facial movement quantification.
For face detection and localization of multiple facial landmarks (eyes, lip corners, nose etc.) within each face, we used a publicly available tool (OpenFace) (Baltrusaitis et al., 2016). The computation of facial movements requires image registration across frames, which we achieved via partbased registration (Sariyanidi et al., 2015). Using landmarks from the corners of the eyes and mouths at each frame, we subdivided the face into three overlapping parts covering the left eye region, the right eye region, and the mouth region (see Figure  3). Cropped sequences had visible jitter due to imprecise landmark localization at each frame, which is detrimental to the analysis of subtle face/head movements. We eliminated jitter using a video stabilization technique (Sariyanidi et al., 2017), which registers consecutive frames to one another. Quantification of facial movements was done using the Facial Bases method (Sariyanidi et al., 2017). This method uses 180 facial movement basis functions, 60 of which correspond to mouth movements. Each basis provides differential information (i.e. change of appearance) about a movement that occurs in a particular region of the face. Most bases are semantically interpretable; for example, one basis is activated when the lip corner of the subject moves upwards/downwards and another basis is activated when the subject's lower lip moves, which typically occurs when the subject is talking (Figure 4). In this study, we used the 60 bases corresponding to mouth movements. The entire video sequence of a participant was represented as a collection of 60 time series, where each time series quantified the activation level of one basis over time (Figure 4). In our analyses, we only used time points when participants were speaking.
Each time series underwent smoothing, peak detection, and normalization steps for reliability and comparability between participants and across 60 bases. We first smoothed each time series using a Gaussian filter with a filter width of 2 standard deviations. We then detected peaks by determining the time points of sign change in the first derivative (i.e. the point at which an increase in activation stops and a decrease begins).
Each facial basis may have a different maximum activation magnitude (Sariyanidi et al., 2017). We therefore normalized the heights of detected peaks via z-normalization, by using the time series from research confederates to calculate the mean activation and the standard deviation for each basis. Finally, we removed outlier peaks by setting activations whose absolute value is above 6 standard deviations to zero.

Computation of Diversity
For both modalities (language and mouth movements) we quantified diversity using Shannon entropy (Cover and Thomas, 2006). From an information theoretical perspective, entropy can be described as the amount of information a data modality carries. Intuitively, one expects a higher entropy (diversity) when, for instance, a participant makes a rich set of facial expressions while speaking compared to a participant who generates only a restricted set of mouth movements. Similarly in the cognitive domain, higher lexical entropy (di-versity) is expected when participants use a variety of words, and lower entropy is expected when participants produce repetitive speech. Shannon entropy (H) is calculated as where b is the base of the logarithm. In this work we used b = 2, yielding a measure of entropy in bits. The probability of generating a word x i (or activation of a facial basis), p(x i ), is calculated from the sample of generated words (or basis activations).
The 'diversity' function of the 'qdap' package in R (R Core Team, 2017) was used to calculate lexical (word-level) entropy for each participant. This function counts the number of different words produced by each participant, resulting in a vector of word counts. The probability of each word, p(x i ), is then calculated by dividing its count by the total number of word counts. Note that the possible number of words and the exact words used by a participant can differ from one participant to other. Therefore, we also tested whether calculated entropy values were affected by total word counts (see Results).
For mouth movements, all participants were assessed using the same set of 60 bases. We calculated the number of times each facial basis was activated (similar to word counts), also taking into account the magnitude of activation, by calculating the sum of the entire time series. Note that the summation of positive and negative values in a time series should be zero, since a basis activation (i.e., a positive value) is followed by a deactivation (a negative value). For example, when a lip corner is stretched, it is then relaxed. Therefore, instead of summing the raw values of the time series, we summed the positive and the absolute of negative values separately, taking the average as our final count value. We repeated this procedure for all 60 bases, yielding a vector of movement counts. Different facial bases may have different expected activation patterns, with some of them activated more frequently than others naturally. We therefore normalized the total activation count of each basis by the maximum count that was observed for the same basis of research confederates. Finally, entropy was calculated using the normalized counts.

Statistical Analysis
Our research design included repeated confederates across participants (i.e., the same 10 confederates joined multiple conversations with different ASD and TD participants). In order to account for this nested design when assessing group differences in diversity values (ASD vs TD), we began by using linear mixed effects models that included confederate ID as a random effect (function 'lmer' from package 'lme4' in R) (R Core Team, 2017;Bates et al., 2015).
We measured the contribution of random effects to the model by comparing the conditional and marginal coefficients of determination, using 'MuMIn' package and 'r.squaredGLMM' function. The conditional and marginal coefficients of determination correspond to variance explained by fixed effects alone and variance explained by both fixed and random effects, respectively. When there was no difference between the two models (i.e. random effects did not contribute to model fit), we also fit ordinary linear regression models using the 'lm' function. Due to our small sample size (n = 44), simpler models were used when possible, to preserve degrees of freedom.
The ASD and TD groups did not differ significantly on mean age, sex ratio, or verbal IQ estimates, but there was a trend toward a difference in full-scale IQ (Table 1). To gauge the robustness of diagnostic group differences and check for the utility of these variables as potential predictors, we also fit models that included sex, age, and IQ as covariates. For the analysis of mouth movements, we used speech length (the sum of participant speech segments) as a covariate; more movement is expected with longer talk times, which may impact diversity. The pipeline for mouth movements described above is sensitive to overall head movements since facial bases may be spuriously activated with head movement. Therefore, we quantified the average head movement of each participant (as provided by OpenFace), by measuring the total motion of the head center during the conversation, and used it as another covariate.
Effect sizes for group differences are reported using Cohen's d. We calculated Cohen's d by dividing the estimated coefficient of the diagnostic variable (0: TD, 1: ASD) in the fitted model (lmer or lm) by the pooled standard deviation of the diversity value (i.e. average standard deviation of ASD and TD groups). Following (Cohen, 1988), d values between 0.20 and 0.50 reflect a small effect, between 0.50 and 0.80 a medium effect, and > 0.80 a large effect.
Agreement between lexical and mouth movement diversity was measured using Spearman's rank-order correlation coefficient.

Basic Conversational Differences
Preliminary analyses revealed that conversations differed on a variety of basic linguistic features, according to the diagnosis of the Participant (Table 2; t-values of the main effect of diagnosis are reported; the random effect of confederate ID contributed only to models for confederate word count and conversational turns; ordinary linear models are reported for all other variables). Conversational length did not differ for ASD and TD participants, which was expected given the controlled 3-minute task design. Confederates in each condition produced the same number of words regardless of the diagnosis of their conversational partners. However, participants with ASD produced fewer words than TD participants (p = 0.002), and conversational partners exchanged marginally fewer turns when the participant had ASD (p = 0.10). Participant groups did not differ on speech rate, but the ASD group had a significantly larger sum of Confederate-to-Participant (C2P) pauses than the TD group. These results demonstrate that participants with ASD produced fewer words and longer pauses than TD participants during the CASS, and trended toward engaging in fewer conversational turns despite comparable task duration.

Lexical Diversity
Preliminary analyses revealed that inclusion of confederate ID as a random effect did not significantly improve model fit for lexical diversity in any model that included diagnosis as a fixed effect; we therefore report ordinary linear models. A simple linear model revealed significantly reduced lexical diversity in participants with ASD (Mean=4.50, SD=0.22) as compared to TD participants (Mean=4.64, SD=0.12; t(42)=2.85, p=0.007, Cohen's d=0.82). The effect of diagnosis on diversity continued to be significant after accounting for age, IQ, and gender (t(39)=3.25, p=0.002). Diversity of confederate language did not differ by participant diagnosis (t(35.26)=0.17, p=0.86), suggesting that the effect of diagnosis on diversity in participants is driven by internal participant-level variables and not by differences in confederate language.
Given the expected (neurotypical) association between word count and entropy (Witten and Bell, 1990;Shannon, 1951), a second model was constructed that included word count, diagnosis, and the interaction between word count and diagnosis as predictors of participant lexical diversity. A significant interaction was revealed (Table 3), such that the slope of the relationship between word count and diversity was greater in the TD group than the ASD group ( Figure 5).

Diversity of Mouth Movements
The random effect of confederate ID did not contribute to model fit when predicting mouth movement diversity; therefore, we report the results of ordinary linear models.
Mirroring our language findings, we observed a significant decrease in mouth movement diversity in the ASD group as compared to the TD group (Cohen's d=1.0,p=0.009) in a model using head movement and speech length as covariates. This difference remained significant when age, sex, and IQ were included as covariates (Cohen's d=1.0, t=-2.52, p=0.016). None of the covariates contributed significantly to the model. In contrast to the observed relationship between word count and word diversity (Table 3), there was no significant relationship between speech length and mouth movement diversity (t=0.50, p=0.619).

Correlations Between Language and Mouth Modalities
We also investigated whether the two modalities (mouth movement and words produced) provided redundant information when characterizing ASDrelated restriction in oral-motor and linguistic diversity. The diversity values of the two modalities were not significantly correlated in the ASD group (Spearman's r=-0.08, p=0.758), in the TD group (Spearman's r=-0.11, p=0.566), or across the sample as a whole (Spearman's r=0.18,p=0.240). This suggests that lexical and oral-motor diversity provide unique information, and could potentially account for independent variance in future models designed to predict restricted interests/repetitive behaviors in ASD. Figure 5: The relationship between word count and linguistic diversity differed by diagnostic status, with a steeper slope in the TD group than the ASD group.

Discussion
In this study, we identified medium-to-large group differences in behavioral entropy in adults with ASD vs. neurotypical adults, specifically in the areas of word production and mouth movement. This study is the first to use both computer vision and computational linguistics to show a "restricted" dimension in adult conversations with non-clinicians (most prior research used children's interactions with psychologists during semi-structured clinical evaluations) (Rouhizadeh et al., 2015;van Santen et al., 2013). In addition to basic group differences, our results revealed a novel interactive effect of word count and diagnosis on lexical diversity. As increasing numbers of words were produced by participants with ASD, they did not reach the same levels of linguistic diversity as their non-ASD peers. Indeed, this gap may widen over the course of longer conversations, and may differ by word category (e.g., function words vs. content words). We will explore these questions in future research with longer samples, wherein we evaluate the relationship between relatively deteriorated linguistic diversity and impressions of social communication ability by gathering post-conversational ratings of social communication quality from confederates.
Our finding that mouth movements are less diverse in ASD is also novel. One possible explanation for this finding is subtle oral-motor impairments in the ASD sample, as children with ASD have been reported to have oral-motor deficits ASD Mean (SD) TD Mean (SD) t-value p-value Duration (mins) 3.08 ( (Adams, 1998), and oral-motor abilities in infancy and toddlerhood predict later speech fluency (Gernsbacher et al., 2008). However, all participants in this study were fluent English speakers without overt oral-motor impairments. Reduced phonological diversity could also result in restricted mouth movements, a hypothesis that will be explored in future analyses.
Reduced facial expressiveness (McIntosh et al., 2006), atypical expressiveness (Samad et al., 2018;Loveland et al., 1994), and limited integration of expressions and vocalizations (Lord et al., 2012a) have all been reported in ASD, which could lead to reduced diversity in mouth movements. Typically, when people take part in a conversation, vocalizations are accompanied by subtle changes in facial expressions (Busso and Narayanan, 2007). Integration across different modalities (e.g., language and facial expressions) is a critical aspect of social communication, and impairment in this area is assessed in common diagnostic instruments for ASD, such as the ADOS (Lord et al., 2012a). However, to the best of our knowledge, there are no objective methods for directly quantifying the degree to which such integration occurs during natural conversations. Development of novel computational tools to fill this gap is an especially promising future direction.
Of clinical note, adults with ASD who participated in our study had just completed an intensive intervention to improve social interaction skills.
It is striking that decreased entropy was evident across domains in this sample, despite the recent intervention that targeted social reciprocity and conversational skills. This suggests that our results may in fact underestimate the magnitude of differences that could be present in untreated individuals.

Conclusion
Adults with ASD exhibit restricted/repetitive patterns of behavior (APA, 2013), but computational efforts to quantify the restricted/repetitive dimension in real-world contexts are just beginning to emerge (Rouhizadeh et al., 2015;Bone et al., 2015;Goodwin et al., 2014). This knowledge gap makes adult impairments difficult to treat, and tracking the effectiveness of interventions that target RRBs is a significant challenge for clinicians and researchers. Our results suggest that crossdomain entropy during naturalistic conversations could serve as a quantitative behavioral marker of ASD.
This study advances the field by applying computational methods across oral-motor and lexical domains, to identify restricted patterns of behavior in ASD in real-world contexts. In future research, we will explore relationships between reduced behavioral diversity and clinical phenotype, with the goal of moving beyond group differences to predict individual variability, and establishing external validity with established measures. We envision that future iterations of the methods described here will be utilized to identify and track highly quantifiable treatment targets in the area of restricted/repetitive behaviors, and will advance the goal of precision medicine for individuals with autism and their families.