Analysis of Behavior Classification in Motivational Interviewing

Analysis of client and therapist behavior in counseling sessions can provide helpful insights for assessing the quality of the session and consequently, the client’s behavioral outcome. In this paper, we study the automatic classification of standardized behavior codes (annotations) used for assessment of psychotherapy sessions in Motivational Interviewing (MI). We develop models and examine the classification of client behaviors throughout MI sessions, comparing the performance by models trained on large pretrained embeddings (RoBERTa) versus interpretable and expert-selected features (LIWC). Our best performing model using the pretrained RoBERTa embeddings beats the baseline model, achieving an F1 score of 0.66 in the subject-independent 3-class classification. Through statistical analysis on the classification results, we identify prominent LIWC features that may not have been captured by the model using pretrained embeddings. Although classification using LIWC features underperforms RoBERTa, our findings motivate the future direction of incorporating auxiliary tasks in the classification of MI codes.


Introduction
Motivational Interviewing (MI) is a psychotherapy treatment style for resolving ambivalence toward a problem such as alcohol or substance abuse. MI approaches focus on eliciting clients' own intrinsic reasons for changing their behavior toward the desired outcome. MI commonly leverages a behavioral coding (annotation) system, Motivational Interviewing Skills Code (MISC) (Miller et al., 2003), which human annotators follow for coding both client's and therapist's utterance-level intentions and behaviors. These codes have shown to be effective means of assessing the quality of the session, training therapists, and estimating clients' behavioral outcomes (Lundahl et al., 2010;Diclemente et al., 2017;Magill et al., 2018). Due to the high cost and labor-intensive procedure of manually annotating utterance-level behaviors, existing efforts have worked on automatic coding of the MI behaviors. The client utterances throughout the MI session are categorized based on their expressed attitude toward change of behavior: (1) Change Talk (CT): willing to change, (2) Sustain Talk (ST): resisting to change, and (3) Follow/Neutral (FN): other talk unrelated to change. An example conversation between a therapist (T) and a client (C) is shown below.
• T: [...] you talked about drinking about 7 times a week [...] Does that sound about right, or? • C: I don't know so much any, like 5, probably like, the most 4 now, in the middle of the week I try to just kinda do work, (CT) • C: I mean, like I would (ST) • C: but, but getting up's worse, it's like being tired, not so much hungover just feeling uhh, class. [...] (CT) • T: When you do drink, how much would you say, would you say the ten's about accurate? • C: About around ten, maybe less, maybe more, depends like, I don't really count or anything but, it's probably around ten or so. (FN) Previous work in MI literature mainly approached automatic classification of behavior codes in MI by modeling utterance-level representations. Aswamenakul et al. (2018) trained a logistic regression model using both interpretable linguistic features (LIWC) and GloVe embeddings, finding that Sustain Talk is associated with positive attitude towards drinking, and the opposite for Change Talk. To account for dialog context, Can et al. (2015) formulated the task as a sequence labeling problem, and trained a Conditional Random Field (CRF) to predict MI codes. More recent approaches leveraged advances in neural networks, using standard recurrent neural networks (RNNs) Ewbank et al., 2020; Huang et al., 2018) or hierarchical encoders with attention (Cao et al., 2019). In addition to context modeling, Tavabi et al. (2020) leveraged pretrained contextualized embeddings (Devlin et al., 2019) and incorporated the speech modality to classify MI codes, beating the previous baseline of Aswamenakul et al. (2018) on a similar dataset. The most gain seemed to come from powerful pretrained embeddings, as with many other NLP tasks. However, it is unclear what these BERT-like embeddings learn, as they are not as interpretable as the psycholinguistically motivated features (LIWC).
In this paper, we study the quality of automatic MI coding models in an attempt to understand what distinguishes language patterns in Change Talk, Sustain Talk, and Follow/Neutral. We develop a system for classifying clients' utterancelevel MI codes by modeling the client's utterance and the preceding context history from both the client and the therapist. We compare the effectiveness and interpretability between contextualized pretrained embeddings and hand-crafted features, by training classifiers using (1) pretrained RoBERTa embeddings (Liu et al., 2019), (2) an interpretable and dictionary-based feature set, Linguistic Inquiry Word Count (LIWC) (Pennebaker et al., 2001). Our best-performing model outperforms the baseline model from previous work on the same dataset (Tavabi et al., 2020), reaching F1=0.66 from F1=0.63.
In examining misclassifications by both models, we identify features that are significant across classes. Our findings suggest that large pretrained embeddings like RoBERTa, despite their high representation power, might not necessarily capture all the salient features that are important in distinguishing the classes. We identified prominent features that are statistically significant across classes on the entire dataset, as well as the misclassified samples. Theses findings suggest that our systems might benefit from fine-tuning pretrained embeddings, adding auxiliary tasks (e.g sentiment classification), and better context modeling.

Data
We use two clinical datasets (Borsari et al., 2015) collected in college campuses from real MI sessions with students having alcohol-related problems. The data consists of transcripts and audio recordings from the client-therapist in-session dialogues. The sessions are manually transcribed, and labelled per utterance using MISC codes. The dataset includes 219 sessions for 219 clients, consisting of about 93k client and therapist utterances; the client-therapist distribution of utterances is 0.44-0.54. The dataset is highly imbalanced, with a class distribution of [0.13, 0.59, 0.28] for [Sustain Talk, Follow/Neutral, Change Talk]. In addition to the in-session text and speech data, the dataset consists of session-level measures regarding clients' behavioral changes toward the desired outcome. Additional metadata includes session-level global metrics such as therapist empathy, MI spirit, and client engagement.

Embeddings and Feature sets
Pretrained RoBERTa Embeddings. RoBERTa (Liu et al., 2019) is an improved representation based on BERT (Devlin et al., 2019). RoBERTa differs from BERT in several aspects: removal of the Next Sentence Prediction objective, introduction of dynamic masking, pretrained on a larger dataset with larger mini-batches and longer sequences. These changes can improve the representations on our data, especially since dialogue utterances in psychotherapy can consist of very long sequences. Our preliminary experiments for fine-tuning both BERT and RoBERTa on our task showed that RoBERTa performed better. We therefore select RoBERTa to obtain utterance representations.
Interpretable LIWC Features. LIWC (Pennebaker et al., 2001) is a dictionary-based tool that assigns scores in psychologically meaningful categories including social and affective processes, based on words in a text input. It was developed by experts in social psychology and linguistics, and provides a mechanism for gaining interpretable and explainable insights in the text input. Given our focus domain of clinical psychology, where domain knowledge is highly valuable, we select the psychologically-motivated LIWC feature set as a natural point of comparison.

Classification Model
For classifying the clients' MI codes, we learn the client utterance representation using features described in 3.1, as well as the preceding history from both the client and therapist. The input window includes the current utterance, and history context. Specifically, the input window consists of a total of 3 or more turn changes across speakers, where each turn consists of one or more consecutive utterances per speaker. In the beginning of the session, where the history context is shorter than the specified threshold, the context history consists of those limited preceding utterances. The size of the context window was selected empirically among 3, 4 or 5 turn changes.
Our input samples contain between 6 and 28 utterances depending on the dynamic of the dialogue, e.g. an example input could be [T C T T T C C T C], where T denotes Therapist's utterance and C denotes Client's. The motivation for using the entire window of context and final utterance is that the encoding by our recurrent neural network (RNN) would carry more information from the final utterance and closer context, while retaining relevant information from the beginning of the window. We also investigated encoding the current utterance separate from the context using a linear layer, but did not see improvements in the classification results.
For RoBERTa embeddings, each utterance representation is the concatenation of (1) CLS token (2) mean pooling of the tokens from the last hidden state (3) max pooling of the tokens from the last hidden state. Figure 1 illustrates this process. For LIWC representations, the features are already extracted on the utterance level. Additionally, for both RoBERTa and LIWC representations, we add a binary dimension for each utterance to indicate the speaker. The history context representation for both RoBERTa and LIWC is obtained by concatenating the utterance-level representation vectors into a 2d matrix. These inputs are then fed into a unidirectional GRU, and the last hidden state is used for the last classification layer.

Results and Discussions
For training, we use a 5-fold subject-independent cross validation. 10% of the train data from each fold is randomly selected in stratified fashion, and held out as the validation set. We optimize the network using AdamW (Loshchilov and Hutter, 2019), with a learning rate of 10 −4 and batch size of 32. We train our model for 25 epochs with early stopping after 10 epochs, and select the model with the highest macro F1 on the validation set. To handle class imbalance, we use a cross-entropy loss with a weight vector inversely proportional to the number of samples in each class. The GRU hidden dimension is 256 and 32 when running on RoBERTa and LIWC representations, respectively.
We compare our work to the best performing model from previous work (Tavabi et al., 2020), trained on the same dataset and under the same evaluation protocol. Briefly, this baseline model differs from our current model in several aspects: BERT embeddings were used as input; the representation vector for the current client utterance is fed into a linear layer. The client and therapist utterances within the context window are separated, mean-pooled and fed individually to two different linear layers. The output encodings from the three linear layers are merged and fed into another linear layer before being passed to the classification layer.
We perform statistical analysis to identify prominent LIWC features across pairs of classes, as well as misclassified samples from each classifier. Since the classifiers encode context, we incorporate the context in the statistical analysis by averaging the feature vectors along utterances within the input window.

Classifier Performance
The classification results are shown in Table 1. The model trained using RoBERTa outperforms the model trained on LIWC features, in addition to beating the baseline model in (Tavabi et al., 2020) with F1-macro=0.66. Improved results over the baseline model are likely due to the following: 1) The previous linear model encodes the client and therapist utterances from the context history separately, therefore potentially missing information from the dyadic interaction. 2) The RNN in our current model temporally encodes the dyadic interaction window. 3) Using RoBERTa embeddings improved over BERT embeddings, as RoBERTa was trained on larger datasets and on longer sequences, making them more powerful representations. The results from other work on classifying client codes in MI range from F1-macro=0.44 (Can et al., 2015) to F1-macro=0.54 (Cao et al., 2019) on different datasets. Aswamenakul et al. (2018), who used a similar dataset to our work, reached F1-macro=0.57. Huang et al. (2018) obtained F1-macro=0.70 by using (ground truth) labels from prior utterances as the model input and domain adaptation for theme shifts throughout the session.

Features
The F1 scores show that Sustain Talk, the minority class, is consistently the hardest to classify and Follow/Neutral, the majority class, the easiest. This is similar to findings from previous work in literature, e.g. (Can et al., 2015) and remains a challenge in automated MI coding. Using approaches like upsampling toward a more balanced dataset will be part of our future work. In order for these systems to be deployable in the clinical setting, the standard we adhere to is guided by a range developed by biostaticians in the field, which indicates values higher than 0.75 to be "excellent" (Cicchetti, 1994). Therefore, despite the good results, there is much room for improvement before such systems can be autonomously utilized in real-world MI sessions. Figure 2 shows the confusion matrices from classification results by the model using LIWC features vs. RoBERTa embeddings. Comparing between classes, Sustain Talk gets misclassified about equally as Follow/Neutral and Change Talk by RoBERTa but it is much more often misclassified as Change Talk by LIWC. On the other hand, Change Talk is more often misclassified as Follow/Neutral by RoBERTa, but misclassified as Sustain Talk by LIWC. We also experimented with simple concatenation of RoBERTa and LIWC features, but did not find significant improvements over the RoBERTa-only model. Better models for combining RoBERTa and LIWC features might improve our results, which will be part of future work.

Salient Features
Statistical analysis on LIWC features across the classes can help identify the salient features distinguishing the classes, therefore can signal important information picked up by the LIWC classifier. We used hierarchical Analysis of Variance (ANOVA), with talk types nested under sessions to account for individual differences, to find linguistic features that are significantly different across MI codes. To further examine the statistical significance across pairs of classes, we performed a Tukey post hoc test. We found the following features to be the most statistically different features across all the pairs of classes: 'WPS' (mean words per sentence), 'informal', 'assent' (e.g. agree, ok, yes), 'analytic.' Additionally, 'AllPunc' (use of punctuations) and 'function' (use of pronouns) were prominent features that were significantly distinguishing Follow/Neutral from the other classes.
We further looked into samples where RoBERTa representations might be limited (i.e. misclassified), while LIWC features were correct in the classification. Using ANOVA, we found the most prominent features in such samples across the 3 classes: 'swear' (6.06), 'money' (5.29), 'anger' (2.24), 'death' (2.19), and 'affiliation' (2.00), where numbers in parentheses denote F-statistic from hierarchical ANOVA. This is consistent with our error analysis in Section 4.2, as shown in Figure 3. The mean scores of the 'swear,' 'money,' and 'anger' categories are higher for Change Talk compared to other classes. We hypothesize that 'swear' and 'anger' in Change Talk may represent anger toward oneself regarding drinking behavior. Words in the 'money' category might be related to the high cost of alcohol (especially with collegeage clients), which can be motivation for behavior change. The Change Talk samples misclassified by the RoBERTa model may indicate the model's failure to capture such patterns.

Conclusion
We developed models for the classification of clients' MI codes. We experimented with pretrained RoBERTa embeddings and interpretable LIWC features as our model inputs, where the RoBERTa model outperformed the baseline from previous work, reaching F1=0.66. Through statistical analysis, we investigated prominent LIWC features that are significantly different across pairs of classes. We further looked into misclassified samples across the classifiers, and identified prominent features that may have not been captured by the RoBERTa model. This finding motivates the use of auxiliary tasks like sentiment and affect prediction, in addition to fine-tuning the model with domain-specific data and better context modeling.
With this work, we aim to develop systems for enhancing effective communication in MI, which can potentially generalize to other types of therapy approaches. Identifying patterns of change language can lead to MI strategies that will assist clinicians with treatment, while facilitating efficient means for training new therapists. These steps contribute to the long-term goal of providing cost-and time-effective evaluation of treatment fidelity, education of new therapists, and ultimately broadening access to lower-cost clinical resources for the general population.