Towards Low-Resource Real-Time Assessment of Empathy in Counselling

Gauging therapist empathy in counselling is an important component of understanding counselling quality. While session-level empathy assessment based on machine learning has been investigated extensively, it relies on relatively large amounts of well-annotated dialogue data, and real-time evaluation has been overlooked in the past. In this paper, we focus on the task of low-resource utterance-level binary empathy assessment. We train deep learning models on heuristically constructed empathy vs. non-empathy contrast in general conversations, and apply the models directly to therapeutic dialogues, assuming correlation between empathy manifested in those two domains. We show that such training yields poor performance in general, probe its causes, and examine the actual effect of learning from empathy contrast in general conversation.


Introduction
As a pillar of psychotherapy, empathy is crucial to effective counselling, owing to its importance in building counsellor 1 -client rapport (Elliott et al., 2011) that can enable more effective interventions and better outcomes (McCambridge et al., 2011;Gaume et al., 2009). In particular, "listening with empathy" is considered a guiding principle (Rollnick et al., 2008) for motivational interviewing (Miller and Rollnick, 2012) (MI), a psychotherapeutic approach widely adopted to elicit positive behaviour change by evoking motivation from clients. Gauging counsellor-side empathy is, therefore, essential to assessing MI integrity (Moyers et al., 2016).
Empathy assessment for MI has conventionally been conducted manually by trained annotators, which requires extensive annotator training and transcript review. Since such a time-consuming and costly setup is difficult to scale up, recent years have seen attempts of automating the process with machine learning, including transcriptbased (Xiao et al., 2012;Gibson et al., 2015, speech-based (Xiao et al., 2014(Xiao et al., , 2015, and multimodal (Xiao et al., 2016b) methods. Those works are, however, limited in that 1) therapist empathy is only assessed at session-level rather than utterance-level; 2) classical machine learning with heuristic feature engineering is used, while recent deep-learning frameworks have not been utilised for this purpose; 3) the machine-learning-based approaches all assume access to privately-owned sizeable corpora of therapeutic dialogues with empathy annotation at session level, but in reality such well-annotated data are often very limited, even more so at utterance level; and 4) the link between empathy manifested in general conversation and in MI counselling remains unexplored.
In this work, we make the first attempt (to the best of our knowledge) at addressing those limitations while probing the correlation between empathy manifestations in different domains. Specifically, we employ pre-trained language models such as BERT (Devlin et al., 2019) for text-based binary classification of utterance-level therapist empathy, optionally taking the conversation context as input. We consider any counsellor utterance to be empathetic if it shows empathy, and non-empathetic if it does not (ranging from neutral to apathetic). Our models have no access to counselling conversations during their training and validation, as we experiment with learning from contrast of empathy vs. non-empathy in out-of-domain (OOD) training data. To that end, we leverage publicly available datasets of general conversations with heuristic empathy labels (Rashkin et al., 2019;Zhong et al., 2020) for OOD training, investigating the connections between general-conversational empathy and therapeutic empathy, as illustrated in Figure 1.
To benchmark the models, we manually anno- Figure 1: Training a binary empathy classifier on heuristically constructed empathetic vs. non-empathetic utterances in general conversations (i.e. out-of-domain w.r.t. MI), and then testing it on MI conversations. In this case, the empathy contrast for training is r/OffMyChest vs. r/CasualConversation. The classifier can take only the listener/therapist utterance (bold) as input or additionally use the preceding speaker/client utterance (italic).
tated utterance-level empathy for a subset of transcribed high-vs. low-quality counselling demonstrations (Pérez-Rosas et al., 2019) that are publicly available. We also build unsupervised baselines for the task by a) formulating binary empathy classification as natural language inference (NLI), as proposed by Yin et al. (2019), and b) tackling the surrogate task of client-counsellor agreement via NLI, under the assumption that an empathetic reply from the counsellor tends to show accordance with the client utterance in the preceding turn. Our experiments show that models trained on OOD empathy contrast are not sufficiently accurate predictors of MI empathy/non-empathy, even though the benefit of such training can be observed when compared to training on OOD data without empathy contrast. Upon probing, we argue that more fine-grained (e.g. sentence-level) empathy annotation and prediction could yield better results.  Xiao et al. (2012) proposed one of the earliest approaches for utterance-level empathy classification using an n-gram language model. Psycholinguistic norm features are used in addition to other linguistic features in the work of (Gibson et al., 2015). More recently,  utilised long shortterm memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997) to generate turn-level behavioural acts that are further processed by a deep neural network to predict session-level empathy.
Speech features have also been examined. Xiao et al. (2014) investigated features such as jitter and shimmer from speech signals, Xiao et al. (2015) studied speech rate entrainment, while Pérez-Rosas et al. (2017) used an array of acoustic and linguistic features to train their multimodal models.
There are also a number of recent studies on datadriven MI behaviour coding based on text (Cao et al., 2019;Tanana et al., 2016;Xiao et al., 2016a;Gibson et al., 2018), speech (Singla et al., 2020), and both (Chen et al., 2019;Flemotomos et al., 2021) , but they are less relevant to this work due to their lack of explicit empathy modelling. Different from the research listed above, this work addresses utterance-level empathy classification instead of session-level assessment, similar to Wu et al. (2020) which proposes utterance-level prediction of whether the therapist needs to show empathy given the context.

Data-Driven Text-Based Research on Empathy in General Conversation
Recent years have witnessed a boom of research on data-driven analysis and application of empathy in general conversations. In terms of empathy analysis for open-domain conversations, Zhou et al. (2021) addressed scoring empathy grounded in specific situations, Welivita and Pu (2020) created a taxonomy of empathetic response intents in social dialogues, while Guda et al. (2021) proposed to take user demographic information into account for empathy prediction.
As therapeutic conversation data is scarce, recent works on empathy analysis have also turned to peersupport dialogues from online communities. Zhou and Jurgens (2020) analysed Reddit 2 conversations for the relationships between condolence, distress and empathy, Hosseini and Caragea (2021) studied empathy seeking and providing with dialogues from a cancer survivor network, and Sharma et al. (2020) proposed an empathy framework of reactioninterpretation-exploration for conversations from mental-health-related online forums.
While early general empathetic chatbots (Zhou and Wang, 2018;Lubis et al., 2018) were mostly based on recurrent neural networks and produced emotion-conditioned output, their more recent counterparts are predominantly based on pretrained language models and leverage emotions in various ways, including emotion detection as an auxiliary objective (Lin et al., 2020), emotionbased mixture-of-experts decoding (Lin et al., 2019), and rewarding response candidates likely to induce positive user emotion .

Data
We leverage 3 two types of data: general conversations and transcripts of MI demonstration videos.
We define an utterance as everything said by an interlocutor in their turn in a 2-person conversation, which is the most widely used definition of utterance in the literature of deep-learning-based conversational intelligence. This differs from some utterance definitions in psychotherapy. For example, an "utterance" in this work is identical to a "volley" as defined in the motivational interviewing skill code (MISC) (Miller et al., 2003), while an "utterance" in MISC is "a complete thought" that "ends either when one thought is completed or a new thought begins with the same speaker, or by an utterance from the other speaker".

General Conversations
Our general conversation data is from two datasets: Persona-based Empathetic Conversation (PEC) (Zhong et al., 2020) and EmpatheticDialogues (ED) (Rashkin et al., 2019). Their statistics are listed in Table 1. For each 2-interlocutor dialogue, we consider the initiator of the conversation as the speaker and the other as the listener.
PEC consists of general conversations crawled from 3 subreddits: r/Happy 4 (r/H), r/OffMyChest 5 (r/OMC), and r/CasualConversation 6 (r/CC). Reddit users exchange happy experiences and thoughts in r/H, share emotional stories that cannot be told easily in r/OMC, and simply talk casually in r/CC. Since the original PEC dataset includes conversations between more than two participants and some conversations are actually subsets of other conversations (e.g. a 2-turn conversation that in effect constitutes the first 2 turns of a 4-turn conversation), we retain only the non-subset conversations that are between 2 interlocutors, in order to align with the counsellor-client nature of therapeutic conversations, and the filtered PEC contains around 56% of the conversations in the original one.
EmpatheticDialogues (abbreviated as ED) is comprised of 23.1K general conversations from MTurker pairs. The speaker of each dialogue was first given an emotion label (e.g. "Afraid"), then described a situation where they had felt the emotion before (e.g. "I've been hearing noises around the house at night"), and finally initiated the conversation about this situation with a listener.

Empathy vs. Non-Empathy
We divide the general conversation data into 2 parts: empathetic-listener conversations and nonempathetic-listener ones. Specifically, we assign "empathetic" labels to all the listener utterances of the dialogues in r/H, r/OMC and ED, and "nonempathetic" to the counterparts in r/CC.
For PEC, the heuristic empathy labelling is based on the annotator ratings from the original paper that suggest comments (i.e. listener For PEC, we utilise 2-interlocutor conversations only. #Conv: number of conversations in the data split. We consider r/Happy, r/OffMyChest and EmpatheticDialogues to consist of mostly empathetic ( ‡) listener utterances and r/CasualConversation to be comprised of predominantly non-empathetic ( ¶) ones. Note that the statistics of PEC are about the filtered dataset as described in Section 3.1. See Table 4 for more details.
utterances) in r/H and r/OMC are significantly more empathetic than those in r/CC, and the interannotator agreement on this as measured by Fleiss' kappa (Fleiss, 1971) was "substantial". For ED, the empathy labelling is intuitive as the authors explicitly instructed the "listeners" to respond empathetically during the data collection. We note that our heuristic labelling for PEC and ED is based on the corpus-level labels given by the creators of the datasets, thus it may not be completely accurate at utterance or sentence level. We nevertheless utilise the heuristic labels for our experiments and leave more fine-grained annotation for future work.

Motivational Interviewing
Our counselling conversations are from Pérez-Rosas et al. (2019), who collected the first and only (to the best of our knowledge) publicly available dataset of MI conversations. The dialogues are the transcripts of 152 demonstrations of high-quality (MI adherent) and another 101 of low-quality (MI non-adherent) counselling from video-sharing platforms such as YouTube and Vimeo. The original transcripts were obtained with the automatic captioning tool of YouTube, so the conversations have minor transcription errors and are mostly without punctuation. We refer to this dataset as ROLE-PLAYMI, and list its statistics in Table 2.

Manual Empathy Annotation
We select a subset of ROLEPLAYMI to manually annotate utterance-level empathy to build a benchmark dataset for our models. The annotation guideline follows the definition of high empathy in MISC: Counsellors high on the empathy scale show an active interest in making sure they understand what the client is saying, including the client's perceptions, situation, meaning, and feelings. We ask the annotators to consider an utterance that shows MISC-defined high empathy as empathetic, otherwise as non-empathetic. Thus, non-empathy in this context can range from neutrality to apathy.  T-u) n/a n/a 38.7% 2.3% %(¬Q.T-u) n/a n/a 71.9% 73.8% p(emp | ¬Q, T-u) n/a n/a 0.50 0.03 p(emp | Q, T-u) n/a n/a 0.10 0.00 Table 2: Statistics of ROLEPLAYMI and ANNO. #Conv: number of conversations in the subset. "T-u" is short for "Therapist Utterance ( Table 5 for more details. We choose 7 transcripts (217 counsellor utterances in total) from the high-quality subset with negligible transcription errors, and 14 transcripts (214 counsellor utterances in total) from the lowquality one. The 431 selected utterances are presented to 2 human annotators for binary utterancelevel empathy annotation. One annotator is a senior researcher that has received formal MI training in the past, and the other is a PhD student that has read in depth about MI (incl. Rollnick et al. (2008)). Their annotations show an inter-annotator agreement of 0.71 measured by Cohen's kappa (Cohen, 1968), indicating "substantial agreement". Finally, the annotators discussed their results and resolved the differences. The annotated MI conversations are denoted as ANNO in the rest of the paper.
As Table 2 shows, 38.7% of the therapist utterances in the high-quality subset are empathetic (i.e. 61.3% non-empathetic), while the number for the low-quality subset is 2.3% for empathetic (i.e. 97.7% non-empathetic), suggesting a marked difference between the empathy levels in high-and low-quality counselling.
We note that our empathy annotation is at utterance-level on the punctuation-free MI tran-scripts, which means an utterance is marked as empathetic as long as a part of the utterance is so, even though the remainder might not be. More finegrained annotation would be possible with punctuated utterances, which we leave for future work.

Question & Empathy
Empirically, we observe that questions in MI do not show empathy in general, which is intuitive since the purpose of questions is to gather more information. Indeed, we notice that the vast majority of the examples of open and closed questions provided by MISC are not empathetic.
Therefore, we additionally conduct binary annotation for each therapist utterance in ANNO as to whether the utterance is (predominantly) a question, by marking an utterance as a question utterance if more than half of the tokens in an utterance constitute at least one open or closed question as defined by MISC. For instance, "it's good to see you up and about how are you feeling after your last little hospitalization" is considered a question utterance, since "how are you feeling after your last little hospitalization" is an open question and makes up more than half of the utterance. We denote the non-question subset of ANNO as ¬Q.ANNO.
The relationship between empathy and question found in ANNO confirms our observation: a nonquestion therapist utterance from high-quality counselling is substantially more likely (0.50) to be empathetic than one from low-quality counselling (0.03), while the same does not hold for question therapist utterances: 0.10 for high-quality and 0.00 for low-quality, which indicates that therapist questions are overall very unlikely to be empathetic.

General-Conversation Empathy vs. Therapeutic Empathy
Comparing ROLEPLAYMI with PEC & ED, we noticed a pronounced difference between empathy in general conversation and therapy: an MI-adherent therapist tends to express empathy through nonquestions (as shown in Table 2), e.g. "The blood sugars have increased some, so you're concerned that things are not as good as they were last time that we talked". Conversely, participants in general conversations often show empathy via questions, e.g. "Oh no! That's scary! What do you think it is?". Thus, analysing sentence-level empathy (instead of utterance-level) could better separate the empathetic and non-empathetic parts, and more overlap between general-conversation empa-thy and therapeutic empathy may be found in the non-question sentences. This was not possible in our experiments as ROLEPLAYMI is not punctuated, thus we leave it for future work. We note that another domain difference is that ROLEPLAYMI consists of transcripts of spoken dialogues whereas PEC and ED contain "written" chat conversations. The difference is smoothed by the high-quality transcription of the ROLEPLAYMI videos and we therefore do not use specific techniques to address the difference, but we plan to investigate this factor further in future work.

Binary Empathy Classification
In this section, we first define the task of binary empathy classification, then lay out the out-of-domain empathy contrast strategy behind our supervised models for the task, and finally describe our unsupervised baselines driven by NLI.

Task Definition
We denote D M I = {(u C i , u T i , e i )}, i = 1, · · · , N as a collection of {(client utterance, therapist utterance, empathy label)} tuples, where u T i is the therapist reply to the client utterance u C i , e i ∈ {emp, ¬emp} denotes if u T i shows empathy, and N is the number of such tuples in the dataset. Our task can be formulated as follows: given u T i and optionally u C i for more context, predict the correct empathy label e i of u T i . We use ANNO as D M I .

Supervised Learning: Using Out-of-Domain Empathy Contrast
Since our manually annotated subset of ROLE-PLAYMI is too small to be a proper training set, we resort to learning from out-of-domain (i.e. non-MI) (OOD) empathy contrast. Specifically, as described in Section 3.1.1 and Figure 1, we utilise all listener utterances in r/H, r/OMC and ED as positive (empathetic) examples and their counterparts in r/CC as negative (non-empathetic) examples, as we aim to leverage parallels between general-conversation empathy and psychotherapeutic empathy. We build 3 empathy vs. non-empathy contrast 7 pairs from general conversations: (r/H vs. r/CC); (r/OMC vs. r/CC); (ED vs. r/CC). For each pair, we sample an equal number of examples from the empathetic (positive) and non-empathetic (negative) subsets to construct a contrast dataset P a Client: Everyone's getting on me about my drinking. | Therapist: Kind of like a bunch of crows pecking at you.

H b
The therapist is empathetic towards the patient Entailment The client wants to smoke more. Neutral The therapist is not listening to the client. Contradiction a P, Premise. b H, Hypothesis.
where in each sample the empathy label e j ∈ {emp, ¬emp} denotes whether the listener response u L j is empathetic towards its preceding speaker utterance u S j . Our sampling ensures that the 2 classes (i.e. emp & ¬emp) in each pair during training are balanced.
For each contrast pair, we train a 1-utterance general-conversation empathy classifier cls (1) to predict e j given u L j , as well as a 2-utterance counterpart cls (2) to predict e j given (u S j , u L j ). Finally, we apply the trained cls (1) and cls (2) directly on D M I , using u C i as u S j and u T i as u L j .

Unsupervised Baseline: Text Classification as Natural Language Inference
Natural language inference (NLI) is the task of determining if a hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given a premise 8 (Table 3). Following Yin et al. (2019) where NLI models prove effective as ready-made zero-shot sequence classifiers, we formulate our empathy classification task as an NLI problem. Assuming only u T i is available, we use it as the premise, and define the 1-utterance empathy hypothesis h (1) as "This text is empathetic.". We then utilise an off-the-shelf NLI model M as an unsupervised 1-utterance empathy classifier nli E (1) to directly predict a label from {entailment, contradiction, neutral} given (u T i , h (1) ). We consider u T i to be classified as an empathetic utterance only if the predicted label is entailment.
We also investigate a client-therapist exchange scenario where both u C i and u T i are provided. The premise p i is then formatted as "Client: u C i | Therapist: u T i ", and we define the 2-utterance hypothesis as h (2) = "The Therapist is empathetic towards the 8 Definition of NLI: https://paperswithcode. com/task/natural-language-inference Client.". We use the same M as an unsupervised 2-utterance empathy classifier nli E (2) given the input (p i , h (2) ). Again, only entailment is deemed equivalent to categorising u T i as empathetic.

Unsupervised Baseline: Client-Therapist Agreement as Natural Language Inference
It is our observation from MISC as well as ROLE-PLAYMI that an empathetic therapist tends to acknowledge the difficulties and feelings of clients, and hence we experiment with NLI-style modelling for client-therapist agreement. Specifically, we use M as an unsupervised 2utterance agreement classifier nli A C→T to measure the agreement between u C i and u T i , using the former as the premise and the latter as the hypothesis. We only interpret an entailment prediction from M as the therapist agreeing with the client and hence the therapist empathising with the client.

Implementation
For OOD empathy contrast (Section 4.2), we keep the original train/dev/test splits of PEC and ED. Since the two datasets in each contrast pair can be vastly different in their sizes (e.g. ED has only 17.8K training examples whereas r/CC has 530.2K), we always sample the positive and negative subsets so that their sizes are identical to that of ED, the smallest dataset, which ensures a) the two classes are balanced in each pair, and b) different cls models are trained with equal amounts of data and their performances are hence comparable.
To minimise the bias in training data caused by such sampling, we train the classifier of each contrast pair 5 times, each time with its own randomly sampled data. Note that this leads to 5 different groups of class-balanced {train, dev, set} datasets for each pair.
We leverage pre-trained language models for all our experiments. BERT (Devlin et al., 2019) is the backbone of our OOD empathy contrast models and its BERT-BASE-UNCASED variant is chosen. We add a fully connected layer atop the classification token ([CLS]) position of the language model to implement a binary classifier, and train the entire model end-to-end on the empathy contrast pairs. For the backbone M of the unsupervised zeroshot baselines, we use the BART-LARGE variant of BART (Lewis et al., 2020) that has been finetuned on MultiNLI (Williams et al., 2018). For more details, see Section B.
To measure model performance on ANNO, we choose Matthews correlation coefficient (MCC) since it is robust to class imbalance, taking into account that only 38.7% of the ANNO examples from the high-quality subset are marked as empathetic and the number is only 2.3% for low-quality. We also use MCC to measure test set performance to increase comparability.

Results
We examine the performances achieved on ANNO by the models introduced in Section 4 , namely the blue bars in the "OOD (1) w/ Contrast" (1-utterance models trained on OOD empathy contrast, i.e. cls (1) ), "OOD (2) w/ Contrast" (2-utterance models trained on OOD empathy contrast, i.e. cls (2) ), and "Baselines" subplots of Figure 2. The value of each blue bar indicates the mean MCC of the 5 models from the corresponding pair, and we use the error bar to simply represent +/-one standard deviation from the mean, in order to illustrate the variation among the scores of the 5 models.
Also, we show in Figure 3 the performances of the OOD models on their respective test sets. In the test set of each of the 5 models from a (D + , D − ) OOD pair, we have N T random samples from D + and another N T from D − , where N T is the size of the original test set of ED, in line with our sampling method for the OOD training sets. The mean (bar value) -standard deviation (error bar) representation follows that of Figure 2. By comparing the scores of the 5 models from an OOD setup on their own test sets and on ANNO, it becomes clear how the domain shift from general conversation to MI affects the performance of those models.
We first observe that while each test set in the OOD setups is different as we address class imbalance with random sampling, it is still obvious that the OOD models achieve considerably better scores on their test sets but experience significant drops on ANNO. In particular, ED vs. r/CC (2) reaches over 0.9 MCC on average on its test sets but only around 0.10 on ANNO. This stops any of the OOD empathy contrast models from being a reliable indicator of therapeutic empathy.
There is also considerable variation in the scores on ANNO (but not on the test sets) of the OOD models from the same empathy contrast pair. For instance, while r/OMC vs. r/CC (2) reaches 0.17 MCC on average, the standard deviation is 0.03. Further, we find that among the 5 models of the r/OMC vs. r/CC (2) pair, the MCC can be as high as 0.21 and as low as 0.11 despite that a) the 5 models only differ in the randomness of their training data sampling, b) the models have negligible variation in their test set performances (Figure 3). This pattern is present in all the OOD models, revealing their brittleness w.r.t. MI empathy classification.
As for the choice between 1-utterance and 2utterance, the effects are mixed. Specifically, r/H vs. r/CC and ED vs. r/CC both have decreased performances on ANNO going from 1-utterance to 2-utterance, while r/OMC vs. r/CC benefits from this transition. In fact, in terms of the average score, r/OMC vs. r/CC (2) is the best setup. This could be because a client talks more about negative experiences in a therapy session, not unlike how the typical speaker shares emotional stores in r/OMC. In contrast, the speakers in r/H are more likely to tell positive experiences, which could explain the performance drop resulting from including the speaker utterance in r/H vs. r/CC (2).
The unsupervised zero-shot baselines do not fare better in general. nli E (1) and nli E (2) score around 0.05 and 0.02, respectively, both below most of the mean scores achieved by the OOD empathy contrast models. This can be attributed to the fact that knowledge gained from NLI tasks are not sufficient for reasoning about complex concepts such as empathy. nli A C→T , on the other hand, shows better results and outperforms half of the OOD empathy contrast models, which suggests correlation between client-therapist agreement and therapist empathy. As a probing step, we swap the client and therapist utterances to reverse the premise-hypothesis formulation and observe that it (nli A T →C ) leads to a substantial drop to -0.04 MCC, further illustrating the aforementioned correlation.

Analysis
To shed light on the impact of the OOD design choices we made in Section 4, we add a control group of OOD models that are trained without empathy contrast for comparison, as shown by the blue bars in the "OOD (1) w/o Contrast", "OOD (2) w/o Contrast" subplots. More specifically, We build 3 pairs: (r/OMC vs. r/H), (ED vs. r/H), and (ED vs. r/OMC), as we consider them (empathy vs. empathy) pairs from which an OOD model is not Figure 2: Results of all models on ANNO and ¬Q.ANNO, measured with Matthews correlation coefficient (Matthews, 1975). The names of the baseline models (shown in the rightmost subplot) are re-written in the figure for better visibility, e.g. "NLI\nE\n(1)" instead of nli E (1) ). The first 4 subplots on the left show the performances of OOD-trained models. The first two show the performances of the 1-(e.g. r/H vs. r/CC (1)) and 2-utterance OOD models (e.g. r/H vs. r/CC (2)) trained on data with empathy contrast (e.g. r/H vs. r/CC, which is empathy vs. non-empathy), while the third and fourth show the performances of the 1-and 2-utterance OOD models trained on data without empathy contrast (e.g. ED vs. r/H, which is empathy vs. empathy). As explained in Section 5.1, for each OOD pair (e.g. r/H vs. r/CC), we randomly sample from the class-unbalanced OOD data 5 times to obtain 5 groups of class-balanced {train, dev, set} data, in order to address class imbalance and data selection bias. For each OOD pair, therefore, we train 5 models independently with the training data from their respective groups. Thus, the value of each rectangular bar indicates the mean of the scores of the 5 models from the 5 data groups of the corresponding OOD pair, and the error bar shows +/-one standard deviation from the mean. able to learn empathy vs. non-empathy contrast. Additionally, we inspect the performances (orange bars) of all the models on ¬Q.ANNO to understand model behaviour in a less noisy context (i.e. question utterances removed).
Interestingly, the control group models score around 0.11 MCC and are not far behind empathy contrast models such as r/OMC vs. r/CC and ED vs. r/CC in the 1-utterance scenario, albeit with similarly large variation in their results. When it comes to 2-utterance, however, the lead of the empathy contrast models (except r/H vs. r/CC) becomes more obvious, with r/OMC vs. r/CC scoring over 0.15 MCC in contrast to ED vs. r/OMC recording less than 0.05. This shows that the benefit of learning from OOD empathy contrast, though small, does exist, and is more pronounced when a) compared against learning from no-empathycontrast OOD data and b) more conversation context is taken into account by the models.
Finally, for the OOD contrast models, we notice mixed effects of removing questions from the benchmark dataset. It enables performance gains for r/H vs. r/CC (1) and ED vs. r/CC (2) but performance drops for the other OOD empathy contrast models. This shows that despite the annotations indicating that question therapist utterances are predominantly non-empathetic, whether a therapist utterance is a question generally does not substantially impact the empathy prediction of an OOD contrast model. One possible explanation, among others, is that the models simply did not learn to associate question with non-empathy during the OOD contrast training and instead learned to base its classification on semantic cues unrelated to question/non-question. Echoing Section 3.3, we argue that analysing non-questions at sentence level would be less noisy and better predictions would thus be possible, which we leave for future work.

Clinical Application & Impact
The motivation for this work was to minimise the annotation effort needed for training an utterance-level classifier of therapeutic empathy/non-empathy, based on the assumption that 1) pre-trained language models can be finetuned to distinguish between empathy and nonempathy in general conversations, and 2) the finetuned model can be leveraged to directly predict therapeutic empathy/non-empathy. Figure 3: Test set performances (in MCC) of all OOD models. The first subplot on the left shows the test set performances of the 1-and 2-utterance OOD models trained on data with empathy contrast, and the second shows the test set performances of the 1-and 2utterance models trained on data without empathy contrast. As explained in Figure 2, each OOD pair (e.g. r/H vs. r/CC (1) / (2)) corresponds to 5 groups of randomly sampled {train, dev, test} data and hence 5 trained models. Thus, the model trained on the training data of a group has a test set score associated with the test data of the group. Therefore, the value of each rectangular bar indicates the mean of the test set scores of the 5 models from the same OOD pair, and the error bar shows +/-one standard deviation from the mean.
Our results, for the most part, show that this simple OOD training approach did not sufficiently perform accurate classification, which limits its application in clinical settings. Compared to supervised learning of session-level empathy on sizeable corpora of well-annotated therapeutic conversations , the task of utterance-level empathy classification with no in-domain training is more challenging and the models unsurprisingly fared worse. As discussed, the coarse, heuristic empathy labelling for the utterances in the training data and the domain gap between general conversation and therapeutic dialogue may have contributed considerably to the sub-optimal performance.
Nevertheless, we believe that this work is a meaningful step towards low-resource real-time assessment of empathy in counselling, and that the idea of utilising pre-trained language models for low-resource scenarios related to clinical psychology is still relevant. With smoothed domain gaps and more fine-grained annotation, future work can still use pre-trained language models to leverage parallels between empathy manifestations in general conversation and therapeutic dialogue. For instance, knowledge of empathy vs. non-empathy learned from well-annotated general conversations can serve as a bootstrapping step for empathy vs. non-empathy training on a minimal amount of wellannotated therapeutic conversations, since there can be a small to modest amount of therapeutic dialogue data available for a specialised domain instead of no data at all, which can take advantage of OOD empathy knowledge as a starting point for in-domain fine-tuning and thus maximise the benefit of OOD empathy training.

Conclusion
We find that our models trained to learn from empathy vs. non-empathy contrast in general conversation (i.e. out-of-domain w.r.t. counselling) are generally not reliable predictors of empathy/nonempathy in motivational interviewing. Upon probing, we observe that OOD empathy contrast learning is still marginally better than OOD learning without empathy contrast, particularly when more conversation context is available.
In future work, we plan to investigate more finegrained empathy annotation and prediction, such as at sentence level, where we expect less noise and more accurate predictions. In addition, we will explore few-shot methods for the empathy classification task with out-of-domain empathy contrast training as a bootstrapping step.

Ethics & Privacy
Empathy often involves deeply personal circumstances (e.g. distress & struggle) and computational studies on it therefore warrant ethical consideration. The greatest ethical risk of this work has been privacy implications, as the conversational data we used could contain large amounts of sensitive identifiable information. To mitigate this risk, we experimented with only de-identified data where mentions of information like name, date, and location are replaced with placeholders. As a counterbalance, this study has considerable benefit as the first investigation of using knowledge of general-conversation empathy to support lowresource computational analysis of MI empathy, and the findings can inspire future efforts in making research on therapeutic empathy more accessible.   -uLen.) 28.5 20.6 24.4 21.6 %(emp.T-u) n/a n/a 38.7% 2.3% %(¬Q.T-u) n/a n/a 71.9% 73.8% p(emp | ¬Q, T-u) n/a n/a 0.50 0.03 p(emp | Q, T-u) n/a n/a 0.10 0.00 Table 5: Statistics of ROLEPLAYMI and ANNO. The abbreviation convention is similar to that in Table 4, while "T-u" is short for "Therapist Utterance(s)" and "C-u" for "Client Utterance(s)". #Conv: number of conversations in the subset. #T-u: number of therapist utterances in the subset. %(emp.T-u): percentage of empathetic therapist utterances. %(¬Q.T-u): percentage of non-question therapist utterances. p(emp | ¬Q, T-u): probability of a non-question therapist utterance being empathetic. p(emp | Q, T-u): probability of a question therapist utterance being empathetic.