Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes

Automatically analyzing dialogue can help understand and guide behavior in domains such as counseling, where interactions are largely mediated by conversation. In this paper, we study modeling behavioral codes used to asses a psychotherapy treatment style called Motivational Interviewing (MI), which is effective for addressing substance abuse and related problems. Specifically, we address the problem of providing real-time guidance to therapists with a dialogue observer that (1) categorizes therapist and client MI behavioral codes and, (2) forecasts codes for upcoming utterances to help guide the conversation and potentially alert the therapist. For both tasks, we define neural network models that build upon recent successes in dialogue modeling. Our experiments demonstrate that our models can outperform several baselines for both tasks. We also report the results of a careful analysis that reveals the impact of the various network design tradeoffs for modeling therapy dialogue.


Introduction
Conversational agents have long been studied in the context of psychotherapy, going back to chatbots such as ELIZA (Weizenbaum, 1966) and PARRY (Colby, 1975). Research in modeling such dialogue has largely sought to simulate a participant in the conversation.
In this paper, we argue for modeling dialogue observers instead of participants, and focus on psychotherapy. An observer could help an ongoing therapy session in several ways. First, by monitoring fidelity to therapy standards, a helper could guide both veteran and novice therapists towards better patient outcomes. Second, rather than generating therapist utterances, it could suggest the type of response that is appropriate. Third, it could alert a therapist about potentially important cues from a patient. Such assistance would be especially helpful in the increasingly prevalent online or text-based counseling services. 1 We ground our study in a style of therapy called Motivational Interviewing (MI, Miller andRollnick, 2003, 2012), which is widely used for treating addiction-related problems. To help train therapists, and also to monitor therapy quality, utterances in sessions are annotated using a set of behavioral codes called Motivational Interviewing Skill Codes (MISC, Miller et al., 2003). Table 1 shows standard therapist and patient (i.e., client) codes with examples. Recent NLP work (Tanana et al., 2016;Xiao et al., 2016;Pérez-Rosas et al., 2017;Huang et al., 2018, inter alia) has studied the problem of using MISC to assess completed sessions. Despite its usefulness, automated post hoc MISC labeling does not address the desiderata for ongoing sessions identified above; such models use information from utterances yet to be said. To provide real-time feedback to therapists, we define two complementary dialogue observers: 1. Categorization: Monitoring an ongoing session by predicting MISC labels for therapist and client utterances as they are made. 2. Forecasting: Given a dialogue history, forecasting the MISC label for the next utterance, thereby both alerting or guiding therapists. Via these tasks, we envision a helper that offers assistance to a therapist in the form of MISC labels.
We study modeling challenges associated with these tasks related to: (1) representing words and utterances in therapy dialogue, (2) ascertaining relevant aspects of utterances and the dialogue history, and (3) handling label imbalance (as evidenced in Table 1). We develop neural models that address these challenges in this domain.
Experiments show that our proposed models outperform baselines by a large margin. For the categorization task, our models even outperform previous session-informed approaches that use information from future utterances. For the more difficult forecasting task, we show that even without having access to an utterance, the dialogue history provides information about its MISC label. We also report the results of an ablation study that shows the impact of the various design choices. 2 . In summary, in this paper, we (1) define the tasks of categorizing and forecasting Motivational Interviewing Skill Codes to provide real-time assistance to therapists, (2) propose neural models for both tasks that outperform several baselines, and (3) show the impact of various modeling choices via extensive analysis.

Background and Motivation
Motivational Interviewing (MI) is a style of psychotherapy that seeks to resolve a client's ambivalence towards their problems, thereby motivating behavior change. Several meta-analyses and empirical studies have shown the high efficacy and success of MI in psychotherapy (Burke et al., 2004;Martins and McNeil, 2009;Lundahl et al., 2010). However, MI skills take practice to master and require ongoing coaching and feedback to sustain (Schwalbe et al., 2014). Given the emphasis on using specific types of linguistic behaviors 2 The code is available online at https://github.com/ utahnlp/therapist-observer. in MI (e.g., open questions and reflections), finegrained behavioral coding plays an important role in MI theory and training.
Motivational Interviewing Skill Codes (MISC, table 1) is a framework for coding MI sessions. It facilitates evaluating therapy sessions via utterance-level labels that are akin to dialogue acts (Stolcke et al., 2000;Jurafsky and Martin, 2019), and are designed to examine therapist and client behavior in a therapy session. 3 As Table 1 shows, client labels mark utterances as discussing changing or sustaining problematic behavior (CT and ST, respectively) or being neutral (FN). Therapist utterances are grouped into eight labels, some of which (RES, REC) correlate with improved outcomes, while MI non-adherent (MIN) utterances are to be avoided. MISC labeling was originally done by trained annotators performing multiple passes over a session recording or a transcript. Recent NLP work speeds up this process by automatically annotating a completed MI session (e.g., Tanana et al., 2016;Xiao et al., 2016;Pérez-Rosas et al., 2017).
Instead of providing feedback to a therapist after the completion of a session, can a dialogue observer provide online feedback? While past work has shown the helpfulness of post hoc eval-i si ui li 1 T: Have you used drugs recently? QUC 2 C: I stopped for a year, but relapsed. FN 3 T: You will suffer if you keep using. MIN 4 C: Sorry, I just want to quit. CT · · · · · · · · ·

Task Definitions
In this section, we will formally define the two NLP tasks corresponding to the vision in §2 using the conversation in table 2 as a running example. Suppose we have an ongoing MI session with utterances u 1 , u 2 , · · · , u n : together, the dialogue history H n . Each utterance u i is associated with its speaker s i , either C (client) or T (therapist). Each utterance is also associated with the MISC label l i , which is the object of study. We will refer to the last utterance u n as the anchor.
We will define two classification tasks over a fixed dialogue history with n elements -categorization and forecasting. As the conversation progresses, the history will be updated with a sliding window. Since the therapist and client codes share no overlap, we will design separate models for the two speakers, giving us four settings in all. Task 1: Categorization. The goal of this task is to provide real-time feedback to a therapist during an ongoing MI session. In the running example, the therapist's confrontational response in the third utterance is not MI adherent (MIN); an observer should flag it as such to bring the therapist back on track. The client's response, however, shows an inclination to change their behavior (CT). Alerting a therapist (especially a novice) can help guide the conversation in a direction that encourages it.
In essence, we have the following real-time classification task: Given the dialogue history H n which includes the speaker information, predict the MISC label l n for the last utterance u n .
The key difference from previous work in pre-dicting MISC labels is that we are restricting the input to the real-time setting. As a result, models can only use the dialogue history to predict the label, and in particular, we can not use models such as a conditional random field or a bi-directional LSTM that need both past and future inputs. Task 2: Forecasting. A real-time therapy observer may be thought of as an expert therapist who guides a session with suggestions to the therapist. For example, after a client discloses their recent drug use relapse, a novice therapist may respond in a confrontational manner (which is not recommended, and hence coded MIN). On the other hand, a seasoned therapist may respond with a complex reflection (REC) such as "Sounds like you really wanted to give up and you're unhappy about the relapse." Such an expert may also anticipate important cues from the client. The forecasting task seeks to mimic the intent of such a seasoned therapist: Given a dialogue history H n and the next speaker's identity s n+1 , predict the MISC code l n+1 of the yet unknown next utterance u n+1 .
The MISC forecasting task is a previously unstudied problem. We argue that forecasting the type of the next utterance, rather than selecting or generating its text as has been the focus of several recent lines of work (e.g., Schatzmann et al., 2005;Lowe et al., 2015;Yoshino et al., 2018), allows the human in the loop (the therapist) the freedom to creatively participate in the conversation within the parameters defined by the seasoned observer, and perhaps even rejecting suggestions. Such an observer could be especially helpful for training therapists (Imel et al., 2017). The forecasting task is also related to recent work on detecting antisocial comments in online conversations (Zhang et al., 2018) whose goal is to provide an early warning for such events.

Models for MISC Prediction
Modeling the two tasks defined in §3 requires addressing four questions: (1) How do we encode a dialogue and its utterances? (2) Can we discover discriminative words in each utterance? (3) Can we discover which of the previous utterances are relevant? (4) How do we handle label imbalance in our data? Many recent advances in neural networks can be seen as plug-and-play components. To facilitate the comparative study of models, we will describe components that address the above questions. In the rest of the paper, we will use boldfaced terms to denote vectors and matrices and SMALL CAPS to denote component names.

Encoding Dialogue
Since both our tasks are classification tasks over a dialogue history, our goal is to convert the sequence of utterences into a single vector that serves as input to the final classifier.
We will use a hierarchical recurrent encoder (Li et al., 2015;Sordoni et al., 2015;Serban et al., 2016, and others) to encode dialogues, specifically a hierarchical gated recurrent unit (HGRU) with an utterance and a dialogue encoder. We use a bidirectional GRU over word embeddings to encode utterances. As is standard, we represent an utterance u i by concatenating the final forward and reverse hidden states. We will refer to this utterance vector as v i . Also, we will use the hidden states of each word as inputs to the attention components in §4.2. We will refer to such contextual word encoding of the j th word as v ij . The dialogue encoder is a unidirectional GRU that operates on a concatenation of utterance vectors v i and a trainable vector representing the speaker s i . 4 The final state of the GRU aggregates the entire dialogue history into a vector H n .
The HGRU skeleton can be optionally augmented with the word and dialogue attention described next. All the models we will study are twolayer MLPs over the vector H n that use a ReLU hidden layer and a softmax layer for the outputs.

Word-level Attention
Certain words in the utterance history are important to categorize or forecast MISC labels. The identification of these words may depend on the utterances in the dialogue. For example, to identify that an utterance is a simple reflection (RES) we may need to discover that the therapist is mirroring a recent client utterance; the example in table 1 illustrates this. Word attention offers a natural mechanism for discovering such patterns.
We can unify a broad collection of attention mechanisms in NLP under a single high level architecture (Galassi et al., 2019). We seek to define attention over the word encodings v ij in the history (called queries), guided by the word encodings in the anchor v nk (called keys). The output is Table 3: Summary of word attention mechanisms. We simplify BiDAF with multiplicative attention between word pairs for f m , while GMGRU uses additive attention influenced by the GRU hidden state. The vector w e ∈ R d , and matrices W k ∈ R d×d and W q ∈ R 2d×2d are parameters of the BiGRU. The vector h j−1 is the hidden state from the BiGRU in GM-GRU at previous position j − 1. For combination function, BiDAF concatenates bidirectional attention information from both the key-aware query vector a ij and a similarly defined query-aware key vector a . GMGRU uses simple concatenation for f c . a sequence of attention-weighted vectors, one for each word in the i th utterance. The j th output vector a j is computed as a weighted sum of the keys: The weighting factor α k j is the attention weight between the j th query and the k th key, computed as Here, f m is a match scoring function between the corresponding words, and different choices give us different attention mechanisms. Finally, a combining function f c combines the original word encoding v ij and the above attention-weighted word vector a ij into a new vector representation z ij as the final representation of the query word encoding: The attention module, identified by the choice of the functions f m and f c , converts word encodings in each utterance v ij into attended word encodings z ij . To use them in the HGRU skeleton, we will encode them a second time using a BiGRU to produce attention-enhanced utterance vectors. For brevity, we will refer to these vectors as v i for the utterance u i . If word attention is used, these attended vectors will be treated as word encodings.
To complete this discussion, we need to instantiate the two functions. We use two commonly used attention mechanisms: BiDAF (Seo et al., 2016) and gated matchLSTM (Wang et al., 2017). For simplicity, we replace the sequence encoder in the latter with a BiGRU and refer to it as GMGRU. Table 3 shows the corresponding definitions of f c and f m . We refer the reader to the original papers for further details. In subsequent sections, we will refer to the two attended versions of the HGRU as BIDAF H and GMGRU H .

Utterance-level Attention
While we assume that the history of utterances is available for both our tasks, not every utterance is relevant to decide a MISC label. For categorization, the relevance of an utterance to the anchor may be important. For example, a complex reflection (REC) may depend on the relationship of the current therapist utterance to one or more of the previous client utterances. For forecasting, since we do not have an utterance to label, several previous utterances may be relevant. For example, in the conversation in Table 2, both u 2 and u 4 may be used to forecast a complex reflection.
To model such utterance-level attention, we will employ the multi-head, multi-hop attention mechanism used in Transformer networks (Vaswani et al., 2017). As before, due to space constraints, we refer the reader to the original work for details. We will use the (Q, K, V ) notation from the original paper here. These matrices represent a query, key and value respectively. The multi-head attention is defined as: The W i 's refer to projection matrices for the three inputs, and the final W o projects the concatenated heads into a single vector. The choices of the query, key and value defines the attention mechanism. In our work, we compare two variants: anchor-based attention, and self-attention. The anchor-based attention is de- Self-attention is defined by setting all three matrices to [v 1 · · · v n ]. For both settings, we use four heads and stacking them for two hops, and refer to them as SELF 42 and ANCHOR 42 .

Addressing Label Imbalance
From Table 1, we see that both client and therapist labels are imbalanced. Moreover, rarer la-bels are more important in both tasks. For example, it is important to identify CT and ST utterances. For therapists, it is crucial to flag MI nonadherent (MIN) utterances; seasoned therapists are trained to avoid them because they correlate negatively with patient improvements. If not explicitly addressed, the frequent but less useful labels can dominate predictions.
To address this, we extend the focal loss (FL Lin et al., 2017) to the multiclass case. For a label l with probability produced by a model p t , the loss is defined as In addition to using a label-specific balance weight α t , the loss also includes a modulating factor (1 − p t ) γ to dynamically downweight wellclassified examples with p t 0.5. Here, the α t 's and the γ are hyperparameters. We use FL as the default loss function for all our models.

Experiments
The original psychotherapy sessions were col-

Preprocessing and Model Setup
An MI session contains about 500 utterances on average. We use a sliding window of size N = 8 utterances with padding for the initial ones. We assume that we always know the identity of the speaker for all utterances. Based on this, we split the sliding windows into a client and therapist windows to train separate models. We tokenized and lower-cased utterances using spaCy (Honnibal and Montani, 2017). To embed words, we concatenated 300-dimensional Glove embeddings (Pennington et al., 2014) with ELMo vectors (Peters et al., 2018). The appendix details the model setup and hyperparameter choices.

Results
Best Models. Our goal is to discover the best client and therapist models for the two tasks. We identified the following best configurations using F 1 score on the development set: 1. Categorization: For client, the best model does not need any word or utterance attention. For the therapist, it uses GMGRU H for word attention and ANCHOR 42 for utterance attention. We refer to these models as C C and C T respectively 2. Forecasting: For both client and therapist, the best model uses no word attention, and uses SELF 42 utterance attention. We refer to these models as F C and F T respectively. Here, we show the performance of these models against various baselines. The appendix gives label-wise precision, recall and F 1 scores. Results on Categorization. Tables 4 and 5 show the performance of the C C and C T models and the baselines. For both therapist and client categorization, we compare the best models against the same set of baselines. The majority baseline illustrates the severity of the label imbalance problem. Xiao et al. (2016) Table 4: Main results on categorizing client codes, in terms of macro F 1 , and F 1 for each client code. Our model C C uses final dialogue vector H n and current utterance vector v n as input of MLP for final prediction. We found that predicting using MLP(H n ) + MLP(v n ) performs better than just MLP(H n ).
The first set of baselines (above the line) do not encode dialogue history and use only the current utterance encoded with a BiGRU. The work of Xiao et al. (2016) falls in this category, and uses a 100-dimensional domain-specific embedding with weighted cross-entropy loss. Previously, it was the best model in this class. We also re-implemented this model to use either ELMo or Glove vectors with focal loss. 5 The second set of baselines (below the line) are models that use dialogue context. Both Can et al. (2015) and Tanana et al. (2016) use wellstudied linguistic features and then tagging the current utterance with both past and future utterance with CRF and MEMM, respectively. To study the usefulness of the hierarchical encoder, we implemented a model that uses a bidirectional GRU over a long sequence of flattened utterance. We refer to this as CONCAT C . This model is representative of the work of Huang et al. (2018), but was reimplemented to take advantage of ELMo.
For categorizing client codes, BiGRU ELMo is a simple but robust baseline model. It outperforms the previous best no-context model by more than 2 points on macro F 1 . Using the dialogue history, the more sophisticated model C C further gets 1 point improvement. Especially important is its improvement on the infrequent, yet crucial labels CT and ST. It shows a drop in the F 1 on the FN label, which is essentially considered to be an unimportant, background class from the point of view of assessing patient progress. For therapist codes, as the highlighted numbers in Table 5 show, only incorporating GMGRU-based word-level attention, GMGRU H has already outperformed many baselines, our proposed model F T which uses both GMGRU-based word-level attention and anchorbased multi-head multihop sentence-level attention can further achieve the best overall performance. Also, note that our models outperform approaches that take advantage of future utterances.
For both client and therapist codes, concatenating dialogue history with CONCAT C always performs worse than the hierarchical method and even the simpler BiGRU ELMo . Results on Forecasting. Since the forecasting task is new, there are no published baselines to compare against. Our baseline systems essentially differ in their representation of dialogue history. The model CONCAT F uses the same architecture   as the model CONCAT C from the categorizing task. We also show comparisons to the simple HGRU model and the GMGRU H model that uses a gated matchGRU for word attention. 6 Tables 6 (a,b) show our forecasting results for client and therapist respectively. For client codes, we also report the CT and ST performance on the development set because of their importance. For the therapist codes, we also report the recall@3 to show the performance of a suggestion system that displayed three labels instead of one. The results show that even without an utterance, the dialogue history conveys signal about the next MISC label. Indeed, the performance for some labels is even better than some categorization baseline systems. Surprisingly, word attention (GMGRU H ) in Table  6 did not help in forecasting setting, and a model with the SELF 42 utterance attention is sufficient. 6 The forecasting task bears similarity to the next utterance selection task in dialogue state tracking work (Yoshino et al., 2018). In preliminary experiments, we found that the Dual-Encoder approach used for that task consistently underperformed the other baselines described here.
For the therapist labels, if we always predicted the three most frequent labels (FA, GI, and RES), the recall@3 is only 67.7, suggesting that our models are informative if used in this suggestion-mode.

Analysis and Ablations
This section reports error analysis and an ablation study of our models on the development set. The appendix shows a comparison of pretrained domain-specific ELMo/glove with generic ones and the impact of the focal loss compared to simple or weighted cross-entropy.  Table 7.

Client Examples (Gold MISC)
Reasoning is required to understand whether a client wants to change behavior, even with full context (50,42) T: On a scale of zero to ten how confident are you that you can implement this change ? C: I don't know, seven maybe (CT); I have to wind down after work (ST) Concise utterances which are easy for humans to understand, but missing information such as coreference, zero pronouns (   The first category requires more complex reasoning than just surface form matching. For example, the phrase seven out of ten indicates that the client is very confident about changing behavior; the phrase wind down after work indicates, in this context, that the client drinks or smokes after work. We also found that the another frequent source of error is incomplete information. In a face-to-face therapy session, people may use concise and effient verbal communication, with guestures and other body language conveying information without explaining details about, for example, coreference. With only textual context, it is difficult to infer the missing information. The third category of errors is introduced when speech is transcribed into text. The last category is about ambivalent speech. Discovering the real attitude towards behavior change behind such utterances could be difficult, even for an expert therapist. Figures 1 and 2 show the label confusion matrices for the best categorization models. We will examine confusions that are not caused purely by a label being frequent. We observe a common confusion between the two reflection labels, REC and RES. Compared to the confusion matrix from Xiao et al. (2016), we see that our models show much-decreased confusion here. There are two  reason for this confusion persisting. First, the reflections may require a much longer information horizon. We found that by increasing the window size to 16, the overall reflection results improved. Second, we need to capture richer meaning beyond surface word overlap for RES. We found that complex reflections usually add meaning or emphasis to previous client statements using devices such as analogies, metaphors, or similes rather than simply restating them.
Closed questions (QUC) and simple reflections (RES) are known to be a confusing set of labels. For example, an utterance like Sounds like you're suffering? may be both. Giving information (GI) is easily confused with many labels because they relate to providing information to clients, but with different attitudes. The MI adherent (MIA) and non-adherent (MIN) labels may also provide information, but with supportive or critical attitude that may be difficult to disentangle, given the limited  Table 8: Ablation study on categorizing client code. * is our best model C C . All ablation is based on it. The symbol + means adding a component to it. The default window size is 8 for our ablation models in the word attention and sentence attention parts.
number of examples.

How Context and Attention Help?
We evaluated various ablations of our best models to see how changing various design choices changes performance. We focused on the context window size and impact of different word level and sentence level attention mechanisms. Tables 8 and 9 summarize our results. History Size. Increasing the history window size generally helps. The biggest improvements are for categorizing therapist codes (Table 9), especially for the RES and REC. However, increasing the window size beyond 8 does not help to categorize client codes (Table 8) or forecasting (in appendix). Word-level Attention. Only the model C T uses word-level attention. As shown in Table 9, when we remove the word-level attention from it, the overall performance drops by 3.4 points, while performances of RES and REC drop by 3.3 and 5 points respectively. Changing the attention to BiDAF decreases performance by about 2 points (still higher than the model without attention). Sentence-level Attention. Removing sentence attention from the best models that have it decreases performance for the models C T and F T (in appendix). It makes little impact on the F C , however. Table 8 shows that neither attention helps categorizing clients codes.

Can We Suggest Empathetic Responses?
Our forecasting models are trained on regular MI sessions, according to the label distribution on Table 1, there are both MI adherent or non-adherent data. Hence, our models are trained to show how the therapist usually respond to a given statement.  Table 9: Ablation study on categorizing therapist codes, * is our proposed model C T . \ means substituting and − means removing that component. Here, we only report the important REC, RES labels for guiding, and the MIN label for warning a therapist.
To show whether our model can mimic good MI policies, we selected 35 MI sessions from our test set which were rated 5 or higher on a 7-point scale empathy or spirit. On these sessions, we still achieve a recall@3 of 76.9, suggesting that we can learn good MI policies by training on all therapy sessions. These results suggest that our models can help train new therapists who may be uncertain about how to respond to a client.

Conclusion
We addressed the question of providing real-time assistance to therapists and proposed the tasks of categorizing and forecasting MISC labels for an ongoing therapy session. By developing a modular family of neural networks for these tasks, we show that our models outperform several baselines by a large margin. Extensive analysis shows that our model can decrease the label confusion compared to previous work, especially for reflections and rare labels, but also highlights directions for future work.

MIN 1019
Group of MI Non-adherent codes: Confront(CO); Direct(DI); Advise without permission(ADW); Warn(WA); Raise concern without permission(RCW) "You hurt the baby's health for cigarettes?" (CO) "You need to xxx." (DI) "You ask them not to drink at your house." (ADW) "You will die if you don't stop smoking." (WA) "You may use it again with your friends." (RCW)  Table 10 Model Setup We use 300-dimensional Glove embeddings pre-trained on 840B tokens from Common Crawl (Pennington et al., 2014). We do not update the embedding during training. Tokens not covered by Glove are using a randomly initialized UNK embedding. We also use characterlevel deep contextualized embedding ELMo 5.5B model by concatenating the corresponding ELMo word encoding after the word embedding vector. For speaker information, we randomly initialize them with 8 dimensional vectors and update them during training. We used a dropout rate of 0.3 for the embedding layers. We trained all models using Adam (Kingma and Ba, 2015) with learning rate chosen by cross validation between [1e −4 , 5 * 1e −4 ], gradient norms clipping from at [1.0, 5.0], and minibatch sizes of 32 or 64. We use the same hidden size for both utterance encoder, dialogue encoder and other attention memory hidden size; it has been selected from {64, 128, 256, 512}. We set a smaller dropout 0.2 for the final two fully connected layers. All the models are trained for 100 epochs with earlystoping based on macro F 1 over development results.
Detailed Results of Our Main Models In the main text, we only show the F 1 score of each our proposed models. We summarize the performance of our best models for both categorzing and forecasting MISC codes in Table 11 with precision, recall and F 1 for each codes.  Table 11: Performance of our proposed models with respect to precision, recall and F 1 on categorizing and forecasting tasks for client and therapist codes Domain Specific Glove and ELMo We use the general psychotherapy corpus with 6.5M words (Alexander Street Press) to train the domain specific word embeddings Glove psyc with 50, 100, 300 dimension. Also, we trained ELMo with 1 highway connection and 256-dimensional output size to get ELMo psyc . We found that ELMo 5.5B performs better than ELMo psyc in our experiments, and general Glove-300 is better than the Glove psyc . Hence for main results of our models, we use ELMo generic by default. Please see more details in Table 12 Model  Table 13: Ablation on forecasting task on both client and therapist code. * row are results of our best forecasting model F C , and F T . \ means substitute anchor attention with self attention. +GMGRU ANCHOR 42 means using word-level attention and achor-based sentence-level attention together.

Full Results for Ablation on Forecasting Tasks
In addition to the ablation table in the main paper for categorizing tasks, we reported more ablation details on forecasting task in Table 13. Wordlevel attention shows no help for both client and therapist codes. While sentence-level attention helps more on therapist codes than on client codes. Multi-head self attention alsoachieves better performance than anchor-based attention in forecasting tasks.
Label Imbalance We always use the same α for all weighted focal loss. Besides considering the label frequency, we also consider the performance gap between previous reported F 1 . We choose to balance weights α as {1.0,1.0,0.25} for CT,ST and FN respectively, and {0.5, 1.0, 1.0, 1.0, 0.75, 0.75,1.0,1.0} for FA, RES, REC, GI, QUC, QUO, MIA, MIN. As shown in Table 14, we report our ablation studies on cross-entropy loss, weighted cross-entropy loss, and focal loss. Besides the fixed weights, focal loss offers flexible hyperparameters to weight examples in different tasks. Experiments shows that except for the model C T , focal loss outperforms cross-entropy loss and weighted cross entropy.  Table 14: Abalation study of different loss function on categorizing and forecasting task. Based on our proposed model for our four settings, we compared our best model with crossentropy loss(ce), α balanced cross-entropy(wce) and focal loss. Here we only report the macro F 1 for rare labels and the overall macro F 1 . γ = 1 is the best for both the model C C and F C , while γ = 0 is the best for C T and γ = 3 for F T . Worth to mention, when γ = 0, the focal loss degraded into αbalanced crossentropy, that first two rows are the same for therspit model.