A Multi-Persona Chatbot for Hotline Counselor Training

Suicide prevention hotline counselors aid individuals during difficult times through millions of calls and chats. A chatbot cannot safely replace a counselor, but we explore whether a chatbot can be developed to help train human counselors. Such a system needs to simulate intimate situations across multiple practice sessions. Open-domain dialogue systems frequently suffer from generic responses that do not characterize personal stories, so we look to infuse conversations with persona information by mimicking prototype conversations. Towards building a “Crisisbot” hotline visitor simulation, we propose a counseling strategy annotation scheme and a multi-task framework that leverages these counselor strategies to retrieve similar examples, generate diverse sub-utterances, and interleave prototype and generated sub-utterances into complex responses. We evaluate this framework with crowdworkers and experienced hotline counselors. The framework considerably increases response diversity and specificity, with limited impact to coherence. Our results also show a considerable discrepancy between crowdworker and counselor judgements, which emphasizes the importance of including target populations in system development and evaluation.


Introduction
Crisis hotlines provide the opportunity for individuals to reach out, be heard, and get help during intense times of need. They are invaluable, highly used resources, as evidenced by The National Suicide Prevention Lifeline centers alone handling 2.2 million calls during 2018 (Vibrant Emotional Health, 2019). Further, many helplines have expanded services to chat, in addition to voice calls, to support texting and increased demand. To help visitors who contact them, helplines rely on counselors trained in various active listening tech- niques and de-escalation strategies (Gould et al., 2013;Knox et al., 2012;Labouliere et al., 2020).
As in other high-stakes settings, training hotline counselors suffers from the fundamental challenge that counselors must practice counseling skills without putting any distressed hotline visitors in danger. Crisis counselors usually obtain experience by practicing with experienced counselors who role-play hotline visitors. Role-playing has been shown to improve crisis intervention (Cross et al., 2011), but requires significant resources that can be challenging for hotlines to provide on tight budgets. For example, it requires experienced counselors' time that could instead be used to help hotline visitors in need.
To provide a realistic, low-risk practice environment to augment crisis hotline counselor training, we seek to develop Crisisbot: a dialogue agent that can simulate a hotline visitor and be used to train human counselors. A chatbot clearly cannot safely and thus ethically take on a counseling role, so we instead explore this direction of using a chatbot to support human counselors. In particular, we consider whether a chatbot can be used to simulate a visitor to better train human counselors. By chatting with Crisisbot, novice counselors could safely practice counseling and build confidence before interacting with hotline visitors.
In addition to returning relevant messages, Crisisbot needs to simulate multiple visitor stories to begin to expose counselors to the astonishing breadth of scenarios that they may encounter. Multiple stories are also needed to support multiple practice sessions, and these visitor stories must be intimate conversations with realistically complex messages (as in Figure 1). However, collecting pre-defined personas in addition to example conversations burdens a limited pool of skilled workers, so we seek to develop stories by mimicking example conversations directly.
Developing personas for chatbots is challenging, as generic responses remain a shortfall of dialogue generation models Roller et al., 2020). Generic responses are especially problematic for characterizing distinct personas and for carrying engaging conversations. Without some specificity in responses, different scenarios cannot be simulated for multiple training sessions.
To simulate multiple, specific, and distinct personas, we propose a counselor strategy annotation scheme and a novel multi-task training framework that constructs persona-relevant responses from example transcripts rather than pre-defined personas. Our framework leverages counselor strategies, a pre-trained generative model, utterance segmentation, and sub-utterance selection. It allows for interleaving generated and prototype text to construct complex responses, which are more realistic.
We explore the utility of our strategy annotation scheme and multi-task framework in the context of hotline counseling. We use our scheme to annotate a corpus of realistic conversations and then use our framework to leverage these annotated conversations for response construction. We first test the trained model components through interactive chats with untrained crowdworkers to assess response quality and experience. We then test the framework with experienced hotline counselors to additionally assess how realistic and useful the system is perceived to be. The interactive conversations are evaluated with both subjective human judgements of user experience and automatically calculated measures of response diversity.
We find that our approach considerably increases the diversity of and the presence of persona-relevant sub-utterances in chatbot responses. We also find discrepancy between crowdworker and counselor evaluation. This discrepancy stresses the importance of our two-fold evaluation approach and the general need for testing within a target setting, especially for specialized systems.
While additional improvements are needed for conversational coherence between turns, our annotation scheme and modeling approach contribute to curating chat experiences with richer responses that characterize distinct personas. We study this framework in the context of counseling, but it could be used more broadly in applications that require simulating multiple scenarios. We make anonymized conversations available for research use 1 .

Related work
Our work builds on a growing interest in using natural language processing to address challenges in mental health care. Unlike previous work that has predominantly focused on studying the quality of care provided or delivering mental health care with chatbots, we focus on training hotline counselors with a Crisisbot. With this approach, we must address the persistent challenge of generic dialogue response generation Roller et al., 2020) to avoid tedious and unrealistic training experiences. We attempt to construct personas by mimicking personas from prototype conversations with a multi-task framework that enables constructing responses from pre-written and generated text.

NLP for hotline and therapy insights
Recent interest in hotline conversations has analyzed the structure of real conversation transcripts with the hope of better understanding what makes hotline conversations perceived to be helpful (Althoff et al., 2016;Zhang and Danescu-Niculescu-Mizil, 2020). Similar work has considered talk therapy conversations and tried to understand conversations and predict counseling acts (Pérez-Rosas et al., 2017 or forecast therapists' next actions (Cao et al., 2019).

Chatbots for therapy care and training
An alternative line of work has followed the seminal Eliza system (Weizenbaum, 1966) and attempted to create dialogue agents to provide therapeutic benefit, i.e., "therapybots" (Fitzpatrick et al., 2017;Inkster et al., 2018). Hotlines are too highrisk to responsibly replace human counselors with chatbots, so, for safety, we instead seek to support mental health professionals with improved training and practice. Our approach is most similar to work that explored building a chatbot to train talk therapists , who have different methods than hotline counselors. This work did not explore enabling multiple personas, but trained a single general persona and found generic responses to be prevalent and tedious for crowdworkers.

Personality in chatbots
There has recently been a surge of interest in generating coherent and consistent dialogues grounded on pre-defined persona profile information from the PersonaChat dataset (Zhang et al., 2018;. Approaches to enforce consistent personas on this dataset have included retrieving relevant profile facts (Zhang et al., 2018), retrieving and refining relevant utterances , increasing the probability of copying a word from the profile (Yavuz et al., 2019), tuning to discourage inconsistent responses (Li et al., 2019a), reranking candidate responses (Welleck et al., 2019), and combining natural language inference with reinforcement learning (Song et al., 2019). Unfortunately, these methods fall short of generating responses that are as grammatical, diverse, engaging, and descriptive as natural human generated conversation (See et al., 2019;Roller et al., 2020). Further, pre-defined profile facts, which these works rely on, are not present or realistic for our application, which only has example conversations. To leverage the rich example transcripts for simulating personas, we propose a multi-task training framework.

Multi-task fine-tuning for dialogue response generation
Recent advances in large pre-trained models (Devlin et al., 2018;Radford et al., 2018) have led researchers to consider leveraging these models for dialogue response generation. A particular approach has been using multi-task training, such as for generating persona-relevant responses (Wolf et al., 2018) and adapting to domains (Li et al., 2019b). We build on this multi-task approach by using components that enable us to interleave prewritten and generated sub-utterances to construct descriptive and persona-consistent responses.

Retrieving and generating messages
Previous work has considered combining retrieval and generation methods for dialogue systems. The majority of this work has taken a retrieve and modify approach, e.g., use a retrieved prototype response to guide the generation of a full response for social chat (Cai et al., 2019; or tech support systems (Pandey et al., 2018). Additional work retrieved facts to generate knowledge grounded systems . Most closely related to our work are studies that use rankers to choose between generated or retrieved responses in social chatbots (Song et al., 2016;Papaioannou et al., 2017a,b;Shalyminov et al., 2018) and in knowledge-grounded systems . Our work differs from these approaches, which used fully generated or fully retrieved messages, by interleaving both selected text and generated text into complex messages. Further, unlike other work, we do this to curate personas and as a component of a larger multi-task framework that combines counselor strategy detection, diverse generation, and a fine-tuned transformer model.

Dataset
To scaffold conversations and help the model identify similar contexts, we develop a counselor strategy annotation scheme. We use this scheme to annotate a dataset of realistic hotline training conversations that was previously collected with experienced hotline counselors (DeMasi et al., 2019). The counselors were asked to role-play hotline conversations, as during training, and to preserve the privacy of all hotline visitors. These conversations are representative of how counselors currently acquire experience during training. However, as they are synthetic, we are able to consider models that include repeating portions of text without violating the privacy of vulnerable individuals. Conversations were required to be 40 total turns in length and allowed to be up to 60 total turns. Additional dataset statistics can be found in Table 1. The dataset was collected in multiple phases with unique conversations collected first and then additional examples built by paraphrasing this initial set. Here, we consider the conversations collected in all phases to initially fine-tune a pre-trained language model. Then, for the full multi-task framework, we consider an annotated subset of the unique conversations and not the paraphrased conversations.The corpus contains extremely diverse scenarios that are usually characterized by relatively long messages with multiple sentences. The considerable diversity of scenarios makes our task challenging, but emphasizes the need for Crisisbot to simulate multiple personas.

Sub-utterance segmentation
Counselor and visitor messages were automatically segmented into sub-utterances, e.g., separate sentences, with a sentence tokenizer. Leading greetings, positive, and negative responses were also separated as sub-utterances for consistent strategy annotation. Segmentation and spelling errors were manually corrected during annotation. Splitting sub-utterances was necessary for annotation, as counselors are often encouraged to reply with multiple components, e.g., to acknowledge feelings before asking a question. For visitors, segmenting messages enables construction of more complex messages, e.g., generating a sub-utterance to respond to an unseen question and then selecting an informative sub-utterance from a prototype conversation to elaborate on the response.

Counselor strategy annotation
To understand input counselor messages and enable the model to look up similar contexts, we develop a counselor strategy annotation scheme and annotate the strategies of counselor sub-utterances in 160 transcripts. The counselor annotation scheme was based on conversations with professionals who train hotline counselors and training materials. It was iteratively refined and expanded by the researchers to cover sub-utterances observed. The annotation scheme consists of 25 annotations, including active listening techniques, de-escalation strategies, and social norms (examples in Table 2, additional details in the Appendix). This approach relates to previous work that annotated conversa-  tion strategies for persuasion , negotiation (He et al., 2018), and motivational interviewing (Cao et al., 2019;Pérez-Rosas et al., 2017), but these annotations are specialized for a hotline counseling context. However, they could be generalized to related applications, and used to extract related conversational exchanges to generate diverse response components. They could also be used to contrast with similar domains and understand the distinction of hotline counseling.

Method
To simulate realistic hotline visitors, we seek to construct complex responses that characterize specific, consistent personas across multiple turns. Towards persona consistency, we base each conversation on a prototype conversation from the corpus and try to mimic the visitor's persona in that conversation. As a result, we do not generate personas, but model conversations on individual training example conversations.To do this, we propose an end-toend multi-task modeling framework that leverages our annotation scheme and a large pre-trained language model. This framework fine-tunes a generative language model while learning two additional model components: a strategy detector to select similar conversational exchanges, i.e., messageresponse pairs, and a next sub-utterance classifier that can interleave generated sub-utterances and sub-utterances selected from the prototype conversation into a coherent response. An overview of using these components to construct responses is in Previous output e.g., "I need help" Counselor input e.g., "I'm here to help." 2. Extract exchanges from prototype conversation that match last strategy 1. Classify last input sub-utterance strategy 3. Condition on exchanges to generate next sub-utterance candidates Generated sub-utterances E.g., "I appreciate it" 0. Select prototype conversation from corpus Strategy e.g., <here-for-you> Selected next sub-utterance "Thank you." Extracted exchanges from prototype e.g., ("We are here to help", "Thank you.")  Figure 2: Overview of process to construct responses to messages input by a counselor. The framework bases each conversation on a single prototype conversation and learns three components to construct responses: a counselor strategy classifier to select relevant exchanges to condition the generator on, a generator that decodes diverse sub-utterances, and a classifier that selects to append either a generated sub-utterance or one selected from the prototype conversation. Figure 2. We combine generated and selected prototype sub-utterances so that the model can respond with diverse, specific responses and appropriately respond to unseen contexts.

Detecting counselor strategies
To identify user intent, i.e., strategies used in a counselor's input message, we use a counselor strategy classifier. This classifier is trained on the counselor strategies that we annotated for each subutterance, as counselor messages typically contain multiple sub-utterances with different strategies (as shown in the introductory example ( Figure 1). The counselor strategy classifier predicts the probability of a strategy s i t for the i-th sentence at turn t to be where h l t−1 and h i t are hidden states from an underlying fine-tuned language model for the last sentence l of the visitor's message at the previous turn t − 1 and at the end of the i−th sentence of the counselor's input message at turn t, respectively. The weights W s are learned. At each training step, this classifier results in a cross-entropy loss across strategies L strat .

Extracting conversational exchanges based on strategies
We use the last detected counselor strategy to identify similar, relevant conversational exchanges in the prototype conversation. Relevant exchanges are selected as counselor message and visitor response pairs where the last counselor sub-utterance strategy matches the last strategy in the counselor's current input message.

Generating diverse sub-utterances conditioned on exchanges
To generate sub-utterances characteristic of a visitor's persona, we condition generation from a finetuned language model on the selected conversational exchanges. The exchanges, and the speaker roles within them, are distinguished from the current context messages by pre-pending special tokens to each message. The examples and roles are further distinguished from current conversation roles by token embeddings, as in previous work distinguishing speaker roles (Wolf et al., 2018). This approach differs from imbuing conversation with factual knowledge  or profile facts (Wolf et al., 2018;Li et al., 2019b) in that we condition on conversational exchanges and in how we select the exchanges.

Choosing a selected or generated sub-utterance
To add detail to responses, while maintaining persona consistency, we allow sub-utterances to be selected from the prototype conversation. We also allow generated sub-utterances, so that the model can appropriately respond to any input. For example, our system needs to generate a response to an unseen direct question, but then may elaborate with text selected from the prototype conversation.
We consider sub-utterances as individual units and train a next-sub-utterance classifier to select between generated sub-utterance candidates or selecting a sub-utterance from the prototype conversation. This classifier takes the form where y i t is the prediction of whether a candidate sub-utterances is the true i−th sub-utterance in turn t. Here h i t is the last hidden state of extending turn t with an i−th candidate sub-utterance. This model component results in a loss L next . To train this model component, a set of five examples is constructed for each sub-utterance in the ground truth response: the true next sub-utterance, three randomly selected sub-utterances, and one sub-utterance from the visitor's recent conversation history, i.e., everything the generator considers. All of these candidates are selected from the same prototype conversation and the inclusion of a recently said sub-utterance is used to decrease repetition.

Multi-task training
The above described model components are trained in a multi-task fashion by minimizing the composite loss function where L LM is the cross-entropy loss from the finetuned pre-trained language model and the other losses are as described above. The parameter λ controls the focus on the language model and is empirically set to 2.0 in all experiments. The language model that we use is first tuned for dialogue in a multi-task framework on the PersonaChat dataset (Wolf et al., 2018), then fine-tuned on the full corpus of transcripts collected, and finally tuned on our annotated subset of transcripts in the above framework.

Experimental setup
To evaluate the proposed full framework and ablated versions, we consider human judgements of technical conversation quality and both automatic measures and human judgements of engaging persona development. We collect these measures first with untrained crowdworkers. Then, based on crowdworker assessments, we select a subset of models to have counselors assess. Not all of the models are evaluated by counselors, as a limited number of experienced hotline counselors were able to be recruited and participate. Further, each model needs to be considered multiple times by each counselor in order for them to assess distinctness of personas between conversations. All human ratings used 5-point Likert scales. This study was approved by the internal review board.

Baseline models
We consider three models with subsets of the full model's components. The Tuned model is a pre-trained language model (Wolf et al., 2018) finetuned to the full 554 counseling conversations collected. The +rank model jointly trains the next subutterance classifier and fine-tunes the pre-trained language model to the subset of annotated transcripts. The sub-utterance classifier is the mechanism used for increasing response length and diversity by interleaving generated sub-utterances with those selected from the prototype conversation. The third model +cond trains the strategy classifier and fine-tunes the pre-trained language model dynamically conditioned on example exchanges from the prototype conversation. It uses no sub-utterance ranking. Finally, we consider the Full model that fine-tunes the language model with the strategy and next sub-utterance classifiers.

Training and interaction details
Our annotated dataset was randomly split with 90% of conversations in the training set and the rest used for validation. During testing, we randomly select a prototype conversation from the validation set at the start of the conversation and only consider selecting sub-utterances and conditioning on examples from that prototype conversation. This approach is intended to focus the conversation on a single visitor for persona consistency. To reduce repetition, we remove examples and sub-utterance from further consideration if a sub-utterance is copied from dynamically conditioning on prototype examples or is selected from the set of visitor's sub-utterances.
To encourage visitor coherence, we use five turns of conversation history for context. To accommodate limited context length, we consider the last sub-utterance of the counselor's messages, as this is typically where counselors will advance the conversation and require a response. We found in preliminary testing that this balance improved the visitor coherence and decreased repetition without significant detriment to responsiveness. Hyper-parameter settings are in the Appendix.

Automatic measures of response diversity
We calculate various metrics to compare the diversity of responses generated by each model. We consider the specificity of a models with the average negative log likelihood (nll.) of each response generated for its context with respect to the tuned model, so that it is comparable across models. We calculate the breadth of responses with the entropy (ent.) of a unigram language model of all tokens   generated and the diversity of response elements with d1 and d2, the ratios of the number of unique to number of total unigrams or bigrams generated, respectively , Finally we consider the length of responses with the average number of tokens in a response (#tok).

Human judgements of quality
For conversation quality, we asked both crowdworker and counselor participants to rate Crisisbot on coherency, fluency and grammatical correctness of responses, and human likeness.

Human judgements of persona
We also sought to gauge the quality of persona simulation in conversations. To do this, we asked all participants to rate each conversation on persona consistency and level of response detail. We additionally asked counselor participants to rate each conversation on how realistic it was of a hotline visitor. After their second and third conversation with Crisisbot, we asked counselors to rate how distinct the persona was from previous conversations. Finally, at the end of each interaction (one conversation for crowdworkers and three conversations for counselors), we asked participants to rate how useful they thought such conversations would be for training. All ratings used 5-point Likert scales.

Crowdworker user study
Crowdworkers with US-based location qualification who chatted with Crisisbot were required to write at least 10 turns before rating their experience.
The interface they interacted with included suggestions of things a counselor might say and it was suggested that they explore the problems facing Crisisbot. A persona, i.e., prototype conversation, was randomly selected for each conversation.

Hotline counselor user study
After crowdworkers' preliminary evaluation, we asked experienced hotline counselors to chat with Crisisbot. Counselors were informed of the purpose of the study and asked to interact with Crisisbot for three conversations in two different settings, i.e., three conversations with each of two models. In each setting a counselor interacted with a single model, but three personas from the model and the personas were fixed between models. As with crowdworkers, counselors were required to write at least 10 messages per conversation. A survey followed each conversation. Unlike in the crowdworker interface, the counselor interface did not include examples of things to say.

Evaluation Results
Each of the four models was interactively evaluated by 30 crowdworkers and we calculated measures of model response diversity on all responses generated (Table 3). Based on the initial crowdworker evaluation, two models were selected for counselor testing. Only two models were tested by counselors because testing each model took significant participant commitment, as counselors had to chat with each model multiple times to consider if the model simulated distinct personas. Due to our ability to  only access a limited pool of experienced counselors, a total of five counselors interacted with the two selected models, resulting in the evaluation of 15 conversations for each model (Table 4).

Crowdworker evaluation
Crowdworkers generally positively reviewed all model variants. While the Full model had some slightly lower quality measures (e.g. coherence and fluency), it was rated to have the most level of detail and hypothesized utility for training, which coincided with increased response diversity across all measures of diversity (Table 3). This result indicates that crowdworkers had a preference in their user experience for the utility of diverse responses to train different scenarios and could accept minimal impact to other quality measures.

Hotline counselor evaluation
Full model responses were more diverse across all measures when responding to counselor messages than the Tuned model (  Table 6. In open feedback, all of the counselors cited variability in responses and specificity of issues as what they liked about the Full model. When asked what they disliked, themes emerged about counselors feeling that trying to get more information from the Tuned model was difficult and the simulated problem scenarios did not feel real. In contrast, none of the counselors reported disliking anything related to specificity with the Full model, but some issues arose with responsiveness and consistency across turns, which is an ongoing challenge for dialogue systems (Roller et al., 2020).

Qualitative review of conversations
Reviewing the interactive conversations reveals considerable difference between counselor and crowdworker conversation approaches, enormous variability in counselor ratings, and inconsistency in how counselor ratings relate to researcher perceptions of individual conversation quality. Examples of conversations can be found in the Appendix. Counselors were generally more engaged than crowdworkers, asked followup questions, and, as they are trained to do, tried to explore issues confronting visitors. Conversely, crowdworker messages would be more focused within individual turns and freely hop between topics. This unsurprising difference emphasizes the need for a specialized chatbot to train counselors how to followup.
As the average quantitative ratings reflect in Tables 3 and 4, even relatively coherent conversations that counselors had with Crisisbot received considerably lower scores than a comparable conversation with a crowdworker. These deflated counselor rat-

Model Counselor liked
Tuned depression is simulated well, but the train of thought doesn't last long Tuned They are good practice for thinking on your toes and receiving answers to your messaging, so you can work on various approaches depending on the situations. Also, there was a range of suicidality and crisis level. Tuned First two were generally more coherent than in my previous session. The chatbot seems to respond well to messages showing validation and empathy. Tuned They were a good opportunity to work with ambivalence and practice some text-based OARS Tuned Content seemed pretty realistic but still not quite coherent enough to be realistic Full More crises situations thrown in. Variety from previous submission. Full There was a variety of responses that prompted critical thinking and creativity. Full i think the statements about depression and past suicidal ideation and attempts are useful for training counselors. likewise the mention of bullying. Full Variety of issues presented Model Counselor disliked Tuned The third chatbot tried to end the conversation early and kept saying thanks, goodbye, etc. but I had to keep the conversation going for testing -not sure if this was an error. Tuned Trying to get more information was very difficult when answers were short and didn't really make sense Tuned i did not feel like i was talking to someone with a real life or real problems Tuned The responses were not connected to the unfolding narrative of the conversation, so it does not replicate the rapport building process that is needed in this work. Full When they would mimic my name choice, took me out of it, although I suppose it does happen occasionally. Full There were many inconsistencies in the conversation and the flow was not there. It felt like responding to one individual statement at a time more than a series of statements that went together. Full i think the bot has to be more flexible in responding to my input-often the topic got changed altogether Full It would help if the personas and storylines were more defined and consistent. ings could relate to their increased experience interacting with humans over chat and their training to ask follow up questions that help individuals dive deeper into conversation, but are more challenging for a bot to respond to. Similarly, it could relate to crowdworkers' increased experience with chatbots and thus lower expectations. Regardless, dialogue systems are notoriously difficult to evaluate and it is also possible that our participants struggled to separate potential improvements on fine-grained dimensions from overall chatbot performance, which resulted in different trends between the evaluations.

Conclusion
We explored developing a Crisisbot to imitate hotline visitors' varied personal stories for better training human counselors. Towards this goal, we developed a counselor strategy annotation scheme and proposed a multi-task framework. This framework bases conversations on personas from prototype conversations and interleaves generated text with text selected from the prototype. Through two user studies, we observed increased response diversity overall, but a considerable difference be-tween crowdworker and counselor perceptions and inconsistency with how ratings reflect conversations. These results reinforce the known challenge of dialogue system evaluation and cautions the use of crowdworkers alone for developing system insights. The disparity in ratings between counselors and crowdworkers stresses that the involvement of specialized, intended users in system development and evaluation is crucial for success. While perhaps intuitive, this result is sometimes overlooked during system evaluation. Counselor open feedback indicates that additional development is required for coherence across turns, but that our framework adds detail and variety in personas, which could enable better variety for training counselors.

Strategy class Strategy
Functional greet self-intro close-conversation ask-name thanks Procedural identify-coping-plan identify-resource intent-to-self-harm identify-self-harm-plan access-to-self-harm-plan when-self-harm

A Appendices
We present auxiliary materials for the appreciation of our work, including additional detail on our counselor strategy annotation scheme, hyperparameter settings, information about evaluation scales, example model output from interactive testing, and open counselor feedback.

B Counselor strategies
Each counselor sub-utterance was labeled with a conversation strategy, which can be grouped into classes. A full list of the strategies annotated is in Table 7.

C Training details
The framework was trained with 4 gradient accumulation steps, a batch size of 1, the Adam optimizer with a learning rate decay of .75 for 4-5 epochs, based on minimum validation loss. For decoding, The max decoding length was set to 50 tokens and tokens were decoded with temperature 0.7 with nu-cleus sampling and with probability mass 0.9. For selecting sub-utterances, 10 candidates were generated to consider. To avoid the model only learning to copy from selected exchanges, we condition the language model on exchanges for a randomly selected counselor strategy for 20% of messages during training.

D Human judgment scales
The scales used by crowdworkers and counselor participants were labeled as follows.
• How coherent was the conversation? 1 (didn't make sense) -5 (could have been a human) • How realistic was the conversation as a simulation of a hotline visitor? 1 (it wasn't anything like chatting with a visitor) -5 (very realistic and representative) • How consistent was the persona? 1 (many contradictory statements) -5 (consistent and plausible person) • How fluent and grammatical were the responses? 1 (constant mistakes, poor english) -5 (no mistakes) • How detailed were the chatbot's responses? 1 (only vague answers without any detail) -5 (lots of details) • Overall, how human-like was the chatbot? 1 (not at all) -5 (could have been a human) • If you were training to be a counselor, how useful would these conversations have been for your training? 1 (not at all useful) -5 (very useful)

E Example conversations
Example conversations between crowdworkers and counselors can be found in Tables 8 -11.

F Selected successful transcript from counselor
The full conversation of the example of a counselor interacting with the Full model that was excerpted in the main text is in Table 12. This conversation highlights low counselor feedback in spite of relatively coherent responses. i'll talk to my friend. ok.

9
having a close friend to talk to in times like this is great. that should help you! i will. thank you for your help.
10 ok, thanks for chatting with me! i hope everything goes well.
i will. thank you so much for your help.    she's been really depressed lately. my parents want to divorce and i'm so stressed that i can't focus on school. 5 i'm sorry to hear that. divorce is very stressful for the whole family, and it makes sense that you are stressed and can't focus.
yeah, i can't even face my own dad right now. he's always pressuring me.
6 wow, on top of that, you are feeling pressured by your dad and it sounds like that is a difficult relationship, too. has anything felt helpful to you when you have felt stressed in the past?
sometimes i get really overwhelmed by stuff, and i can't focus on school. i feel like i'm a burden and i don't know how to deal with school.

7
i hear that life feels very overwhelming. no wonder you can't focus on school! has there been a time in the past when you were overwhelmed, but things got better?
hopefully, i'll be able to set an appointment with a therapist soon and also see my girlfriend in the meantime though.
8 yes, i hear that talking to your therapist is something that helps you, and also spending time with your girlfriend. are you able to reach out to either of them today?
i'm not ready to think about that yet. but i do miss her and she's the only person who's ever really supported me emotionally.
9 i see, maybe it feels like you are not ready for that kind of social contact yet? is there something that you can do on your own that helps you feel more relaxed?
okay, i'll keep that in mind next time i check in. thank you. bye 10 okay, thank you so much for reaching out today. we are always here to talk. how are you feeling after our chat today? not great, but i'm okay. i've been really stressed lately.