Building blocks of a task-oriented dialogue system in the healthcare domain

There has been significant progress in dialogue systems research. However, dialogue systems research in the healthcare domain is still in its infancy. In this paper, we analyse recent studies and outline three building blocks of a task-oriented dialogue system in the healthcare domain: i) privacy-preserving data collection; ii) medical knowledge-grounded dialogue management; and iii) human-centric evaluations. To this end, we propose a framework for developing a dialogue system and show preliminary results of simulated dialogue data generation by utilising expert knowledge and crowd-sourcing.


Introduction
There has been significant progress in the research field of the dialogue system in past years with the help of large-scale pre-trained language models (LMs) (Vaswani et al., 2017;Radford et al., 2019;Lewis et al., 2020). Pre-trained LMs show a good generalised ability obtained from massive training data collected from the internet and achieve stateof-the-art performance over a wide range of dialogue domains (Zhang et al., 2020). While many studies exist on general purpose dialogues, the research on dialogue systems for healthcare applications is still in its infancy.
There are two major directions in the development of a dialogue system. One direction is to build a chatbot that can have a conversation with a user. This approach mainly focuses on generating appropriate response given user input and dialogue history. Researchers have been working on this direction to create systems to produce more human-like (Adiwardana et al., 2020), consistent (Wolf et al., 2019), and empathetic (Rashkin et al., 2019) responses. The other direction is to build a task-oriented dialogue system that performs a specific task, such as triage or diagnosis within the healthcare domain where researchers focus on developing systems that can detect implicit symptoms or make precise diagnosis/triage result (Middleton et al., 2016;Razzaki et al., 2018;Xu et al., 2019;Wei et al., 2018).
In this study, we consider a dialogue system for a sleep coaching programme for healthy people who would like to optimise their sleep. Motivated by cognitive behaviour therapy for insomnia (CBT-I), we focus on investigating the relationship between how people think, behave, and sleep (Morin et al., 2006). The first step of the coaching programme is a complaints assessment to identify sleep issues and their potential causes and decide the next step (e.g., referring to sleep apnea treatment, providing a sleep education, suggesting a behaviour change programme, etc). During this process, a coaching provider (coach) plays as an active listener, asking questions to probe specific information, while a coaching receiver (user) has more chance to provide complaints and elaborate on these.
Real challenges in the development of a dialogue system, especially a machine learning-based system, come from three fundamental questions: i) how to obtain relevant data; ii) how to develop an automated system; and iii) how to evaluate a system. In this paper, we first analyse existing approaches that address the above questions (Section 2). Then we propose our method to address these questions (Section 3) and show preliminary results and discuss its limitations (Section 4).
The major contributions of this paper are as follows: • Identifying gaps in existing dialogue systems in the healthcare domain.
• Proposing a framework consisting of three building blocks.
• Constructing a dataset to illustrate the validity of the proposed method.
2 Related Work

Data Collection
Obtaining dialogue data is time-consuming and might not be available, especially in the healthcare domain. There are several recent studies on creating a large-scale conversation dataset in the healthcare domain by scrapping dialogues from online websites (Wei et al., 2018;Xu et al., 2019;Zeng et al., 2020). These web-scraping approaches, however, are not scalable and might create potential privacy issues.
To mitigate the scalability issue, some studies leverage domain knowledge to generate simulated dialogue. For example, Liednikova et al. (2020) modelled a typical dialogue flow between doctorpatient in the form of a tree. Then they augmented data by adding similar sentences extracted from an online forum. A drawback of this approach is that access to data sources is required and it might not be available within European countries in the light of the General Data Protection Regulation (GDPR). Contrary to this, Liu et al. (2019) proposed a framework for generating simulated data based on templates, which are logically and clinically verified, and incorporated linguistic knowledge to create diverse augmented data.
Another line of work on collecting dialogue data is to utilise a user simulator. User simulator has been widely used to interact with a dialogue system . Some of the recent works adapted agenda-based user simulator (Schatzmann and Young, 2009) to create training data for dialogue-based diagnosis systems (Wei et al., 2018;Xu et al., 2019). However, they still utilised web-scrapped data to model user behaviour.

Dialogue Management
Dialogue management is a component of a dialogue system that processes dialogue context and decides the right next action for the agent to take (Young et al., 2013). For health-related dialogue (e.g., symptom check, triage, diagnosis, etc), the role of dialogue management is to decide what to ask, answer, or inform given the context. Middleton et al. (2016) casts triage into a sequence of questions and answers. They modelled triage flow as a graph by encoding medical knowledge. This graph plays the role of dialogue management to guide a system to interact with users and make a triage decision. This approach has the following advantages: 1) it alleviates the issue of data collection since they do not rely on machine learning with large-scale data but human expert knowledge; 2) it can reason about its predictions. However, the limitation of this approach is that it requires a lot of expert resources.
Some task-oriented dialogue systems learn how to manage a dialogue flow by reinforcement learning (RL) (Wei et al., 2018;Xu et al., 2019). For example, Wei et al. (2018) framed a dialogue management module as an RL agent with a deep Qnetwork (Mnih et al., 2015). With this approach, the RL agent can decide the next action (i.e., to inquire about implicit symptoms, to make a diagnosis, etc) based on the current dialogue state. Later, Xu et al. (2019) showed that incorporating a medical knowledge graph and symptom-disease relations can allow an RL agent to ask more relevant implicit symptoms and make a precise diagnosis.
There are also some recent works on developing generative models for an end-to-end dialogue system in the healthcare domain (Liednikova et al., 2020;Zeng et al., 2020) by utilising generative pre-trained LMs (Wolf et al., 2019;Radford et al., 2018Radford et al., , 2019Lewis et al., 2020;Zhang et al., 2020;Vaswani et al., 2017). However, considering the fact that these generative models are less controllable (Wallace et al., 2019;, using a pre-trained LM-based generative model for health-related conversation could be risky.

Evaluation
To evaluate a task-oriented dialogue system, multiple metrics are used; both automatic evaluation metrics and human evaluation metrics. Automatic evaluation metrics include success rate, the average number of turns per dialogue session, matching rate, and average reward for an RL-based system (Li et al., 2017;Wei et al., 2018;Xu et al., 2019). While the automated metrics focus on task completion, human evaluation metrics consider qualitative aspects of the dialogue, such as the quality of dialogue flow, the appropriateness of decision making (diagnosis validity), and dialogue fluency scored by experts (Razzaki et al., 2018;Xu et al., 2019).
However, user perspective has been less considered in evaluating a task-oriented dialogue sys-tem in healthcare. User-centric metric, such as a user rating score or user preference score , is widely used for evaluating generalpurpose dialogue systems Shah et al., 2018;Budzianowski and Vulić, 2019;Roller et al., 2020). A user-centric metric can not only be used to assess the performance of a system but debug a system as well. For example, a user might have difficulty understanding the complex language that a system uses or be annoyed by too many questions without a proper explanation. In this case, using proper user-centric metrics can provide an insight into which aspects of a system should be updated.

Building Blocks
Here we outline three building blocks of a dialogue system in the healthcare domain and identify open research questions for each building block. To this end, we propose a framework for developing a conversation agent for healthcare-related dialogues.

Privacy-Preserving Data Collection
As mentioned earlier, the potential privacy issues create challenges in data collection, especially in European countries in the light of GDPR. We identify three potential methods of data collection while safeguarding privacy. The first potential method is to apply appropriate privacy protection techniques to the collected data, such as de-identification that replaces the sensitive information for text (Neamatullah et al., 2008;Meystre et al., 2010;Neubauer and Heurix, 2011). The second potential method is to generate synthetic data by training generative models on the collected data (Guan et al., 2019;Hatua et al., 2019;Pan et al., 2020). The third potential method is to generate simulated data by building a user simulator that can interact with a dialogue system (Wei et al., 2018;Xu et al., 2019;Kao et al., 2018). Applying these three methods, however, entails the following consideration: How much is the risk of information leakage? What is the difference in performance between models trained on de-identified, synthesised, simulated and real data?

Medical Knowledge-Grounded Dialogue Management
Unlike an open-domain dialogue, healthcarerelated dialogue should be grounded in medical knowledge. Two types of knowledge can be in-cluded in a dialogue system. The first type of knowledge is the knowledge about dialogue between healthcare professional and healthcare recipient. For example, in the healthcare domain, there exists a typical structure of dialogue that is advised to be followed. Modelling a dialogue structure can guide a system to have an appropriate dialogue flow (Middleton et al., 2016;Razzaki et al., 2018). The second type of knowledge is medical knowledge, including correlations between symptoms and causal relation between symptom and diseases. Incorporating medical knowledge can allow a system to have more appropriate dialogue and make a precise decision (Ni et al., 2017;Ghosh et al., 2018;Xu et al., 2019). The open questions are: How to efficiently encode expert knowledge into a machine-accessible format (e.g., knowledge graph, knowledge base) and how to incorporate it into a machine learning model? How to maintain the previously built knowledge to keep updated?

Human-Centric Evaluation
Since a dialogue system is designed to interact with a user, a human evaluation should be is considered as an ideal evaluation. More specifically, two types of human evaluations metrics should be considered to correctly evaluate a dialogue system in the healthcare domain: one from the expert (healthcare professional) perspective and the other from the end-user (healthcare recipient) perspective. Experts from the domain should validate the appropriateness of the dialogue actions made by an agent and assess the quality of the dialogue (Razzaki et al., 2018;Xu et al., 2019). Also, end-user should evaluate a system in terms of satisfaction, usability, and comprehensibility by rating each aspect Shah et al., 2018) or deciding the preferred system Roller et al., 2020). This is associated with the following questions: Which aspects are critical to assess both the functionality and the usability of a system? How can these evaluations be reflected to update a system efficiently?

A Proposed Framework
Considering the above-mentioned building blocks, we propose a framework for developing a conversational agent in the healthcare domain as illustrated in Figure 1.

Simulated Data Generation
The proposed framework generates simulated dialogue data to avoid potential privacy issue in data collection. We follow recent works on generating a simulated data set based on the knowledge of user behaviour and the characteristics of dialogue without using real user data (Shah et al., 2018). This consists of two steps: firstly, a template is constructed by exploiting expert knowledge. Secondly, data is augmented by utilising crowdsourcing.
Reinforcement Learning Agent Similar to previous studies (Wei et al., 2018;Xu et al., 2019), we frame a dialogue management module as an RL agent. We propose a two-step training procedure. At the first step, the RL agent is trained with a user simulator, either an agenda-based (Schatzmann and Young, 2009) or a model-based (El Asri et al., 2016;Kreyssig et al., 2018) one. At the second step, the RL agent is further trained by interacting with realworld users.
Model evaluation To evaluate the model, we use both an automatic evaluation metric and a human evaluation metric. Since we consider a taskoriented dialogue system, success rate and matching rate (Xu et al., 2019) are used as automatic metrics. For the human evaluation metric, validity scores by experts (Razzaki et al., 2018) and preference scores by users (Li et al., 2019) are used.

Preliminary Results
This section describes an initial approach of generating simulated dialogues based on a template and crowdsourced data. The goal of a dialogue is to assess user complaints related to their sleep and identify all potential behavioural factors that might be associated with the reported complaints.

Dialogue Template
We consulted an expert in the sleep domain to model a dialogue between user and coach in the form of a tree. The dialogue template is structured in three parts of questions and potential answers related to sleep issues, the impacts of sleep issues, and behavioural factors (i.e., habits/lifestyles that might affect sleep quality). More specifically, one open-ended question that is associated with 11 potential answers and two close-ended follow-up questions (i.e., the frequency and the duration of the reported issue) in the sleep issue part, one openended question that is associated with 10 potential answers and one close-ended follow-up question (i.e., an enquiry regarding daytime fatigue) in the impact part, and 11 close-ended questions in the behavioural factor part. A subset of the dialogue template and a corresponding dialogue example is shown in Figure 2.

Crowdsourced Data
Then we collected crowdsourced data via the Amazon MTurk platform. Participants were asked to answer two open-ended questions related to sleep issues and their impacts and check all applicable behavioural factors. Further, the participants are asked to paraphrase the specific sleep conditions (i.e., issues, impacts), if they have ever experienced them, and the selected behavioural factors. The former and the latter data are denoted as the answer data set and the paraphrase data set, respectively. The answer data set are further used to create user goals. Following the previous works (Schatzmann and Young, 2009;Wei et al., 2018;Xu et al., 2019), we create a user goal G = (E, I) consisting of explicit information E, which is reported in the answers to the open-ended questions, and implicit information I, which is the answers to the behavioural factor that can be retrieved via probing questions.

Dialogue Simulation
The collected crowdsourced data are further used to simulate dialogues. At the beginning of each dialogue, a user goal is sampled from the answer data set. Then a dialogue is simulated based on the dialogue template with a set of handcrafted rules and augmented by using the paraphrase data set. An example of a user goal and the simulated and augmented dialogues are shown in Appendix B.

Limitations and Future Study
In this paper, we show preliminary results of simulating dialogues based on the dialogue template and crowdsourced data. Our approach aims to augment the size of the simulated dialogue data set by replacing user answers with samples from the separate paraphrase data set. However, there are a few limitations that might be associated with the proposed method. More specifically, the following concerns should be addressed in a future study: First of all, the paraphrased sentences should be diverse and the simulated dialogues should cover all potential dialogue paths. To validate the quality, the paraphrased sentences and the simulated dialogues are required to be accessed by proper measures. Secondly, as  has already pointed out, the RL agent may not generalise enough to realworld dialogues even though it works well with a user simulator. Therefore, there should be the additional step of on-line learning by interacting with real-world users (Shah et al., 2018) to mitigate this issue.

Conclusion
In this paper, we analyse recent studies on the development of a dialogue system in the healthcare domain and outline three building blocks, namely: i) privacy-preserving data collection; ii) medical knowledge-grounded dialogue management; and iii) human-centred evaluations. To this end, we propose a framework for developing a dialogue system and show preliminary results of simulated dialogue data generation by utilising expert knowledge and crowdsourcing. In the future study, we foresee working on implementing a user simulator that can interact with a reinforcement learning agent, accessing the quality of the simulated dialogues, and deploying the reinforcement learning agent to interact with both a user simulator and real-world users.

A Crowdsourced Data
We collected two crowdsourced data sets for experiments: The answer data set contains user goals consisting of answers to the two open-ended questions (i.e., sleep issue and the impact of the issue) and one multiple-choice question (i.e., habits/lifestyles). The paraphrase data set contains paraphrased answers related to the sleep conditions (i.e., sleep issue and the impact of the issue) and the selected multiple-choice answers (i.e., habits/lifestyles). The collected data were annotated with class labels as shown in tables 2 to 4. Figure 3 shows label distributions of the collected data sets.

B User Goal and Simulated Dialogue
An example of a user goal is shown in Figure 4. To simulate a dialogue, we used the dialogue template with a set of handcrafted rules to select a coach's next question. Each question is followed by the answer by using the sampled user goal. If the question cannot be answered by the user goal, we randomly select an answer either Yes or No. The simulated dialogue is then paraphrased by replacing user answers with samples from the paraphrase data set. Table 5 illustrates the examples of a simulated dialogue and an augmented dialogue.
(a) Issue label distribution in the answer data set.
(b) Issue label distribution in the paraphrase data set.
(c) Impact label distribution in the answer data set. (d) Impact label distribution in the paraphrase data set.
(e) Habit label distribution in the answer data set. (f) Habit label distribution in the paraphrase data set. Figure 3: Class label distributions of the collected data sets. Note that the answer data set and the paraphrase data set have identical habit class label distribution but the former contains binary values (i.e., True, False) and the latter contains free-text values (i.e., paraphrased sentences).