Task-oriented Dialogue System for Automatic Diagnosis

In this paper, we make a move to build a dialogue system for automatic diagnosis. We first build a dataset collected from an online medical forum by extracting symptoms from both patients’ self-reports and conversational data between patients and doctors. Then we propose a task-oriented dialogue system framework to make diagnosis for patients automatically, which can converse with patients to collect additional symptoms beyond their self-reports. Experimental results on our dataset show that additional symptoms extracted from conversation can greatly improve the accuracy for disease identification and our dialogue system is able to collect these symptoms automatically and make a better diagnosis.


Introduction
Automatic phenotype identification using electronic health records (EHRs) has been a rising topic in recent years (Shivade et al., 2013). Researchers explore with various machine learning approaches to identify symptoms and diseases for patients given multiple types of information (both numerical data and pure texts). Experimental results prove the effectiveness of the identification of heart failure (Jonnalagadda et al., 2017;Choi et al., 2016), type 2 diabetes (Li et al., 2015;Zheng et al., 2017), autism spectrum disorders (Doshi-Velez et al., 2014), infection detection (Tou et al., 2018) etc. Currently, most attempts focus on some * Corresponding author specific types of diseases and it is difficult to transfer models from one disease to another.
In general, each EHR contains multiple types of data, including personal information, admission note, diagnose tests, vital signs and medical image. And it is collected accumulatively following a diagnostic procedure in clinic, which involves interactions between patients and doctors and some complicated medical tests. Therefore, it is very expensive to collect EHRs for different diseases. How to collect the information from patient automatically remains the challenge for automatic diagnosis.
Recently, due to its promising potentials and alluring commercial values, research about taskoriented dialogue system (DS) has attracted increasing attention in different domains, including ticket booking Peng et al., 2017a), online shopping (Yan et al., 2017) and restaurant searching (Wen et al., 2017). We believe that applying DS in the medical domain has great potential to reduce the cost of collecting data from patients.
However, there is a gap to fill for applying DS in disease identification. There are basically two major challenges. First, the lack of annotated medical dialogue dataset. Second, no available DS framework for disease identification. By addressing these two problems, we make the first move to build a dialogue system facilitating automatic information collection and diagnosis making for medical domain. Contributions are two-fold: • We annotate the first medical dataset for dialogue system that consists of two parts, one is self-reports from patients and the other is conversational data between patients and doctors.
• We propose a reinforcement learning based framework for medical DS. Experiment results on our dataset show that our dialogue system is able to collect symptoms from patients via conversation and improve the accuracy for automatic diagnosis.

Dataset for Medical DS
Our dataset is collected from the pediatric department in a Chinese online healthcare community 1 . It is a popular website for users to inquire with doctors online. Usually, a patient would provide a piece of self-report presenting his/her basic conditions. Then a doctor will initialize a conversation to collect more information and make a diagnosis based on both the self-report and the conversational data. An example is shown in Table 1. As we can see, the doctor can obtain additional symptoms during conversation beyond the self-report.
For each patient, we can also obtain the final diagnosis from doctors as the label. For clarity, we term symptoms from self-reports as explicit symptoms while those from conversational data as implicit symptoms.
We choose four types of diseases for annotation, including upper respiratory infection, children functional dyspepsia, infantile diarrhea and children's bronchitis. We invite three annotators (one with medical background) to label all the symptom phrases in both self-reports and conversational data. The annotation is performed in two steps, namely symptom extraction and symptom normalization.
Symptom Extraction We follow the BIO (begin-in-out) schema for symptom identification ( Figure 1). Each Chinese character is assigned a label of "B", "I" or "O". Also, each extracted symptom expression is tagged with True or False indicating whether the patient suffers from this symptom or not. In order to improve the annotation agreement between annotators, we create two guidelines for the self-report and the conversational data respectively. Each record is annotated by at least two annotators. Any inconsistency would be further judged by the third one. The Cohen's kappa coefficient between two annotators are 71% and 67% for self-reports and conversations respectively.  Table 1: An example of a user record. Each record consists of two parts: self-report from the patient and the conversation between the doctor and the patient. Underlined phrases are symptom expressions.
Symptom Normalization After symptom expression identification, medical experts manually link each symptom expression to the most relevant concept on SNOMED CT 2 for normalization. Table 2 shows some phrases that describe symptoms in the example and some related concepts in SNOMED CT. The overview of dataset is presented in Table 3.
After symptom extraction and normalization, there are 144 unique symptoms identified. In order to reduce the size of action space of the DS, only 67 symptoms with a frequency greater than or equal to 10 are kept. Samples are then generated, called user goal. As we know, each user goal (see Figure 2) is derived from one real world patient record 3 .   tracks the dialogue states and takes system actions; NLG generates natural language given the system actions. In this work, we focus on the DM for automatic diagnosis consisting of two sub-modules, namely, dialogue state tracker (DST) and policy learning. Both NLU and NLG are implemented with template-based models. Typically, a user simulator is designed to interact with the dialogue system (Liu et al., 2017;Peng et al., 2017b;Su et al., 2016;Schatzmann et al., 2006). We follow the same setting as  to design our medical DS. At the beginning of a dialogue session, the user simulator samples a user goal (see Figure 2), while the agent attempts to make a diagnosis for the user. The system will learn to select the best response action at each time step by maximizing a long term reward.

User Simulator
At the beginning of each dialogue session, a user simulator samples a user goal from the experiment dataset. At each turn t, the user takes an action a u,t according to the current user state s u,t and the previous agent action a t−1 , and transits into the next user state s u,t+1 . In practice, the user state s u is factored into an agenda A (Schatzmann et al., 2007) and a goal G, noted as s u = (A, G). During the course of the dialogue, the goal G ensures that the user behaves in a consistent, goal-oriented manner. And the agenda contains a list of symptoms and their status (whether or not they are requested) to track the progress of the conversation. Every dialogue session is initiated by the user Figure 2: An example of user goal. Each user goal consists of four parts, disease tag is the disease that the user suffers; explicit symptoms are symptoms extracted from the user self-report; implicit symptoms are symptoms extracted from the conversational data between the patient and the doctor; request slots is the disease slot that the user would request.
via the user action a u,1 which consists of the requested disease slot and all explicit symptoms. In terms of the symptom requested by the agent during the course of the dialogue, the user will take one of the three actions including True (if the symptom is positive), False (if the symptom is negative), and not sure (if the symptom is not mentioned in the user goal). If the agent informs correct disease, the dialogue session will be terminated as successful by the user. Otherwise, the dialogue session will be recognized as failed if the agent makes incorrect diagnosis or the dialogue turn reaches the maximum dialogue turn T.

Dialogue Policy Learning
Markov Decision Process Formulation for Automatic Diagnosis We cast DS as Markov Decision Process (MDP) (Young et al., 2013) and train the dialogue policy via reinforcement learning (Cuayahuitl et al., 2015). An MDP is composed of states, actions, rewards, policy, and transitions. State S. A dialogue state s includes symptoms requested by the agent and informed by the user till the current time t, the previous action of the user, the previous action of the agent and the turn information. In terms of the representation vector of symptoms, it's dimension is equal to the number of all symptoms, whose elements for positive symptoms are 1, negative symptoms are -1, notsure symptoms are −2 and not-mentioned symp-toms are 0. Each state s ∈ S is the concatenation of these four vectors.
Actions A. An action a ∈ A is composed of a dialogue act (e.g., inform, request, deny and confirm) and a slot (i.e., normalized symptoms or a special slot disease). In addition, thanks and close dialogue are also two actions.
Transition T . The transition from s t to s t+1 is the updating of state s t based on the agent action a t , the previous user action a u,t−1 and the step time t.
Reward R. The reward r t+1 = R(s t , a t ) is the immediate reward at step time t after taking the action a t , also known as reinforcement.
Policy π. The policy describes the behaviors of an agent, which takes the state s t as input and outputs the probability distribution over all possible actions π(a t |s t ).
Learning with DQN In this paper, the policy is parameterized with a deep Q-network (DQN) (Mnih et al., 2015), which takes the state s t as input and outputs Q(s t , a; θ) for all actions a. A Q-network can be trained by updating the parameters θ i at iteration i to reduce the mean squared error between the Q-value computed from the current network Q(s, a|θ i ) and the Q-value obtained from the Bellman equation y i = r+γ max a Q(s , a |θ − i ), where Q(s , a |θ − i ) is the target network with parameters θ − i from some previous iteration. In practice, the behavior distribution is often selected by an -greedy policy that takes an action a = arg max a Q(s t , a ; θ) with probability 1 − and selects a random action with probability , which can improve the efficiency of exploration. When training the policy, we use a technique known as experience replay. We store the agent's experiences at each time-step, e t = (s t , a t , r t , s t+1 ) in a fixed size, queue-like buffer D.
In a simulation epoch, the current DQN network is updated multiple times (depending on the batch size and the current size of replay buffer) with different batches drawn randomly from the buffer, while the target DQN network is fixed during the updating of current DQN network. At the end of each epoch, the target network is replaced by the current network and the current network is evaluated on training set. The buffer will be flushed if the current network performs better than all previous versions.

Experimental Setup
The max dialogue turn T is 22. A positive reward of +44 is given to the agent at the end of a success dialogue, and a −22 reward is given to a failure one. We apply a step penalty of −1 for each turn to encourage shorter dialogues. The dataset is divided into two parts: 80% for training with 568 user goals and 20% for testing with 142 user goals. The of -greedy strategy is set to 0.1 for effective action space exploration and the γ in Bellman equation is 0.9. The size of buffer D is 10000 and the batch size is 30. And the neural network of DQN is a single layer network. The learning rate is 0.001. Each simulation epoch consists of 100 dialogue sessions and the current network is evaluated on 500 dialogue sessions at the end of each epoch. Before training, the buffer is pre-filled with the experiences of the rule-based agent (see below) to warm start our dialogue system.
To evaluate the performance of the proposed framework, we compare our model with baselines in terms of three evaluation metrics following  and Peng at al. (2017a;2017b), namely, success rate, average reward and the average number of turns per dialogue session. As for classification models, we use accuracy as the metric.
The baselines include: (1) SVM: This model treats the automatic diagnosis as a multi-class classification problem. It takes one-hot representation of symptoms in the user goal as input, and predicts the disease. There are two configurations: one takes both explicit and implicit symptoms as input (denoted as SVM-ex&im), and the other takes only explicit symptoms to predict the disease (denoted as SVM-ex).
(2) Random Agent: At each turn, the random agent takes an action randomly from the action space as the response to the user's action.
(3) Rule-based Agent: The rule-based agent takes an action based on handcrafted rules. Conditioned on the current dialogue state s t , the agent will inform disease if all the known symptoms related are detected. If no disease can be identified, the agent will select one of the left symptoms randomly to inform. The relations between diseases and symptoms are extracted from the annotated corpus in advance. In this work, only the first T /2.5 4 symptoms with high frequency are kept for each disease so that the rule-based agent could inform a disease within the max dialogue turn T .    Table 4 shows the accuracy of two SVM-based models. The result shows that the implicit symptoms can greatly improve the accuracy of disease identification for all the four diseases, which demonstrates the contribution of implicit symptoms when making diagnosis for patients. Figure  3 shows the learning curve of all the three dialogue systems and Table 5 shows the performance of these agents on testing set. Due to the large action space, the random agent performs badly. The rule-based agent outperforms the random agent in a large margin. This indicates that the rulebased agent is well designed. We can also see that the RL-based DQN agent outperforms rule-based agent significantly. Moreover, DQN agent outperforms SVM-ex by collecting additional implicit symptoms via conversing with patients. However, there is still a gap between the performance of DQN agent and SVM-ex&im in terms of accuracy, which indicates that there is still rooms for the improvement of the dialogue system.

Related Works
In 2003, an ontology-based dialogue system that supports electronic referrals for breast cancer is proposed (Milward and Beveridge, 2003), which can deal with the informative response of users based on the medical domain ontologies. In addition, there are two works where deep reinforcement learning is applied for automatic diagnosis (Tang et al., 2016;Kao et al., 2018). However, their models need extra human resources to categorize the diseases into different groups and the data used is simulated that can not reflect the situation of the real patients.

Conclusions and Future Works
In this paper, we propose a reinforcement learning based framework of dialogue system for automatic diagnosis and build a dataset for training DS which is derived from the dialogue text between real patients and doctors. Experiment results on a selfconstructed dataset show that our dialogue system is able to collect additional symptoms via conversation with patients and improve the accuracy for automatic diagnosis. The relationship between diseases and symptoms is an external knowledge which is thought to be useful for the automatic diagnosis. One of our future directions is to explore models that can incorporate external knowledge for better policy learning.