MedDialog: A Large-scale Medical Dialogue Dataset

Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build large-scale medical dialogue datasets – MedDialog, which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, MedDialog is the largest medical dialogue dataset to date. We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, GPT, BERT-GPT, and compare their performance. It is shown that models trained on MedDialog are able to generate clinically correct and human-like medical dialogues. We also study the transferability of models trained on MedDialog to low-resource medical dialogue generation tasks. It is shown that via transfer learning which ﬁne-tunes the models pretrained on MedDialog, the performance on medical dialogue generation tasks with small datasets can be greatly improved, as shown in human evaluation and automatic evaluation. The datasets and code are available at https://github.com/UCSD-AI4H/Medical-Dialogue-System


Introduction
Telemedicine refers to the practice of delivering patient care remotely, where doctors provide medical consultations to patients using HIPAA compliant video-conferencing tools. As an important complement to traditional face-to-face medicine practiced physically in hospitals and clinics, telemedicine has a number of advantages. First, it increases access to care. For people living in medically under-served communities (e.g., rural areas) that are in shortage of clinicians, telemedicine enables them to receive faster and cheaper care compared with traveling over a long distance to visit a clinician. Second, it reduces healthcare costs. In a study 1 by Jefferson Health, it is shown that diverting patients from emergency departments with telemedicine can save more than $1,500 per visit. Third, telemedicine can improve the quality of care. The study in (Pande and Morris, 2015) shows that telemedicine patients score lower for depression, anxiety, and stress, and have 38% fewer hospital admissions. Other advantages include improving patient engagement and satisfaction, improving provider satisfaction, etc. Please refer to (Wootton et al., 2017) for a more comprehensive review.
While telemedicine is promising, it has several limitations. First, it puts additional burden on physicians. In addition to practicing face-toface medicine which already makes physicians very busy, physicians need to provide remote telemedicine consultations, which further increases the risk of physician burnout. Second, different from in-hospital patients, the progression of whose medical conditions can be easily tracked by clinicians, remote patients are difficult to track and monitor. To address such problems, there has been increasing research interest in developing artificial intelligence (AI) methods to assist in telemedicine. In particular, medical dialogue systems are being developed to serve as "virtual doctors". These "virtual doctors" are aimed to interact with patients via natural dialogues, asking about the medical conditions and history of patients and providing clinical advice. They can also proactively reach out to patients to ask about the progression of patients' conditions and provide timely interventions.
To build medical dialogue systems, a large collection of conversations between patients and doctors is needed as training data. Due to data privacy concerns, such data is difficult to obtain. The existing medical dialogue datasets (Xu et al., 2019;Yang et al., 2020) are limited in size or biased to certain diseases, which cannot adequately serve the purpose of training medical dialogue systems that can achieve doctor-level intelligence and cover many specialities in medicine.
To address the limitations of existing datasets, we build large-scale medical dialogue datasets -MedDialog -that contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. Both datasets cover almost all specialities in medicine, ranging from internal medicine to family medicine and covers a wide spectrum of diseases, including cancer, pneumonia, etc. To our best knowledge, they are the largest Chinese and English medical dialogue datasets to date. The data is open to the public. Each consultation starts with a description of medical conditions and history, followed by the conversation between doctor and patient. In certain consultations, doctors make diagnosis conclusions and give suggestions on treatment. The conversations have multiple turns.
On the Chinese MedDialog (MedDialog-CN) dataset, we train several dialogue generation models for the interested community to benchmark with. Generating a response given the conversation history can be formulated as a sequence-tosequence (seq2seq) learning problem, where we use the Transformer (Vaswani et al., 2017) architecture to perform this task. Transformer consists of an encoder which embeds the conversation history and a decoder which generates the response. Both the encoder and decoder use self-attention to capture long-range dependency between tokens. In addition to training the Transformer on MedDialog-CN from scratch, we can pretrain the encoder and decoder on a corpora much larger than MedDialog-CN, then finetune them on MedDialog-CN. BERT-GPT (Wu et al., 2019;Lewis et al., 2019) is a pretrained model where the encoder is pretrained using BERT (Devlin et al., 2018) and the decoder is pretrained using GPT (Radford et al.). Besides the seq2seq formulation, dialogue generation can be formulated as a language modeling problem which generates the next token in the response conditioned on the concatenation of the already generated tokens in the response and the conversation history. GPT (Radford et al.;Zhang et al., 2019) is a pretrained language model based on Transformer decoder. BERT-GPT and GPT are finetuned on MedDialog-CN. We perform evaluation of these models using automatic metrics including perplexity, BLEU (Papineni et al., 2002a), Dist (Li et al., 2015), etc. The generated responses are clinically informative, accurate, and human-like.
We utilize the models trained on the large-scale MedDialog-CN dataset to improve performance in low-resource dialogue generation tasks where the dataset size is small. The study is performed on COVID-19 dialogue generation on the CovidDialog (Yang et al., 2020) dataset, which contains 1,088 dialogues and 9,494 utterances. The small size of this dataset incurs high risk of overfitting, if directly training the large-sized neural models on it. To alleviate this risk, we take the weights of dialogue generation models pretrained on MedDialog-CN and finetune the weights on CovidDialog. Human evaluation and automatic evaluation show that pretraining on MedDialog-CN can greatly improve the performance on CovidDialog and generate clinically meaningful consultations about COVID-19.
The major contributions of this paper are: • We build large-scale medical dialog datasets -MedDialog, which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, they are the largest of their kinds to date.
• We pretrain several dialogue generation models on the Chinese MedDialog dataset, including Transformer, BERT-GPT, and GPT, and compare their performance using automatic metrics.
• Through human evaluation and automatic evaluation, we show that the pretrained models on MedDialog-CN can significantly improve performance on medical dialogue generation tasks where the dataset size is small, via transfer learning.
The rest of this paper is organized as follows. Section 2 and 3 present the datasets and dialogue generation models (DGMs). Section 4 gives experimental results of developing DGMs on Chinese MedDialog and studies the transferability of DGMs trained on MedDialog-CN to other low-resource medical dialogue generation tasks. Section 5 reviews related works and Section 6 concludes the paper.

Related Works
There have been several works investigating medical dialogue generation. Wei et al.  built a task-oriented dialogue system for automatic diagnosis. The system detects the user intent and slots with values from utterances, tracks dialogue states, and generates responses. Xu et al. (Xu et al., 2019) developed a medical dialogue system for automatic medical diagnosis that converses with patients to collect additional symptoms beyond their self-reports and automatically makes a diagnosis. This system incorporates a medical knowledge graph into the topic transition in dialogue management. Xia et al. (Xia et al.) developed a reinforcement learning (RL) based dialogue system for automatic diagnosis. They proposed a policy gradient framework based on the generative adversarial network to optimize the RL model.

Datasets
Our MedDialog consists of a Chinese dataset and an English dataset, collected from different sources.

The Chinese MedDialog dataset
The Chinese MedDialog (MedDialog-CN) dataset contains 3.4 million Chinese dialogues (consultations) between patients and doctors. The total number of utterances is 11.3 million. Each consultation starts with the narration of patient' medical condition and history, including present disease, duration of the disease, allergies, medications, past diseases, etc. Then it follows with the multi-turn conversation between patient and doctor. In the conversation, there are cases that multiple consecutive utterances are from the same person (either doctor or patient) and these utterances were posted at different time points. For such cases, we combine the consecutive utterances from the same person into a single utterance. Optionally, at the end of the consultation, the doctor makes diagnosis and  Table 1 shows statistics of the Chinese dataset. Figure 1 shows an exemplar consultation. The data is crawled from an online consultation website -haodf.com 2 , which provides consultation service to patients. The dialogues cover 29 broad categories of specialties including internal medicine, pediatrics, dentistry, etc. and 172 fine-grained specialties including cardiology, neurology, gastroenterology, urology, etc. The consultations are conducted from 2010 to 2020.

The English MedDialog dataset
The English MedDialog (MedDialog-EN) dataset contains 0.26 million English consultations between patients and doctors. The total number of utterances is 0.51 million. Each consultation consists of two parts: (1) description of patient's medical conditions; (2) conversation between patient and doctor. The data is crawled from iclinic.com 3 and healthcaremagic.com 4 , which are two online platforms of healthcare services, including symptom self-checker, video consultation, online chat with doctors, etc. The consultations cover 51 categories of communities including diabetes, elderly problems, pain management, etc. and 96 specialties including andrology, cardiology, nephrology, pharmacology, etc. The consultations were conducted from 2008 to 2020. Table 2 shows statistics of the English dataset.

Advantages of our datasets
To our best knowledge, MedDialog-CN and MedDialog-EN are the largest Chinese and English medical dialog dataset respectively. They have the following advantages.
• Large number of conversations and utterances. MedDialog-CN has about 3.4 mil-  lion conversations and 11.3 million utterances.
MedDialog-EN has about 0.3 million conversations and 0.5 million utterances.
• Broad coverage of medical specialities.   greatly minimizes population biases in these two datasets. Table 3 shows a comparison of our datasets with several other medical dialogue datasets. The number of dialogs and diseases in our datasets are both much larger than those in other datasets.

Methods
We train several dialogue generation models on the Chinese MedDialog dataset for the interested research community to benchmark with. During training, given a dialogue containing a sequence of alternating utterances between patient and doctor, we process it into a set of pairs {(s i , t i )} where the target t i is a response from the doctor and the source s i is the concatenation of all utterances (from both patient and doctor) before t i . A dialogue generation model takes s as input and generates t. This problem can be formulated either as a sequence-to-sequence learning problem where the goal is to generate t conditioned on s via an encoder-decoder model, or as a language modeling problem which generates the i-th token t i in t conditioned on the concatenation of the conversation history s and the already generated sequence t 1 , · · · , t i−1 in the response before t i via a language model.

Dialogue Generation as
Sequence-to-Sequence Modeling The problem of response generation can be formulated as a sequence-to-sequence (seq2seq) learn-ing (Sutskever et al., 2014) problem: given the conversation history s, generate the response t. We use the Transformer (Vaswani et al., 2017) architecture for seq2seq modeling. Transformer consists of an encoder which embeds the input sequence into a latent space and a decoder which takes the embedding of the input sequence as input and generates the output sequence. Different from LSTM-based seq2seq models (Sutskever et al., 2014) which learn representations of a sequence of tokens in a recurrent manner and therefore suffer computational inefficiency due to their sequential nature, Transformer uses self-attention to capture the long-range dependency among tokens by calculating the similarity between each pair of tokens in the sequence. Self-attention avoids sequential computation and greatly facilitates parallel computation. A building block in Transformer contains the following modules: a self-attention sub-layer, a token-wise feed-forward sub-layer, residual connections (He et al., 2016) between sub-layers, and layer normalization (Ba et al., 2016). Both the encoder and decoder are composed of a stack of such building blocks. The encoder generates an encoding for each token in the input sequence. These encodings are fed into the decoder to generate the output sequence. To generate the token at position i, the decoder encodes the generated tokens from 1 to i − 1 (like an encoder), calculates an attentional representation by performing attention between the encodings of input tokens and the encodings of output tokens 1, · · · , i − 1, then feeds the attentional representation into a softmax layer to generate token i. Transformer learns the weights in the encoder and decoder by maximizing the conditional likelihood of responses conditioned on conversation histories.

Dialogue Generation as Language Modeling
Besides the sequence-to-sequence formulation, response generation can be formulated as a language modeling problem as well. Given the conversation history s, a language model defines the following probability on the sequence of tokens t = t 1 , · · · , t n in the response: where s, t 1 , · · · , t i−1 denotes the concatenation of s and t 1 , · · · , t i−1 . GPT (Radford et al.) is a  pretrained language model which uses the Transformer decoder to model the conditional probability p(t i |s, t 1 , · · · , t i−1 ) in Eq. (1), which first encodes the tokens in s, t 1 , · · · , t i−1 , then predicts t i based on the encodings. GPT learns the weights of the decoder by maximizing the likelihood (defined based on Eq.1) on the responses in the training data.

Pretraining
Before training Transformer and GPT on the MedDialog-CN dataset, we can first pretrain them on general-domain text datasets which are much larger than MedDialog-CN, to get a good initialization of the weight parameters. BERT-GPT (Wu et al., 2019;Lewis et al., 2019) is a pretraining approach of Transformer, which uses BERT (Devlin et al., 2018) to pretrain the Transformer encoder and uses GPT to pretrain the Transformer decoder. Given a sequence of tokens, BERT randomly marks out some of them. The masked sequence is fed into the transformer encoder, which aims to recover the masked tokens. The weights in the encoder are learned by maximizing the accuracy of recovery.
In BERT-GPT, the BERT encoder generates representation of the input sequence, which is then fed into the GPT decoder to generate the response.

Experimental Settings
We split the Chinese MedDialog dataset into a training set, a validation set, and a test set with a ratio of 0.8:0.1:0.1. The split was based on dialogues, not based on source-target pairs. The split statistics are summarized in Table 4. The models were built at the Chinese character level. The validation set was used for hyperparameter tuning. The training procedure was stopped when the validation loss stopped to decrease. For Transformer, the implementation by HuggingFace 5 was used, where the hyperparameters followed the default settings in the original Transformer (Vaswani et al., 2017). In BERT-GPT, the BERT encoder and GPT decoder are Transformers with 12 layers. The hidden state size is 768. The optimization of weight parameters was performed using stochastic gradient descent, with a learning rate of 1e-4. The maximum length of input sequences was truncated to 400 and that of output sequences was truncated to 100. For GPT, the DialoGPT-small (Zhang et al., 2019) architecture was used, with 10 layers. We set the embedding size to 768 and the context size to 300. In layer normalization, the epsilon hyperparameter was set to 1e-5. In multi-head self-attention, we set the number of heads to 12. The weight parameters were learned with Adam (Kingma and Ba, 2014). The initial learning rate was set to 1.5e-4 and the batch size was set to 32. The learning rate scheduler was set to Noam, with 2000 warm-up steps. Top-k random sampling (Fan et al., 2018) with k = 50 was used for decoding in all methods. We evaluated the trained models using automatic metrics including perplexity, NIST-n (Doddington, 2002)  . Perplexity measures the language quality of the generated responses. The lower, the better. NIST, BLEU, and METEOR measure the similarity between the generated responses and groundtruth via n-gram matching. The higher, the better. Entropy and Dist measure the lexical diversity of generated responses. The higher, the better. BERT-GPT is pretrained on Chinese corpus collected from the Large Scale Chinese Corpus for NLP 6 . The corpus includes Chinese Wikipedia containing 104 million documents, News containing 2.5 million news articles from 63,000 sources, Community QA containing 4.1 million documents belonging to 28 thousand topics, and Baike QA containing 1.5 million question-answering pairs from 493 domains. The total size of these datasets is 15.4 GB. GPT is pretrained on Chinese Chatbot Corpus 7 containing 14 million dialogues and 500k-Chinese-Dialog 8 containing 500K Chinese dialogues.   Table 5 shows the performance on the MedDialog-CN test set. From this table, we make the following observations. First, BERT-GPT achieves lower perplexity than Transformer. This is because BERT-GPT is pretrained on a large collection of corpora before being finetuned on MedDialog-CN. Pretraining enables the model to better capture the linguistic structure among words, which yields lower perplexity. Second, on machine translation metrics including NIST-4, BLEU-2, BLEU-4, and METEOR, BERT-GPT performs worse than Transformer. This indicates that Transformer is able to generate responses that have more overlap with the groundtruth. However, it is worth noting that the studies in (Liu et al., 2016) show that machine translation metrics are not reliable evaluation metrics for dialogue generation. Given the same conversation history, many responses are valid. A response should not be deemed as bad simply because it has little overlap with the response given by a doctor. Third, on diversity metrics, BERT-GPT and Transformer are on par, which indicates that they have similar capability in generating diverse responses. Fourth, compared with BERT-GPT, GPT has worse perplexity, better machine translation scores, and comparable diversity scores. Figure 2 shows an example of generated responses on the MedDialog-CN test set. The response generated by BERT-GPT is clinically informative and accurate. It prescribes Ebastine and gives detailed instructions of taking this medication. Ebastine is a medication for treating eczema. The patient mentioned that his/her baby has eczema. So this prescription is clinically meaningful. The language quality of the response is also good. It is syntactically and semantically correct and smooth. The response generated by GPT is also good, but less specific. It believes the baby has a skin allergy issue, but does not pinpoint the exact issue as BERT-GPT does. The response generated by Transformer is less clinically informative. It does not give medical suggestions. But it asks for further information, which is also a valid response. Figure 3 shows another example. The response generated by BERT-GPT is clinically accurate and concise. The language quality is great. The response generated by GPT is self-conflicting. It says "if there is no abnormality at the throat, you can take a laryngoscope test; if abnormal, you should take a laryngoscope test", which is semantically inconsistent. The response generated by Transformer prescribes two repetitive laryngoscope tests, which is clinically insensible.

Transfer to Other Datasets
In this section, we study how to use the models pretrained on MedDialog-CN to improve the performance on low-resource dialogue generation tasks where the dataset size is small. The target task is generating medical dialogues related to COVID-19 on the small-sized CovidDialog-Chinese (Yang et al., 2020) dataset. We finetune the MedDialogpretrained models on CovidDialog-Chinese, and use the finetuned models to generate COVID-19related dialogues.

Data
We use a Chinese dialogue dataset about COVID-19: CovidDialog-Chinese (Yang et al., 2020), for the experiments. This dataset has 1,088 patientdoctor dialogues about COVID-19, with 9,494 utterances and 406,550 tokens (Chinese characters) Transformer: 需要看到皮肤的形态才能给予具体建议. (I need to see the skin to give detailed suggestions.) GPT: 如果是过敏性皮肤，建议外用酮康唑乳膏. (If the skin is allergic, can take some Ketoconazole cream.) BERT-GPT: 你去买一种依巴斯汀一天晚上吃一片试试看. (Can use Ebastine, one tablet every day at bedtime.) Conversation history: 声音嘶哑, 几年了, 不时发作. (My voice is broken. In the past few years, it happens oftentimes.) Groundtruth response: 手术方式: 门诊表麻电子喉镜下 摘除(小息肉)或住院全麻支撑喉镜下显微镜下摘除(大息 肉). (Surgery choices: outpatient, surface anesthetic, remove small polyps using electrolaryngoscope; inpatient, general anesthesia, remove large polyps using laryngoscope and microscope.) Transformer: 建议到当地儿童医院进行喉镜检查, 然后到 比较好的医院进行喉镜检查, 以便确诊和选择治疗. (I suggest to take a laryngoscope test at a local Children's hospital. Then take a laryngoscope test at a better hospital, to confirm and select the best way for treatment.) GPT: 你好, 声音嘶哑原因较多, 不知你的是喉部还是咽喉 部? 如果喉部没有异常, 建议做电子喉镜检查, 如果有异 常, 应该做喉镜检查. *声带是指哪些部位的病变? (Hello. Broken voice can be caused by many things. Is your problem at the throat or pharynx? If there is no abnormality at the throat, you can take a laryngoscope test. If abnormal, you should take a laryngoscope test. BERT-GPT: 您好，建议先到当地医院做一个电子喉镜检 查. (Hello. I suggest to take a laryngoscope test at a local hospital.)  in total. Duplicated and incomplete dialogues were removed. The dialogues are multi-turn. The average number of utterances in a dialogue is 8.7. The utterances are reasonably long. The average number of tokens in an utterance is 42.8. Table 6 shows the statistics of this dataset.

Experimental settings
We split the CovidDialog-Chinese dataset into a training set, a validation set, and a test set with a ratio of 0.8:0.1:0.1. The split is based on dialogues. The split statistics are summarized in Table 7. Most hyperparameter settings follow those in Section 4.1, except the following: in optimization, the batch size was set to 8. We evaluate the trained models using automatic metrics including perplexity, NIST-4 (Doddington, 2002), BLEU-2, 4 (Pa-    (Lavie and Agarwal, 2007), Entropy-4 (Zhang et al., 2018), and Dist-1, 2 (Li et al., 2015). We also perform human evaluation. We randomly select 100 dialog examples and ask 5 undergraduate and graduate students to rate the generated responses in terms of informativeness, relevance, and human-likeness. Informativeness is about whether a response contains sufficient medical information such as explanations of diseases and suggestions for treatment. Relevance is about whether the content of a response matches with that of the conversation history. Human-likeness is about whether a response sounds like a human. The ratings are from 1 to 5. The higher, the better. The ratings from different annotators are averaged as the final results.  Transformer, pretraining on MedDialog-CN improves results on all metrics. This demonstrates that pretraining on MedDialog-CN can improve performance on low-resource medical dialog generation tasks. Second, on GPT, pretraining on MedDialog-CN improves 5 of the 8 metrics. On BERT-GPT, pretraining on MedDialog-CN improves half of metrics. The reason that improvement on GPT and BERT-GPT is not as significant as that on Transformer is probably because these two models are already pretrained using other corpora. Therefore the value of pretraining on MedDialog-CN is diminishing. However, it is still useful to pretrain on MedDialog-CN to adapt these two models to the medical dialog domain.    strates the effectiveness of pretraining. We perform significance tests between different methods based on the double-sided Student's t-test. The results are shown in Table 10. As can be seen, in most cases, the p-value is less than 0.015, demonstrating high statistical significance. For Transformer, GPT, and BERT-GPT, using pretraining (PT) on MedDialog-CN achieves significantly better performance than not using pretraining (No-PT). Figure 4 shows an example of generating a doctor's response given the utterance of a patient. As can be seen, models pretrained on MedDialog-CN perform better than their unpretrained counterparts. For example, the response generated by GPT without pretraining on MedDialog-CN is not understandable by human. With pretraining on MedDialog-CN, it generates a much better response which gives medical advice. Figure 5 shows another example. Similarly, without MedDialogpretraining, the response generated by GPT is not readable. With pretraining, the generated response is smooth and clinically informative.

Conclusions and Future Works
To facilitate the research and development of medical dialogue systems that can potentially assist in telemedicine, we build large-scale medical dialogue datasets -MedDialog -which contain 1) a Chinese dataset with 3.4 million conversations between patients and doctors, 11.3 million utterances, 660.2 million tokens, covering 172 specialties of diseases, and 2) an English dataset with 0.26 million conversations, 0.51 million utterances, 44.53 million tokens, covering 96 specialties of diseases. To our best knowledge, they the largest of their kind. We pretrain Transformer, GPT, and BERT-GPT on MedDialog-CN. The results show that the generated dialogues by these pretrained models are clinically meaningful and human-like. We use transfer learning to apply these pretrained models for low-resource dialogue generation. On a COVID-19 dialogue generation task where the dataset is small, human evaluation and automatic evaluation show that models pretrained on MedDialog-CN can effectively improve the quality of generated responses.
For future work, we will annotate medical entities in our datasets. Such annotations can facilitate the development of goal-oriented medical dialog systems.