Summarizing Medical Conversations via Identifying Important Utterances

Summarization is an important natural language processing (NLP) task in identifying key information from text. For conversations, the summarization systems need to extract salient contents from spontaneous utterances by multiple speakers. In a special task-oriented scenario, namely medical conversations between patients and doctors, the symptoms, diagnoses, and treatments could be highly important because the nature of such conversation is to find a medical solution to the problem proposed by the patients. Especially consider that current online medical platforms provide millions of public available conversations between real patients and doctors, where the patients propose their medical problems and the registered doctors offer diagnosis and treatment, a conversation in most cases could be too long and the key information is hard to be located. Therefore, summarizations to the patients’ problems and the doctors’ treatments in the conversations can be highly useful, in terms of helping other patients with similar problems have a precise reference for potential medical solutions. In this paper, we focus on medical conversation summarization, using a dataset of medical conversations and corresponding summaries which were crawled from a well-known online healthcare service provider in China. We propose a hierarchical encoder-tagger model (HET) to generate summaries by identifying important utterances (with respect to problem proposing and solving) in the conversations. For the particular dataset used in this study, we show that high-quality summaries can be generated by extracting two types of utterances, namely, problem statements and treatment recommendations. Experimental results demonstrate that HET outperforms strong baselines and models from previous studies, and adding conversation-related features can further improve system performance.


Introduction
Applying natural language processing (NLP) techniques to the medical field is a prevailing trend nowadays and has great potential in many applications, such as key information extraction in medical literature (Kim et al., 2011;Dernoncourt et al., 2017;Ševa et al., 2018), risk factor identification in electronic health records (Chang et al., 2015;Cormack et al., 2015;Cheng et al., 2016), and medical question answering (Pampari et al., 2018;Tian et al., 2019). As the demand for healthcare services increases greatly in the past decades, 2 it is urgent to improve the quality and efficiency of healthcare, reduce workload and mental stress of health providers and increase patient satisfaction. Recently, Internet-based healthcare platforms such as online doctor systems and doctor-patient cyber communities have been increasingly used by patients and health professionals with the hope that they would alleviate the ever-increasing demands for healthcare services and reduce the inaccessibility of services caused by geographical and socio-economic barriers. In such platforms, a patient can start a conversation to a registered doctor by typing their medical problems and then the doctor may ask the patients to specify his/her problem (e.g., symptoms, treatment has been taken, etc.). Since the conversation is asynchronous, it is possible that one speaker (either the  Figure 1: An example of a conversation and its different types of summaries. P and D stand for speaker roles, i.e., patient and doctor, and PD, DT, and OT in the last column refer to the utterance tags for problem description, diagnosis or treatment, and others, respectively. SUM1 is a summary of the medical problem from the patient; SUM2 is a summary of the diagnosis and treatment from the doctor. The English translation is not part of the corpus, which is added as a reference. patient or the doctor) may type multiple lines (utterances) before the other speaker responds. Through this process, all key information regarding to a medical problem, as well as its diagnosis and medical recommendations, are recorded in the entire conversation. Once the platforms make all such conversations publicly available, other patients with similar medical problems can search relevant conversations and find potentially helpful solutions. However, when a conversation is too long or the key information is scattered in it, one could hardly find the essential contents or misread them in many cases. As a result, the summarization of the conversation, especially for problem statement and treatment recommendations, is an important task to help new patients locate useful information to address their medical concerns. Due to the nature of medical conversation, i.e., a task that seeks solutions to provide medical recommendations for particular health problems, it is possible to perform the task by identifying important utterances in such conversations. In this study, important utterances refer to the utterances that contain key information for the medical problem or for the treatment. Therefore, our focus is different from existing studies on utterances in conversations, where they pay more attention to assessment of utterances with respect to the functionalities of utterances in the conversations, such as analyzing automatically generated utterances regarding their suitability within particular conversational contexts (Inaba and Takahashi, 2016;Lison and Bibauw, 2017), evaluation of human conversational performance on readability, sensibility, and social involvement (Dascalu et al., 2010), and identification of segments of utterance that are produced with more emphases for certain interactional purposes (Takeuchi et al., 2007). Little research has been done to identify important utterances that contribute to a specific outcome of a conversation, which in this study refers to the content about patient's problem and recommendation treatment in the conversation.
To conduct the medical conversation summarization task, in this paper, we propose a new benchmark   Table 1a illustrates the overall statistics, and Table 1b reports the number of conversations with SUM2-A only or with both SUM2-A and SUM2-B. dataset in Chinese, which has over 40K cases covering nearly 2K disease types. In each case, there is a medical conversation between a patient and a doctor, and two summaries: one for problem statement and the other for treatment recommendations. Figure 1 shows an example conversation with the two types summaries: "SUM1" for problem statement and "SUM2" for treatment recommendations. SUM2 has two types, i.e., type A and B, which will be explained in the next section.
Besides, we propose a hierarchical encoder-tagger (HET) model for extractive summarization to tag each utterance in a medical conversation with regard to whether an utterance is a problem statement or a treatment recommendation. We further enhance the model with end-to-end memory networks (Sukhbaatar et al., 2015) to incorporate the information in relevant utterances in the conversation. We use BERT (Devlin et al., 2019) as the token-level encoder and try several utterance-level encoders and taggers. Experimental results show that HET outperforms strong baselines as well as models from previous studies on this dataset. Analyses are also conducted to better understand our findings from the results.

A Corpus of Medical Conversations
Medical conversation is a type of task-oriented conversation. Different from ordinary conversations in which topics are often fluid, in task-specific conversations, participants interact to accomplish a projected set of goals and sub-goals (Litman and Allen, 1987;Drew and Heritage, 1992). Specifically, for conversations in the medical domain from online medical platforms, the projected goal is for the doctor to diagnose and offer treatment recommendation for the patient's problem (Drew and Heritage, 1992;Robinson, 2012;Wang et al., 2020). Particularly in China, many platforms make such medical conversations publicly available so that new patients with similar problem can search relevant conversations and find helpful information from them. Therefore, summarization of the patient's problem and doctor's recommendations in a conversation could be highly important because such summaries can help the new patients locate the key information, especially when a conversation is too long. To conduct such summarization, a straightforward solution is to identity the important utterances that contain key information for problem statements or treatment recommendations. However, limited corpus can be found to train such summarization model, especially for Chinese. Therefore, we develop a corpus in Chinese for medical conversation summarization and illustrate the details in the following text.
The Raw Data The original data are crawled from one of the most well-known online health provider platforms 3 in China, under a section called "Frequently Inquired Health Problems." 4 In these conversations, patients consult registered doctors 5 about some health problems; doctors help them to determine the nature of the problems, provide treatment recommendations, and/or advise them to seek further medical attention from other health facilities. Instead of isolated question-answer segments or part of the conversations, this data contain full conversations between patients and doctors, covering the entire interaction process. In addition to dialogues, each conversation contains meta information such as the type of disease and the corresponding hospital department, as well as the speakership of the utterances in conversation. 6 Many (but not all) conversations include a summary added by doctors after the conversations are conducted. The summary has two parts: SUM1 describes the medical problem that the patient has; SUM2 summarizes the doctor's diagnosis or treatment recommendations. SUM2 is of two types: Type A (we denote it as SUM2-A) is the concatenation of a few utterances in the conversation, whereas Type B (we denote it as SUM2-B) is a more concise summary written by the doctor and may contain text that does not appear in the conversation. In all, we crawled 109,850 conversations from 23 hospital departments or sub-divisions, and the conversations cover 1,839 disease types, which forms our raw corpus. Among them, only half of them contains both SUM1 and SUM2. This again emphasizes the necessity of this summarization task, because if we can automatically generate the missing summaries for problem statement and treatment recommendations, new patients may have more references when they search conversations that are relevant to their problem.
Data Processing To facilitate the task of conversation summarization, we process the raw corpus by only reserving the conversations that have both SUM1 and SUM2, and further clean the resulted data by removing duplicates and those conversations containing only one utterance. The cleaned data contain both input and output for the summarization task. Particularly, SUM1 and SUM2-A are the concatenation of selected utterances in the conversation that provide key information for problem statement and treatment recommendations. Therefore, the important utterances identified in a conversation are those likely to appear in the summary. In detail, following Nallapati et al. (2017) and Chen and Bansal (2018), we use ROUGE scores to measure the overlap between an utterance and a summary, and label the utterances accordingly; that is, we break the summary into segments, 7 and then for each segment, find the closest utterance in the conversation according to ROUGE-1 score. If the score is greater than a threshold, we label that utterance as "PD" if the summary is SUM1 and "DT" if it is SUM2. For all other utterances, we label them with "OT". We call those resulting "PD", "DT", and "OT" as silver-standard labels. Table 1 shows the statistics of the processed dataset, where Table 1a reports the overall statistics of all data and the train/test splits (we use 80% for training and 20% for testing), and Table 1b illustrates the number of conversations where SUM2 is SUM2-A only or has both SUM2-A and SUM2-B. A few points are worth mentioning. First, on average, each dialogue includes 19.0 UTTERANCE (and about half of them are by the doctor), but only 4.5 of them are tagged with label "DT", which demonstrates that more than half of doctors' utterances are not included in the summary. Such utterances can be greetings, symptom inquiry, etc. Second, the conversations between the patient and the doctor are asynchronous: either party can type some messages, walk away, and later come back to continue the discussion. This property makes the corpus different from other benchmark corpora (such as AMI (McCowan et al., 2005)) consisting of dialogues during in-person meetings. Third, for SUM2, all conversations have SUM2-A, and only a small portion (around 7.5% in the training and testing sets) have both SUM2-A and SUM2-B. Therefore, for the conversations with both SUM-2A and SUM-2B, we use their concatenation to compute the average length reported in Table 1a. Forth, while this paper focuses on summarization, the corpus can be used for other NLP tasks such as question answering and dialogue analysis.

Summarization via Tagging
To model conversation, a common approach is to use a two-level hierarchical sequential model (Serban et al., 2016), in which a conversation may be modeled as a sequence of utterances, and each utterance is modeled as a sequence of words or characters. Using such hierarchical models, conventional studies mainly focused on conversation generation (Sordoni et al., 2015;Serban et al., 2016;Serban et al., 2017), where a decoder is employed to generate responses conditioning upon the vectors encoded from the hierarchical modeling of previous utterances.
For our dataset, there is a big overlap between utterances and the summaries; for instance, as shown in Table 1b, SUM2 in the majority of the conversations (92.5% in training and 92.3% in the test set) are in this dataset we only include typed messages for our research. 7 The summary in this corpus often uses a full-width comma (U+FF0C) as a delimiter, and we use this delimiter to break a summary into segments. of type SUM2-A only and the rest contain both SUM2-A and SUM2-B, where SUM2-A is generated by concatenating several utterances in the conversation. To take advantage of such a property, we treat summarization as a tagging task; that is, we generate the summaries by first labeling the utterances with the PD, DT, OT tags and then concatenating the labeled utterances to form summaries.
We define the input utterance sequence as U = u 1 , u 2 , · · · , u i · · · , u n with each u i presented as a sequence of basic tokens (e.g., word or character) u i = w i,1 , w i,2 , · · · , w i,l i . To model the input, our model follows the typical hierarchical structure in which the tokens and utterances are encoded with by separate encoders and hierarchically stacked. Then a tagger is attached at the utterance-level to predict PD/DT/OT labels. Afterwards, we concatenate the utterances labeled by PD and DT to generate the summary of medical problems and doctor's diagnosis, respectively. To further enhance our model, we adopt memory networks (Sukhbaatar et al., 2015) to incorporate the information from relevant utterances in the conversation. Therefore, our model is a hierarchical encoder-tagger (HET) with the memory module applied between the token-level and utterance-level encoders, which is illustrated in Figure 2. Also, it is worth noting that our method can generate the two types of summaries simultaneously, since they directly come from the predicted PD/DT/OT labels. In the following texts, we firstly introduce the memory module and then elaborate the whole hierarchical tagging process with the memories.

Utterance Memories
As discussed above, we regard our summarization task as an utterance tagging process. Similar to other tagging tasks in which contextual information is highly helpful in determining the output tags (Song and Xia, 2012;Marcheggiani and Titov, 2017;Higashiyama et al., 2019;Tian et al., 2020a;Tian et al., 2020b), for each utterance u i in the conversation, relevant utterances in each conversation also provide useful information  to determine whether a particular utterance is important. To exploit the information from relevant utterances, we adopt end-to-end memory networks (Sukhbaatar et al., 2015), which (as well as the variants) have been demonstrated to be useful in many tasks (Miller et al., 2016;Tian et al., 2020c), to learn from them to facilitate important utterance tagging. In doing so, we first map all utterances [u 1 , · · · , u j , · · · , u n ] in the conversation into their memory vectors and value vectors. The memory vectors (denoted by m j for u j ) are directly copied from the utterance representation obtained from the token encoder; the value vectors (denoted by v j for u j ) are obtained by a BiLSTM encoder. Specifically, memory vectors m j are used to compute the similarity with the input utterance; while v j carries u j 's encoding information for generating final memory output. Then for each utterance u i with its representation h i , we use it to address relevant utterance through the memory, which is formalized as Here, δ i,j ∈ {0, 1} is a binary activator which equals 1 if the speaker of u j is identical with that of u i and equals 0 otherwise; m j = h j because the memory vectors are copied from the utterance representation obtained from the token encoder (TE); p i,j is the weight measuring the relevance between u j and u i . Afterwards, the value vectors v j of u j are weighted with p i,j and summed by where a i is the vector to represent the information from relevant utterances via a weighted sum operation.

The Hierarchical Encoder-tagging with Memories
To obtain the representation of each input utterance u i , we apply BERT (Devlin et al., 2019) as our token-level encoder (TE), and use the encoded hidden vector of "[CLS]" 8 as h i to represent the utterance u i . Once a i is obtained from the memory module, we concatenate it with h i and get the resulting utterance representation for the utterance level encoding 9 by Then, an utterance-level encoder (UE) is applied to model the utterance representations in a sequential way. For example, if we use LSTM for UE, the utterance-level encoding is formulated as where the o i is the step-wise state for utterances and h i is used as the input to the UE at each time step. Note that, in addition to LSTM, there are many other ways for UE, e.g., BiLSTM. Herein we use LSTM as an example of the UE for the sake of simplicity. On the top of the encoder, there is the tagger layer performing the identification task, where a trainable matrix W and bias vector b is used to align o i to the output space: Afterwards, a softmax or conditional random field (CRF) (Lafferty et al., 2001) algorithm is applied to o i to obtain the output tags. Finally, we concatenation all utterance with the label PT and DT to generate the summary of patient's problem (SUM1) and doctor's diagnoses (SUM2), respectively.

Settings
We experiment our HET model with and without the memory on our corpus. For model implementation, at the token-level encoder (TE), we use the Chinese version of BERT 10 and ZEN (Diao et al., 2019) 11 with their default settings, where for both BERT and ZEN, we use 12 layers of multi-head attentions with the dimension of hidden vectors set to 768; for the utterance level, we firstly run experiments with no encoder; then following previous studies such as (Kalchbrenner and Blunsom, 2013;Kumar et al., 2018), we experiment with two recurrent neural network models (namely, LSTM and BiLSTM) to encode the utterance sequence for each conversation, where the dimension of hidden states is set to 300 for LSTM and 150 for BiLSTM encoder.
In the memory module, the embedding matrix and BiLSTM encoder for obtaining the value vectors v j for u j are applied directly to the Chinese characters in the utterance. All parameters in the embedding matrix and the BiLSTM encoder in the memory module are initialized randomly, with the dimension of embedding and hidden states set to 768 and 384, respectively (which allows the dimension of v i to match that of the hidden vector of BERT and ZEN).
For the tagger, we run two types of them, i.e., softmax and CRF, in order to test whether there is a strong  The results of HET using BERT and ZEN as the token encoder with and without the memory module (M). We also try different combinations of utterance encoders (UE) (i.e., none, LSTM, and BiLSTM) and taggers (i.e., softmax and CRF). PD and DT are the two tags for important utterances; P, R, and F are the precision, recall, and F scores of the predicted labels when compared with the silver-standard PD/DT/OT labels; R-1, R-2, and R-L are ROUGE-1, ROUGE-2, and ROUGE-L scores of the generated summaries when compared with gold references in the corpus (i.e., the SUM1 and SUM2).
dependency between the importance labels of adjacent utterances. We use cross-entropy and negative log-likelihood as loss functions for softmax and CRF, respectively. For evaluation, we use F scores for the tagging results 12 and use ROUGE-1, ROUGE-2, and ROUGE-L scores 13 to evaluate the generated summaries using SUM1 and SUM2 in the dataset as the gold standard. If the SUM2 of a conversation includes both SUM2-A and SUM2-B, we treat the concatenation of SUM2-A and SUM2-B as the gold standard for SUM2 in all the experiments, except the results in Table 4.

Basic HETs
The first experiment is to explore how the HET models perform under different settings on the proposed dataset, where models with and without the memory module and configured with different token encoders (BERT and ZEN), UEs (no UE, LSTM, and biLSTM), and taggers (softmax and CRF) are tested. Table  2(a) and 2(b) show the results of utterance tagging (in terms of precision, recall, and F scores) and summarization (in terms of ROUGE-1, ROUGE-2, and ROUGE-L) for both problem statement (SUM1) 12 We use the code in the sklearn framework https://scikit-learn.org/stable/modules/classes.html. 13 The code is from https://github.com/google-research/google-research/tree/master/rouge.

Models
PD (SUM1) DT (SUM2)  Table 3: Experimental results of our runs of models from previous studies as well as our best HET (with BiLSTM UE, softmax tagger, and the memory module).
and treatment recommendation (SUM2) when BERT and ZEN token encoders are used. Some observations are stated in order below. First, the overall results demonstrate that the method of generating summaries via tagging works well on our dataset. In most cases, models that perform well on tagging (F scores) also perform well on summarization (ROUGE scores). Second, for both BERT and ZEN encoders, the HET model works well with different combinations of UEs and taggers, which illustrates the validity of our approach. Among different settings, the one using BiLSTM UE outperforms others, suggesting that the sequential organization of utterances play an important role in identifying important utterances in conversations. Third, compared with models without the memory module, models with memories achieve greater improvements on the doctor diagnoses (SUM2). However, the effect of memories is not as good for the problem description (SUM1). One possible explanation could be that the information of other utterances is more useful for determining whether an utterance can be tagged for SUM2 than that for SUM1; the memory module can appropriately model such information and thus including the memory module in HET is more helpful on SUM2 than that for SUM1.

Comparison with Previous Studies
On our dataset, we compare our approach with two previous extractive summarization models. The first one is SummaRuNNer (Nallapati et al., 2017) and the other is a contextualized extractive method (CEM) proposed by . 14 Since these models are originally designed for document summarization, which cannot generate summaries for patient's problem and doctor's diagnosis simultaneously, in our experiments, we directly concatenation all utterances to form a document as the input (i.e., the conversation utterances are regarded as document sentences) and train the models for SUM1 and SUM2 separately . For both models, we apply the Chinese character embeddings from Tencent Embedding 15  and select the top ranked 7%, and 24% 16 of the utterances (sentences) as the summarization of patient's problem and doctor's diagnosis, respectively. Table 3 shows the best results of the two reference models as well as our model using BERT and ZEN with the best setting (i.e., BiLSTM UE, softmax tagger, as well as the memory module), where our approach outperforms both referential systems on both SUM1 and SUM2, where the model with ZEN obtains the best results.

SUM2-A vs. SUM2-B as Gold Standard
As shown in Table 1b, 7.7% of conversations in the test set contain both SUM2-A and SUM2-B. So far, for those conversations, we have used the concatenation of SUM2-A and SUM2-B as the gold standard (see Table 2-3). Table 4(a) shows the performance of the four systems (i.e., Ref-1 from Nallapati et al. (2017) and Ref-2 from ) and our model using BERT and ZEN as TE under the best setting (e.g., BiLSTM UE, softmax tagger, with the memory module)) on the entire test set, but with SUM2-A as the gold standard. Not surprisingly, for all three models, their performances with SUM2-A as the gold standard are higher than the ones with concatenation of SUM2-A and SUM2-B as the gold standard (see the last three columns in Table 3).   (Nallapati et al., 2017) and Ref-2 (Wang et al., 2019)) and our best model (with BiLSTM UE, softmax tagger, and the memory module), where different part (i.e., SUM2-A or SUM2-B) of SUM2 is regarded as the gold standard.
Table 4(b) reports the results on the 697 conversations in the test set that have both SUM2-A and SUM-2B, with either SUM2-A or SUM-2B as the gold standard. For all three systems, ROUGE scores with SUM2-B as gold standard are much lower than the ones with SUM2-A, indicating that generating summaries that are similar to manually crafted summaries is still a challenge task.

HETs with Meta-Information
In addition to the utterances, each conversation in the dataset has three major types of meta information; namely, speaker role (patient or doctor) (SR), hospital department (HD), and disease name (DN). We experiment with adding such meta-information on top of our model using BERT and ZEN as TE under the best setting. To incorporate the meta-information, we use a single-layer neural network to transfer them into vectorized representation, and concatenate them to their correspondent encoder layers. Specifically, SR is added to TE; HD and DN are added to UE. 17 Table 5 reports the performance of our HET models with different combination of the meta-information, where the results without using any meta-information are shown in the first row (which is identical to the last row in Table 3). Compared to the baselines, models with meta-information achieve better performance in most cases. Specifically, adding SR results in higher improvements compared with HD and DN. One possible explanation could be that the utterances from the patient and the doctor could be more important in generating problem statement (SUM1) and treatment recommendation (SUM2), respectively. Therefore, adding SR would help our model to focus more on the utterances for the patients and the doctors when it is predicting PD and DT labels for SUM1 and SUM2, respectively.

Extractive Summarization
As a direct research line related to our work, extractive summarization aims to extract important sentences in the input and use them to form a summary. Most previous studies focused on document summarization (Nallapati et al., 2017;Narayan et al., 2018;Xiao and Carenini, 2019;Luo et al., 2019) while some focused on summarization of meeting transcripts (Riedhammer et al., 2010;Singla et al., 2017), where their problem settings and data preparation are different from ours. Specifically, compared with summarization for documents, our task of conversation summarization is more challenging because utterances in the conversation are less formally written and there are speaker role changes during the entire conversation; compared with summarization for meeting transcripts, where the summary is similar to a short meeting-log, our task requires to generate more informative summaries to facilitate the needs of providing useful information to potential patients from the online platform. General extractive approaches for summarization always face challenges of redundancy when they use extracted sentences to generate an informative and readable summary within a length, in which additional modeling is required to address it even though with powerful neural models, e.g., BiLSTM (Nallapati et al., 2017), transformers , and attentions (Xiao and Carenini, 2019). On the contrary,  Table 5: Results of our models using BERT and ZEN TE under the best setting (with BiLSTM UE, softmax tagger, and the memory module). "SR", "HD", and "DN" stand for the meta-information of speaker roles, hospital departments, and disease names, respectively.
in our work, this challenge may not be an issue because the redundancy in the original input is limited and directly concatenating selected utterances with their same order in the original conversation does not lead to unreadable summmaries in most cases. Therefore, to have a good performance in conversation summarization in the medical domain, task-specific designs of summarization model are expected.

Utterance Modeling in Conversations
Studies on dialogue systems have drawn much attention recently, where many of them have been done on utterance modeling in human-human conversations (Wang et al., 2018a;Liu et al., 2019). In these studies, one stream of utterance modeling focuses on dialogue act classifications, which aims to attribute one of predefined acts to each utterance in conversations (Lee and Dernoncourt, 2016;Liu et al., 2017;Kumar et al., 2018;Wang et al., 2018b;Raheja and Tetreault, 2019). Another stream focuses on assessment of utterances in terms of their quality in various aspects, such as sentiment analysis (Inaba and Takahashi, 2016;Lison and Bibauw, 2017;Misra et al., 2019). Our study on extractive summarizations for conversation can be regarded as in the line of the latter stream in evaluating utterances for human-human conversations, where little research has been done for utterances based on their importance to the pragmatic outcomes (i.e., summaries for problem statement and treatment recommendations in our study) of the conversations.

Conclusion and Future Work
In this paper, we proposed a new task of medical conversation summarization, which is performed by identifying important utterances in the conversation between patients and doctors. Based on the real data from a Chinese online medical service provider, a hierarchical encoder-tagger model (HET), which is enhanced by the memory module, was proposed to tag each utterance in a conversation with problem statement or treatment recommendation. The labeled utterances are then concatenated to form summaries. The experimental results demonstrate the validity of our approach to medical conversation summarization via identifying important utterances on the proposed dataset. For future work, we plan to perform further key information extraction on the conversation summaries from similar medical problems, so that we can obtain relevant information such as symptoms and treatment recommendations to a particular medical problem and help new patients to locate more precise references that are covered in many cases.