Enhancing Dialogue Symptom Diagnosis with Global Attention and Symptom Graph

Symptom diagnosis is a challenging yet profound problem in natural language processing. Most previous research focus on investigating the standard electronic medical records for symptom diagnosis, while the dialogues between doctors and patients that contain more rich information are not well studied. In this paper, we first construct a dialogue symptom diagnosis dataset based on an online medical forum with a large amount of dialogues between patients and doctors. Then, we provide some benchmark models on this dataset to boost the research of dialogue symptom diagnosis. In order to further enhance the performance of symptom diagnosis over dialogues, we propose a global attention mechanism to capture more symptom related information, and build a symptom graph to model the associations between symptoms rather than treating each symptom independently. Experimental results show that both the global attention and symptom graph are effective to boost dialogue symptom diagnosis. In particular, our proposed model achieves the state-of-the-art performance on the constructed dataset.


Introduction
With the widespread use of electronic health records (EHRs) in medical treatment, symptom diagnosis based on EHRs have received a lot of attention in the natural language processing (NLP) research community (Linder et al., 2007;Shivade et al., 2013). Previous work on EHRs achieved great success in determining the diagnosis of clinical depression (Trinh et al., 2011), identifying community-acquired pneumonia (DeLisle et al., 2013), improving medication reconciliation (Persell et al., 2018) and infection detection . However, EHRs usually contained historical information, such as the medical records or health records, which can not well reflect the current symptoms of the patients. In contrast, the dialogues between doctors and patients during the medical consultation process provide many valuable clues for the current symptom diagnosis. Only a few researchers focus on the dialogue between doctors and patients.  proposed a reinforcement learning based framework for medical dialogue system for automatic diagnosis. As shown in Table 1, a kid has a cough, the doctor asks the patient whether he has a fever. The patient describes his kid's real situation. Some symptoms like coughing appear, but some symptoms like fever don't appear. What's more, some symptoms are uncertain such as cold because doctor can not make a clear judgment at that time. Though the dialogues show great potential in med-ical treatment, symptom diagnosis based on dialogues, namely dialogue symptom diagnosis, have rarely been studied. Moreover, there are no public datasets on dialogue symptom diagnosis as far as we know.
In this paper, we focus on the studies of dialogue symptom diagnosis and define it by two subtasks: symptom recognition and symptom inference. The symptom recognition aims to identify the symptom related entities from the dialogues, which is the basic step in finding symptoms or diseases. Symptom recognition is similar with the disease named entity recognition (NER) (Dogan et al., 2014) task that is generally considered as a sequence labeling problem (Chinchor and Robinson, 1997;Sang and De Meulder, 2003). Whereas, symptom recognition in dialogues is more challenging due to the short texts and nonstandard oral description. Regarding to the symptom inference, it focuses on making further decisions whether the symptom is True, False, or Uncertain with the patients, which helps the doctors diagnose the disease better in the next step.
To promote the research of dialogue symptom diagnosis, we collect a large amount of dialogues between patients and doctors from a Chinese online medical forum, and construct a dataset for the above two sub-tasks in dialogue symptom diagnosis. In addition, we provide several classical and advanced baselines on this dataset for further research. Furthermore, we propose an approach, which embeds a global attention and symptom graph to improve the performance of dialogue symptom diagnosis. Specifically, the global attention aims at incorporating more related information from the whole dialogue and corpus for better symptom entity representation, which will be used for symptom recognition and inference. Regarding to the symptom graph, it is built by treating each symptom as a node, and the edges are connected according to the true co-occurrence in the dialogues. We build the symptom graph to model the associations between symptoms rather than treating each symptom independently to improve the inference precision.
The contributions of this work can be summarized as follows: • We provide a public dataset to promote the research of dialogue symptom diagnosis, which contains the annotation results in dialogues with respect to symptom recognition and symptom inference. • We present a global attention mechanism, which captures more symptom related information from both dialogues and corpus to boost the performance of dialogue symptom diagnosis. • We build a symptom graph to model the associations between symptoms, which further helps improve the precision of symptom inference. • We perform extensive experiments, and the experimental results demonstrate the effectiveness of our proposed approach in the two sub-tasks of dialogue symptom diagnosis.

Related Work
Early attempts on biomedical NER task were based on rule-based dictionary matching method and machine learning method. (Lin et al., 2004) used maximum entropy as the underlying machine learning method incorporated with dictionarybased and rule-based methods for post-processing to identify biomedical entities. (Jimeno et al., 2008) used MetaMap which is provided from the National Library of Medicine and a dictionary matching method to identify diseases. In recent years, researchers had proposed many neural network-based models on this problem. Most models use encoder-decoder architecture. (Collobert et al., 2011) used the convolutional neural network (CNN) as an encoder, and the conditional random field (CRF) (Lafferty et al., 2001) as a decoder. More recent works used LSTM as encoder which performed better in sequential problems, (Huang et al., 2015) used bidirectional LSTM as encoder, and the BiLSTM-CRF model achieved state-of-the-art on many datasets. Therefore, many researchers chose BiLSTM-CRF model as a baseline model when solving sequential problems. Some researchers made attempts to get better word representation. (Ma and Hovy, 2016) used an additional CNN to represent character-level features on the basis of BiLSTM-CRF. With character encoder, it can extract features inside words and get better representations.
In task of symptom NER, some symptom names entities are complex. There were many efforts to exploit features beyond individual sequences. (Yaghoobzadeh and Schütze, 2016) used knowledge base and aggregated corpus-level contextual information to learn an entity's classes. To address the challenges of identifying rare and complex disease names, (Xu et al., 2019) proposed a method that incorporates both disease dictionary matching and a document-level attention mechanism into BiLSTM-CRF for disease NER. (Xu et al., 2018) used the document-level attention mechanism to capture long-range contextual dependencies and address clinical NER tasks. Symptom recognition is a very important step, but these researchers focus on symptom recognition only and do not further infer the recognized symptoms.

Dataset
In this section, we make a description of our dataset. We have constructed a Chinese dataset from the pediatric department of a Chinese online health community 1 . Patients can submit their health problems and then doctors start a conversation to know more about the patient and provide professional suggestions. Annotation Symptoms reflect the abnormal state of the patient or the presence of the disease. The annotation consists of three parts, namely symptom recognition, symptom normalization and symptom inference. Figure 1 gives an example. We apply BIO (begin-in-out) schema at character level and each symptom is tagged with an extra label (True, False and Uncertain) which indicates whether the patient really has the symptom. Each symptom also links to the most relevant one on SNOMED CT 2 for normalization. In order to ensure the quality of the dataset, we hired three annotators with medical background. Each character is marked by two annotators and the inconsistent part is further judged by the third annotator. The Cohen's kappa coefficient (Fleiss and Cohen, 1973)    Data Details Our dataset has a total of 2,067 conversations and we focus on four diseases, namely, "upper respiratory infection", "functional dyspepsia", "infantile diarrhea" and "bronchitis". The distribution of the diseases is shown in Table 2. Table 3 presents some statistics of the dataset. SNE stands for symptom named entity. Besides, the proportion of symptom status as True, False and Uncertain is around 63%, 12% and 25%. In order to get a reasonable comparison, we split the dataset by a 3:1:1 ratio to obtain the training set, validation set and test set 3 .

Proposed Model
The framework of our proposed model is presented in Figure 2. Our model consists of three parts, the first part is symptom recognition, the second part is symptom graph, and the third part is symptom inference. We first encode the word sequence by Bi-LSTM. Then we present a global attention mechanism to get the contextual information from document level and corpus level. Next, we re-encode the hidden states obtained above and decode by CRF to recognize the symptoms. To model the associations between disease entities in the dialogue, we construct a symptom graph, which is then incorporated into the classification layer for symptom inference. The detailed description of each step is shown in the following sections.

Bi-LSTM Encoder
In this work, we use the bidirectional long shortterm memory network (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) to encode the input sequences. Bi-LSTM has been widely-used to extract contextual text features. Bi-LSTM encodes the input from left to right and the same sequence in reverse (Huang et al., 2015). Given input sequence X = (x 1 , x 2 , ..., x n ), we can get the hid- Formally, the basic units including hidden state h t and the memory c t are updated with following equations: where σ is the sigmoid function and is the element-wise product. x t is the input vector at time t. i t , f t , o t denote the input, forget and output gate respectively.

Global Attention
Our global attention mechanism is shown in Figure 3, which consists of two parts, namely document-level attention and corpus-level attention. We will describe the details in the following.
Document-level Attention In a dialogue, the information provided by a single sentence is very limited and the same word may indicate different meanings due to the ambiguity. Therefore, we apply the document-level attention mechanism to make full use of the information in the whole dialogue to alleviate the ambiguity problem. We define a document (or dialogue) D = (S 1 , S 2 , ...) and a sentence S p = (w p1 , w p2 , ...). S p represents the pth sentence of the document and w pi represents the ith word of S p . h pi is the hidden state of w pi . We search for the sentence with the same word w pi from the current document, and feed the found sentence into the same Bi-LSTM model. For example, as shown in Figure  3, w pi represents the word "cough". The sentences as S D q and S D r in the current dialogue also contain "cough" . We add the hidden states of the word in the two sentences into a seth pi ={h 1 pi ,h 2 pi , ...}. In Figure 3,h qj andh rk areh 1 pi andh 2 pi respectively. We weight the hidden states by documentlevel attention and the attentive representation is formulated as follows: v, W h , Wh and b e are the parameters to be learned. H D pi denotes the contextual information of the word w pi in the dialogue.
Corpus-level Attention Noting that the same word in different dialogues may indicate additional associations, we devise a corpus-level attention mechanism to capture the extra information.
We define the corpus C = {D 1 , D 2 , D 3 , ...}. Similar to the document-level attention, we find the supporting sentences in the corpus that contain the current word. In Figure 3, the sentences as S C s and S C t contain the word "cough", andh sm andh tn are the corresponding hidden states. We apply corpus-level attention to obtain the attentive representation of the hidden states in the corpus: H C pi denotes the related information of the word w pi in the corpus, α C,j pi is the attention weight for the corresponding hidden state in the corpus.

Both Document and Corpus-level Attention
In order to integrate the information obtained based on document-level attention and corpuslevel attention, we concatenate h pi , H C pi and H D pi , and feed it into another Bi-LSTM model. Thus, the final hidden state of each word contains the complementary information from both the dialogue and corpus.

Symptom Recognition
In this work, we apply the Conditional Random Field (CRF) (Lafferty et al., 2001) as decoder for symptom recognition. CRF can compute the global optimal sequence and efficiently capture the dependencies among tags (e.g. label 'I' can not follow 'O') via jointly decoding the chain of labels. The Viterbi algorithm (Viterbi, 1967) is chosen for inference using dynamic programming. Given the representation of a sequence, we first map it to the tag space by a linear layer. Then, the score of the input along with a prediction y is given by: where T is a transition matrix, T i,j represents the score of transition from tag i to tag j; P is a matrix of the output from the last layer, P i,j is the score of the j th tag for the i th word in the sentence. The goal is to predict the best tag path that given by:

Symptom Graph
Symptom entities have a certain probability of cooccurrence in a dialogue. For example, "fever" may appear in a conversation with "cold" at the same dialogue, and "cough" may appear in a conversation with "sputum" at the same dialogue. In order to capture the associations between the disease entities, we build a graph G = (V, E), where V = v 1 , v 2 , ..., v m is the node set and E ⊂ V × V is the edge set. The edges e i,j = (v i , v j ) in the graph is undirected. The nodes are the normalized symptom entities with status True from the training corpus. Edge e i,j = (v i , v j ) indicates symptom entities v i and v j co-occur in a document. The co-occurrence number of two entities is normalized by min-max normalization to obtain the edge weight. Thus, the weight w i,j ∈ (0, 1).

Symptom Inference
Intuitively, the associations between symptom entities can help enhance symptom inference. Therefore, we first define a smoothness loss that quantitatively measures the entity associations in the constructed symptom graph: where y i is 0 or 1 that depends whether the entity has been recognized by the symptom recognition module. L = D − A denotes the Laplacian matrix of an undirected subgraph G with k nodes and m edges corresponding to the current document. A ∈ R k×k denotes the weighted adjacency matrix of subgraph G . D is a degree matrix and D ii = j A ij . y T is a k-dimensional vector. Theoretically, if the model fully recognizes the symptoms, the value of S is 0 indicating the symptom graph is smooth. When some symptoms are not recognized, the value of S depends on the weights between the nodes of the incorrect symptoms and the neighbor nodes. With the smoothness loss defined above, we then incorporate it into the loss function for symptom inference. Symptoms are classified into three categories: True, False, or Uncertain, and we adopt a softmax function in the classification layer to predict the probability of the symptom belonging to each category. The classification layer and the CRF layer share the same hidden states of the upper Bi-LSTM encoder. The objective is defined as minimizing the joint cross-entropy loss in classification and the smoothness loss in the graph: where D is the document set, i is the index of the sentence, j is the index of the symptom, p c i,j is the predicted probability of the gold-standard polarity class c for the jth symptom in the ith sentence of the document, and λ is a weight parameter to control the importance of the smoothness loss.

Experimental Setup
We use 200-dimensional Chinese embeddings trained on Wikipedia and fine tune them during model training by back-propagating the gradients. The parameters of the weight matrix are initialized by the Xavier method (Glorot and Bengio, 2010). The Stochastic gradient descent (SGD) method with a momentum of 0.9 is used for optimization. The initial learning rate η 0 = 0.015 and the learning rate gradually decreases with the increasing epoch. The specific update formula is η t = η 0 /(1 + ρt) where ρ = 0.05 and t is the number of trained epoch. Gradient clipping is set to 5 in order to avoid "gradient exploding". Other experimental settings such as the dropout rate is 0.5 and the Bi-LSTM hidden dimension is 200. We build the symptom graph from our training set. There are 1,646 edges and 162 nodes. We initialize the labels as 1 for all nodes.
Look-up Table and Stop Words For the attention mechanism, we select at most three document-level supporting sentences and three corpus-level supporting sentences. We build a look-up table that can quickly get the index of a word in each sentence and the index of a sentence in each document. Therefore, the time complexity of finding supporting sentences and words is O(1). Meanwhile, we use a stop-word list that contains 178 words. In this way, we can further reduce the time cost.
Symptom Normalization We consider the symptom normalization as a text classification problem. In our dataset, there are 162 normalized symptoms. We apply the convolution neural network (CNN) to classify all symptoms into 162 categories. The accuracy of symptom normalization on the test set is 97.04%. Thus, the symptom normalization doesn't bring too much noise to our model.

Performance of Symptom Recognition
Symptom recognition is the basis of symptom inference. We report the results of the recent advanced baselines as well as the variants of our proposed method. Specifically, we compare the performance of the following models: • Bi-RNNs (Dyer et al., 2015): The models use LSTM or GRU for the sentence encoder, and treat symptom entity recognition as a classification problem with the softmax function. From now on, we use RNNs to denote LSTM or GRU for ease of description.
• Bi-RNNs-CRF (Huang et al., 2015): The models use RNNs for the sentence encoder and a CRF layer for decoder, which yields the tagging prediction for each token.
• Corpus-level Attention: It is a Bi-LSTM-CRF model that incorporates corpus-level features via our corpus-level attention.
• Document-level Attention: It is a Bi-LSTM-CRF model that incorporates document-level features via our documentlevel Attention.
• Both Corpus and Document-level Attention: It is a Bi-LSTM-CRF model that incorporates both document-level and corpus-level features via our global attention.
The overall results of symptom recognition are shown in Tabel 4. We observe that the Bi-RNNs models including Bi-GRU and Bi-LSTM have the similar performance, which get about 81% in the F1 score on our dataset. The Bi-RNNs-CRF models perform much better than Bi-RNNs, which indicates the effectiveness of the CRF model for sequence tagging. In addition, the performance can be slightly improved by incorporating the character level information with CNN. Furthermore, by integrating either our corpus-level attention or document-level attention into the existing models, the performance can be significantly boosted. In particular, our model with global attention achieves the best performance in terms of all metrics.    Table 5 presents the symptom inference results of the classical Bi-LSTM CRF-inference model and our proposed joint model ( Figure 2). The results show that our proposed model with global attention significantly outperforms the Bi-LSTM CRFinference model for symptom inference across all the categories. In particular, we achieve substantial improvements for inferring the False and Uncertain categories of symptoms, by utilizing the global information in the current dialogue and the whole corpus To investigate the effect of the symptom graph for symptom inference, we compare the models with and without symptom graphs. The results in Table 5 show that when incorporating the symptom graphs for inference, the performance of each model can be further boosted. These observations have verified the effectiveness of modeling the associations between symptoms via graphs for symptom inference. Table 6 presents a case of symptom recognition based on the baseline and our model. It is observed that the baseline Bi-LSTM CRF model only identifies the word "allergic" as a symptom. In contrast, our model can recognize the phrase "allergic rhinitis" by utilizing the related information (i.e., "allergies rhinitis" and "allergic caused rhinitis") in the document-level and corpus-level supporting sentences, which is more accurate for the symptom description in this case. Table 7 shows the results of symptom inference for a case by using the baseline and our joint model. From the patient's answer, we know that the kid has no "allergy". Whereas, the symptom "allergies" in the doctor's question sentence is inferred as uncertain by the baseline. By incorporating the global attention mechanism, our joint model can correctly infer the symptom as false.

Model Sentence
Bi-LSTM CRF 医生：相对来说，这个年龄的孩子出现过敏性鼻 炎比较少见。 Doctor: Relatively speaking, allergic rhinitis is rare in children of this age.   To have an insight of why the symptom graph can help boost symptom inference, we select several frequent symptoms in dialogues, namely "Cough", "Sputum", "Fever", "Diarrhea", "Snot", "Cold" and "Indigestion", and visualize the associations between the symptoms in Figure 4. The darker color indicates a larger association weight between the symptoms. We observe that the "cough" and "sputum" are highly associated, which corresponds to our intuition that the patient will probably have a cough and sputum simultaneously. To make it more clear, we show the inference results for each symptom with and without graph in Figure 5. The results show that our model with graph achieves larger improvements than that without graph for inferring the highly associated symptoms such as "cough" and "sputum", which indicates the necessity to incorporate the symptom graph to enhance symptom diagnosis.

Conclusions and Future Works
In this paper, we construct a dataset for dialogue symptom diagnosis, and present a model with global attention and symptom graph for diagnosing symptoms in dialogues. Our global attention mechanism consists of the document-level and corpus-level attentions, which select supporting sentences from the current dialogue and corpus to overcome the information limitations. Experiments on our dataset show that our global attention can effectively boost the performance of dialogue symptom diagnosis. Furthermore, we build a symptom graph to model the associations between symptoms, which helps improve the performance of symptom inference.
In the future, we will build a larger symptom graph and use external medical information to further improve the performance of symptom diagnosis on dialogues.