Counterfactual Supporting Facts Extraction for Explainable Medical Record Based Diagnosis with Graph Network

Providing a reliable explanation for clinical diagnosis based on the Electronic Medical Record (EMR) is fundamental to the application of Artificial Intelligence in the medical field. Current methods mostly treat the EMR as a text sequence and provide explanations based on a precise medical knowledge base, which is disease-specific and difficult to obtain for experts in reality. Therefore, we propose a counterfactual multi-granularity graph supporting facts extraction (CMGE) method to extract supporting facts from irregular EMR itself without external knowledge bases in this paper. Specifically, we first structure the sequence of EMR into a hierarchical graph network and then obtain the causal relationship between multi-granularity features and diagnosis results through counterfactual intervention on the graph. Features having the strongest causal connection with the results provide interpretive support for the diagnosis. Experimental results on real Chinese EMR of the lymphedema demonstrate that our method can diagnose four types of EMR correctly, and can provide accurate supporting facts for the results. More importantly, the results on different diseases demonstrate the robustness of our approach, which represents the potential application in the medical field.


Introduction
Electronic Medical Record (EMR) based diagnosis has attracted extensive attention due to its comprehensive historical information and clinical descriptions with the development of natural language processing and medical informatics (Yang et al., 2018;Choi et al., 2018;Liu et al., 2019;Dong et al., 2020;Ma et al., 2020b). The application of deep learning in medicine requires adequate medical explanations for the result. Specific to the diagnosis of EMR, the model needs to provide the text description supporting the diagnosis results. 1 The code is available at https://github.com/CKRE/CMGE Figure 1: An example of EMR. We consolidated the various parts of the EMR into a single document as input and our goal is to extract supporting facts at the granularity of the clause.
As shown in Figure 1, an irregular EMR is a document of disease-related information, including symptoms, history of the disease, preliminary examination results, and so on, which is disordered and sparse with meaningless noisy text. Existing methods provide explanation through medical entities (Yuan et al., 2020), text spans (Mullenbach et al., 2018) and the weights of external knowledge (Ma et al., 2018). The entity is critical to the diagnosis (Sha and Wang, 2017; Girardi et al., 2018), but for the medical explanation, it cannot provide specific information of symptoms (such as positive or negative). And the form of the span is too fragmented and lacks readability. Therefore, the clause as a more informative and readable representation is needed to be combined above the level of entities.
Most of the previous methods provide reliable explanations for diagnosis by calculating the similarity with an external medical knowledge base (ICD 2 and CCS 3 ) (Xu et al., 2019(Xu et al., , 2020. KAME (Ma et al., 2018) uses the weights of the nodes in the introduced knowledge graph to provide explanations. Depending on the hierarchical relations in the database, GMAN (Yuan et al., 2020) builds a disease hierarchy graph and a causal graph to find critical entities. However, a trusted medical knowledge base requires a mass of expertise in different fields to build, and it may be incomplete or erroneous in practical clinical applications. So far, how to extract supporting facts from the EMR itself without an external medical knowledge base is still a problem.
Counterfactual reasoning provides a link between what could have happened when inputs had been changed (Verma et al., 2020). Doctors usually make a judgment based on several related symptoms during diagnosing a disease. In this regard we can consider a question: will a doctor make a misdiagnosis without one of the critical symptoms? The result is clear. In a counterfactual way, if we gradually weaken the features until the diagnosis changes dramatically, then this feature can be considered as a supporting fact.
Based on this consensus, we propose a counterfactual multi-granularity graph supporting facts extraction (CMGE) method for the irregular EMR in this paper. First, we model the EMR as a hierarchical graph structure, which contains sentences, clauses, and entities. Specifically, sentences are used to model the temporal relationship, clauses provide a complete descriptive explanation, and entities provide symptom support as others. On this basis, we use a graph attention network to aggregate all information from different granularities. Then, we can do a counterfactual intervention to obtain the causal relation between feature and diagnosis. Specifically, we train a learnable soft-mask matrix to mask the feature of nodes or edges in the graph while keeping the diagnosis unchanged, and the remaining features are the supporting facts of the diagnosis. Counterfactual reasoning on the graph requires enhancing the medical features contained in the text of different granularity, so we use clustering labels 4 to cluster clauses and entities. The experimental results demonstrate the effectiveness of our method. The contributions of this paper are summarized as follows: • We propose a multi-granularity structured Figure 2: This figure shows the hierarchical connection structure between multi-grained nodes. The black edges in the graph represent the tree structure connection between the four types of nodes in the EMR. For the red edges, the left part shows the connection between the clause nodes and the graph aggregate nodes, and the right part shows the fully connected form between clause nodes. modeling method based on the hierarchical graph network that decomposes the EMR into sentences, clauses, and entities, and use clustering labels to enhance the expression of medical features.
• We adapt counterfactual intervention to extract critical supporting facts from the EMR during diagnosis. Importantly, our method is disease-independent and does not require a precise external medical knowledge base, so that it is suitable for a wide range of applications.
• The evaluation conducted on the real EMR dataset shows that our method can correctly diagnose the types of lymphedema. Keyword coverage and human evaluation show that the counterfactual reasoning method has better extraction accuracy and robustness compared to two existing methods reimplemented by ourselves.

Proposed Method
Given an irregular EMR in the form of free text X = [x 1 , x 2 , · · · , x L ] with L words, the task for us is to extract supporting facts that can be used to explain the diagnosis result without relying on external knowledge while performing diagnosis. The supporting facts can be entities or clauses of text.

Multi-Granularity Graph Construction
The medical features in the EMR are sparse and medical entities are insufficient to provide sufficient explanation for diagnosis. Therefore, we do multi-granularity segmentation for EMRs, which Figure 3: An overview of counterfactual multi-granularity graph supporting facts extraction network. To show the soft-mask process clearly, we assume that the features of both nodes and edges in the graph are 1, and denotes element-wise multiplication between graph features and mask matrix. All the edges in the graph are bidirectional. For clear reading, we only mark the monodirectional mask value for the bidirectional edges in the Edge-Mask. enhances the symptom features of entities and explanation of diagnosis, while maintaining the integrity of the text. An EMR can be divided by periods into sentences, which can be further divided into clauses by commas or semicolons as a more granular segmentation. In order to keep the symptom features of entities, we do Named Entity Recognition 5 and number extraction for each clause 6 . In addition, we add two general nodes representing the gender and age of the patient respectively.
After segmentation, as shown in Figure 2, we can build a hierarchical tree structure. The nodes at each level represent the text of sentences, clauses, and entities respectively. Specifically, for each EMR, we connect the two general nodes, sentence nodes sequentially. Then, we connect the clause node to the sentence node to which it belongs and the entity nodes disassembled from it. In particular, a fully-connected relationship is established between all the clause nodes, which overcomes the defect that Graph Attention Network (GAT) can only aggregate the information from adjacent nodes when the network is shallow and expands the receptive field of each sub-sentence node to the whole EMR. Then, all clause nodes are connected to an aggregate node which is used to do the diagnosis. All the edges in the graph are bidirectional to make the information between nodes flow better. 5 https://github.com/daiyizheng123/Bert-BiLSTM-CRFpytorch 6 We recommend Stanza (Qi et al., 2020;Zhang et al., 2020) for English EMR. https://github.com/stanfordnlp/stanza

Clustering labels
In the original EMR, all tokens have the same weight, so noisy text will degrade the performance of diagnosis and explanation. To improve the accuracy of symptom presentation, clustering-labels are used to cluster clauses and entities into corresponding medical classifications. Specifically, the clause is divided into 33 classes and the entity is divide into 10 classes, which is a scientific classification method in medicine derived from the textbook "Diagnostics" (Xuehong Wan, 2013). These labels are disease-free and can be labeled without expert knowledge by crowdsourcing annotation. We manually annotated the corresponding labels for the entire dataset on our own platform. And we have trained a BERT (Devlin et al., 2019) based text classifier on 30% of the data, which can achieve the annotation accuracy of 80.76% on clauses and 97.13% on entities on the remaining data. This shows that our method can easily annotate largescale data. With these labels, we can gather the same types of features together in the feature space, thereby enhancing the model's overall attention to important types of features. Please refer to Appendix B.2 for more details.

Input Encoder
After building the multi-granularity graph for a medical record, each node in the graph contains a sequence X node = [x 1 , x 2 , · · · , x n ] with n words, which is tokenized by the tokenizer of BERT (Devlin et al., 2019). In order to maintain the con-sistency of the results of different granularity encoding, we use one bi-directional RNN (Schuster and Paliwal, 1997) with GRU (Cho et al., 2014) to cover the sequence of sentences, clauses, entities and general information into hidden state sequence respectively H m = (h 1 , h 2 , · · · , h n ): where h t is the hidden state of the t-th token and e(x t ) is the embedding vectors with random initialization of x i . Finally, we use the last hidden state of i-th text sequence as the feature H i of node i .

Graph Reasoning
Once we get the feature of the node, we use the Graph Attention Network (GAT) (Velickovic et al., 2018) to aggregate the information between different granularity. GAT can obtain the correlation score between nodes based on the attention mechanism, which is the key to the interpretability of our model. Specifically, GAT takes all the node features as input and calculates the attention coefficients α ij by where H i is the feature of node i, W ∈ R d×d is a learnable weight matrix for the linear projection, a ∈ R 2d is a learnable weight vector used to transform the adjacent node feature representations to the edge score e ij between the i-th and j-th nodes. Equation (4) means to do a softmax normalization between all the edge attention scores on the edges connected to node i. Then, we update the feature of each node by After graph reasoning, the representation H of each node has been updated with the granular information aggregated from adjacent nodes and can be used for subsequent tasks.

Multi-task Prediction
After obtaining the updated node features, we use them in three subtasks: (i) graph classification for automatic diagnosis; (ii) sub-sentence classification for clustering; and (iii) entity classification for clustering.
Taking entity node classification as an example, for each entity node, we use a two-layer MLP with the ReLU activation function to calculate the probability. For an entity node i, we can get By the same way, we can obtain the probability P graph , P clause , P entity . The same as the common multi-task learning, we joint all the losses together as: (6) where λ 1 , λ 2 and λ 3 are hyper-parameters, and all the loss are calculated by cross-entropy loss.

Counterfactual Reasoning on Graph
Providing supporting information while making the diagnosis is the key to applying Artificial Intelligence into the medical field. Inspired by (Ying et al., 2019), we add node-mask or edge-mask into GAT to obtain the counterfactual result after the training and eliminate the noise nodes while keeping the diagnostic results unchanged. For edge-mask, we introduce a learnable matrix M with the same form as the adjacency matrix of the medical record graph. Each element m ij in the matrix represents the degree of mask for message aggregation from node i to node j in the graph. With this method, the calculation of attention coefficients in the GAT has been changed to And for node-mask, similarly, we introduce a learnable parameter β i for each node i in the graph. The parameter represents the degree of mask for the feature in the node. After node-mask, the calculation of e ij and H i has been changed to In the training of counterfactual reasoning, we jointly optimize three loss functions to obtain accurate counterfactual results. To ensure that the model can make a correct diagnosis after the counterfactual intervention, we use the original model  to obtain the fact result D i and maximizes the probability of selecting the correct diagnosis in counterfactual reasoning. Besides, we minimize the sum of all elements in the mask matrix to ensure that all noise nodes are filtered as much as possible. Since there is an exponential level possibility of counterfactual intervention on the model through the node-mask or edge-mask, we minimize the information entropy of the mask matrix regarding which node to select to reduce the uncertainty of the result. Finally, the loss of counterfactual reasoning is as follows: where λ 4 , λ 5 and λ 6 are hyper-parameters, N is the number of elements in the mask matrix M , and all the elements in M are mapped to the [0, 1] by sigmoid function. For node-mask, the training is similar.
After counterfactual reasoning, we extract the nodes or edges (each edge represents the two nodes connected) represented by the top-k elements in the mask matrix as supporting facts.

Experimental Setting
Based on the cooperation with the hospitals, we conducted experiments with real EMR data. We selected the EMRs from the department of lymphedema and diagnose the disease of primary lymphedema (原发性淋巴水肿), secondary lymphedema (继发性淋巴水肿), chylous reflux lymphedema (乳糜返流性淋巴水肿) and others (其 他). The reasons for us to choose this department are as follows: (I) Lymphedema is a sub-discipline in medicine, so the researches on it, whether in Medicine or Artificial Intelligence, is still limited.
For example, ICD10 can not provide full medical supporting. (II) The pathogenesis and treatment methods of different types of lymphedema vary greatly, but their outward manifestations are similar. Therefore, there is an urgent need for a simple method of earlier diagnosis system of lymphedema. (III) Specialist doctors pay more attention to the diagnosis in sub-discipline disease and do not concern with the large-scale rough diagnosis.
Formally, there are 1000 EMRs used in our experiment, of which 900 are used for training and 100 are used for testing. The statistics of four types of diseases are shown in

Baseline
We designed two representative models to compare the ability to extract medical support facts under similar task conditions based on attention and variational inference: Self-Attention This method represents most of the existing approaches and provides explanations through attention similarity. We use BiGRU to encode the EMR. With the sequence embedding, following (Choi et al., 2016), we use average pooling to obtain the overall representation for automatic diagnosis. For supporting fact extraction, following (Mullenbach et al., 2018), we calculate the self-attention weight of each token, and design a sliding window method to obtain the average attention scores of fixed length spans, among which having high scores are taken as the supporting facts.
PostKS This is another method based on variational inference we've designed in addition to attention. Inspired by the dialogue knowledge selection model PostKS (Lian et al., 2019), we convert the pivotal information extraction into a clause selection problem. This method uses the text result of the diagnosis(as shown in Figure 1) to calculate the correlation with the clause as posterior distribution through the attention mechanism, and then uses self-attention and average pooling between clauses to obtain correlation score as the prior distribution. During training, based on variational inference, the model uses posterior information to guide prior selection, so that makes the prior distribution and the  posterior distribution consistent. Finally, during inference, we select the clauses with high prior attention scores among clauses as supporting facts. Please refer to Appendix A.2 for more details.

Evaluation Metrics
To measure the performance of our pivotal information extraction module, we built a simple diagnostic criterion from (Levine, 2017), which is a complete diagnosis and treatment guide for lymphedema written by medical experts. Based on this diagnosis criteria, we used a combination of automatic evaluation and human evaluation.
Automatic Evaluation The precision, recall, and F1 are used as the metrics to measure the diagnostic accuracy of the model, which is the basis for the practical application. Specifically, several key-phrases for the three types of lymphedema are manually identified respectively to represent diagnostic features, and they are the re-descriptions of diagnostic criteria in the guide using phrases from EMRs. We use hit@1/3/5 (Bordes et al., 2013) to measure the coverage rate of the extracted results to the key-phrases. These metrics represent whether one of the diagnostic features is included in the top-1/3/5 extracted results. Please refer to Appendix B.3 for more details.
Human Evaluation Since some of the implicit medical features cannot be covered by key-phrases, human evaluation is necessary. We used each model to extract the top 3 supporting facts respectively for all 100 EMR samples in the testset, and randomly shuffled the order of the results. Then we invited 3 evaluators with medical backgrounds and having read the guide to determine whether the results conform to medical knowledge. We focus on the comprehensiveness and trustworthiness of each model. Comprehensiveness is used to mea-sure whether the model can provide more medical features, and trustworthiness is used to measure whether the extraction results are helpful for diagnosis. For each item, the evaluator is asked to score in 0 ∼ 2. The final indicator is the average of the three evaluators.

Diagnostic Result
The diagnostic results are shown in Table 2. From the results, we can see our model performs better than all the compared models and can achieve about 99% accuracy in the diagnosis of lymphedema, which exceeds the comparison models by 3%-5% in precision, recall, and F1. Based on our model, the categories of clauses and entities can be distinguished correctly, which demonstrates that the clustering information contained in the pseudolabels is correctly learned by our multi-granularity model. This result indicates that the accuracy of our method in the diagnosis of lymphedema is in line with clinical requirements. Since our goal is to make the model really help doctors in clinical practice with reliable medical explanations, we will focus on the performance of the counterfactual extraction of the supporting facts for the diagnosis that follows. Please refer to Appendix A.4 for the effectiveness of our model in diagnosis on the benchmark data.

Counterfactual Extraction Result
Automatic Result Table 3 shows the automatic evaluation results of the supporting facts extraction.
Since the identified keywords are difficult to accurately cover the features for diagnosis and models have different adaptability to various diseases, the performance is distinguishing on different diseases. Compared with other models, the counterfactual-   based methods, especially the Edge-Mask method, has an advantage in accuracy and robustness on the whole. Hit@1 shows that the Edge-Mask can locate key facts more quickly than the comparison methods and hit@5 shows that it achieves over 70% accuracy on secondary lymphedema and chylous reflux lymphedema. In the comparison of different lymphedema, other methods have a greater performance degradation, and only the Edge-Mask maintains high accuracy in various diseases, indicating that the Edge-Mask method is highly robust to different diseases. Table 4 shows the results of the human evaluation of the four categories of diagnosis. Compared with other methods, the counterfactual-based methods have great advantages in comprehensiveness, which indicates that our method can focus more on useful medical information and eliminate invalid noise in the EMR. The fourth category requires focus. This category includes all non-lymphedema medical records, and its diseases are diverse and complex. It can be seen that the method of counterfactual reasoning has strong performance in this type in terms of comprehensiveness and credibility, indicating that our method is truly independent of the type of disease and suitable for large-scale promotion. Table 2 shows the ablation experiment results for the clustering labels. For the experiment without corresponding labels, we used a classifier with random initialization parameters for classification, which can reflect the expectation of the ability to encode medical features of the model. The results show that both the clause label and entity label can improve the accuracy of diagnosis by about 1% on the basis of over 96% accuracy. Since we use the same encoder to encode the three granular texts of the sentence, clause and entity, the addition of clause labels also improves the accuracy of entity classification and vice versa. The result indicates that the introduction of cluster tags enhances the expression of medical information in the model and enables the model to better extract and utilize relevant medical knowledge from irregular text.

Advanced Analysis
Results in Primary Lymphedema Since the diagnosis of primary lymphedema is mainly diagnosed by excluding other types of lymphedema, the keywords we established are not standardized in the EMR, the performance of all models in Table 3 has a significant decline and only be used for comparison. And the performance in human evaluation is consistent with other diseases in Table  4.
Results in Chylous Reflux Lymphedema Except for Edge-Mask, the performance of the other methods on chylous reflux lymphedema has dropped significantly. Since this type of EMR only accounts for 4% of the dataset, the models based on frequency statistics are difficult to capture key features. And Edge-Mask, using counterfactual intervention to obtain causal relation, is diseaseindependent and can adapt to few data.

Node-Mask and Edge-Mask
Edge-Mask is included in Node-Mask. Masking the feature of a node will inevitably reduce the flow of information on all connected edges. So compared to Node-Mask, Edge-Mask is a fine-grained counterfactual intervention. For Node-Mask, the flow of multigranularity information between nodes will be truncated. For example, when a clause node is masked, the entity features belonging to it are truncated together. Therefore, Node-Mask has a weaker performance than Edge-Mask. Figure 4 is an example randomly obtained from the test set. In this graph, each node represents a clause that contains the entities used to describe the symptoms of the disease and the edges represent the connection between them. All the aforementioned features constitute a hierarchical supporting graph to provide effective help for doctors' diagnosis. As we can see, our model successfully extracted the patient's history of cancer, surgery and chemotherapy, which can clearly indicate that the patient is suffering from secondary lymphedema. This shows that the supporting facts we extracted are effective. We provide a comparison of the extraction results of different models in Appendix A.3. Figure 5 shows an example of the visualization of the Edge-Mask matrix. It can be seen that most of the edges have been masked, and only the edges from two key feature nodes have been preserved. This proves that our method can effectively filter noisy features and extract supporting facts.  ) propose a hierarchical model that combines document structure and entity structure. We used a multi-granularity hierarchical graph network to model the EMR documents.

Visual Presentation of Results
Counterfactual Reasoning Providing explanations based on counterfactual reasoning has a long history (Lewis, 1973;Woodward, 2005). In recent years, (Oberst and Sontag, 2019) introduces a kind of structural causal model to genera counterfactual trajectories in a synthetic environment of sepsis management. (Lin et al., 2020) presents a patient simulator to generate informative counterfactual response in the disease diagnosis. (Lenis et al., 2020) identifies salient regions of a medical image by measuring the effect of local counterfactual imageperturbations. We use counterfactual reasoning in EMRs to provide explanations for diagnosis.

Conclusion
In this paper, we propose a counterfactual multigranularity graph supporting facts extraction (CMGE) method for the irregular EMR without an external medical knowledge base. Based on this model, we can correctly diagnose lymphedema. The proposed counterfactual-based approach can discover the causal relationship between symptoms and diagnosis. The results of supporting fact extraction show that our method has strong robustness and can maintain accuracy in various diseases and even in categories with few data resources. In the future, we will introduce multi-modal into the model such as radiology images to discover more medical knowledge from EMRs.

A.1 Implementation Details
To implement our model, we use the tokenizer of BERT (Devlin et al., 2019) to obtain the tokens of the EMR text sequence. For BiGRU (Cho et al., 2014) encoder, the embedding dimension is 300 and the hidden dimension is 256 with two layers. In graph reasoning, we use 2 multi-heads GAT layers with 8 heads. The input dimension of GAT is 1024 and the output dimension is 128. For counterfactual reasoning, we fix the parameters of all the diagnostic models and only optimized the matrix of Edge-Mask or Node-Mask. The hyper-parameters can be set to any possible value based on the tuning. With the manual tuning for diagnostic accuracy, except for λ 5 , all the hyper-parameters of the loss function are set to 1 in the experiment, and λ 5 is set to 0.1 for Node-Mask and 0.005 for Edge-Mask. We trained on the diagnostic model for 20 epochs and do counterfactual training on each sample for 200 epochs. Our model has a total of 16.7M parameters and can easily train and infer in Titan XP. Since we are not doing parallel processing, counterfactual reasoning is the most consuming, and it takes 7 seconds for each instance.

A.2 PostKS
We modified the PostKS (Lian et al., 2019) model to this task. In order to enhance the accuracy of supporting facts extraction, except the diagnosis label, we use some additional diagnostic descriptions related to the disease, which are shown in "Diagnosis" in Figure 1. Figure 6 shows the overview of variational inference model. All the clauses and the diagnosis are encoded by BiGRU and we take the last hidden state h n as the feature sequence C = [c 1 , c 2 , · · · , c n ] for the clauses and the feature d for diagnosis. Based on these, we can calculate the posterior distribution as: where N is the number of clauses, c i is the feature of the i-th clause, and for prior distribution, we calculate as: Then use the average pooling to obtain the selfattention weight p(c = c i |C) of each clause and optimize: where λ d , λ k are hyper-parameters and L diagnosis is the cross-entropy loss for diagnosis result.  the posterior information has worked, but it cannot provide an explanation for the diagnosis. The Edge-Mask discovered the "swelling after surgery immediately", which is the best support for the diagnosis of secondary lymphedema, indicating the effectiveness of it. For Chylous Reflux Lymphedema, only Self-Attention and Edge-Mask find critical information like "milky white liquid". Compared with Self-Attention, Edge-Mask has a more complete description of the supporting facts.

A.4 Evaluation on benchmark data
We didn't find any benchmarks on the task of directly extracting supporting facts from EMRs without other knowledge. To better prove the performance of our model, we have done experiments on the English EMR benchmark "MIMIC-III-50" (Mullenbach et al., 2018) for the task of assigning ICD codes to EMRs. This task assigns multiple codes to EMRs from 50 labels. Compared to our diagnosis of four types of EMRs, the difficulty is obvious.
The key module of our model in diagnosis is the multi-granularity hierarchical graph (MHG) document modeling method based on clauses and entities. In the experiments, we subsequently connect our multi-granularity hierarchical graph network module after BiGRU (Mullenbach et al., 2018) and MultiResCNN (Li and Yu, 2020) to further encode the EMRs. Since the clause categories are not labeled on this dataset, we only used the entity labels obtained by NER and do not constrain the clause node.
The result shown in Table 5 show that our module achieves effective performance improvements on all metrics based on MultiResCNN and Bi-GRU. With our module, BiGRU even surpasses MultiResCNN in some metrics, while they originally have a huge gap in performance. This experimental result proves the effectiveness of our model in diagnosis on the benchmark data.

B.1 Data Collection
We collected data from the real historical electronic medical records (EMRs) of the department of lymphedema. It contains the patient's self-complaint, history of present illness, past illness, personal history, family history, physical examination, and specialist examination. In order to protect the privacy of patients, we have deleted all content related to personal information. For the experiment, we extracted three types of EMRs of primary lymphedema, secondary lymphedema, and chyle reflux lymphedema from all EMRs. In addition, we added 25% of the confounded EMRs which includes patients who were hospitalized in the department of lymphedema, but the final diagnosis was other diseases. The statistics of four diseases in the final dataset are shown in Table 1.
Although the EMR distinguishes information such as the history of present illness and past illness, since the content of each part is still irregular text, and most of the existing EMRs are not standardized, we treat the EMR as an unstructured text and connect all the pieces together. Since our EMRs contain a complete physical examination and life history, most of the symptomatic entities present are negative and unrelated to diagnosis, which introduces a lot of noise into diagnosis and explanation. This is also an important reason that we cannot use entities as supporting facts. We do not have permission from hospitals to publish the Chinese EMR data since they are legally protected by the laws. So we can only provide two examples in Table 7 with extraction results of each model.

B.2 Clustering Label
The detailed description of the cluster label is shown in Table 6. They are derived from the textbook "Diagnostics" (Xuehong Wan, 2013). It's a scientific classification method in medicine. In our experimental data, we manually annotated the corresponding labels on our own platform. These labels are crude, disease-independent and there may be intersections between categories, because they are only used to cluster clauses or entities and do not require a high degree of accuracy. Therefore, we can easily annotate a small part of the dataset manually and train a text classifier to classify the remaining data based on BERT (Devlin et al., 2019). The experimental results show that the classifier we trained on 30% of the data can achieve the annotation accuracy of 80.76% on clauses and 97.13% on entities in the remaining data.

B.3 Diagnostic Criterion
In this section, we will briefly introduce the diagnostic criteria for three types of lymphedema. Different hospital or even departments have their own ways to describe the recognized diagnostic criteria, which will be reflected in the EMRs. So our diagnostic criteria are manually annotated by analyzing the EMRs and the diagnosis guide (Levine, 2017). They are the re-descriptions of diagnostic criteria in the guide using phrases from EMRs.
Secondary Lymphedema For secondary lymphedema, the most important diagnostic criterion is whether the patient's lymphatic vessels have been damaged. Therefore, if there are descriptions related to tumors, surgery, radiotherapy, etc. in the medical records, it is likely to be secondary lymphedema.
Primary Lymphedema For primary lymphedema, the main basis for diagnosis is whether the patient's lymphatic vessels have congenital dysplasia or edema. Since there are few descriptions of this basis in the medical records, we will also take "edema without an inducement many years ago(多年前无诱因出现水肿)" as the basis for correct extraction in the evaluation.
Chylous Reflux Lymphedema For chylous reflux lymphedema, the key to the diagnosis is whether the patient has chylous reflux. Therefore, if there are descriptions related to milky white fluid, effusion reflux, etc. in the medical record, it is roughly considered to be chylous reflux lymphedema.