Towards Interpretable Clinical Diagnosis with Bayesian Network Ensembles Stacked on Entity-Aware CNNs

The automatic text-based diagnosis remains a challenging task for clinical use because it requires appropriate balance between accuracy and interpretability. In this paper, we attempt to propose a solution by introducing a novel framework that stacks Bayesian Network Ensembles on top of Entity-Aware Convolutional Neural Networks (CNN) towards building an accurate yet interpretable diagnosis system. The proposed framework takes advantage of the high accuracy and generality of deep neural networks as well as the interpretability of Bayesian Networks, which is critical for AI-empowered healthcare. The evaluation conducted on the real Electronic Medical Record (EMR) documents from hospitals and annotated by professional doctors proves that, the proposed framework outperforms the previous automatic diagnosis methods in accuracy performance and the diagnosis explanation of the framework is reasonable.


Introduction
The automatic diagnosis of diseases has drawn the increasing attention from both research communities and industrial companies in the recent years due to the advancement of artificial intelligence (AI) (Liang et al., 2019;Esteva et al., 2019;Liu et al., 2018). As reported in (Anandan et al., 2019), "AI-enabled analysis software is helping to guide doctors and other health-care workers through diagnostic processes and questioning to arrive at treatment decisions with greater speed and accuracy." Although the image-based diagnosis has been well studied using PACS (Picture Archiving and Communication Systems) data (Litjens et al., 2017), the text-based diagnosis for Clinical Decision Support (CDS) (Berner, 2007) remains difficult due to the rare access to reliable clinical corpus and the difficulty in balancing between accuracy and interpretability. TR 血常规示白细胞计数升高, WBC12.5 * 10 9 /L. C反应 蛋白正常. ( The blood test showed elevated white blood cell count, WBC12.5 * 10 9 /L. The C-reactive protein is normal.) Diagnosis 急性扁桃体炎 (Acute tonsillitis) There have been attempts to study automatic text-based diagnosis with Electronic Medical Record (EMR) documents integrated in the Hospital Information System (Mullenbach et al., 2018;Yang et al., 2018;Girardi et al., 2018). Basically, an EMR document is written by a doctor and consists of several sections that describe the illness of the patient. Besides the patient's basic information like name, age and gender, an EMR document contains Chief Complaint (CC), History of Present Illness (HPI), Physical Examination (PE), Test Reports (TR, e.g. lab test reports and PACS reports), Diagnosis, etc. Table 1 shows a real outpatient EMR document from a hospital. These sections describe the patient's medical situation from different aspects: CC summarizes the patient's main discomforts of this visit. HPI extends CC by adding more details and findings from the conversation between doctor and patient. PE shows the findings by physically examining the patient's body, e.g. by palpation or inspection. TR are the objective findings from the lab test reports or the PACS reports. In the hospitals, the doctors will make a comprehensive analysis mainly based on CC, HPI, PE, TR and the basic information, and make a diagnosis. However, it is very hard for computers to automatically understand all the diverse sections and capture the key information before making an appropriate diagnosis. Besides, an inpatient EMR document is similar to that in Table 1 except that HPI, PE and TR are usually more lengthy and detailed. The framework proposed in this work can be applied on both the outpatient and the inpatient EMR documents and we will not distinguish them later.
In this study, we bring forward a novel framework of automatic diagnosis with EMR documents for CDS. 1 Specifically, we propose to predict the main diagnosis based on the patient's current illness. Different from the previous works (Yang et al., 2018;Sha and Wang, 2017;Li et al., 2017;Girardi et al., 2018;Mullenbach et al., 2018) that solely rely on the end-to-end neural models, we propose to stack the Bayesian Network (BN) ensembles on top of Entity-aware Convolutional Neural Networks (ECNN) in automatic diagnosis, where ECNN improves the accuracy of the prediction and BN ensembles explain the prediction. The proposed framework attempts to bring some interpretability of the predictions by incorporating the knowledge encoded in the BN ensembles. The main contributions of this work are as follows: • We propose a novel framework that stacks the Bayesian network ensembles on top of the entity-aware convolutional neural networks to bring interpretability into automatic diagnosis without compromising the accuracy of deep learning. Interpretability is very important in the AI-empowered healthcare studies. • We bring forward three variants of Bayesian Networks for disease inference that provides interpretability. Moreover, we ensemble these BNs towards more robust diagnosis results. • The evaluation conducted on real EMR documents from hospitals proves that the proposed framework outperforms the previous automatic diagnosis methods with EMRs. The proposed framework has been used as a critical component in the clinical decision support system developed by Baidu, which assists physicians in diagnosis in over hundreds of primary healthcare facilities in China. • We publish the Chinese medical knowledge graph of Gynaecology and Respiration used in our Bayesian Network for disease inference with this paper for reproducibility. The data 1 Different from Electronic Health Record (EHR) where the illness of a patient's multiple visits are combined together, EMR only contains the patient's illness of this particular visit. EMRs are more generally used in the hospitals in China. set can be downloaded from Github. 2

Related Work
Due to the rapid advancement of machine intelligence, the text-based automatic diagnosis is becoming one of the most important applications of machine learning and natural language processing in the recent years (Anandan et al., 2019;Koleck et al., 2019). Different from diagnosis or question answering on the Web , diagnosis for the CDS takes place in the hospitals and clinics, and the predictive algorithm is integrated into the Hospital Information System to assist doctors and physicians in the diagnosis. Liang et al. (2019) proposes a top-down hierarchical classification method towards diagnosing pediatric diseases. From the root to the leaf, each level on the diagnostic hierarchy is a logistic regression model that performs classification on labels from coarse granularity to fine-grained granularity, e.g. from organ systems down to respiratory systems and to upper respiratory systems. This method requires heavy manual annotation of training samples at different levels of hierarchy. Zhang et al. (2017) combines the variational auto-encoder and the variational recurrent neural network together to make diagnosis based on laboratory test data. However, laboratory test data are not the only resources considered in this paper. Prakash et al. (2017) introduces the memory networks into diagnostic inference based on free text clinical records with external knowledge source from Wikipedia. Sha and Wang (2017) proposes a hierarchical GRU-based neural network to predict the clinical outcomes based on the medical code sequences of the patient's previous visits. It deals with the sequential disease forecasting problem with EHR data rather than the diagnosis problem for the current visit with EMR document. Similarly, Choi et al. (2016a) studies the RNN-based model for clinical event prediction. Baumel et al. (2017) investigates the multi-label classification problem for discharge summaries of EHR with hierarchical attention-bidirectional GRU.
The most similar works to ours are in (Yang et al., 2018;Li et al., 2017) which trains an endto-end convolutional network model to predict di-agnosis based on EMRs. Besides, Girardi et al. (2018) improves the CNN model with the attention mechanism in automatic diagnosis. Moreover, Mullenbach et al. (2018) studies a label-wise attention model to further improve the accuracy of diagnosis at the cost of more computation time. Choi et al. (2016b) proposes a reverse time attention mechanism for interpretable healthcare studies.
Different from the previous studies, the novelty of this paper is to bring interpretability into automatic diagnosis by stacking the ensembles of Bayesian networks on top of the entity-aware convolutional neural networks.

The Proposed Framework
Automatic diagnosis can be formally considered as a classification problem where the proposed method outputs a probability distribution Pr(d|S) over all diseases d ∈ D based on the illness description S. In this study, S corresponds to the patient's EMR document, i.e. S consists of several sections of texts and some structured data like age, gender and medical department.
We bring forward a new framework that combines the black-box deep learning and the whitebox knowledge inference to diagnose disease with EMR documents. Figure 1 shows the architecture of the proposed framework. Firstly, the medical entities are extracted from the EMR contents. Then, the EMR document is fed into the entity-aware convolutional networks to generate disease prior probability. Next, the Bayesian network ensembles perform disease inference based on the prior probability and the probabilistic graphical models (PGMs) before ensembling the final predictions.

Named Entity Recognition
Before introducing the convolutional and the Bayesian networks, we first discuss a basic component of this framework -the named entity recognition (NER). NER extracts the entities as well as their types from text sentences, which is very important to capture the key information of the texts. In our experiments, we used Baidu's enterprise Chinese medical NER system that integrates the advanced NER models Jia et al., 2019) and extracts entities of symptoms, vital signs, diseases and test report findings.
The F1 score of the NER system we use is 91% in a separate evaluation conducted on 1000 deduplicated sentences from real EMR documents by 10 certificated physicians in China. 3 Meanwhile, the polarity (positive (+), negative (-) or unknown (?)) of entities is also recognized. The polarity in this work objectively means the presence or absence of a finding in a given EMR. It is recognized in conjunction with the rule-based method with a vocabulary of negative Chinese words as well as the polarity detection model. Table 2 shows the NER results of the EMR in Table 1. Please note that the disease (acute tonsillitis) from the diagnosis section is the ground-truth label to predict and it will not be included in the input to the predictive model in the evaluation.
In the offline processing of the EMR corpus, we preserved the Top-K most frequent entities of all types as the entity vocabulary. In later experiments, we empirically set K = 10, 000. The entity vocabulary will be used to construct the one-hot feature for each EMR document, which will be introduced later. Since NER is not the focus of this study, the readers can choose the public Chinese NER API 4 from Baidu for fast experiments. We will focus on the major contributions of the proposed framework in the next sections.

ECNN for Prior Generation
The convolutional networks take as input the list of texts w.r.t. the sections of an EMR document as well as the medical entities extracted from them, and output the probability distribution of the diseases. To distinguish from the previous CNN models without medical entities (Yang et al., 2018;Li et al., 2017), we use ECNN to denote the entityaware CNN model proposed in this paper where another branch of fully connected layers processes the medical entities and outputs the corresponding feature representation. Let N denote the number of sections (CC, HPI, PE, TR, etc) selected from the EMR document to construct ECNN. ECNN consist of two parts: (1) N convolutional towers, each of which reads a unique section, and (2) one multi-layer perceptron (MLP) branch that reads a high-dimensional hand-crafted feature.
Similar to the previous CNN method for text classification (Kim, 2014), each convolutional tower processes the input sequence with three kernels of various length resulting in multi-channel feature output. The three kernels process the input with 3-grams, 4-grams and 5-grams, respectively, and their outputs are concatenated as the output of a convolutional tower. Each kernel in the convolutional networks has 100 filters with strides as 1. The input is padded with valid method and the output is activated by ReLU.
For the input of MLP, we create the entity vocabulary that consists of the top-K frequent entities. Then, each EMR document is transformed to a Kdimensional one-hot feature f . That is, if the i-th entity in the entity vocabulary appears as a positive finding in the input EMR, then the i-th dimension of f is set to 1, and otherwise, it is set to 0. Moreover, the patient's age and gender are appended to f to get the hand-crafted feature for MLP. The MLP contains one dense layer activated by sigmoid function with 128 hidden units.
ECNN is trained with Adam optimizer (learning rate 0.001), 20 epochs and batch size of 32. The output of each convolutional tower and the output of the MLP are further concatenated before passing through the dropout and the softmax layer. Similar to Kim (2014), the dropout rate is empirically set to 0.5. A |D|-dimensional feature is output by ECNN as the disease priors for the inference in the next where D is the disease set.
In ECNN, the CNNs are supposed to capture the sequential signals in the section texts and the MLP is supposed to encode the feature of the critical entities. By jointly modeling with CNNs and MLP, the proposed ECNN is expected to have superior performance than either of them alone.

Bayesian Network Ensembles
Although ECNN also outputs a probability distribution over all diseases, the result is not interpretable due to its end-to-end nature. However, the interpretability is very important in the CDS to explain how the diagnosis is generated by machines. Thus, we propose the Bayesian network ensembles on top of the output of ECNN to explicitly infer disease with PGMs. There are three steps:

Relation Extraction
We extract the relations between disease and other types of entities (disease, finding) where finding can be symptom, vital sign, test report finding, etc.
The rest of this paper will use finding to denote any type of entities other than disease. Relation extraction is performed in conjunction with the (disease, finding) co-occurrence mining and the deep extraction model  from the EMR documents and the textbooks 5 . Then, the pairs with high co-occurrences larger than a support (e.g. 5) are preserved. The extracted relations are reviewed by 10 certificated physicians. The invalid extracted relations which result from issues like incorrect recognition of entities or polarities by NER, the symptom caused by the secondary diagnosis but incorrectly paired with the first diagnosis, are removed before adding to the medical knowledge graph. Therefore, the relation (disease, finding) in the medical knowledge graph can, to some extent, be interpreted as: disease causes finding.
In our study, the pairs are mined from 275,797 EMR documents of two medical departments (Gynaecology and Respiration). On average, each disease of Gynaecology in our experiments is associated with 24 findings and that of Respiration is 42. For Gynaecology, there are 33 diseases, 305 symptoms, 143 vital signs and 25 test report findings in the PGMs. For Respiration, there are 21 diseases, 263 symptoms, 187 vital signs and 31 test report findings in the PGMs.

Relation Weights Estimation
We experiment with six classical text features as the relation weights in this study.
(1) Occurrence. The weight of finding i given disease j is: where n(i, j) is the number of co-occurrences of finding i and disease j. w(i; j) is computed by the type of findings.
(2) TF-IDF Feature. Similar to TF-IDF feature in information retrieval, the weight of finding i given disease j is: where n i is the number of diseases whose EMR documents contain finding i.
(3) TFC Feature. TFC feature (Salton and Buckley, 1988) is a variant of TF-IDF and it estimates the weight of finding i given disease j as: 5 The undergraduate teaching materials in most of the medical schools in China, authorized by the publisher.
(4) TF-IWF Feature. The Term-Frequency Inverse-Word-Frequency (TF-IWF) feature (Basili et al., 1999) estimates the weight of finding i given disease j as: where t i represents the number of occurrences of word i in the whole training corpus.
(5) CHI Feature. CHI feature (χ 2 Test) measures how much a term is associated with a class from a statistical view. The CHI feature of finding i given disease j is (Yang and Pedersen, 1997): where N , A, B, C and D are the number of all documents, the number of documents containing finding i and belonging to disease j, the number of documents containing i but not belonging to j, the number of documents belonging to j but not containing i, and the number of documents not containing i and not belonging to j.
(6) Mutual Information. This feature assumes that the higher the strength between a finding and a disease, the higher their mutual information will be. Similar to the definition in CHI feature, this feature is defined as: .
The above features are normalized by disease before applying to the diagnosis inference. By default, the average of the six features is used as the connection weight.

Diagnosis Inference
We propose the Bayesian network ensembles for the diagnosis inference. Specifically, a group of PGMs with the extracted relations and weights are ensembled towards the final predictions.
Firstly, multiple bipartite graphs between disease nodes and each type of finding nodes are derived from the medical knowledge graph. For M types of findings, there will be M bipartite graphs.
In later experiments, M = 3, i.e. (disease, symptom), (disease, vital sign) and (disease, test result finding). Based on the findings extracted from EMR document, each bipartite graph can be independently used to infer the disease distribution.
For Bayesian inference, we compute the posterior probability of diseases given the findings in the EMR document extracted by NER: where F + and F − are the sets of the positive and the negative findings in the given EMR document, respectively. Following Eq. (7), it is straightforward to get Pr(d|F + sym , F − sym ), Pr(d|F + sign , F − sign ) and Pr(d|F + test , F − test ) w.r.t. the predictions based on symptom alone, vital sign alone and test report finding alone. To compute the joint probability Pr(d, F + , F − ) and Pr(F + , F − ), we refer the readers to the QuickScore method (Heckerman, 1990) and the deduction therein. To speed up computation when a disease is associated with too many positive findings, the variational method on the PGMs is applied (Jordan et al., 1999).
Next, we assemble these bipartite graphs in different ways to get three variants of PGMs (Fig. 1).
(1) Parallel. This method independently performs inference with each type of finding and average their results: Parallel assumes that the ways to diagnose disease are different using different types of entities, and their predictions can complement each other. An extension of Parallel is to perform a weighted sum of the three predictions. For simplicity concerns, we experiment with equal weights in this paper.
(2) Universal. This method mixes all types of findings together into a single network: It means that Universal does not distinguish the types of entities and performs the type-free Bayesian inference. Compared with the other two PGM variants, the connections between diseases and findings in Universal are much denser. It assumes that the prediction benefits from the joint inference by seeing more findings of multiple types at the same time.
(3) Cascade. This method constructs the multilayer Bayesian networks with finding types as layers and use the output of the previous layer as the prior probability for the current layer.
where Pr(d CN N ) is the disease probability distribution computed by the convolutional networks in Sec. 3.2 and d ∼ Pr(d x ) means that variable d satisfies prior probability distribution Pr(d x ). Cascade first infers disease with symptoms alone and uses the disease probability from ECNN as priors. Then, it infers disease with vital signs alone and uses the disease probability from symptombased inference as priors. Finally, it infers disease with test report findings alone and uses the disease probability from the previous output as priors.
We present the cascade appraoch in such order because it shows the best results compared to those in other orders in our experiments. Cascade assumes that each type of entities can be used to refine the previous predictions by incorporating additional information.
The output of the above three PGMs are ensembled, e.g. weighted sum, as the final predictions. In all, the proposed framework takes the raw EMR document and the NER results as input, and outputs the diagnosis predictions.
Although we experiment with three types of entities in this paper, the proposed Bayesian network ensemble method is not limited to these types of entities. It is easy to add more entity types in the proposed method when applicable.

The Interpretability of BN Ensembles
One of the major contributions of this work is to bring interpretability into automatic diagnosis by stacking the Bayesian network ensembles on top of the convolutional networks. We illustrate how the predictions are explained, i.e. interpretability, by BN with Fig. 2. We use the symptom-based bipartite graph to illustrate for the simplicity concern, and the other types of entities explain the predictions in the same way.
In Fig. 2, if only pharyngalgia is extracted from a patient's EMR, then upper respiratory infection (URI) will be predicted with high probability but the probability of pneumonia and phthisis will Figure 2: The example of the interpretability of Bayesian network. The connection from disease d to symptom s represents that d has some probability to cause s to be present. If d is diagnosed, the detected symptoms from EMR that are connected with d can be used to explain the diagnosis. be set to the minimum because both of them are not likely to cause pharyngalgia based on their cooccurrences in the corpus. The proposed method can explain the prediction of URI with symptom pharyngalgia and their co-occurrence times besides the prediction probability.
If pharyngalgia and hemoptysis are both extracted from a patient's EMR, then URI as well as phthisis will be predicted with some positive probability (their rankings depend on both their prior probability and their connection weights to pharyngalgia and hemoptysis), but pneumonia will be predicted with the minimum probability. This is because the noisy-OR gate is used in the Bayesian inference (Heckerman, 1990). The proposed method explains the prediction of URI with the positive finding of symptom pharyngalgia and explains the prediction of phthisis with the positive finding of symptom hemoptysis as well as their cooccurrences.

Experiments and Results
In this section, we will introduce the data sets we experiment with and the evaluation results.

Data Sets
The proposed framework is evaluated on the real EMR documents (mostly admission records). We have collaborated with several top hospitals in China and we are authorized to conduct experiments with 275,797 EMR documents of two medical departments for the evaluation (see Table 3). 6 6 Unfortunately, we have not yet obtained the permission from the hospitals to make the evaluation data sets public at this moment because EMR documents are legally protected by the Chinese laws and there is too much sensitive information about the patients and the doctors in them. We are currently working with the hospitals in contributing the benchmark EMR data sets for automatic diagnosis, but it takes time due to the legal issues. We suggest the readers to focus their attention on the contribution of the novel automatic diagnosis framework in this paper.  The collected EMR documents are processed as follows: The main diagnosis in each EMR document is extracted as its disease label. Then, we select the top diseases from the collected EMR documents, which results in 33 diseases from Gynaecology (including Salpingitis, Cervical Carcinoma, Endometritis, Fibroid, etc) and 21 diseases from Respiration (including Upper Respiratory Infection, Chronic Bronchitis, Pneumonia, Asthma, Lung Cancer, etc) that cover over 90% of all EMR documents. There is a long-tail distribution of EMR documents by diseases as shown in Fig. 3, and each of the selected diseases has over 100 EMR documents for training. The other diseases are discarded in the experiments due to the lack of enough EMR documents to train a trustworthy model. Next, in order to ensure the validity of the disease labels in the test set, we recruit 10 professional physicians to review the labels by evenly sampling EMR documents under each disease. In this way, we collected 606 reviewed EMR documents for Gynaecology and 214 for Respiration as the test set (See disease distribution in supplemental files). The rest EMR documents are used for training. Since we are not given the identity of patient w.r.t. each EMR, the training and the testing sets are considered disjoint.
In later experiments, we separately report the performance under both departments. It is more important and difficult to distinguish diseases within the same department than that across departments due to the overlapping symptoms, signs and test report findings among the similar diseases.

Experimental Results
We conduct experiments on the collected data sets to evaluate the performance of the framework.

Experimental Settings
In the experiments, we used four CNN towers (N = 4) w.r.t. CC, HPI, PE and TR, and each tower has three channels with kernel length 3, 4 and 5 (representing 3-grams, 4-grams and 5-grams).
We use Jieba package 7 to perform Chinese word segmentation on the training set and remove the punctuation from the segmentation results. The segmented word corpus is used to train the 100-dimensional word embeddings using the Word2Vec (Mikolov et al., 2013) method (window as 5, min support as 5) implemented in the gensim package 8 . The top 100,000 frequent segmented words consist of the word vocabulary in the embedding layer of ECNN. Thus, the size of the embedding layer is (100000, 100).
Besides, the top 10,000 frequent entities (not segmented words) as well as age and gender are used to construct the one-hot feature into MLP which consists of one hidden dense layer (128 Sigmoid units) due to the efficiency consideration. Similar to Kim (2014), the dropout rate is empirically set to 0.5. By default, we use the average of all six relation weights in the experiments. The final predictions are the average of the three PGM variants. ECNN and PGMs are trained separately offline. Table 4 shows the Top-k sensitivity (The micro average of the per-disease Top-k sensitivity, com-7 https://github.com/fxsjy/jieba 8 https://radimrehurek.com/gensim/ monly used as the accuracy measurement in healthcare studies (Liang et al., 2019).) under two departments. Generally, sensitivity is ususally used in binary classification (mostly output yes or no). Similarly, when we are dealing with classification of multi-class rather than binary classification, the proposed automatic diagnosis model outputs the probability distribution over K diseases (classes) for a given EMR. Suppose there are l i out of n i cases, where d i is included in the Top-k predictions (ranked by probability) for the n i EMRs of disease d i . The Top-k sensitivity of the proposed model on disease d i is: l i n i . Furthermore, in the overall evaluation of the proposed model on all diseases, we use the micro average of all classes as the overall Top-k sensitivity:

Performance Accuracy
CAML (Mullenbach et al., 2018) performs the label-wise attention on top of a CNN model. CNN (Yang et al., 2018) concatenates CC, HPI and TR together before sending to the multi-channel CNN model. ACNN (Girardi et al., 2018) incorporates the gram-level attention with a CNN model. The empirical settings of hyper parameters are selected from the original papers. Besides, they share the same training set, training epochs, learning rate and batch size with the proposed methods.
Among the proposed methods, PGM-* (-C, -P, -U and -E represent Cascade, Parallel, Universal and Ensemble, respectively) are the methods that solely relies on the Bayesian networks which use the disease distribution in the training set as the prior probability. ECNN is the proposed method without the BN ensembles. ECNN-PGM-* are the combined methods while ECNN-PGM-E is the proposed method with ECNN and Bayesian network ensembles in Figure 1. According to the results: (1) Most of the proposed methods ECNN-PGM-* outperform the previous automatic diagnosis methods, which shows the effectiveness of the proposed methods.
(2) ECNN outperforms CNN due to the incorporation of medical entities. Jointly modeling with free texts and medical entities brings extra accuracy performance compared with modeling with only either one. (3) Stacking Bayesian Networks on top of the neural networks is very likely to further improve the performance, especially with the ensemble of the predictions from multiple PGMs.  Based on the analysis, the diagnosis performance of a disease is higher if it shares less findings with other diseases or it has more specific findings.

Interpretability
The interpretability is reflected on the observed findings in the EMR that connect to the predicted disease in the medical knowledge graph as well as their co-occurrences. We generate the prediction explanation with the following template: The patient is diagnosed as disease d because (s)he is suffering from symptom s i , and (s)he has the vital sign of v j , and the lab test (or PACS report) shows (s)he has t k . Besides, s i , v j and t k have been found on the patients of d for n i , n j , n k times, respectively, in the previous EMR documents that support this diagnosis.
Since the extracted relations in the medical knowledge graph are reviewed by the certificated physicians, the validity of explanation is guaranteed from the clinical perspective. We randomly select 50 testing samples per department whose Top-1 diagnosis prediction is correct and generate the explanation for the diagnosis prediction with Res-Top1 Gyn-Top3 Res-Top3 Figure 5: The accuracy of ECNN-PGM-E using different types of features. Gyn and Res represent gynaecology and respiration, respectively. MI and Occ are mutual information and occurrence, respectively.
the above template. The explanation is evaluated by three certificated physicians. The evaluation is subjective, but all of them agree that the prediction is well-supported by the generated explanation.

Feature Importance
Figure 5 shows the accuracy performance using different types of features. We can see that in this evaluation, TFC, TF-IDF and the average of all features are likely to lead to higher accuracy compared to the other features where the accuracy of Top-3 prediction is over 88%. In all, the above experiments prove that the proposed framework can improve the accuracy of automatic diagnosis and bring reasonable interpretability into the predictions in the same time.

Conclusion
In this paper, we investigate the problem of automatic diagnosis with EMR documents for clinical decision support. We propose a novel framework that stacks the Bayesian Network ensembles on top of the Entity-aware Convolutional Neural Networks. The proposed design brings interpretability into the predictions, which is very important for the AI-empowered healthcare, without compromising the accuracy of convolutional networks. The evaluation conducted on the real EMR documents from hospitals validates the effectiveness of the proposed framework compared to the baselines in automatic diagnosis with EMR.