Leveraging Hierarchical Category Knowledge for Data-Imbalanced Multi-Label Diagnostic Text Understanding

Clinical notes are essential medical documents to record each patient’s symptoms. Each record is typically annotated with medical diagnostic codes, which means diagnosis and treatment. This paper focuses on predicting diagnostic codes given the descriptive present illness in electronic health records by leveraging domain knowledge. We investigate various losses in a convolutional model to utilize hierarchical category knowledge of diagnostic codes in order to allow the model to share semantics across different labels under the same category. The proposed model not only considers the external domain knowledge but also addresses the issue about data imbalance. The MIMIC3 benchmark experiments show that the proposed methods can effectively utilize category knowledge and provide informative cues to improve the performance in terms of the top-ranked diagnostic codes which is better than the prior state-of-the-art. The investigation and discussion express the potential of integrating the domain knowledge in the current machine learning based models and guiding future research directions.


Introduction
Electronic health records (EHR) usually contain clinical notes, which are free-form text generated by clinicians during patient encounters, and a set of metadata diagnosis codes from the International Classification of Diseases (ICD), which represent the diagnoses and procedures in a standard way. ICD codes have a variety of usage, ranging from billing to predictive modeling of the patient state (Choi et al., 2016). Automatic diagnosis prediction has been studied since 1998 (de Lima et al., 1998). Mullenbach et al. (2018) pointed out the main challenges of this task: 1) the large label space, with over 15,000 codes in the ICD-9 taxonomy, and over 140,000 codes in the newer ICD-10 taxonomies (Organization et al., 2007), and 2) noisy text, including irrelevant information, misspellings and non-standard abbreviations, and a large medical vocabulary. Several recent work attempted at solving this task by neural models (Shi et al., 2017;Mullenbach et al., 2018).
However, most prior work considered the output labels independently, so that the codes with few samples are difficult to learn (Shi et al., 2017). Therefore, Mullenbach et al. (2018) proposed an attentional model to effectively utilize the textural forms of codes to facilitate learning. In addition to textual definitions of codes, the category domain knowledge may provide additional cues to allow the codes under same category to share parameters, so the codes with few samples can benefit from it. To effectively utilize the category knowledge from the ICD codes, this paper proposes several refined category losses and incorporate them into convolutional models and then evaluate the performance on both MIMIC-3 (Johnson et al., 2016) and our internal dataset. The experiments on MIMIC shows that the proposed knowledge integration model significantly improves the previous methods and achieves the state-of-the-art performance, and the improvement can also be observed in our internal dataset. The idea is similar to the prior work (Singh et al., 2018), which considered the keyword hierarchy for information extraction from medical documents, but our work focuses on leveraging domain knowledge for clinical code prediction. Our contributions are three-fold: • This paper first leverages external domain knowledge for diagnostic text understanding.
• The paper investigates multiple ways for incorporating the domain knowledge in an endto-end manner.
• The proposed mechanisms improve all prior  models and achieves the state-of-the-art performance on the benchmark MIMIC dataset.

Methodologies
Given each clinical record in EHR, the goal is to predict the corresponding diagnostic codes with the external hierarchical category information. This task is framed as a multi-label classification problem. The proposed mechanism is built on the top of various convolutional models to further combine with the category knowledge. Below we introduce the previously proposed convolutional models which are used for latter comparison in the experiment and detail the mechanism that leverages hierarchical knowledge.

Convolutional Models
There are various models for sequence-level classification, and this paper focuses on two types of convolutional models for investigation. The models are described as follows. Note that the proposed mechanism is flexible for diverse models.
TextCNN Let x i ∈ IR k be the k-dimensional word embedding corresponding to the i-th word in the document, represented by the matrix X = [x 1 , x 2 , ..., x N ], where N is the length of the document. TextCNN (Kim, 2014) applies both convolution and max-pooling operations in one dimension along the document length. For instance, a feature c i is generated from a window of words x i , x i+1 , ..., x i+h , where h is the kernel size of the filters. The pooling operation is then applied over c = [c 1 , c 2 , ..., c n−h+1 ] to pick the maximum valueĉ = max(c) as the feature corresponding to this filter. We implement the model with kernel size = 3,4,5, considering different window sizes of words.

Convolutional Attention Model (CAML)
Because the number of samples of each code is highly unbalanced, it is difficult to train each label with very few samples. To resolve this issue, the CAML model utilizes the descriptive definition of diagnosis codes, which additionally applies a per-label attention mechanism, where the additional benefit is that it selects the n-grams from the text that are most relevant to each predicted label (Mullenbach et al., 2018).

Knowledge Integration Mechanism
Considering the hierarchical property of ICD codes, we assume that using the higher level labels could learn more general concepts and thus improve the performance. For instance, the definitions of ICD-9 codes 301.2 and 307.1 are "Schizoid personality disorder" and "Anorexia nervosa" respectively. If we only use the labels given by the dataset, they are seen as two independent labels; however, in the ICD structure, both 301.2 and 307.1 belong to the same high-level category "mental disorders". The external knowledge shows that category knowledge provides additional cues to know code relatedness. Therefore, we propose four types of mechanisms that incorporate hierarchy category knowledge to improve the ICD prediction below.
Cluster Penalty Motivated by Nie et al. (2018), we compute two constraints to share the parameters of the ICD codes under the same categories. The between-cluster constraint, Ω between , indicates the total distance of parameters between mean of all ICD codes and the mean of each category.
whereθ is the mean vectors of all ICD codes,θ k is the mean vector of the k-th category. The withincluster constraint, Ω within , is the distance of parameters between the mean of each category and its low-level codes.
where J (k) is a set of labels that belong to the kth category. Ω between and Ω within are formulated as additional losses to enable the model to share parameters across codes with the same categories.
Multi-Task Learning Considering that the high-level category can be treated as another task, we apply a multi-task learning approach to leverage the external knowledge. This model focuses on predicting the low-level codes, y low , as well as its high-level category, y high , individually illustrated in Figure 1.
where W high ∈ IR N high ×d , N high means the number of high-level categories, and d is the dimension of hidden vectors derived from CNN.
Hierarchical Learning We build a dictionary for mapping our low-level labels to the corresponding high-level categories illustrated in Figure 1. To estimate the weights for high-level categories, y high , two mechanisms are proposed: • Average meta-label: The probability of the kth high-level category can be approximated by the averaged weights for low-level codes that belong to the k-th category.
• At-least-one meta-label: Motivated by Nie et al. (2018), meta labels are created by examining whether any disease label for the kth category has been marked as tagged, where the high-level probability is derived from the low-level probability of disease labels.

Training
The knowledge integration mechanisms are built on top of the multi-label convolutional models, which treat each ICD label as a binary classification. The predicted values for high-level categories come from the proposed mechanisms. Considering that learning low-level labels directly is difficult due to the highly imbalanced label distribution, we add a loss term indicating the highlevel category in order to learn the general concepts in addition to the low-level labels, and train the model in an end-to-end fashion. Note that the high-level loss is set as loss high = Ω between + Ω within for cluster penalty and the binary log loss for other methods.
where λ is the parameter to control the influence of the knowledge category and we choose λ = 0.1.

Experiments
In order to measure the effectiveness of the proposed methods, the following experiments are conducted.

Setup
We evaluate our model on two datasets, one is the benchmark MIMIC-3 data and another is the dataset collected by National Taiwan University Hospital (NTUH). MIMIC-3 (Johnson et al., 2016) is a benchmark dataset, where the text and structured records from a hospital ICU. We use the same setting as the prior work (Mullenbach et al., 2018), where 47,724 discharge summaries is for training, with 1,632 summaries and 3,372 summaries for validation and testing, respectively. We also obtain a subdataset from original MIMIC3-Full, called MIMIC3-50, which has the top 50 high frequency labels. NTUH dataset is collected   from an internal hospital, where each record includes narrative notes describing a patients stay and associated diagnostic ICD-9 codes. There are total 1,495 ICD-9 codes in the data, and the distribution is highly imbalanced. Our data is noisy due to typos and different writing styles, where the OOV rate is 0.373 based on the large vocabulary obtained from PubMed and PMC. As shown in Table 1, our data, Internal-200, is more challenging due to much shorter text inputs and higher OOV rate compared with the benchmark MIMIC-3 dataset. We split the whole set of 25,375 records from Internal-200 into 17,762 as training, 2,537 as validation, and 5,076 as testing.

Results
The baseline and the results of adding the proposed mechanisms are shown in Table 2. For MIMIC3-50, all proposed mechanisms achieve the improvement for almost all metrics, and the best one is from the hierarchical learning with average meta-label. The consistent improvement indicates that category knowledge provides informative cues for sharing parameters across low-level codes under the same categories. For MIMIC3-Full, our proposed mechanisms still outperform the baseline CNN model, and the best performance comes from the one with multi-task learning. The reason may be that multi-task learning has more flexible constraints compared with hierarchical learning, and it is more suitable for this more challenging scenario due to data imbalance. In addition, the proposed knowledge integration mechanisms using multi-task learning or hierarchical learning with average meta-label are able to improve the prior state-of-the-art model, CAML (Mullenbach et al., 2018), demonstrating the superior capability and the importance of domain knowledge.
To further investigate the model effectiveness, we perform the experiments on the NTUH dataset in Table 3. Due to shorter clinical notes and higher OOV rate, this dataset is more challenging and the results are lower than the ones in MIMIC-3. Nevertheless, the proposed methods still improve the performance by integrating category knowledge using multi-task learning or hierarchical learning with average meta-label. In sum, our proposed category knowledge integration mechanisms are capable of improving the text understanding performance by combining the domain knowledge with neural models and achieve the state-of-theart results.

Qualitative Analysis
From our prediction results, we find that our proposed mechanisms tend to predict more labels than the baseline models for both CNN and CAML. Specifically, our methods can assist models to consider more categories from shared information in the hierarchy. The additional codes often contain the right answers and sometimes are in the correct categories but not exactly matched. Moreover, our mechanisms have the capability of correcting the wrong codes to the correct ones which are under the same category. The appendix provides some examples for reference.

Conclusion
This paper proposes multiple mechanisms using the refined losses to leverage hierarchical category knowledge and share semantics of the labels under the same category, so the model can better understand the clinical texts even if the training samples are limited. The experiments demonstrate the effectiveness of the proposed knowledge integration mechanisms given the achieved state-of-theart performance and show the great generalization capability for multiple datasets. In the future, we plan to analyze the performance of each label, investigating which label can benefit more from the proposed approaches.