Clinical-Coder: Assigning Interpretable ICD-10 Codes to Chinese Clinical Notes

In this paper, we introduce Clinical-Coder, an online system aiming to assign ICD codes to Chinese clinical notes. ICD coding has been a research hotspot of clinical medicine, but the interpretability of prediction hinders its practical application. We exploit a Dilated Convolutional Attention network with N-gram Matching mechanism (DCANM) to capture semantic features for non-continuous words and continuous n-gram words, concentrating on explaining the reason why each ICD code to be predicted. The experiments demonstrate that our approach is effective and that our system is able to provide supporting information in clinical decision making.


Introduction
International Classification of Disease (ICD) is the diagnostic classification standard in the field of clinical medicine, which assigns unique code to each disease. The popularization of ICD codes immensely promotes the information sharing and clinical research of disease worldwide and has a positive influence on health condition research, insurance claims, morbidity and mortality statistics (Shi et al., 2017). Therefore, ICD coding -which assigns proper ICD codes to a clinical note -has drawn much attention.
It is always that ICD coding relies on the manual work of professional staff. The manual coding is very error-prone and time-consuming since the continuous updating version of ICD codes results in a substantial increase in code numbers. The number of ICD-10 codes is up to 72,184, more than five times the previous version (i.e., . It allows for more detailed classifications of patients' conditions, injuries, and diseases. However, there * co-first authors, they contributed equally to this work is no doubt that the increased granularity increases the difficulty of manual coding. Existing studies came up with several approaches of automatic coding prediction to replace the repetitive manual work, from the traditional machine learning methods (Perotte et al., 2013;Koopman et al., 2015), to neural network methods (Shi et al., 2017;Yu et al., 2019). Although these methods achieve great success, they are still confronted with a critical challenge, which is the interpretability of predicted codes. Explainable model and results are essential for clinical medicine decision making (Mullenbach et al., 2018). Thus, the practical approach is supposed to predict correct codes and simultaneously give the reason why each code is predicted. In this paper, we try to provide the interpretability of predictions from a semantic perspective. It is a phenomenon that the exact disease names or similar expressions of disease names often appear in the discharge summary. For example, as shown in Figure 1, the exact matching with disease name such as "fatty liver" is a direct evidence of inference. We call the continuous consistent words as explicit semantic features. Moreover, the inexact matching such as "rheumatoid multisite arthritis" is also very useful to predict the codes and should be taken into consideration. We refer to the non-continuous words as implicit semantic features. The two kinds of semantic features are both clues to explain the reason why to assign each code, which is also the basis of experts in manual coding process. To capture the two semantic phenomena, we exploit dilated convolution and n-gram matching mechanism to extract implicit semantic features and explicit semantic features, respectively. Furthermore, we develop a system to assist the professional coders in assigning the correct codes. In summary, the main contributions are as follows: • We collect a large-scale Chinese Clinical notes dataset, making up for the lack of Chinese ICD coding corpus.
• We propose a novel method to simultaneously capture implicit and explicit semantic features, which enables to give interpretability for each predicted code.   Automatic ICD coding has recently been a research hotspot in the field of clinical medicine, where neural network architecture methods show promising results than traditional machine learning methods.
Most studies treat automatic ICD coding as a multi-label classification problem and use only the free-text in summaries to predict codes (Subotin and Davis, 2015;Kavuluru et al., 2015;Yu et al., 2019), while many methods benefit from extra information. Shi et al. (2017) encode label description with character-level and word-level long shortterm memory network. Rios and Kavuluru (2018) encode label description with averaging words embedding. Furthermore, adversarial learning is employed to unify writing styles of diagnosis descriptions and ICD code descriptions (Xie et al., 2018). Besides code descriptions, Wikipedia comes to be regarded as an external knowledge source (Prakash et al., 2017;Bai and Vucetic, 2019).
Additionally, inferring interpretability is a crucial challenge and obstacle for practical automatic coding, since professionals are willing to be con- Figure 3: The whole architecture of the model. The input is the clinical text, and output is the ICD codes. The yellow dotted box indicates how to use attention-based dilated convolution to capture the implicit semantic of noncontinuous words. The green dotted box indicates how to use n-gram matching mechanism to capture the explicit semantic of continuous n-gram words. vinced by the model insights of vital supporting information or decision-making process (Vani et al., 2017;Mullenbach et al., 2018). Baumel et al. (2018) employ bidirectional Gated Recurrent Unit with sentence-level attention to obtain relevant sentences for each code. Mullenbach et al. (2018) use attention at the word level, which is more finegrained. Our work is inspired by (Mullenbach et al., 2018), assigning the importance value for each label to the discharge summaries to assists in explaining the model prediction process.

Dilated Convolution
Dilated convolution is designed for image classification to aggregate multi-scale contextual information without losing resolution in computer vision (Yu and Koltun, 2016). It inserts "holes" in the standard convolution map to increase the reception field. The hole-structure brings a breakthrough improvement to the semantic segmentation task.
Similarly, several hole-structured convolution neural networks (CNNs) (Lei et al., 2015;Guo et al., 2017) are designed to handle natural language processing tasks. In the text, there exists noncontinuous semantic where useless information may be interspersed among the sentences. Holes in the dilated convolution can ignore the extra word between the non-continuous words and well adapt to match non-continuous semantic. Since the semantic infomation is crutial when understanding natural language (Zuo et al., 2019), we apply the dilated convolution to encode the text, capturing the non-continuous semantic information.

Method
We propose a Dilated Convolutional Attention network with N-gram Matching Mechanism (DCANM) for ICD coding task. Figure 3 describes the architecture of the model. The input of model is all sentences in clinical notes, which are spliced together. The input sentences interact with ICD code names to capture explicit semantic features and generate an n-gram matrix. At the same time, the input sentences are transformed into vector and processed by dilated CNN to capture implicit semantic features. Attention mechanism is used to improve the performance. Then all features are concatenated to form the final features. Finally, we use a sigmoid classifier to predict the probability of each code. Next, we give the detailed descriptions.
Word Embedding. Word embedding is a lowdimensional vector representation of a word. We use the pre-trained embedded matrix W wrd ∈ R d w ×|V | , where d w is the dimension of word embedding and |V | is the size of vocabulary. Given a sentence, S = [w 1 , w 2 , ..., w N ], where N is the number of words in the sentence, we can get the word embedding by: where v i is the one-hot representation of the current word in the corresponding column of W wrd .
Explicit Semantic Features. N-gram matching mechanism is applied to capture explicit semantic features. We use disease names (D) to sampling on the text (T). First, move the sliding window on the disease name d l ∈ D to get a n-gram substring. Then, calculate the frequency of each n-gram substring in the free-text. The sum of frequencies of gram with same length n (denoted as gram n ) has reflected the emergence of disease names in the text, nevertheless some grams have their unique particularity. For example, given a 2-gram string, "糖尿" (Diabetes) is more representative than "慢 性"(Chronic) though they have the same length.
To represent the degree of importance of different n-gram, each n-gram is given a term frequencyinverse document frequency (tf-idf) weight. Finally, for each free-text clinical note, we calculate an explicit semantic n-gram matrix (M ) with size of L × W , where L is numbers of labels and W is the numbers of sliding windows. For example, we have four sliding windows which lengths are 2, 3, 4, 5, so W is 4. For the l-th row the w-th column item in the feature map, we have: where w is the index of n-length sliding window, gram ln is all n-length substrings of the l-th disease name, gram lni is the i-th gram ln , L gram ln is the number of gram ln , count gram lni is the frequencies of gram lni in the text, L n l is the length of the l-th disease name, f req gram lni is the frequencies of gram lni in all disease names.
In this calculation, we can distinguish the importance degree of n-gram substring. It also works on English clinical notes, for instance, in a specific case from MIMIC-III (Johnson et al., 2016), the tf-idf value of "history of" is 1.79 while "atrial fibrillation" is 9.32 because "history of" appears 249 times in all ICD disease names and "atrial fibrillation" only appears two times. The higher the value is, the more representative the word is. Therefore "atrial fibrillation" is more likely to indicate a disease than "history of". Implicit Semantic Features. Dilated convolution is applied to capture implicit semantic features. For a long clinical text, dilated convolution extends the reception field in the situation of not using pooling operation so that every kernel has a wider range of information. More importantly, it has "holes" in convolution map, which means it can be adapted to match the non-continuous semantic information. For example, "类风湿性多部位关节 炎"(Rheumatoid multisite arthritis) in the clinical notes refers to "类风湿性关节炎"(Rheumatoid arthritis) in ICD, the convolution map with holes can tolerate the redundant parts, as shown in Figure  4. It is a distinct advantage of dilated convolution for processing texts. Formally, the actual filter width of dilated convolutional neural network is computed as, where r ∈ [1, 2, 3, ...] is the dilated rate, k is the origin filter width. For each step n, the typical convolution is computed as formula 5 and dilated convolution is computed as formula 6. The dilated CNN is same as typical CNN when the dilated rate is 1, since k d equals to k when r = 1: where W c ∈ R k d ×de×dc is the convolutional filter map, k d is the actual filter width, d e is the size of the word embedding, an d c the size of the filter output and b c ∈ R dc is the bias.
Attention. After convolution, the sentence is represented as H ∈ R dc×N . We employ the per-label attention mechanism (Mullenbach et al., 2018) to find the most contributed characters for each label. For each label l, the distributed attention weight is computed as: where u l ∈ R dc is the vector representation of label l. Finally, the sentence is represented as: We employ attention both for typical CNN and dilated CNN, for convenience of distinction, we denote them as m l and m l , respectively.
Classification. m l and m l is concatenated with the linear transformed n-gram matrix horizontally. The aim of this step is to combining all the features together. Then we exploit sigmoid classifier and the prediction of label i is computed as, where i ∈ [1, 2, ..., L], W ∈ R 3dc , b is the bias, m l is the linear projection of n-gram matrix(M ). The loss function is the multi-label binary crossentropy (Nam et al., 2013). (10) where y i ∈ {0, 1} is the ground truth for the i-th label andŷ i is the sigmoid score for the i-th label. Figure 2 illustrates the user interface of our system. User Input. The left of Figure 2(a) displays the user input. The user enters the whole free clinical note, which includes at least one from admission situation, admission diagnosis, discharge situation, and discharge diagnosis into the input box.

User Interface
Predicted Labels. The predicted labels are presented in the list of Figure 2(a), including disease name and homologous ICD codes. The number of predicted codes are not always the same as the diseases in discharge diagnosis, because clinicians may leave out certain diseases and several diagnoses should be combined into one ICD code (Shi et al., 2017). Our model can list all these diseases, and give the reason why they should be predicted.
Interpretability. Interpretability is a critical aspect of the decision-making system, especially in the clinical medicine domain. In our system, we give two ways, n-gram matching mechanism and attention, to assist users in understanding why each code is predicted. A user can know why the model predicted the labels, and what the key information in its decision was: (1) N-gram Matching Mechanism. When a patient suffering from a disease, the corresponding text span related to disease names often appear in the discharge summary. As shown in Figure 2 (b1), the gram in disease name is highlighted to give a hint to users if it appears in the clinical text. Highlighting not only tells users why we predict each code but also prompts the place of the important information.
(2) Attention. As shown in Figure 2 (b2), the red background is attention distribution, and the darker the color is, the more useful the word is to predict the current label. The darker color is also helpful and attractive for human-being to doublecheck the correction of labels.

Dataset
We evaluate our model on both Chinese and English datasets. The Chinese dataset, collected by us, contains 50,678 Chinese clinical notes and 6,200 unique ICD-10 codes. For each clinical note, it contains five parts: admission situation, admission diagnosis, discharge situation, discharge diagnosis and annotated ICD-10 codes. Admission situation involves chief complaints, past medical history, etc. Discharge situation involves the results of general examination. Admission diagnosis and discharge diagnosis involve disease names, which may not be totally consistent with standard names in ICD-10. The manually annotated codes are based on ICD-10, which are tagged by professional coders after reading through the whole clinical note. The dataset (CN-Full) is formed with full labels mentioned above, and it is divided into train set and test set with the radio of 9:1. In addition, due to the phenomenon that massive codes are infrequent, and a small amount of codes are high-frequent, we reconstructed a sub-dataset (CN-50) with the most frequent 50 codes from the original dataset. The specific process is that filtering the origin train set and test set, and maintain the data which has at least one of the top 50 most frequent codes.  To better compare with the previous works, we also evaluate our method on the MIMIC-III dataset (Johnson et al., 2016), which is the most authoritative English dataset for evaluating the performance of automatic ICD coding approaches. The detailed description for these datasets is listed in Table 1.

Data Preprocess and Parameters
We splice the admission situation, admission diagnosis, discharge situation and discharge diagnosis together, which is the input of the model. The max length of the input is 1000. The word embedding is pre-trained using Word2Vec (Mikolov et al., 2013) with the dimensions of 100. The text is from all clinical notes. The batch size is 16. The dropout rate is 0.5. The optimizer is Adam (Kingma and Ba, 2015) with a learning rate of 0.0001.
We use Micro-F1, Macro-F1, area under the ROC (Receiver Operating Characteristic) curve (AUC) and P@k as the metrics. P@k (Precision at k) is the fraction of the k highest-scored labels that are present in the ground truth.

Results
First, for the Chinese dataset (CN-Full and CN-50), CAML (Mullenbach et al., 2018) is set as our baseline, which use traditional convolutional attention network. Moreover, we test the dilated CNN and ngram matching mechanism separately. The results in Table 2 indicate that dilated CNN and n-gram matching mechanism both have a positive effect on improving performance from baseline, and the best results are obtained when they combined.
We also evaluate our method on English dataset (MIMIC-III-50). The results are shown in Table  3. The CNN and Bi-GRU are the classic methods and the results are the same as (Mullenbach et al., 2018). Our proposed model achieves the Micro-F1 score of 0.641, which outperforms all previous works, more importantly providing interpretability.
Besides, we notice that macro-F1 measure is always lower than micro-F1, especially in the full labels datasets. It means the smaller classes have  Table 3: Evaluation on MIMIC-III-50 dataset poorer performance than larger classes, which is consistent with the facts. Either MIMIC-III or the Chinese dataset, the sample distributions are extremely imbalanced. Minority of codes are highly frequent, while most codes are infrequent. N-gram matching mechanism helps improve macro-F1 on CN-Full dataset obviously, reaching two times than baseline. It can be inferred that utilizing grams in disease names is useful for the smaller class.

Conclusion
In this paper, we propose a Dilated Convolutional Attention network with N-gram Matching Mechanism (DCANM) for automatic ICD coding. The dilated CNN, which is first applied to the ICD coding task, aims to capture semantic information for non-continuous words, and the n-gram matching mechanism aims to capture the continuous semantic. They both provide a pretty good interpretability for prediction. Moreover, we develop an openaccess system to help users assign ICD codes. We will try to utilize external resources to solve the few-shot and zero-shot problem in the future.