Towards Automated ICD Coding Using Deep Learning

International Classification of Diseases(ICD) is an authoritative health care classification system of different diseases and conditions for clinical and management purposes. Considering the complicated and dedicated process to assign correct codes to each patient admission based on overall diagnosis, we propose a hierarchical deep learning model with attention mechanism which can automatically assign ICD diagnostic codes given written diagnosis. We utilize character-aware neural language models to generate hidden representations of written diagnosis descriptions and ICD codes, and design an attention mechanism to address the mismatch between the numbers of descriptions and corresponding codes. Our experimental results show the strong potential of automated ICD coding from diagnosis descriptions. Our best model achieves 0.53 and 0.90 of F1 score and area under curve of receiver operating characteristic respectively. The result outperforms those achieved using character-unaware encoding method or without attention mechanism. It indicates that our proposed deep learning model can code automatically in a reasonable way and provide a framework for computer-auxiliary ICD coding.


Introduction
The International Classification of Diseases (ICD) is a health care classification system maintained by the World Health Organization 1 , which provides a hierarchy of diagnostic codes of diseases, disorders, injuries, signs, symptoms, etc. It is widely used for reporting diseases and health conditions, assisting in medical reimbursement decisions, collecting morbidity and mortality statistics, to name a few.
While ICD codes are important for making clinical and financial decisions, medical coding -which assigns proper ICD codes to a patient admission -is time-consuming, error-prone and expensive. Medical coders review the diagnosis descriptions written by physicians in the form of textual phrases and sentences and (if necessary) other information in the electronic medical record of a clinical episode, then manually attribute the appropriate ICD codes by following the coding guidelines 2 . Several types of errors frequently occur. First, when writing diagnosis descriptions, physicians often utilize abbreviations and synonyms, which causes ambiguity and imprecision when the coders are matching ICD codes to those labels 3 . Second, in many cases, several diagnosis descriptions are closely related and should be combined into a single combination ICD code. However, unexperienced coders may code each disease separately. Such errors are called unbundling. Third, the ICD codes are organized in a hierarchical structure where the top-level codes represent generic disease categories and the bottom-level codes represent more specific diseases. A miscoding happens when the coder matches the diagnosis description to an overly generic code instead of a more specific code. The cost incurred by coding errors and the financial investment spent on improving coding quality are estimated to be $25 billion per year in the US 4,5 .
To reduce coding errors and cost, we aim at building an ICD coding machine which automatically and accurately translates the free-text diagnosis descriptions into ICD codes. To achieve this goal, several technical challenges need to be addressed. First, the diagnosis descriptions written by physicians and the textual descriptions of ICD codes are written in quite different styles even if they refer to the same disease. In particular, the definitions of ICD code are formally and precisely worded, while diagnosis descriptions are usually written in an informal and ungrammatical way, with telegraphic phrases, abbreviations, and typos. Second, as stated earlier, there does not necessarily exist a one-to-one mapping between diagnosis descriptions and ICD codes, and human coders should consider the overall health condition when assigning codes. In many cases, two closely related diagnosis descriptions need to be mapped into a single combination ICD code. On the other hand, physicians may write two health conditions into one diagnosis description which should be mapped onto two ICD codes under such circumstances.

Contributions
We present a deep learning approach to automatically perform ICD coding given the diagnosis descriptions. Specifically, we propose a hierarchical neural network model which is able to capture the latent semantics of ICD definitions and diagnosis descriptions, despite their significant difference in writing style. Attention mechanism is designed to address the mismatch between diagnosis description number and assigned code number. We train the model on 8,066 hospital admissions, tune hyper-parameters on 1,728 admissions, and evaluate the performance on a held-out test set of 1,729 hospital admissions. We demonstrate that our coding machine can accurately assign ICD codes.

Related work
The accuracy and efficiency of manual ICD coding has always been a concern of clinical practice. KJ O'malley et al. has summarized the complete workflow of assigning ICD codes manually 2 , which is a dedicated procedure and is prone to errors. To avoid the massive human labour to code, scientists have proposed some automatic or semi-automatic ICD classification system, especially from narrative clinical notes for better health care practice [6][7][8][9][10][11][12] . But the experimental dataset was generally small and domain specific. For example, the shared task involved assigning ICD-9 codes to 1954 radiology records has attracted a lot of attention 10 , and In 2015 Koopman et al. propose a classification system for identifying different types of cancers for death certificates based on ICD classification system 12 . Contrary to these experiments, the dataset in our experiment is much larger and contains various domains of clinical practice.
There also has been some trials to assign ICD codes utilizing the document of discharge summary [13][14][15] . Leah S. Larkey and W. Bruce Croft have trained three statistical classifiers with many human-tuned parameters and an ensemble model to give candidate ICD labels, more focused on principal diagnostic code(the most significant diagnostic code), to each discharge summary document 13 . Besides, Franz et al. compares three coding methods given discharge diagnosis, but their object is to assign just one diagnostic code to each diagnosis description 14 . All of them utilize the full-text document of discharge summary, thus suffer from the complicated preprocessing of the noisy text. To build a more practical ICD coding machine, we formulate our coding task as a general multi-label classification problem on diagnosis descriptions, without many parameters to tune or restricting the number of assigned codes for each patient record.

Dataset and preprocessing
We perform the study on the publicly available MIMIC-III dataset 16 , which contains de-identified and comprehensive electronic medical records of 58,976 patient visits in the Beth Israel Deaconess Medical Center from 2001 to 2012. The patient visit record usually has a clinical note called discharge summary, which contains multiple sections of information, such as 'discharge diagnosis', 'past medical history', 'admission medications', and 'chief complaint'. The diagnosis descriptions are usually included in the 'discharge diagnosis' and 'final diagnosis' sections 17 . We use a variety of standard text pre-processing techniques such as regular expression matching and tokenization to turn the noisy and irregular raw note texts in these sections into clean diagnosis descriptions. Each resulting label is a short phrase or a sentence, articulating one disease or condition. Patient visits that contain no extracted diagnosis descriptions are discarded.
Each patient visit has a list of ICD codes given by the medical coders. These codes are documented in structured tables. The entire dataset contains 6,984 unique codes, each of which has a textual description, describing a disease, symptom, or condition. Many codes are only assigned to a few patient visits. Due to the sparsity of data, it is very difficult to train an accurate coding model for all of them. Instead, we choose 50 most frequent codes to carry out the study while noting that our model can readily be extended to more codes as long as sufficient training data is available. The frequency of one code is measured as the number of patient visits that the code is assigned to. Table 1 shows a sample of admission record in the raw dataset and extracted diagnosis descriptions. The 'HADMID' is used by the MIMIC-III to denote each hospital admission identically. We omitted irrelevant sections in the original texts of discharge summary, like 'discharge disposition' and 'physical examination'. The extracted diagnosis descriptions given by physicians are in enumeration style. Notice that there is an extra newline in the third written diagnosis description and it's removed after extraction. The number of diagnosis descriptions is not equal to the number of assigned diagnostic codes.
In this way, we obtain 11,523 hospital admission records with overall 59,302 diagnosis descriptions. Figure 1(a) shows the distribution of the number of extracted plain-text diagnosis descriptions across medical records. After restricting our ICD coding target to the 50 most frequent codes, the distribution of ICD code frequency is shown in Figure 1(b), and the distribution of the number of assigned codes per admission record is shown in 1(c). We split the dataset into training set with 8,066 hospital admission records, validation set with 1,728 records, and test set with 1,729 records.

Model design
The ICD coding model mainly consists of four modules, which are used for (1) encoding the diagnosis descriptions, (2) encoding the ICD codes based on their textual descriptions, (3) matching diagnosis descriptions with ICD codes, and (4) assigning the ICD codes, respectively. The overall architecture is illustrated in Figure 2. In the following we present each component in detail.

Diagnosis description encoder
We leverage the long short-term memory (LSTM) recurrent network to encode the diagnosis descriptions 18 . LSTM is a popular variant of the recurrent neural network (RNN). Due to the capacity of capturing long-range semantics in texts, LSTM is widely used for language modeling and sequence encoding 19,20 . An LSTM recurrent network consists of a sequence of units, each of which models one item in the input sequence. Each unit consists of an input gate i i i, a forget gate f f f , a cell gate g g g, an output gate o o o, a cell state c c c, and a hidden state h h h, which are all vectors. They are computed as follows: For clarity, we denote scalars in plain lowercase letters, vectors in bold lowercase, and matrices in bold uppercase. The operator ' * ' in Equation 1 denotes element-wise multiplication and t represents the time step in the sequence. The sigmoid function is defined as: sigmoid(x) = 1/(1 + exp(−x)), and the tanh function is tanh(x) = (exp(x) − exp(−x))/(exp(x) + exp(−x)). For each diagnosis description, we use both character-level LSTM network and word-level LSTM network to obtain its hidden representation. Specifically, in the character-level LSTM, x x x t is the embedding vector of the t th character in the word, and T is the total number of characters in this word. We select the hidden state of LSTM in the last time step as the hidden representation of the word. In the word-level LSTM, x x x t is the hidden vector of the t th word in the sentence, and T is the number of words. Similarly, we choose the last hidden state as the representation of the sentence. The reason why we choose character-aware encoding method is there are considerable medical terms with same suffix denoting similar diseases and we expect the character-level LSTM to capture such characteristics. In the following, we denote the hidden representations of the written diagnosis descriptions as h h h 1 , , , h h h 2 , , , . . .. . .. . ., , , h h h m , where m is the number of extracted diagnosis descriptions in one record.

ICD code encoder
For each ICD code, we adopt the same two-level LSTM architecture, i.e., character-level and word-level, to obtain the hidden representation of its long title definition, which is provided in the MIMIC-III dataset. For example, in MIMIC-III, the long title of ICD code '4010' is 'Malignant essential hypertension'. The hidden vector of 'Malignant essential hypertension' obtained with the LSTM network serves as the representation of ICD code '4010'. The parameters of the neural networks for the ICD code encoder and the diagnosis description encoder are not tied, in order to learn different language styles of these two sets of texts. we use u u u 1 , , , u u u 2 , , , . . .. . .. . .u u u n to denote the hidden representations of different ICD codes obtained by their long title definitions, where n is the total number of ICD categories. As in our experiment we have picked out the most frequent 50 codes, n = 50.

Attentional match
Typically, the number of written diagnosis descriptions does not equal to the number of assigned ICD codes, so we cannot directly assign one code to one diagnosis description. Considering that human coders are supposed to assign appropriate codes according to overall health condition, in parallel, we take all diagnosis descriptions into account during coding by adopting an attention strategy. The attention mechanism provides a recipe for choosing which diagnosis descriptions are important when performing coding.
We use u i,k and h j,k to represent the k th dimension of hidden representations of the i th ICD code and the j th diagnosis description, respectively. For the i th ICD code, we use a i, j to denote its attention score on the j th diagnosis description, which is 4/11 the cosine similarity of the hidden representations of the i th ICD code and the j th diagnosis description.
Then we design two different kinds of attention layers to obtain the confidence score of ICD code assignment: Hardselection and Soft-attention mechanism, which are depicted in Figure 3. Hard-selection. Based on the assumption that the most related diagnosis description plays a decisive role when assigning ICD code, for each ICD code, we define the dominating diagnosis as the one that has the maximum attention score among all diagnosis descriptions. We apply the sigmoid function to normalize the score into a probability value in [0, 1]. The probability of the i th ICD code being assigned is thus: Soft-attention. Instead of choosing the single maximum attention score, here we apply a softmax function to normalize the attention scores among all diagnosis descriptions into a probability simplex. The normalized attention scores are utilized as the weights of different diagnosis descriptions. We then use the weighted average over the hidden representations of different diagnosis descriptions as the attentional hidden vector. In this way, the attentional hidden vector can take into account all diagnosis descriptions with varying levels of attention. The attentional vector of the i th ICD code is denoted asũ u u i .

Linear projection layer
For the attentional hidden vectorũ u u i , we use linear Perceptron structure as the output layer to project the vector into a real value 21,22 , which represents the confidence score of predicting label to be true. The Perceptron parameters are different among each code. Finally, we utilize sigmoid function to normalize the confidence score into a probability, which ranges from 0 to 1.
Parameter learning We use binary cross entropy as the loss function for each ICD code 23,24 . The loss function for each pat record can be formulated as follows: where t i is the real label of the i th ICD code, i.e., true(1) or false(0). All parameters are learned by minimizing the loss function with stochastic gradient descent 25 .

Hyperparameter setting
The model is trained on the training set using the standard ADAM optimizer 26 , with an initial learning rate 0.001 and mini-batch size 10. Hyper-parameters are fine-tuned on the validation set. In particular, the number of hidden units and output units of all LSTM modules are 200. For word-level LSTM in our experiment, we apply a dropout layer with 0.5 dropout probability to the output, to avoid the overfitting problem 27 . Since our model provides a probability score for each assignment of ICD code, we also tune on the validation set the optimal threshold that cuts the probability score into a binary output, i.e., true or false, to obtain best F1 score.

Analysis and evaluation
Considering the ICD code assignment is generally sparse, with most ICD codes labeled as false and only a few as true, we use the micro F1-score and AUC ROC (area under curve of receiver operating characteristic) score as the quantitative metrics. Micro F1-score is a harmonic mean of precision and recall. It is widely used to evaluate the performance of a binary classifier on imbalanced data 28 . The micro AUC ROC score is calculated as the area under the ROC curve, which is drawn by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings 29,30 . Intuitively, the AUC ROC score measures the probability that the model assigns higher score for a positive instance than negative one, the lower bound of which is 0.5. Table 2 shows the F1-score and AUC ROC score of different models evaluated on the test set. We can observe that with the Softattention mechanism, the F1 and AUC ROC increase 5.2 and 2.3 percent, respectively, compared to the Hard-selection model. To further explore the efficacy of different modules in our model, we perform an ablation study on our intact Soft-attention model, which leverages character-level and word-level LSTM encoder and Soft-attention mechanism. Performance decreasing on several ablation experiments indicates that all the designed modules in our model are necessary and play a crucial role in the coding process. The attention scores on the subset of 50 ICD codes is shown in Table 3

Discussion
Our model achieves attractive performance on the ICD coding task, which indicates our architecture is reasonable: to extract diagnosis descriptions from discharge summary, and use the hidden meaning of these descriptions to predict on target ICD code whose meaning is represented by its formal title, can be a promising methodology to perform automated ICD coding. The Soft-attention mechanism can push the model to allocate different attention on multiple diagnosis descriptions, which can obtain better performance compared to Hard-selection. In the following, we will discuss several important insights provided by our ablation study on the Soft-attention model in detail.

6/11
Index Diagnosis description  Table 3. Attention allocation on one sample. Top shows all the extracted diagnosis descriptions from written discharge diagnosis given by physicians. Down shows the attention allocation on a subset of ICD codes. The codes in bold format have true labels.

7/11
To evaluate the effectiveness of the character-level LSTM module in learning the hidden representation of medical vocabulary, we remove it from the model. Instead, we obtain the hidden vector of each word from a tunable word embedding layer, where each word is assigned a fixed-dimension vector. Note that the other modules remain intact and the Soft-attention strategy is leveraged. It causes 0.024 and 0.018 drop of the F1 score and AUC ROC respectively, which demonstrates the necessity of incorporating character-level encoder into representation learning. We have also tried to initialize the word embedding layer with pre-trained word vectors. These vectors were learned using word2vec tool on a large corpus of medical research papers collected from Pubmed, BioMed and PLOS 31 . In this setting, All the words are transformed to lowercase and lemmatized in advance. The performance has increased compared to randomly initialized word embeddings but it's still lower than our character-level LSTM model. To ensure our character-level LSTM module can give reasonable representations for diagnosis description, we have checked the nearest neighbors of words and sentences based on Euclidean distance in hidden space. Table 4 shows a subset of words and sentences and their nearest neighbors. On top of the table displays the word neighbor relationship contrast between the model with character-level LSTM and word embedding layer with pre-trained word vectors. First, it indicates that character-level LSTM word encoder can correct various typos and recognize different morphologies appearing in the written diagnosis descriptions, by generating similar representations for them. For example, our model can recognize different written variants of 'Ischemia' and generate near representations for them. In addition, many disease names and procedures with same suffix are denoting similar diseases, which can be captured by our character-level LSTM encoder efficiently. Otherwise, there exist some words with same suffix but unrelated meanings indeed that are also distributed near in the hidden space, however, these words are not denoting disease categories in many cases, like 'state', so it should have little effect to the coding. While looking at the sentence neighbor relationship shown at the bottom of the table 4, we could observe that out model can generate near embeddings for similar sentences too.
Besides word-level LSTM encoder, word averaging method also shows strong performance to generate sentence embeddings 32 . we have also tried averaging the word embeddings in one sentence to obtain the hidden vector of sentence, instead of using LSTM encoder. Keeping other modules intact, the F1 drops 0.028 and AUC ROC drops 0.014, which indicates the word-level LSTM encoder is superior to word averaging.
We evaluate the necessity of the attention mechanism by comparing our Soft-attention model with a linear classifier without attention mechanism. We design the architecture of linear classifier as follows. For each ICD code, we concatenate its hidden vector with the representation of the overall diagnosis, which is obtained by averaging the hidden vectors of each diagnosis description. The concatenated vector is processed by a linear Perceptron to get the confidence score. The parameters of linear Perceptron are independent among different ICD codes. Replacing attention mechanism with such a linear classifier causes the F1 and AUC ROC to drop 0.061 and 0.018 respectively, which demonstrates the advantage of our attention mechanism.

Limitations
The performance achieved by our hierarchical neural models with Soft-attention mechanism shows that reliable performance could be obtained even through a simple diagnosis description extraction process. But, considering the noisy format of the electrical discharge summary, with a more elaborate diagnosis extraction preprocessing and cleaner corpus with high-quality diagnosis descriptions, we believe the performance could be improved further.
Another limitation of our study is the candidate ICD codes are restricted to the most frequent 50 ones. If one ICD code is too rare, there will not be enough evidence to construct a valid neural model, and the label imbalance problem will be more severe 33 , which makes the learning harder. It should be helpful to obtain more formatted records to support the model to learn.
With separate linear Perceptrons to assign each ICD code, we have assumed that the assignment of different ICD codes are mutually independent. However, ICD codes indeed correlate with each other to some extent. For example, the long title name of ICD code '40390' is 'Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage I through stage IV, or unspecified', while '40391' is 'Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease'. If ICD code '40390' is assigned to one patient record, ICD '40391' should definitely not be assigned since these two codes represent exclusive health conditions. And it might be helpful to leverage the hierarchy structure of ICD codes 15 . Thus, modeling such correlations with some structured methods can be meaningful for improving performance.

Conclusions
We find it is promising to construct a high quality ICD coding machine directly from diagnosis description in the electronic medical records. Our model achieves high performance, suggesting that an attentional match between the diagnosis descriptions and the textual definition of ICD code suits well for the inference task. The proposed Soft-attention mechanism can learn to allocate varying attention strengths on multiple diagnosis descriptions when assigning ICD codes. Just like the reasoning of Our experiment indicates the potential for real life applications in view of the high performance even on some noisyformatted data. We believe that with more elaborate data preprocessing techniques, and with more formatted electrical medical records, the automatic coding can be even more accurate. Since our model can give a probability score when assigning ICD code, we can decrease the probability threshold to get higher recall rate or increase to get higher precision. Thus, in addition to coding ICD diagnosis directly, our model can also serve as an assistant tool for doctors, helping them to pre-select a small set of candidate codes and thus greatly alleviating doctors' workloads.
In this paper we have adopted ICD-9 diagnostic codes as coding target, however the proposed approach can straightforwardly be adapted to new revisions of ICD codes, like ICD-10 34 , as long as the formal definitions of all codes and golden diagnostic codes on training data are available.