Explainable Prediction of Medical Codes from Clinical Text

Clinical notes are text documents that are created by clinicians for each patient encounter. They are typically accompanied by medical codes, which describe the diagnosis and treatment. Annotating these codes is labor intensive and error prone; furthermore, the connection between the codes and the text is not annotated, obscuring the reasons and details behind specific diagnoses and treatments. We present an attentional convolutional network that predicts medical codes from clinical text. Our method aggregates information across the document using a convolutional neural network, and uses an attention mechanism to select the most relevant segments for each of the thousands of possible codes. The method is accurate, achieving precision@8 of 0.71 and a Micro-F1 of 0.54, which are both better than the prior state of the art. Furthermore, through an interpretability evaluation by a physician, we show that the attention mechanism identifies meaningful explanations for each code assignment.


Introduction
Clinical notes are free text narratives generated by clinicians during patient encounters. They are typically accompanied by a set of metadata codes from the International Classification of Diseases (ICD), which present a standardized way of indicating diagnoses and procedures that were performed during the encounter. ICD codes have a variety of uses, ranging from billing to predictive modeling of patient state (Choi et al., 2016;Ranganath et al., 2015;Denny et al., 2010;Avati et al., 2017). Because manual coding is timeconsuming and error-prone, automatic coding has been studied since at least the 1990s (de Lima et al., 1998). The task is difficult for two main reasons. First, the label space is very high-dimensional, with over 15,000 codes in the ICD-9 taxonomy, and over 140,000 codes combined in the newer ICD-10-CM and ICD-10-PCS taxonomies (World Health Organization, 2016). Second, clinical text includes irrelevant information, misspellings and non-standard abbreviations, and a large medical vocabulary. These features combine to make the prediction of ICD codes from clinical notes an especially difficult task, for computers and human coders alike (Birman-Deych et al., 2005).
In this application paper, we develop convolutional neural network (CNN)-based methods for automatic ICD code assignment based on text discharge summaries from intensive care unit (ICU) stays. To better adapt to the multi-label setting, we employ a per-label attention mechanism, which allows our model to learn distinct document representations for each label. We call our method Convolutional Attention for Multi-Label classification (CAML). Our model design is motivated by the conjecture that important information correlated with a code's presence may be contained in short snippets of text which could be anywhere in the document, and that these snippets likely differ for different labels. To cope with the large label space, we exploit the textual descriptions of each code to guide our model towards appropriate parameters: in the absence of many labeled examples for a given code, its parameters should be similar to those of codes with similar textual descriptions.
We evaluate our approach on two versions of MIMIC (Johnson et al., 2016), an open dataset of ICU medical records. Each record includes a variety of narrative notes describing a patient's stay, including diagnoses and procedures. Our approach substantially outperforms previous results on medical code prediction on both MIMIC-II and MIMIC-III datasets.
We consider applications of this work in a decision support setting. Interpretability is important for any decision support system, especially in the  Table 1: Presentation of example qualitative evaluations. In real evaluation, system names generating the 4-gram are not given. An 'I' marking indicates a snippet evaluated as informative, and 'HI' indicates that it is highly informative; see § 4 for more details.
medical domain. The system should be able to explain why it predicted each code; even if the codes are manually annotated, it is desirable to explain what parts of the text are most relevant to each code. These considerations further motivate our per-label attention mechanism, which assigns importance values to -grams in the input document, and which can therefore provide explanations for each code, in the form of extracted snippets of text from the input document. We perform a human evaluation of the quality of the explanations provided by the attention mechanism, asking a physician to rate the informativeness of a set of automatically generated explanations. 1

Method
We treat ICD-9 code prediction as a multilabel text classification problem (McCallum, 1999). 2 Let  represent the set of ICD-9 codes; the labeling problem for instance is to determine , ∈ {0, 1} for all ∈ . We train a neural network which passes text through a convolutional layer to compute a base representation of the text of each document (Kim, 2014), and makes || binary classifi-1 Our code, data splits, and pre-trained models are available at github.com/jamesmullenbach/ caml-mimic. 2 We focus on codes from the ICD-9 taxonomy, rather than the more recent ICD-10, for the simple reason that this is the version of ICD used in the MIMIC datasets. cation decisions. Rather than aggregating across this representation with a pooling operation, we apply an attention mechanism to select the parts of the document that are most relevant for each possible code. These attention weights are then applied to the base representation, and the result is passed through an output layer, using a sigmoid transformation to compute the likelihood of each code. We employ a regularizer to encourage each code's parameters to be similar to those of codes with similar textual descriptions. We now describe each of these elements in more detail.

Convolutional architecture
At the base layer of the model, we havedimensional pre-trained embeddings for each word in the document, which are horizontally concatenated into the matrix = [ 1 , 2 , … , ], where is the length of the document. Adjacent word embeddings are combined using a convolutional filter ∈ ℝ × × , where is the filter width, the size of the input embedding, and the size of the filter output. At each step , we compute where * denotes the convolution operator, is an element-wise nonlinear transformation, and ∈ ℝ is the bias. We additionally pad each side of the input with zeros so that the resulting matrix has dimension ℝ × .

Attention
After convolution, the document is represented by the matrix ∈ ℝ × . It is typical to reduce this matrix to a vector by applying pooling across the length of document, by selecting the maximum or average value at each row (Kim, 2014). However, our goal is to assign multiple labels (i.e., medical codes) for each document, and different parts of the base representation may be relevant for different labels. For this reason, we apply a per-label attention mechanism. An additional benefit is that it selects the -grams from the text that are most relevant to each predicted label.
Formally, for each label , we compute the matrix-vector product, ⊤ , where ∈ ℝ is a vector parameter for label . We then pass the resulting vector through a softmax operator, obtaining a distribution over locations in the document, where SoftMax( ) = exp( ) ∑ exp( ) , and exp( ) is the element-wise exponentiation of the vector . The attention vector is then used to compute vector representations for each label, As a baseline model, we instead use maxpooling to compute a single vector for all labels,

Classification
Given the vector document representation , we compute a probability for label using another linear layer and a sigmoid transformation: where ∈ ℝ is a vector of prediction weights, and is a scalar offset. The overall model is illustrated in Figure 1.

Training
The training procedure minimizes the binary cross-entropy loss, plus the L2 norm of the model weights, using the Adam optimizer (Kingma and Ba, 2015).

Embedding label descriptions
Due to the dimensionality of the label space, many codes are rarely observed in the labeled data. To improve performance on these codes, we use text descriptions of each code from the World Health Organization (2016). Examples can be found in Table 1, next to the code numbers. We use these descriptions to build a secondary module in our network that learns to embed them as vectors. These vectors are then used as the target of regularization on the model parameters . If code is rarely observed in the training data, this regularizer will encourage its parameters to be similar to those of other codes with similar descriptions.
The code embedding module consists of a maxpooling CNN architecture. Let be a max-pooled vector, obtained by passing the description for code into the module. Let be the number of true labels in a training example. We add the following regularizing objective to our loss , where is a tradeoff hyperparameter that calibrates the performance of the two objectives. We call this model variant Description Regularized-CAML (DR-CAML).

Evaluation of code prediction
This section evaluates the accuracy of code prediction, comparing our models against several competitive baselines. MIMIC-III (Johnson et al., 2016) is an open-access dataset of text and structured records from a hospital ICU. Following previous work, we focus on discharge summaries, which condense information about a stay into a single document. In MIMIC-III, some admissions have addenda to their summary, which we concatenate to form one document. Each admission is tagged by human coders with a set of ICD-9 codes, describing both diagnoses and procedures which occurred during the patient's stay. There are 8,921 unique ICD-9 codes present in our datasets, including 6,918 diagnosis codes and 2,003 procedure codes. Some patients have multiple admissions and therefore multiple discharge summaries; we split the data by patient ID, so that no patient appears in both the training and test sets.

Datasets
In this full-label setting, we use a set of 47,724 discharge summaries from 36,998 patients for training, with 1,632 summaries and 3,372 summaries for validation and testing, respectively.

Secondary evaluations
For comparison with prior work, we also follow Shi et al. (2017) and train and evaluate on a label set consisting of the 50 most frequent labels. In this setting, we filter each dataset down to the instances that have at least one of the top 50 most frequent codes, and subset the training data to equal the size of the training set of Shi et al. (2017), resulting in 8,067 summaries for training, 1,574 for validation, and 1,730 for testing.
We also run experiments with the MIMIC-II dataset, to compare with prior work by Baumel et al. (2018) and Perotte et al. (2013). We use the train/test split of Perotte et al.  Table 2.
Preprocessing We remove tokens that contain no alphabetic characters (e.g., removing "500" but keeping "250mg"), lowercase all tokens, and replace tokens that appear in fewer than three training documents with an 'UNK' token. We pretrain word embeddings of size = 100 using the word2vec CBOW method (Mikolov et al., 2013) on the preprocessed text from all discharge summaries. All documents are truncated to a maximum length of 2500 tokens.

Systems
We compare against the following baselines: • a single-layer one-dimensional convolutional neural network (Kim, 2014); • a bag-of-words logistic regression model; • a bidirectional gated recurrent unit (Bi-GRU). 3 For the CNN and Bi-GRU, we initialize the embedding weights using the same pretrained word2vec vectors that we use for the CAML models. All neural models are implemented using PyTorch 4 . The logistic regression model consists of || binary one-vs-rest classifiers acting on unigram bagof-words features for all labels present in the training data. If a label is not present in the training data, the model will never predict it in the held-out data.

Parameter tuning
We tune the hyperparameters of the CAML model and the neural baselines using the Spearmint Bayesian optimization package (Snoek et al., 2012;Swersky et al., 2013). 5 We allow Spearmint to sample parameter values for the L2 penalty on the model weights and learning rate , as well as filter size , number of filters , and dropout probability for the convolutional models, and number of hidden layers of dimension for the Bi-GRU, using precision@8 on the MIMIC-III full-label validation set as the performance measure. We use these parameters for DR-CAML as well, and port the optimized parameters to the MIMIC-II full-label and MIMIC-III 50-label models, and manually fine-tune the learning rate in these settings. We select for DR-CAML based on pilot experiments on the validation sets. Hyperparameter tuning is summarized in Table 3. Convolutional models are trained with dropout after the 3 Our pilot experiments found that GRU was stronger than long short-term memory (LSTM) for this task. 4 https://github.com/pytorch/pytorch 5 https://github.com/HIPS/Spearmint   embedding layer. We use a fixed batch size of 16 for all models and datasets. Models are trained with early stopping on the validation set; training terminates after the precision@8 does not improve for 10 epochs, and the model at the time of the highest precision@8 is used on the test set.

Evaluation Metrics
To facilitate comparison with both future and prior work, we report a variety of metrics, focusing on the micro-averaged and macro-averaged F1 and area under the ROC curve (AUC). Micro-averaged values are calculated by treating each (text, code) pair as a separate prediction. Macro-averaged values, while less frequently reported in the multilabel classification literature, are calculated by averaging metrics computed per-label. For recall, the metrics are distinguished as follows: where TP denotes true positive examples and FN denotes false negative examples. Precision is computed analogously. The macro-averaged metrics place much more emphasis on rare label prediction.
We also report precision at (denoted as 'P@n'), which is the fraction of the highestscored labels that are present in the ground truth. This is motivated by the potential use case as a decision support application, in which a user is presented with a fixed number of predicted codes to review. In such a case, it is more suitable to select a model with high precision than high recall. We choose = 5 and = 8 to compare with prior work (Vani et al., 2017;Prakash et al., 2017). For the MIMIC-III full label setting, we also compute precision@15, which roughly corresponds to the average number of codes in MIMIC-III discharge summaries (Table 2).

Results
Our main quantitative evaluation involves predicting the full set of ICD-9 codes based on the text of the MIMIC-III discharge summaries. These results are shown in Table 4. The CAML model gives the strongest results on all metrics. Attention yields substantial improvements over the "vanilla" convolutional neural network (CNN). The recurrent Bi-GRU architecture is comparable to the vanilla CNN, and the logistic regression baseline is substantially worse than all neural architectures. The best-performing CNN model has 9.86M tunable parameters, compared with 6.14M tunable parameters for CAML. This is due to the hyperparameter search preferring a larger number of filters for the CNN. Finally, we observe that the DR-CAML performs worse on most metrics than CAML, with a tuned regularization coefficient of = 0.01. Among prior work, only Scheurwegs et al.
(2017) evaluate on the full ICD-9 code set for MIMIC-III. Their reported results distinguished between diagnosis codes and procedure codes. The CAML models are stronger on both sets. Additionally, our method does not make use of any external information or structured data, while  We feel that precision@8 is the most informative of the metrics, as it measures the ability of the system to return a small high-confidence subset of codes. Even with a space of thousands of labels, our models achieve relatively high precision: of the eight most confident predictions, on average 5.5 are correct. It is also apparent how difficult it is to achieve high Macro-F1 scores, due to the metric's emphasis on rare-label performance. To put these results in context, a hypothetical system that performs perfectly on the 500 most common labels, and ignores all others, would achieve a Macro-F1 of 0.052 and a Micro-F1 of 0.842.

Secondary evaluations
To compare with prior published work, we also evaluate on the 50 most common codes in MIMIC-III (Table 5), and on MIMIC-II (Table 6). We report DR-CAML results on the 50-label setting of MIMIC-III with = 10, and on MIMIC-II with = 0.1, which were determined by grid search on a validation set. The other hyperparameters were left at the settings for the main MIMIC-III evaluation, as described in Table 3. In the 50-label setting of MIMIC-III, we see strong improvement over prior work in all reported metrics, as well as against the baselines, with the exception of precision@5, on which the CNN baseline performs best. We hypothesize that this is because the relatively large value of = 10 for CAML leads to a larger network that is more suited to larger datasets; tuning CAML's hyperparameters on this dataset would be expected to improve performance on all metrics. Baumel et al. (2018) additionally report a micro-F1 score of 0.407 by training on MIMIC-III, and evaluating on MIMIC-II. Our model achieves better performance using only the (smaller) MIMIC-II training set, leaving this alternative training protocol for future work.

Evaluation of Interpretability
We now evaluate the explanations generated by CAML's attention mechanism, in comparison with three alternative heuristics. A physician was presented with explanations from four methods, using a random sample of 100 predicted codes from the MIMIC-III full-label test set. The most important -gram from each method was extracted, along with a window of five words on either side for context. We select = 4 in this setting to emulate a span of attention over words likely to be given by a human reader. Examples can be found in Table 1. Observe that the snippets may overlap in multiple words. We prompted the evaluator to select all text snippets which he felt adequately explained the presence of a given code, provided the code and its description, with the option to distinguish snippets as "highly informative" should they be found particularly informative over others.

Extracting informative text snippets
CAML The attention mechanism allows us to extract -grams from the text that are most influential in the prediction of each label, by taking the argmax of the SoftMax output .  results from the max-pooling step as

Max-pooling CNN
we can compute the importance of position for label , We then select the most important -gram for a given label as arg max .

Logistic regression
The informativeness of each -gram with respect to label is scored by the sum of the coefficients of the weight matrix for , over the words in the -gram. The top-scoring -gram is then returned as the explanation.
Code descriptions Finally, we calculate a word similarity metric between each stemmed -gram and the stemmed ICD-9 code description. We compute the idf-weighted cosine similarity, with idf weights calculated on the corpus consisting of all notes and relevant code descriptions. We then select the argmax over -grams in the document, breaking ties by selecting the first occurrence. We remove those note-label pairs for which no -gram has a score greater than 0, which gives an "unfair" advantage to this baseline.

Results
The results of the interpretability evaluation are presented in Table 7. Our model selects the greatest number of "highly informative" explanations, and selects more "informative" explanations than both the CNN baseline and the logistic regression model. While the cosine similarity metric also performs well, the examples in Table 1 demonstrate the strengths of CAML in extracting text snippets in line with more intuitive explanations for the presence of a code. As noted above, there exist some cases, which we exclude, where the cosine similarity method is unable to provide any explanation, because no -grams in a note have a nonzero similarity for a given label description. This occurs for about 12% of all note-label pairs in the test set.

Related Work
Attentional Convolution for NLP CNNs have been successfully applied to tasks such as sentiment classification (Kim, 2014) and language modeling (Dauphin et al., 2017). Our work combines convolution with attention (Bahdanau et al., 2015;Yang et al., 2016) to select the most relevant parts of the discharge summary. Other recent work has combined convolution and attention (e.g., Allamanis et al., 2016;Yin et al., 2016;dos Santos et al., 2016;Yin and Schütze, 2017). Our attention mechanism is most similar to those of Yang et al. (2016) and Allamanis et al. (2016), in that we use context vectors to compute attention over specific locations in the text. Our work differs in that we compute separate attention weights for each label in our label space, which is better tuned to our goal of selecting locations in a document which are most important for predicting specific labels.
Automatic ICD coding ICD coding is a longstanding task in the medical informatics community, which has been approached with machine learning and handcrafted methods (Scheurwegs et al., 2015). Many recent approaches, like ours, use unstructured text data as the only source of information (e.g., Kavuluru et al., 2015;Subotin and Davis, 2014), though some incorporates struc-    (Wang et al., 2016), relied on datasets that focus on a subset of medical scenarios (Zhang et al., 2017), or evaluated on data that are not publicly available, making direct comparison difficult (Subotin and Davis, 2016). A recent shared task for ICD-10 coding focused on coding of death certificates in English and French (Névéol et al., 2017). This dataset also contains shorter documents than those we consider, with an average of 18 tokens per certificate in the French corpus.
We use the open-access MIMIC datasets containing de-identified, general-purpose records of intensive care unit stays at a single hospital. Perotte et al. (2013) use "flat" and "hierarchical" SVMs; the former treats each code as an individual prediction, while the latter trains on child codes only if the parent code is present, and predicts on child codes only if the parent code was positively predicted. Scheurwegs et al. (2017) use a feature selection approach to ICD-9 and ICD-10 classification, incorporating structured and unstructured text information from EHRs. They evaluate over various medical specialties and on the MIMIC-III dataset. We compare directly to their results on the full label set of MIMIC-III.
Other recent approaches have employed neural network architectures. Baumel et al. (2018) apply recurrent networks with hierarchical sentence and word attention (the HA-GRU) to classify ICD9 diagnosis codes while providing insights into the model decision process. Similarly, Shi et al. (2017) applied character-aware LSTMs to generate sentence representations from specific subsections of discharge summaries, and apply attention to form a soft matching between the representations and the top 50 codes. Prakash et al. (2017) use memory networks that draw from discharge summaries as well as Wikipedia, to predict top-50 and top-100 codes. Another recent neural architecture is the Grounded Recurrent Neural Network (Vani et al., 2017), which employs a modified GRU with dimensions dedicated to predicting the presence of individual labels. We compare directly with published results from all of these papers, except Vani et al. (2017), who evaluate on only a 5000 code subset of ICD-9. Empirically, the CAML architecture proposed in this paper yields stronger results across all experimental conditions. We attribute these improvements to the attention mechanism, which focuses on the most critical features for each code, rather than applying a uniform pooling operation for all codes. We also observed that convolution-based models are at least as effective, and significantly more computationally efficient, than recurrent neural networks such as the Bi-GRU.
Explainable text classification A goal of this work is that the code predictions be explainable from features of the text. Prior work has also em-phasized explainability. Lei et al. (2016) model "rationales" through a latent variable, which tags each word as relevant to the document label.  compute the salience of individual words by the derivative of the label score with respect to the word embedding. Ribeiro et al. (2016) use submodular optimization to select a subset of features that closely approximate a specific classification decision (this work is also notable for extensive human evaluations). In comparison to these approaches, we employ a relatively simple attentional architecture; this simplicity is motivated by the challenge of scaling to multi-label classification with thousands of possible labels. Other prior work has emphasized the use of attention for highlighting salient features of the text (e.g., Rush et al., 2015;Rocktäschel et al., 2016), although these papers did not perform human evaluations of the interpretability of the features selected by the attention mechanism.

Conclusions and Future Work
We present CAML, a convolutional neural network for multi-label document classification, which employs an attention mechanism to adaptively pool the convolution output for each label, learning to identify highly-predictive locations for each label. CAML yields strong improvements over previous metrics on several formulations of the ICD-9 code prediction task, while providing satisfactory explanations for its predictions. Although we focus on a clinical setting, CAML is extensible without modification to other multi-label document tagging tasks, including ICD-10 coding. We see a number of directions for future work. From the linguistic side, we plan to integrate the document structure of discharge summaries in MIMIC-III, and to better handle non-standard writing and other sources of out-of-vocabulary tokens. From the application perspective, we plan to build models that leverage hierarchy of ICD codes (Choi et al., 2016), and to attempt the more difficult task of predicting diagnosis and treatment codes for future visits from discharge summaries.