EMR Coding with Semi-Parametric Multi-Head Matching Networks

Coding EMRs with diagnosis and procedure codes is an indispensable task for billing, secondary data analyses, and monitoring health trends. Both speed and accuracy of coding are critical. While coding errors could lead to more patient-side financial burden and misinterpretation of a patient’s well-being, timely coding is also needed to avoid backlogs and additional costs for the healthcare facility. In this paper, we present a new neural network architecture that combines ideas from few-shot learning matching networks, multi-label loss functions, and convolutional neural networks for text classification to significantly outperform other state-of-the-art models. Our evaluations are conducted using a well known de-identified EMR dataset (MIMIC) with a variety of multi-label performance measures.


Introduction
Electronic medical record (EMR) coding is the process of extracting diagnosis and procedure codes from the digital record (the EMR) pertaining to a patient's visit. The digital record is mostly composed of multiple textual narratives (e.g., discharge summaries, pathology reports, progress notes) authored by healthcare professionals, typically doctors, nurses, and lab technicians. Hospitals heavily invest in training and retaining professional EMR coders to manually annotate all patient visits by reviewing EMRs. Proprietary commercial software tools often termed as computerassisted coding (CAC) systems are already in use in many healthcare facilities and were found to be helpful in increasing medical coder productivity (Dougherty et al., 2013). Thus progress in automated EMR coding methods is expected to directly impact real world operations.
In the US, the diagnosis and procedure codes used in EMR coding are from the Interna-tional Classification of Diseases (ICD) terminology (specifically the ICD-10-CM variant) as required by the Health Insurance Portability and Accountability Act (HIPPA). ICD codes facilitate billing activities, retrospective epidemiological studies, and also enable researchers to aggregate health statistics and monitor health trends. To code EMRs effectively, medical coders are expected to have thorough knowledge of ICD-10-CM and follow a complex set of guidelines to code EMRs. For example, if a coder accidentally uses the code "heart failure" (ICD-10-CM code I50) instead of "acute systolic (congestive) heart failure" (ICD-10-CM code I50.21), then the patient may be charged substantially more 1 causing significant unfair burden. Therefore, it is important for coders to have better tools at their disposal to find the most appropriate codes. Additionally, if coders become more efficient, hospitals may hire fewer coders to reduce their operating costs. Thus automated coding methods are expected to help with expedited coding, cost savings, and error control.
In this paper, we treat medical coding of EMR narratives as a multi-label text classification problem. Multi-label classification (MLC) is a machine learning task that assigns a set of labels (typically from a fixed terminology) to an instance. MLC is different from multi-class problems, which assign a single label to each example from a set of labels. Compared to general multi-label problems, EMR coding has three distinct challenges. First, with thousands of ICD codes, the label space is large and the label distribution is extremely unbalanced -most codes occur very infrequently with a few codes occurring several orders of magnitude more than others. Second and more importantly, a patient may have a large number of diagnoses and procedures.
On average, coders annotate an EMR with more than 20 such codes and hence predicting the top one or two codes is not sufficient for EMR coding. Third, EMR narratives may be very long (e.g., discharge summaries may have over 1000 words), which may result in a needle in a haystack situation when attempting to seek evidence for particular codes.
Recent advances in extreme multi-label classification have proven to work well for large label spaces. Many of these methods (Yu et al., 2014;Bhatia et al., 2015; focus on creating efficient multi-label models that can handle 10 4 to 10 6 labels. While these models perform well in large label spaces, they don't necessarily focus on improving prediction of infrequent labels. Typically, they optimize for the top 1, 3, or 5 ranked labels by focusing on the P@1, P@3, and P@5 evaluation measures. The labels ranked at the top usually occur frequently in the dataset and it is not obvious how to handle infrequent labels. One solution would be to ignore the rare labels. However, when the majority of medical codes are infrequent, this solution is unsatisfactory. While neural networks have shown great promise for text classification (Kim, 2014;Yang et al., 2016;Johnson and Zhang, 2017), the label imbalances associated with EMR coding hinder their performance. Imagine if a dataset contains only one training example for every class leading to one-shot learning, a subtask of few-shot learning. How can we classify a new instance? A trivial solution would be to use a non-parametric 1-NN (1 nearest neighbor) classifier. 1-NN does not require learning any label specific parameters and we only need to define features to represent our data and a distance metric. Unfortunately, defining good features and picking the best distance metric is nontrivial. Instead of manually defining the feature set and distance metric, neural network training procedures have been developed to learn them automatically (Koch et al., 2015). For example, matching networks (Vinyals et al., 2016) can automatically learn discriminative feature representations and a useful distance metric. Therefore, using a 1-NN prediction method, matching networks work well for infrequent labels. However, researchers typically evaluate matching networks on multi-class problems without label imbalance. For EMR coding with extreme label imbalance with several labels occurring thousands of times, tra-ditional parametric neural networks (Kim, 2014) should work very well on the frequent labels. In this paper, we introduce a new variant of matching networks (Vinyals et al., 2016;Snell et al., 2017) to address the EMR coding problem. Specifically, we combine the non-parametric idea of k-NN and matching networks with traditional neural network text classification methods to handle both frequent and infrequent labels encountered in EMR coding.
Overall, we make the following contributions in this paper: • We propose a novel semi-parametric neural matching network for diagnosis/procedure code prediction from EMR narratives. Our architecture employs ideas from matching networks (Vinyals et al., 2016), multiple attention (Lin et al., 2017), multi-label loss functions (Nam et al., 2014a), and convolutional neural networks (CNNs) for text classification (Kim, 2014) to produce a state-ofthe-art EMR coding model.
• We evaluate our model on publicly available EMR datasets to ensure reproducibility and benchmarking; we also compare against prior state-of-the-art methods in EMR coding and demonstrate robustness across multiple standard evaluation measures.
• We analyze and measure how each component of our model affects the performance using ablation experiments.

Related Work
In this section we cover recent methodologies that are either relevant to our approach and problem or form the main ingredients of our contribution.

Extreme Multi-label Classification
Current methods for extreme MLC fall into two categories: embedding and tree-based methods.
Embedding-based methods aim to reduce the training complexity. They effectively reduce the label space by assuming the training label matrix is low rank. Intuitively, rather than learning independent classifiers for each label (binary relevance) (Tsoumakas et al., 2010), classifiers are learned in a reduced label spaceL L where L is the total number of labels. Likewise, a projection matrix is learned to convert predictions from the reduced label space back to the original label space. In general, embedding methods vary based on how they reduce the label space and how the projection operation is optimized. Tai and Lin (2012) use principal component analysis (PCA) to reduce the label space. Low-rank Empirical risk minimization for Multi-Label Learning (LEML) (Yu et al., 2014) jointly optimizes the label space reduction and the projection processes. RobustXML (Xu et al., 2016) is similar to LEML but it treats infrequent labels as outliers and models them separately.  employ neural networks for extreme multi-label problems using a funnel-like architecture that reduces the label vector dimensionality. Tree-based multi-label methods work by recursively splitting the feature space. These methods usually differ based on the node splitting criterion. FastXML (Prabhu and Varma, 2014) partitions the feature space using the nDCG measure as the splitting criterion. Pfas-treXML (Jain et al., 2016) improves on FastXML by using a propensity scored nDCG splitting criterion and re-ranking the predicted labels to optimize various ranking measures.

Memory Augmented Neural Networks
Memory networks (Weston et al., 2014) have access to external memory, typically consisting of information the model may use to make predictions. Intuitively, informative memories concerning a given instance are found by the memory network to improve its predictive power. Kamra et al. (2017) use memory networks to fix issues of catastrophic forgetting. They show that external memory can be used to learn new tasks without forgetting previous tasks. Memory networks are now applied to a wide variety of natural language processing tasks, including question answering and language modeling (Sukhbaatar et al., 2015;Bordes et al., 2015;Miller et al., 2016).
Matching networks (Vinyals et al., 2016;Snell et al., 2017) have recently been developed for few/one-shot learning problems. We can interpret matching networks as a key-value memory network (Miller et al., 2016). The "keys" are training instances, while the "values" are the labels associated with each training example. Intuitively, the concept is similar to a hashmap. The model will search for the most similar training instance to find its respective "value". Also, matching networks can be interpreted as a k-NN based model that automatically learns an informative distance metric. Finally, Altae-Tran et al. (2017) used match-ing networks for drug discovery, a problem where data is limited.

Diagnosis Code Prediction
The 2007 shared task on coding radiology reports (Pestian et al., 2007) was the first effort that popularized automated EMR coding. Traditionally, linear methods have been used for diagnosis code prediction. Perotte et al. (2013) developed a hierarchical support vector machine (SVM) model that takes advantage of the ICD-9-CM hierarchy. In our prior work, we train a linear model for every label (Rios and Kavuluru, 2013) and re-rank the labels using a learning-to-rank procedure (Kavuluru et al., 2015).  supplement the diagnosis code training data with data from PubMed (biomedical article corpus and search system) to train linear models using both the original training data and the PubMed data.
Recent advances in neural networks have also been put to use for EMR coding: Baumel et al. (2018) trained a CNN with multiple sigmoid outputs using binary cross-entropy. Duarte et al. (2017) use hierarchical recurrent neural networks (RNNs) to annotate death reports with ICD-10 codes. Vani et al. (2017) introduced grounded RNNs for EMR coding. They found that iteratively updating their predictions at each time step significantly improved the performance. Finally, similar to our work, memory networks (Prakash et al., 2017) have recently been used for diagnosis coding. However, we would like to note two significant differences between the memory network from Prakash et al. (2017) and our model. First, they don't use a matching network and their memories rely on extracting information about each label from Wikipedia. In contrast, our model does not use any auxiliary information. Second, they only evaluate on the 50 most frequent labels, while we evaluate on all the labels in the dataset.

Our Architecture
An overview of our model is shown in Figure 1. Our model architecture has two main components.
1. We augment a CNN with external memory over a support set S, which consists of a small subset of the training dataset. The model searches the support set to find similar examples with respect to the input instance. We make use of the homophily assumption that similar instances in the support set are coded pi (x) h(sk) . . . q Figure 1: The matching CNN architecture. For each input instance, x, we search a support set using different representations of x and use the similar support instances and auxiliary features to the output layer.
with similar labels. Therefore, we use the related support set examples as auxiliary features. The similar instances are chosen automatically by combining ideas from metric learning and neural attention. We emphasize that unlike in a traditional k-NN setup, we do NOT explicitly use the labels of the support set instances. The support set essentially enriches and complements the features derived from the input instance.
2. Rather than predicting labels by thresholding, we rank them and select the top k labels specific to each instance where k is predicted using an additional output unit (termed MetaLabeler). We train the MetaLabeler along with the classification loss using a multi-task training scheme.
Before we go into more specific details of our architecture, we introduce some notation. Let X represent the set of all training documents and x be an instance of X. Likewise, let S represent the set of support instances and s be an instance of S. We let L be the total number of unique labels. Our full model is described in following subsections.

Convolutional Neural Networks
We use a CNN to encode each document following what is now a fairly standard approach consisting of an embedding layer, a convolution layer, a max-pooling layer, and an output layer (Collobert et al., 2011;Kim, 2014). However, in our architecture, the CNN additionally aids in getting interme-diate representations for the multi-head matching network component (Section 3.2). Intuitively, CNNs make use of the sequential nature of text, where a non-linear function is applied to region vectors formed from vectors of words in short adjacent word sequences. Formally, we represent each document as a sequence of word vectors, [w 1 , w 2 , . . . , w n ], where w i ∈ R d represents the vector of the i-th word in the document. The region vectors are formed by concatenating each window of s words, w i−s+1 || . . . ||w i , into a local region vector c j ∈ R sd . Next, c j is passed to a non-linear function where W ∈ R v×sd , b ∈ R v , and ReLU is a rectified linear unit (Glorot et al., 2011;Nair and Hinton, 2010). Each row of W represents a convolutional filter; so v is the total number of filters.
After processing each successive region vector, we obtain a document representation D = [ĉ 1 ,ĉ 2 , . . . ,ĉ n+s−1 ] by concatenating eachĉ j forming a matrix D ∈ R v×(n+s−1) . Each row of D is referred to as a feature map, formed by different convolutional filters. Unfortunately, this representation is dependent on the length of the document and we cannot pass it to an output layer. We use max-over-time pooling to create a fixed size vector g(s) = [ĉ 1 max ,ĉ 2 max , . . . ,ĉ q max ], whereĉ j max = max(ĉ j 1 ,ĉ j 2 , . . . ,ĉ j n+s−1 ). 2084

Multi-Head Matching Network
Using the support set and the input instance, our goal is to estimate P (y|x, S). The support set S is chosen based on nearest neighbors and its selection process is discussed in Section 3.4. Among instances in S, our model finds informative support instances with respect to x and creates a feature vector using them. This feature vector is combined with the input instance to make predictions.
First, each support instance s k ∈ S is projected into the support space using a simple single-layer feed forward NN as where W s ∈ R z×v and b s ∈ R z . Likewise, we project each input instance x into the input space using a different feed forward neural network, where W i α ∈ R z×v and b i α ∈ R z . Compared to the support set neural network where we use only a single network, for the input instance we have u projection neural networks. This means we have u versions of x, an idea that is similar to selfattention (Lin et al., 2017), where the model learns multiple representations of an instance. Here each p i (g(x)) represents a single "head" or representation of the input x. Using different weight matrices, [W 1 α , . . . , W u α ] and [b 1 α , . . . , b u α ], we create different representations of x (multiple heads). For both the input multi-heads and the support instance projection, we note that the same CNN is used (also indicated in Figure 1) whose output is subject to the feed forward neural nets outlined thus far in this section.
Rather than searching for a single informative support instance, we search for multiple relevant support instances. For each of the u input instance representations, we calculate a normalized attention score where A i,k represents the score of the k-th support example with respect to the i-th input representation p i (g(x)) and is the square of the Euclidean distance between the input and support representations.
Next, the normalized scores are aggregated into a matrix A ∈ R u×|S| . Then, we create a feature vector where q ∈ R uz , vec is the matrix vectorization operator, and S ∈ R |S|×z is the support instance CNN feature matrix whose i-th row is h(g(s i )) for i = 1, . . . , |S|. Intuitively, multiple weighted averages of the support instances are created, one for each of the u input representations. The final feature vector, is formed by concatenating the CNN representation of the input instance x and the support set feature vector q. Finally, the output layer for L labels involves computinĝ where W c ∈ R L×(uz+v) , b c ∈ R L , and σ is the sigmoid function. Because we use a sigmoid activation function, each label prediction (ŷ i ) is in the range from 0 to 1.

MetaLabeler
The easiest method to convertŷ into label predictions is to simply threshold each element at 0.5. However, most large-scale multi-label problems are highly imbalanced. When training using binary cross-entropy, the threshold 0.5 is optimized for accuracy. Therefore, our predictions will be biased towards 0. A simple way to fix this problem is to optimize the threshold value for each label. Unfortunately, searching for the optimal threshold of each label is computational expensive in large label spaces. Here we train a regression based output layerr wherer estimates the number of labels x should be annotated with. At test time, we rank each label by its score inŷ. Next,r is rounded to the nearest integer and we predict the topr ranked labels.

Training
To train our model, we need to define two loss functions. First, following recent working on multi-label classification with neural net-works (Nam et al., 2014b), we train using a multilabel cross-entropy loss. The loss is defined as which sums the binary cross-entropy loss for each label. The second loss is used to train the MetaLabeler for which we use the mean squared error where r is the vector of correct numbers of labels andr is our estimate. We train these two losses using a multi-task learning paradigm (Collobert et al., 2011). Similar to previous work with matching networks (Vinyals et al., 2016;Snell et al., 2017), "episode" or mini-batch construction can have an impact on performance. In the multi-label setting, episode construction is non-trivial. We propose a simple strategy for choosing the support set S which we find works well in practice. First, at the beginning of the training process we loop over all training examples and store g(x) for every training instance. We will refer to this set of vectors as T . Next, for every step of the training process (for every mini-batch M ), we search T \ M to find the e nearest neighbors (using Euclidean distance) per instance to form our support set S. Likewise, we add e random examples from T \ M to the support set. Therefore, our support set S contains up to |M |e + e instances. The purpose of the random examples is to ensure the distance metric learned during training (captured by improving representations of documents as influenced by all network parameters) is robust to noisy examples.

Matching Network Interpretation
If we do not use the support set label vectors, then what is our network learning? To answer this question we directly compare the matching network formulation to our method. Matching networks can be expressed aŝ where a(, ) is the attention/distance learned between two instances, k indexes each support instance, and y k is a one-hot encoded vector. a(, ) is equivalent to A 1,k assuming we use a single head. Traditional matching networks use onehot encoded vectors because they are evaluated on multi-class problems. EMR coding is a multi-label problem. Hence, y k is a multi-hot encoded vector. Moreover, with thousands of labels, it is unlikely even for neighboring instance pairs to share many labels; this problem is not encountered in the multi-class setting. We overcome this issue by learning new output label vectors for each support set instance. Assuming a single head, our method can be re-written aŝ whereỹ k is the learned label vector for support instance s. Next, we defineỹ k , the learned support set vectors, asỹ where both W 1 c and W 2 c are submatrices of W c . Using this reformulation, we can now see that our method's main components (equations (1)-(3)) are equivalent to this more explicit matching network formulation (equations (4)-(5)). Intuitively, our method combines a traditional output layer -the first half of equation 4 -with a matching network where the support set label vectors are learned to better match the labels of their nearest neighbors.

Experiments
In this section we compare our work with prior state-of-the-art medical coding methods. In Section 4.1 we discuss the two publicly available datasets we use. Next, Section 4.2 describes the implementation details of our model. We summarize the various baselines and models we compare against in Section 4.3. The evaluation metrics are described in Section 4.4. Finally, we discuss how our method performs in Section 4.5.

Datasets
EMR data is generally not available for public use especially if it involves textual notes. Therefore, we focus on the publicly available Medical Information Mart for Intensive Care (MIMIC) datasets for benchmarking purposes. We evaluate using two versions of MIMIC: MIMIC II (Lee et al., 2011) and MIMIC III (Johnson et al., 2016), where the former is a relatively smaller and older dataset and the latter is the most recent version. Following prior work (Perotte et al., 2013;Vani et al., 2017), we use the free text discharge summaries in MIMIC to predict the ICD-9-CM 2 codes. The dataset statistics are shown in Table 1.
For comparison purposes, we use the same MIMIC II train/test splits as Perotte et al. (2013). Specifically, we use discharge reports collected from 2001 to 2008 from the intensive care unit (ICU) of the Beth Israel Deaconess Medical Center. Following Perotte et al. (2013), the labels for each discharge summary are extended using the parent of each label in label set. The parents are based on the ICD-9-CM hierarchy. We use the hierarchical label expansion to maximize the prior work we can compare against.
The MIMIC III dataset has been extended to include health records of patients admitted to the Beth Israel Deaconess Medical Center from 2001 to 2012 and hence provides a test bed for more advanced learning methods. Unfortunately, it does not have a standard train/test split to compare against prior work given we believe we are the first to look at it for this purpose. Hence, we use both MIMIC II and MIMIC III for comparison purposes. Furthermore, we do not use the hierarchical label expansion on the MIMIC III dataset.
Before we present our results, we discuss an essential distinction between the MIMIC II and MIMIC III datasets. Particularly, we are interested in the differences concerning label imbalance. From Table 1, we find that MIMIC III has almost twice as many examples compared to MIMIC II in the dataset. However, MIMIC II on average has more instances per label. Thus, although MIMIC III has more examples, each label is used fewer times on average compared to MIMIC II. The reason for this is because of how the label sets for each instance were extended using the ICD-9 hierarchy in MIMIC II.

Implementation Details
Preprocessing: Each discharge summary was tokenized using a simple regex tokenization scheme (\w\w+). Also, each word/token that occurs less than five times in the training dataset was replaced with the UNK token. Model Details: For our CNN, we used convolution filters of size 3, 4 and 5 with 300 filters for each filter size. We used 300 dimensional skip-gram (Mikolov et al., 2013) word embeddings pre-trained on PubMed. The Adam optimizer (Kingma and Ba, 2015) was used for training with the learning rate 0.0001. The minibatch size was set to 4, e -the number of nearest neighbors per instance -was set to 16, and the number of heads (u) is set to 8. Our code is available at: https://github.com/ bionlproc/med-match-cnn

Baseline Methods
In this paper, we focused on comparing our method to state-of-the-art methods for diagnosis code prediction such as grounded recurrent neural networks (Vani et al., 2017) (GRNN) and multi-label CNNs (Baumel et al., 2018). We also compare against traditional binary relevance methods where independent binary classifiers (L1regularized linear models) are trained for each label. Next, we compare against hierarchical SVM (Perotte et al., 2013), which incorporates the ICD-9-CM label hierarchy. Finally, we also report the results of the traditional matching network with one modification: We train the matching network with the multi-label loss presented in Section 3.4 and threshold using the MetaLabeler described in Section 3.3.
We also present two versions of our model: Match-CNN and Match-CNN Ens. Match-CNN is the multi-head matching network introduced in Section 3. Match-CNN Ens is an ensemble that averages three Match-CNN models, each initialized using a different random seed.

Evaluation Metrics
We evaluate our method using a wide variety of standard multi-label evaluation metrics. We use the popular micro and macro averaged F1 measures to assess how our model (with the MetaL-    abeler) performs when thresholding predictions. For problems with large labels spaces that suffer from significant imbalances in label distributions, the default threshold of 0.5 generally performs poorly (hence our use of MetaLabeler). To remove the thresholding effect bias, we also report different versions of the area under the precision-recall (PR) and receiver operating characteristic (ROC) curves. Finally, in a real-world setting, our system would not be expected to replace medical coders. We would expect medical coders to use our system to become more efficient in coding EMRs. Therefore, we would rank the labels based on model confidence and medical coders would choose the correct labels from the top few. To understand if our system would be useful in a real-world setting, we evaluate with precision at k (P@k) and recall at k (R@k). Having high P@k and R@k are critical to effectively encourage the human coders to use and benefit from the system.

Results
We show experimental results on MIMIC II in Table 2. Overall, our method improves on prior work across a variety of metrics. With respect to micro F1, we improve upon GRNN-128 by over 1%.
Also, while macro-F1 is still low in general, we also improve macro F1 compared to state-of-theart neural methods by more than 1%. In general, both micro and macro F1 are highly dependent on the thresholding methodology. Rather than thresholding at 0.5, we rank the labels and pick the top k based on a trained regression output layer. Can we do better than using a MetaLabeler? To measure this, we look at the areas under PR/ROC curves. Regarding micro and macro PR-AUC, we improve on prior work by ≈ 2.5%. This suggests that via better thresholding, the chances of improving both micro and macro F1 are higher for Match-CNN compared to other methods. Finally, we are also interested in metrics that evaluate how this model would be used in practice. We perform comparably with prior work on P@k. We show strong improvements in R@k with over a 4% improvement in R@40 compared to grounded RNNs and over 1% improvement when compared with Baumel et al. (2018). Our method also outperforms matching networks across every evaluation measure. We present MIMIC III results in Table 3. We reiterate that MIMIC III does not have a standard train/test split. Hence we compare our model to our implementations of methods from prior ef-  forts. For MIMIC III also we show improvements in multiple evaluation metrics. Interestingly, our method performs much better than the standard CNN on MIMIC III, compared to the relative performances of the two methods on MIMIC II. Match-CNN improves on CNN in R@40 by almost 5% on the MIMIC III dataset. The gain in R@40 is more than the 1% improvement found on MIMIC II. We hypothesize that the improvements on MIMIC III are because the label imbalance found in MIMIC III is higher than MIMIC II. Increased label imbalances mean more labels occur less often. Therefore, we believe our model works better with less training examples per label compared to the standard CNN model.
In Table 4 we analyze each component of our model using an ablation analysis on the MIMIC III dataset. First, we find that removing the matching component significantly effects our performance by reducing micro PR-AUC by almost 5%. Regarding micro and macro F1, we also notice that the MetaLabeler heuristic substantially improves on default thresholding (0.5). Finally, we see that the multi-head matching component provides reasonable improvements to our model across multiple evaluation measures. For example, P@8 and P@40 decrease by around 1% when we use attention with a single input representation.

Conclusion
In this paper, we introduce a semi-parametric multi-head matching network with a specific application to EMR coding. We find that by combining the non-parametric properties of matching networks with a traditional classification output layer, we improve metrics for both frequent and infrequent labels in the dataset. In the future, we plan to investigate three limitations of our current model. 1. We currently use a naive approach to choose the support set. We believe that improving the support set sampling method could substantially improve performance.
2. We hypothesize that a more sophisticated thresholding method could have a significant impact on the micro and macro F1 measures. As we show in Table 4, MetaLabeler outperforms naive thresholding strategies. However, given our method shows non-trivial gains in PR-AUC compared to micro/macro F1, we believe better thresholding strategies are a worthy avenue to seek improvements.
3. Both the MIMIC II and MIMIC III datasets have around 7000 labels but ICD-9-CM contains over 16000 labels and ICD-10-CM has nearly 70,000 labels. In future work, we believe significant attention should be given to zero-shot learning applied to EMR coding.
To predict labels that have never occurred in the training dataset, we think it is vital to take advantage of the ICD hierarchy. Baker and Korhonen (2017) improve neural network training by incorporating hierarchical label information to create better weight initializations. However, this does not help with respect to zero-shot learning. If we can better incorporate expert knowledge about the label space, we may be able to infer labels we have not seen before.