Ontological attention ensembles for capturing semantic concepts in ICD code prediction from clinical text

We present a semantically interpretable system for automated ICD coding of clinical text documents. Our contribution is an ontological attention mechanism which matches the structure of the ICD ontology, in which shared attention vectors are learned at each level of the hierarchy, and combined into label-dependent ensembles. Analysis of the attention heads shows that shared concepts are learned by the lowest common denominator node. This allows child nodes to focus on the differentiating concepts, leading to efficient learning and memory usage. Visualisation of the multi-level attention on the original text allows explanation of the code predictions according to the semantics of the ICD ontology. On the MIMIC-III dataset we achieve a 2.7% absolute (11% relative) improvement from 0.218 to 0.245 macro-F1 score compared to the previous state of the art across 3,912 codes. Finally, we analyse the labelling inconsistencies arising from different coding practices which limit performance on this task.


Introduction
Classification of clinical free-text documents poses some difficult technical challenges. One task of active research is the assignment of diagnostic and procedural International Classification of Diseases (ICD) codes. These codes are assigned retrospectively to hospital admissions based on the medical record, for population disease statistics and for reimbursements for hospitals in countries such as the United States. As manual coding is both time-consuming and error-prone, automation of the coding process is desirable. Coding errors may result in unpaid claims and loss of revenue (Adams et al., 2002).
Automated matching of unstructured text to medical codes is difficult because of the large * equal contribution number of possible codes, the high class imbalance in the data, and the ambiguous language and frequent lack of exposition in clinical text. However, the release of large datasets such as MIMIC-III (Johnson et al., 2016) has paved the way for progress, enabling rule-based systems (Farkas and Szarvas, 2008) and classical machine learning methods such as support vector machines (Suominen et al., 2008), to be superseded by neural network-based approaches (Baumel et al., 2017;Karimi et al., 2017;Shi et al., 2018;Duarte et al., 2018;Rios and Kavuluru, 2018). The most successful reported model on the ICD coding task is a shallow convolutional neural network (CNN) model with label-dependent attention introduced by Mullenbach et al. (2018) and extended by Sadoughi et al. (2018) with multi-view convolution and a modified label regularisation module.
One of the common features of the aforementioned neural network models is the use of attention mechanisms (Vaswani et al., 2017). This mirrors advances in general representation learning.
In the text domain, use of multi-headed attention has been core to the development of Transformerbased language models (Devlin et al., 2018;Radford et al., 2019). In the imaging domain, authors have had success with combining attention vectors learned at the global and local levels with Double Attention networks (Chen et al., 2018). In the domain of structured (coded) medical data, Choi et al. (2017) leveraged the ontological structure of the ICD and SNOMED CT coding systems in their GRAM model, to combine the attention vectors of a code and its ancestors in order to predict the codes for the next patient visit based on the codes assigned in the previous visit.
Our contributions are: 1. A structured ontological attention ensemble mechanism which provides improved accuracy, efficiency, and interpretability. Dataset # Documents # Unique patients # ICD-9 Codes # Unique ICD-9 codes  Training  47,719  36,997  758,212  8,692  Development  1,631  1,374  28,896  3,012  Test  3,372  2,755  61,578  4,085  Total  52,722  41,126 848,686 8,929 2. An analysis of the multi-level attention weights with respect to the text input, which allows us to interpret the code predictions according to the semantics of the ICD ontology. 3. An analysis of the limitations of the MIMIC-III dataset, in particular the labelling inconsistencies arising from variable coding practices between coders and between timepoints.

Dataset
We used the MIMIC-III dataset (Johnson et al., 2016) ("Medical Information Mart for Intensive Care") which comes from the intensive care unit of the Beth Israel Deaconess Medical Center in Boston. We concatenated the hospital discharge summaries associated with each admission to form a single document and combined the corresponding ICD-9 codes. The data was split into training, development, and test patient sets according to the split of Mullenbach et al. (2018) (see Table 1).

Methods
We formulate the problem as a multi-label binary classification task, for which each hospital discharge summary is labelled with the presence or absence of the complete set of ICD-9 codes for the associated admission. Our model is a CNN similar to those of (Mullenbach et al., 2018;Sadoughi et al., 2018). Inspired by the graph-based attention model of (Choi et al., 2017), we propose a hierarchical attention mechanism (mirroring the ICD ontology) which yields a multi-level, labeldependent ensemble of attention vectors for predicting each code. Our architecture is shown in Figure 1 and described below.

Embedding
Documents were pre-processed by lower-casing the text and removing punctuation, followed by tokenisation during which purely numeric tokens were discarded. We used a maximum input length of 4500 tokens and truncated any documents longer than this (260 training, 16 devel-opment, and 22 test). Tokens were then embedded with a 100-dimensional word2vec model. For each document, token embeddings were concatenated to give a 100 × N document embedding matrix D, where N is the document length. We pre-trained the word2vec model on the training set using continuous bag-of-words (CBOW) (Mikolov et al., 2013). The vocabulary comprises tokens which occur in at least 3 documents (51,847 tokens). The embedding model was fine-tuned (not frozen) during subsequent supervised training of the complete model.

Convolutional module
The first part of the network proper consists of a multi-view convolutional module, as introduced by Sadoughi et al. (2018). Multiple onedimensional convolutional kernels of varying size with stride = 1 and weights W are applied in parallel to the document embedding matrix D along the N dimension. The outputs of these kernels are padded at each end to match the input length N . This yields outputs of size C × M × N where C is the number of kernel sizes ("views"), M is the number of filter maps per view, and N is the length of the document. The outputs are max-pooled in the C dimension i.e., across each set of views, to yield a matrix E of dimensions M × N : Optimal values were C = 4 filters of lengths {6, 8, 10, 12} with M = 256 filter maps each.

Prediction via label-dependent attention
Label-specific attention vectors are employed to collapse the variable-length E document representations down to fixed-length representations. For each label l, given the matrix E as input, a tokenwise linear layer u l is trained to generate a vector of length N . This is normalised with a softmax operation, resulting in an attention vector a l :  Figure 1: Network architecture. The output of the convolutional module is fed into the ensemble of ancestral attention heads for multi-task learning. Circles with dots represent matrix product operations. Ancestors are mapped to descendants by multiplication with a mapping connectivity matrix based on the ontology structure.
The attention vector is then multiplied with the matrix E which yields a vector v l of length M , a document representation specific to a label: If multiple linear layers u l,0 , u l,1 , . . . are trained for each label at this stage, multiple attention vectors (or "heads") will be generated. Thus, multiple document representations v l could be made available, each of length M , and concatenated together to form a longer label-specific representation for the document. We experimented with multiple attention vectors and found two vectors per label to be optimal. To make a prediction of the probability of each label, P (l), there is a final dense binary classification layer with sigmoid activation. This is shown for two attention vectors:

Prediction via label-dependent ontological attention ensembles
The ICD-9 codes are defined as an ontology, from more general categories down to more specific descriptions of diagnosis and procedure. Rather than simply training two attention heads per code as shown in Section 3.3, we propose to exploit the ontological structure to train shared attention heads between codes on the same branch of the tree, thus pooling information across labels which share ancestry. In this work, we use two levels of ancestry, where the first level corresponds to the pre-floating-point portion of the code. For instance, for the code 425.11 Hypertrophic obstructive cardiomyopathy, the first-degree ancestor is 425 Cardiomyopathy and the second-degree ancestor is 420-429 Other forms of heart disease (the chapter in which the parent occurs). This is illustrated in Figure 2. For the entire set of 8929 labels, we identi-fied 1167 first-degree ancestors and 179 seconddegree ancestors. Compared to two attention vectors per code, this reduces the parameter space and memory requirements from 17,858 attention heads (8929 x 2) to 10,275 attention heads (8929 + 1167 + 179) as well as increasing the number of training samples for each attention head. The label prediction for each code is now derived from the concatenated child (c), parent (p) and grandparent (gp) document representations: In order to facilitate learning of multiple attention heads, we employ deep supervision using the ancestral labels, adding auxiliary outputs for predicting the parent and grandparent nodes:

Training process
We trained our model with weighted binary cross entropy loss using the Adam optimiser (Kingma and Ba, 2014) with learning rate 0.0005.
Stratified shuffling: The network accepts input of any length but all instances within a single batch need to be padded to the same length. To minimise the amount of padding, we used lengthstratified shuffling between epochs. For this, documents were grouped by length and shuffled only within these groups; groups were themselves then shuffled before batch selection started.
Dampened class weighting: We employed the standard practice of loss weighting to prevent the imbalanced dataset from affecting performance on rare classes. We used a softer alternative to empirical class re-weighting, by taking the inverse frequencies of positive (label= 1) and negative (label= 0) examples for each code c, and adding a damping factor α. In the equations below, n labelc=1 stands for the number of positive examples for the ICD code c, and n stands for the total number of documents in the dataset.
Upweighting for codes with 5 examples or fewer, where we do not expect to perform well in any case, was removed altogether as follows: Deep supervision: The loss function was weighted in favour of child codes, with progressively less weight given to the codes at higher levels in the ICD ontology. A weighting of 1 was used for the child code loss, a weighting w h for the parent code auxiliary loss, and w 2 h for the grandparent code auxiliary loss, i.e., Optimal values were α = 0.25 and w h = 0.1.

Implementation and hyperparameters
The word2vec embedding was implemented with Gensim (Řehůřek and Sojka, 2010) and the ICD coding model was implemented with PyTorch (Paszke et al., 2017). Experiments were run on Nvidia V100 16GB GPUs. Hyperparameter values were selected by maximising the development set macro-F1 score for codes with more than 5 training examples.

Results
In our evaluation, we focus on performance across all codes and hence we prioritise macro-averaged metrics, in particular macro-averaged precision, recall, and F1 score. Micro-averaged F1 score and Precision at k (P @K) are also reported in order to directly benchmark performance against previously reported metrics. All reported numbers are the average of 5 runs, starting from different random network initialisations. We compare our model to two previous stateof-the-art models: Mullenbach et al. (2018), and Sadoughi et al. (2018) (published only on arXiv). We trained these models with the hyperparameter values quoted in the respective publications, and used the same early stopping criteria as for our model. Both Mullenbach et al. and Sadoughi et al. use Table 4: Ablation study of individual components of the final method. All models are trained with the F 1 macro stopping criterion. Experiments 2 and 3 do not use the ontological attention mechanism, and instead have one or two attention heads respectively per code-level label. For experiment 4, child-parent and parent-grandparent connections were randomised, removing shared semantics between codes across the full 3 levels.
is included in the model reported here. However, this regularisation is not used in our own model where we observed no benefit.
Overall results are shown in Table 2. Our method significantly outperforms the benchmarks on macro-F1 and P @8.
Previous models have optimised for F1 microaverage. Different target metrics require different design choices: after removal of the class weighting in the loss function and when using F 1 micro as our stopping criterion, we are also able to surpass previous state-of-the-art results on micro-F1. The results are presented in Table 3; our method achieves the highest F 1 micro score, as well as the highest P @8 score. We note that P @8 score is consistently higher for models stopped using the F 1 micro criterion.
In Table 4 we present an ablation study. It can be seen that the improvement in performance of the ontological attention model is not simply due to increased capacity of the network, since even with 73% greater capacity (17,858 compared to 10,275 attention vectors), the two-vector multiheaded model has a 1.2% drop in performance. Experiments with deep supervision and randomisation of the ontology graph connections show the benefit of each component of the ontological architecture. We also measure the effect of additional changes made during optimisation of the architecture and training.
Levels of the ontology: Three levels of the ontology (including the code itself) were found to be optimal for the Ontological Attention model (see Figure 3). Adding parent and grandparent levels provide incremental gains in accuracy. Adding a level beyond the grandparent node (i.e., the greatgrandparent level) does not provide further improvement. Since we identified only 22 ancestral nodes at the level directly above the grandparent, we hypothesise that the grouping becomes too coarse to be beneficial. In fact, all procedure codes share the same ancestor at this level; the remaining Figure 3: F 1 macro for models using attention ensembles across different levels of the ontological tree. Error bars represent the standard deviation across 5 different random weight initialisations. The model with 1 level has only the code-level attention head, the model with 2 levels also includes the shared parent attention heads; the model with 3 levels adds the shared grandparent attention heads (this is our reported Ontological Attention model), and finally, the model with 4 levels adds shared great-grandparent attention heads.

Analysis of the attention weights
In Figure 4 we show how the weights of codelevel u l vectors (which give rise to the attention heads) change when the ontological attention ensemble mechanism is introduced. As expected, we observe that in the case of a single attention head, the weights for different codes largely cluster together based on their position in the ontology graph. Once the parent and grandparent attention heads are trained, the ontological similarity structure on the code level mostly disappears. This suggests that the common features of all codes within a parent group are already extracted by the parent attention. thus, the capacity of the code-level attention is spent on the representation of the differences between the descendants of a single parent.

Interpretability of the attention heads
In Section 4.2, we showed the links between the ontology and the attention heads within the space of the u l vector weights. We can widen this analysis to links between the predictions and the input, by examining which words in the input documents are attended by the three levels of attention heads for a given label. A qualitative visual example is shown in Figure 5. We performed quantitative frequency analysis of high-attention terms (keywords) in the training set. A term was considered a keyword if its attention weight in a document surpassed the threshold t kw : where N is the length of a document and γ kw is a scalar parameter controlling the strictness of the threshold. With γ kw = 1, a term is considered a keyword if its attention weight surpasses the uniformly distributed attention. In our analysis we chose γ kw = 17 for all documents. We aggregated these keywords across all predicted labels in the training set, counting how many times a term is considered a keyword for a label. The results of this analysis are in line with our qualitative analysis of attention maps. The most frequent keywords for the labels presented in the example in Figure 5 include "cancer", "ca", "tumor", at the grandparent level (focusing on the concept of cancer); "metastatic", "metastases" and "metastasis" at the parent level (focusing on the concept of metastasis); and "brain", "craniotomy", "frontal" at the code-level (focusing on terms relating to specific anatomy). A sibling code (198.5 Secondary malignant neoplasm of bone and bone marrow) displays similar behaviour in focusing on anatomy, with "bone", "spine", and "back" being among the most frequent keywords.
Not all codes display such structured behaviour. For instance, the grandparent 401-405 Hypertensive disease attended to the term "hypertension" most frequently. The parent code 401 Essential hypertension, does not attend to "hypertension", but neither does it attend to any useful keywords -this may be due to the code being simple compared to its sibling codes, which are more specific (e.g., 402 Hypertensive heart disease). Interestingly, the children of 401 Essential hypertension attend to the word "hypertension" again, while also focusing on terms that set them apart from each other -e.g., 401.0 Malignant essential hypertension focuses on terms implying malignancy, such as "urgency", "emergency", and "hemorrhage".

Limitations due to labelling variability
Since performance on this task appears to be much lower than might be acceptable for real-world use, we investigated further. Figure 6 shows the perlabel F1 scores; it can be seen that there is high It can be seen that codes naturally cluster by their parent node. Selected higher-level alignments are indicated by additional contours -for grandparent nodes (3 nodes) and for diagnoses/procedure alignment (in the case of cardiovascular disease). (c) u l vectors in the ontological attention ensemble model for the same set of codes (and the same t-SNE hyperparameters). In most cases the clustering disappears, indicating that the attention weights for the ancestral codes have extracted the similarities from descendants' clusters.
Method Mullenbach et al. (2018)   Inspection of examples for some of the poorly performing codes revealed some variability in coding policy, described further below.

Misreporting of codes
The phenomenon of human coding errors is reported in the literature; for instance, Kokotailo  and Hill estimated sensitivity and specificity to be 80% and 100% respectively for ICD codes relating to stroke and its risk factors (Kokotailo and Hill, 2005). In the MIMIC-III dataset, we inspected the assignment of smoking codes (current smoker 305.1, past smoker V15.82, or never smoked i.e., no code at all), using regular expression matching to identify examples of possible miscoding, followed by manual inspection of 60 examples (10 relating to each possible miscoding category) to verify our estimates. We estimated that 10% of patients had been wrongly assigned codes, and 30% of patients who had a mention of smoking in their record had not been coded at all. We also observed that often the "correct" code is not clear-cut. For instance, many patients had smoked in the distant past or only smoke occasionally, or had only re-cently quit; in these cases, where the narrator reliability may be questionable, the decision of how to code is a matter of subjective clinical judgement.

Revisions to the coding standards
Another limitation of working with the MIMIC-III dataset is that during the deidentification process, information about absolute dates was discarded. This is problematic when we consider that the MIMIC-III dataset contains data that was collected between 2001 and 2012, and the ICD-9 coding standard was reviewed and updated annually between 2006 and 2013 (Centers for Medicare & Medicaid Services) i.e., each year some codes were added, removed or updated in their meaning.
To investigate this issue, we took the 2008 standard and mapped codes created post-2008 back to this year. In total, we identified 380 codes that are present in the dataset but were not defined in the 2008 standard. An example can be seen in Figure 7. We report our results on the 2008 codeset in Table 5. It can be seen that there is an improvement to the metrics on this dataset, which we expect would increase further if all codes were mapped back to the earliest date of 2001. Without time data, it is an unfair task to predict codes which are fundamentally time-dependent. This is an interesting example of conflicting interests between (de)identifiability and task authenticity.
During real-world deployment, codes should be assigned according to current standards. In order to use older data, codes should be mapped forwards rather than backwards. The backwards operation was possible by automated re-mapping of the codes, however the forwards operation is more arduous. Newly introduced codes may require annotation of fresh labels or one-to-many conversion -both operations requiring manual inspection of the original text. A pragmatic approach would be to mask out codes for older documents where they cannot be automatically assigned.

Conclusions
We have presented a neural architecture for automated clinical coding which is driven by the ontological graph of relationships between codes. This model establishes a new state-of-the-art result for the task of automated clinical coding with MIMIC-III dataset. Compared to simply doubling the number of attention heads, our ontological attention ensemble mechanism provides improve- ments in accuracy, in memory efficiency, and in interpretability. Our method is not specific to an ontology, and in fact could be used for a graph of any formation. If we were to exploit further connections within the ICD ontology e.g., between related diagnoses and procedures, and between child codes which share modifier digits, we would expect to obtain a further performance boost.
We have illustrated that labels may not be reliably present or correct. Thus, even where plenty of training examples are available, the performance may (appear to) be low. In practice, the most successful approach may be to leverage a combination of automated techniques and manual input. An active learning setup would facilitate adoption of new codes by the model as well as allowing endorsement of suggested codes which might otherwise have been missed by manual assignment, and we propose this route for future research.