HyperCore: Hyperbolic and Co-graph Representation for Automatic ICD Coding

The International Classification of Diseases (ICD) provides a standardized way for classifying diseases, which endows each disease with a unique code. ICD coding aims to assign proper ICD codes to a medical record. Since manual coding is very laborious and prone to errors, many methods have been proposed for the automatic ICD coding task. However, most of existing methods independently predict each code, ignoring two important characteristics: Code Hierarchy and Code Co-occurrence. In this paper, we propose a Hyperbolic and Co-graph Representation method (HyperCore) to address the above problem. Specifically, we propose a hyperbolic representation method to leverage the code hierarchy. Moreover, we propose a graph convolutional network to utilize the code co-occurrence. Experimental results on two widely used datasets demonstrate that our proposed model outperforms previous state-of-the-art methods.


Introduction
The International Classification of Diseases (ICD) is a healthcare classification system supported by the World Health Organization, which provides a unique code for each disease, symptom, sign and so on. ICD codes have been widely used for analyzing clinical data and monitoring health issues (Choi et al., 2016;Avati et al., 2018). Due to the importance of ICD codes, ICD coding -which assigns proper ICD codes to a medical record -has drawn much attention. The task of ICD coding is usually undertaken by professional coders according to doctors' diagnosis descriptions in the form of free texts. However, manual coding is very expensive, time-consuming and error-prone.

Automatic ICD Coding Model
Mr.[**Known lastname 58216**] is an 87 year old male with Parkinsons Disease, difficulty breathing ,…,… 87 year old male presents with severe chest tightness, respiratory failure, and pneumatosis coli indicative of visceral necrosis.
As the patient was not a surgical candidate, medical prognosis was poor …… The cost incurred by coding errors and the financial investment spent on improving coding quality are estimated to be $25 billion per year in the US (Lang, 2007). Two main reasons can account for this. First, only the people who have medical expert knowledge and specialized ICD coding skills can handle the task. However, it is hard to train such an eligible ICD coder. Second, it is difficult to correctly assign proper codes to the input document even for professional coders, because one document can be assigned multiple ICD codes and the number of codes in the taxonomy of ICD is large. For example, there are over 15,000 and 60,000 codes respectively in the ninth version (ICD-9) and the tenth version (ICD-10) of ICD taxonomies.
To reduce human labor and coding errors, many methods have been carefully designed for automatic ICD coding (Perotte et al., 2013;Mullenbach et al., 2018). For example in Figure 1, given the clinical text of a patient, the ICD coding model needs to automatically predict the corresponding ICD codes. The automatic ICD coding task can be modeled as a multi-label classification task since each clinical text is usually accompanied by mul-  Figure 2: An example of ICD-9 descriptors and the derived hierarchical structure.
tiple codes. Most of the previous methods handle each code in isolation and convert the multi-label problem into a set of binary classification problems to predict whether each code of interest presents or not (Mullenbach et al., 2018;Rios and Kavuluru, 2018). Though effective, they ignore two important characteristics: Code Hierarchy and Code Co-occurrence, which can be leveraged to improve coding accuracy. In the following, we will introduce the two characteristics and the reasons why they are critical for the automatic ICD coding.
Code Hierarchy: Based on ICD taxonomy, ICD codes are organized under a tree-like hierarchical structure as shown in Figure 2, which indicates the parent-child and sibling relations between codes. In the hierarchical structure, the upper level nodes represent more generic disease categories and the lower level nodes represent more specific diseases. The code hierarchy can capture the mutual exclusion of some codes. If code X and Y are both children of Z (i.e., X and Y are the siblings), it is unlikely to simultaneously assign X and Y to a patient in general (Xie and Xing, 2018). For example in Figure 2, if code "464.00 (acute laryngitis without mention of obstruction)" is assigned to a patient, it is unlikely to assign the code "464.01 (acute laryngitis with obstruction)" to the patient at the same time. If automatic ICD coding models ignore such a characteristic, they are prone to giving inconsistent predictions. Thus, a challenging problem is how to model the code hierarchy and use it to capture the mutual exclusion of codes.
Code Co-occurrence: Since some diseases are concurrent or have a causal relationship with each other, their codes usually co-occur in the clinical text, such as "997.91 (hypertension)" and "429.9 (heart disease)". In this paper, we call such characteristic code co-occurrence which can capture the correlations of codes. The code co-occurrence can be utilized to correctly predict some codes which are difficult to predict by only using the clinical text itself. For example in Figure 1, the code of "acute respiratory failure" can be easily inferred via capturing apparent clues (i.e., the green bold words) from the text. Although there are also a few clues to infer the code of "acidosis", they are very obscure, let alone predict the code of "acidosis" by only using these obscure clues. Fortunately, there is a strong association between these two diseases: one of the main causes of "acidosis" is "acute respiratory failure". This prior knowledge can be captured via the fact that the codes of the two diseases usually co-occur in clinical texts. By considering the correlation, the automatic ICD coding model can better exploit obscure clues to predict the code of "acidosis". Therefore, another problem is how to leverage code co-occurrence for ICD coding.
In this paper, we propose a novel method termed as Hyperbolic and Co-graph Representation method (HyperCore) to address above problems. Since the tree-likeness properties of the hyperbolic space make it more suitable for representing symbolic data with hierarchical structures than the Euclidean space (Nickel and Kiela, 2017), we propose a hyperbolic representation learning method to learn the Code Hierarchy. Meanwhile, the graph has been proved effective in modeling data correlation and the graph convolutional network (GCN) enables to efficiently learn node representation (Kipf and Welling, 2016). Thus, we devise a code co-occurrence graph (co-graph) for capturing Code Co-occurrence and exploit the GCN to learn the code representation in the co-graph.
The contributions of this paper are threefold. Firstly, to our best knowledge, this is the first work to propose a hyperbolic representation method to leverage the code hierarchy for automatic ICD coding. Secondly, this is also the first work to utilize a GCN to exploit code co-occurrence correlation for automatic ICD coding. Thirdly, experiments on two widely used automatic ICD coding datasets show that our proposed model outperforms previous state-of-the-art methods.

Related Work
Automatic ICD Coding. Automatic ICD coding is a challenging and important task in the medical informatics community, which has been studied with traditional machine learning methods (Larkey and Croft, 1996;Perotte et al., 2013) and neural network methods (Koopman et al., 2015;Rios and Kavuluru, 2018;Yu et al., 2019). Given discharge  Figure 3: The architecture of Hyperbolic and Co-graph Representation method (HyperCore). In the Poincaré ball B n , we show the embeded code hierarchy (i.e., tree-like hierarchical structure). The dots l i (i = 1, 2, 3) on the treelike hierarchical structure and triangles m i (i = 1, 2, 3) in the Poincaré ball denote hyperbolic code embeddings and hyperbolic document representations, respectively. summaries, Perotte et al. (2013) propose a hierarchical SVM model to predict ICD codes. Recently, neural network methods have been introduced to the task. Mullenbach et al. (2018) propose an attention based convolutional neural network (CNN) model to capture important information for each code. Xie and Xing (2018) adopt tree long shortterm memory (LSTM) to utilize code descriptions. Though effective, they ignore the code hierarchy and code co-occurrence. Hyperbolic Representation. Hyperbolic space has been applied to modeling complex networks (Krioukov et al., 2010). Recent research on representation learning demonstrates that the hyperbolic space is more suitable for representing symbolic data with hierarchical structures than the Euclidean space Kiela, 2017, 2018;Hamann, 2018). In the field of natural language processing (NLP), the hyperbolic representation has been successfully applied to question answering (Tay et al., 2018), machine translation (Gulcehre et al., 2018) and sentence representation (Dhingra et al., 2018). To our knowledge, this is the first work to apply hyperbolic representation method to the automatic ICD coding task. Graph Convolutional Networks. GCN (Kipf and Welling, 2016) is a powerful neural network, which operates on graph data. It yields substantial improvements over various NLP tasks such as semantic role labeling , multi-document summarization (Yasunaga et al., 2017) and machine translation (Bastings et al., 2017). Veličković et al. (2017) propose graph atten-tion networks (GAT) to summarize neighborhood features by using masked self-attentional layers. We are the first to capture the code co-occurrence characteristic via the GCN for the automatic ICD coding task.

Method
We propose a hyperbolic and co-graph representation (HyperCore) model for automatic ICD coding. Firstly, to capture the code hierarchy, we learn the code hyperbolic representations and measure the similarities between document and codes in the hyperbolic space. Secondly, to exploit code cooccurrence, we exploit the GCN to learn code cooccurrence representations and use them as query vectors to obtain code-aware document representations. Finally, the document-code similarity scores and code-aware document representations are then aggregated to predict the codes. Figure 3 shows the overall architecture of our proposed model.

Convolution Neural Network Encoder
We first map each word into a low dimensional word embedding space. The document can be denoted as X = {x 1 , x 2 , . . . , x N }, where N is the length of the document. Then, we exploit the CNN to encode the clinical text due to its high computational efficiency: where W c is the convolutional filter. b c is the bias. k is the filter size. * is the convolution operator.

Code-wise Attention
After encoding by CNN, we obtain the document representation H = {h 1 , h 2 , . . . , h N }. Since we need to assign multiple codes for each document and different codes may focus on different sections of the document, we employ code-wise attention to learn relevant document representations for each code. We first generate the code vector for each code via averaging the word embeddings of its descriptor: where v i is the code vector, N d is the length of the descriptor, w j is the embedding of j-th word in the descriptor, and L is the total number of codes in the dataset (Jouhet et al., 2012;Johnson et al., 2016).

The code vectors set is
Then, we generate the code-wise attention vector via matrix-vector product: Finally, we use the document representation H and attention vector α i to generate the code-aware document representation: We concatenate the c i (i = 1, . . . , L) to obtain the code-aware document representation, denoted as C = {c 1 , c 2 , . . . , c L } ∈ R dc×L .

Document-Code Similarities in Hyperbolic Space
To capture the code hierarchy, we learn the code hyperbolic representations and measure the similarities between document and codes in the hyperbolic space. In this section, we propose a hyperbolic code embedder to obtain code hyperbolic representations, and we also propose a hyperbolic document projector to project the document representations from Euclidean space to hyperbolic space. We then compute the similarities between the document and codes in the hyperbolic space.

Hyperbolic Geometry
Hyperbolic geometry is a non-Euclidean geometry which studies spaces of constant negative curvature. Our approach is based on the Poincaré ball model (Nickel and Kiela, 2017), which is a particular model of hyperbolic space and is wellsuited for gradient-based optimization. In particular, let B n = {x ∈ R n | ||x|| < 1} be the open n-dimensional unit ball, where || · || denotes the Euclidean norm. The Poincaré ball (B n , g x ) is defined by the Riemannian manifold, i.e., the open unit ball equipped with the Riemannian metric tensor: where x ∈ B n . g E denotes the Euclidean metric tensor. Furthermore, the distance between two points u, v ∈ B n is given as: where arcosh is the inverse hyperbolic cosine function, i.e., arcosh(x) = ln(x + (x 2 − 1)). If we consider the origin O and two points u, v, when the two points moving towards the outside of the Poincaré ball (i.e., ||u||, ||v|| → 1), the distance . That is, the path between the two points converges to a path through the origin, which can be seen as a tree-like hierarchical structure.

Hyperbolic Code Embedder
The tree-likeness of the hyperbolic space makes it natural to embed hierarchical structures. By embedding code hierarchy in the Poincaré ball, the top codes are placed near the origin and bottom codes are near the boundary. The embedding norm represents depth in the hierarchy, and the distance between embeddings represents the similarity. Let D = {(l p , l q )} be the set of parent-child relations between code pairs. Θ = {θ i } T i=1 , θ i ∈ B dp is the corresponding code embedding set, where T is the number of all ICD codes. In order to enforce related codes to be closer than unrelated codes, we minimize the following loss function to get the code hyperbolic representations when ||θ i || < 1(i = 1, . . . , L): where N (l p ) = {l q |(l p , l q ) / ∈ D} ∪ {l p } is the set of negative samples. The hyperbolic code representations in our work are denoted as is the distance defined as Equation (6).

Hyperbolic Document Projector
To compute the similarities between document and codes in hyperbolic space, the code-aware document representations C = {c 1 , c 2 , . . . , c L } need to be projected into the hyperbolic space. We exploit the re-parameterization technique (Dhingra et al., 2018;López et al., 2019) to implement it, which involves computing a direction vector r and a norm magnitude η. We use the c i as an example to illustrate the procedure: where Φ dir : R dc → R dp is the direction function. We parameterize it as a multi-layer perceptron (MLP). Φ norm : R dc → R is the norm magnitude function. We use a linear layer to implement it. σ is the sigmoid function to ensure the resulting norm η i ∈ (0, 1). The re-parameterized document representation is defined as m i = η i r i , which lies in hyperbolic space B dp .
The re-parameterization technique enables to project the code-aware document representation into the Poincaré ball, which enables the avoidance of the stochastic Riemannian optimization method (Bonnabel, 2013) to learn the parameters in the hyperbolic space. Instead, we can exploit the deep learning optimization method to update the parameters in the entire model.

Compute Document-Code Similarity
Since there doesn't exist a clear hyperbolic innerproduct, the cosine similarity is not appropriate to be the metric. In our work, we adopt the hyperbolic distance function to model the relationships between the document and codes. Since the hyperbolic document representation for each code has been obtained, we just need to compute the similarity with the corresponding hyperbolic code embedding: where S ∈ R L is the document-code similarity score.
[; ] is the concatenation operation. d(·) is the distance function defined as Equation (6).

Code-aware Document Representations via Graph Convolutional Network
To exploit code co-occurrence, we exploit the graph to model code co-occurrence correlation, and then we use the GCN to learn code cooccurrence representations. In this section, we first construct the co-graph according to the statistics of the code cooccurrence in the training set, and then we exploit the GCN to encode the code co-occurrence correlation.

Code Co-graph Construction
Given a graph with L nodes, we can represent the graph using a L × L adjacency matrix A. To capture the co-occurrence correlations between codes, we build the code co-occurrence graph (co-graph), which utilizes the code co-occurrence matrix as the adjacency matrix. If the i-th code and the j-th code co-occur in the clinical text, there is an edge between them. Intuitively, if the i-th code co-appears with the j-th code more often than the k-th code, the probabilities of the i-th code and the j-th code should have stronger dependencies. Therefore, in our work, we use the co-appearing times between two codes as the connection weights in the adjacency matrix, which can represent the prior knowledge. For example, if the i-th code co-appears n times with the j-th code, we set A ij = n.

Code Co-occurrence Encoding via GCN
The inputs of GCN are initial representations of codes V which are obtained via Equation (2) and the adjacency matrix A. We use the standard convolution computation (Kipf and Welling, 2016) to encode code co-occurrence: whereÃ = A + I. I is the identity matrix, D ii = jÃ ij , H (l) ∈ R L×dc and H (0) = V . ρ is an activation function (e.g., ReLU). After co-occurrence correlation encoding via GCN, the code representations enable to capture the code co-occurrence correlations. Then, we use the codewise attention to obtain code-aware document representations, denoted as D = {d 1 , d 2 , . . . , d L } 1 .

Aggregation Layer
After capturing the code hierarchy and code cooccurrence, we use an aggregation layer to fuse document-code similarity scores S and code-aware document representations D for enhancing representation with each other: where W s and W d are transformation matrixes. U = {u 1 , u 2 , . . . , u L } ∈ R L are final document representations for each code. λ is the hyperparameter.  Table 1: Comparison of our model and other baselines on the MIMIC-III dataset. We run our model 10 times and each time we use different random seeds for initialization. We report the mean ± standard deviation of each result.

Training
The prediction for each code is generated via: Our model is to be trained using a multi-label binary cross-entropy loss: where y i ∈ {0, 1} is the ground truth for the i-th code.

Datasets
We evaluate our proposed model on two widely used datasets, including MIMIC-II (Jouhet et al., 2012) and MIMIC-III (Johnson et al., 2016). Both datasets contain discharge summaries that are tagged by human coders with a set of ICD-9 codes. For MIMIC-III dataset, we use the same experimental setting as previous works (Shi et al., 2017;Mullenbach et al., 2018

Metrics and Parameter Settings
Following previous work (Mullenbach et al., 2018), we use macro-averaged and micro-averaged F1, macro-averaged and micro-averaged AUC (area under the ROC, i.e., receiver operating characteristic curve) and Precision@N (P@N) as the metrics.  The P@N indicates the proportion of the correctlypredicted labels in the top-N predicted labels.
Hyper-parameters are tuned on the development set by grid search. The word embedding size d e is 100. The convolution filter size is 10. The size of the filter output is 200. The dropout rate is 0.4. The λ is 0.2. The batch size is 16. Adam (Kingma and Ba, 2014) is used for optimization with an initial learning rate 1e-4. We pre-train the word embeddings on the combination of training sets of MIMIC-II and MIMIC-III datasets by using word2vec toolkit (Mikolov et al., 2013).

Baselines
SVM: A hierarchical support vector machine (SVM) is proposed by Perotte et al. (2013) to use the hierarchical nature of ICD codes, which is evaluated on the MIMIC-II dataset. C-MemNN: A condensed memory neural network is proposed by Prakash et al. (2017) to predict ICD codes on the MIMIC-III 50 dataset. C-LSTM-ATT: A character-aware LSTM based attention model is proposed by Shi et al. (2017). It is also evaluated on the MIMIC-III 50 dataset. HA-GRU: A hierarchical attention gated recurrent unit model is proposed by Baumel et al. (2018) to predict ICD codes on the MIMIC-II dataset. CAML & DR-CAML: The convolutional attention network for multi-label classification (CAML) is proposed by Mullenbach et al. (2018). DR-CAML is an extension of CAML which  incorporates the code description. They achieve the state-of-the-art performance on the MIMIC-III and MIMIC-II datasets.

Compared with State-of-the-art Methods
We repeat 10 times training and each time we use different random seeds for initialization. We report the mean ± standard deviation of each result. Table  1 and Table 2 show the results on the MIMIC-III and MIMIC-II datasets, respectively. Since some baselines are evaluated either on MIMIC-III or MIMIC-II, the baselines used for the two datasets are different. Overall, we observe that: (1) In Table 1, our method HyperCore outperforms all the baselines on MIMIC-III dataset. For example, compared with the state-of-the-art model DR-CAML, our method achieves 2.2% and 3% improvements of Micro-F1 score on MIMIC-III full and MIMIC-III 50 respectively. It indicates that, as compared to neural network based models that handle each code in isolation, our method can better take advantage of the rich correlations among codes. In addition, the small standard deviations indicate that our model obtains stable good results.
(2) As previous work (Mullenbach et al., 2018), the Macro-F1 score of our method on MIMIC-III full is lower than that on the MIMIC-III 50. The reason is that MIMIC-III full has long-tail frequency distributions, and the Macro-F1 places more emphasis on rare code prediction. Therefore, it is difficult to achieve a high Macro-F1 score on MIMIC-III full. Nevertheless, our method still achieves the best result on the Macro-F1 metric. It indicates that our method is very effective.
(3) In Table 2, our method HyperCore also achieves the best performance over all metrics on the MIMIC-II. Especially, compared with the stateof-the-art model DR-CAML, our method achieves 5.9% improvements of Macro-AUC, which indicates the effectiveness of our method.
(4) As shown in Table 2, the neural network based methods outperform the traditional model (SVM), which indicates the limitation of human-designed features and the advancement of neural networks for the automatic ICD coding.

Ablation Experiment
To investigate the effectiveness of the hyperbolic and co-graph representation, we conduct the ablation studies. The experimental results are listed in Table 3. From the results, we can observe that: (1) Effectiveness of Hyperbolic Representation. Compared with the model removed hyperbolic representation, the HyperCore improves the Micro-F1 score from 0.539 to 0.551 on MIMIC-III full dataset. It demonstrates the effectiveness of the hyperbolic representation.
(2) Effectiveness of Co-graph Representation. Compared with the model removed the co-graph representation, the HyperCore model improves the performance, achieving 2.6% improvements of Micro-F1 score on the MIMIC-III 50 dataset. The great improvements indicate the co-graph representation is very effective.
(3) Effectiveness of Hyperbolic and Co-graph Representation. When we remove the hyperbolic and co-graph representation, the performance drops significantly. The Micro-F1 score drops from 0.477 to 0.439 on the MIMIC-II dataset. It indicates that simultaneously exploiting the hyperbolic and cograph representation is also very effective.

The Analysis of Hyperbolic Code Embedding Dimension
Since the dimensionality of the hperbolic code embeddings is very important for hyperbolic representation, we investigate its effect. The size of hyperbolic code embeddings is set 10, 20, 50, 70 and 100. Table 4 shows the results of our model on the MIMIC-III and MIMIC-II datasets. We have two important observations: (1) The best hyperbolic code embedding dimensionality on MIMIC-III full is larger than it on MIMIC-III 50 and MIMIC-II. The reason may be that the number of codes in MIMIC-III full is   more than other two datasets, which needs higherdimensional hyperbolic code embedding to represent the code hierarchy.
(2) The performance does not always improve when the hyperbolic code embedding size increases. We guess that low dimensional embeddings can capture the hierarchy and the network is prone to over-fitting when high dimensional hyperbolic code embeddings are used.

The Hierarchy of Hyperbolic Code Embedding
After embedding the ICD codes into the hyperbolic space, the top level codes will be placed near the origin and low level codes near the boundary, which can be reflected via their norms. Table 5 shows examples of ICD-9 codes and their hyperbolic norms. The first and second blocks list codes of "Diseases of the Respiratory System" and "Diseases of the Digestive System", respectively. As expected, the lower level codes have higher hyperbolic norms, and this approves that when the disease is more specific, the hyperbolic norm is larger. For example, code "487.8 (influenza with other manifestations)" has a higher norm than "487 (influenza)", and "550.0 (inguinal hernia with gangrene)" has a higher norm than "550 (inguinal hernia)". It indicates that the hyperbolic code embeddings can  capture the code hierarchy.

Case Study
We give an example shown in Figure 4 to illustrate the visualization of code-wise attention and the effectiveness of hyperbolic and co-graph representation.
(1) Code-wise attention visualization: When the HyperCore model predicts the code "518.81 (acute respiratory failure)", it can assign larger weights to more informative words, like "respiratory failure" and "chest tightness". It shows the codes-wise attention enables to select the most informative words.
(2) The effectiveness of hyperbolic representations: Our proposed model and the CNN+Attention can both correctly predict the code "518.81". However, the CNN+Attention model gives contradictory predictions. Our proposed model can avoid the prediction contradictions by exploiting code hierarchy, which proves the effectiveness of hyperbolic representations.
(3) The effectiveness of co-graph representation: Although there is no very obvious clue to predict the code "276.2 (acidosis)", our model can exploit the co-occurrence between the code "518.81" and "276.2" to assist in inferring the code "276.2". It demonstrates the effectiveness of the co-graph representation.

Conclusion
In this paper, we propose a novel hyperbolic and cograph representation framework for the automatic ICD coding task, which can jointly exploit code hierarchy and code co-occurrence. We exploit the hyperbolic representation learning method to leverage the code hierarchy in the hyperbolic space. Moreover, we use the graph convolutional network to capture the co-occurrence correlation. Experimental results on two widely used datasets indicate that our proposed model outperforms previous state-ofthe-art methods. We believe our method can also be applied to other tasks that need to exploit hierarchical label structure and label co-occurrence, such as fine-grained entity typing and hierarchical multi-label classification.