Clinical Concept Extraction for Document-Level Coding

The text of clinical notes can be a valuable source of patient information and clinical assessments. Historically, the primary approach for exploiting clinical notes has been information extraction: linking spans of text to concepts in a detailed domain ontology. However, recent work has demonstrated the potential of supervised machine learning to extract document-level codes directly from the raw text of clinical notes. We propose to bridge the gap between the two approaches with two novel syntheses: (1) treating extracted concepts as features, which are used to supplement or replace the text of the note; (2) treating extracted concepts as labels, which are used to learn a better representation of the text. Unfortunately, the resulting concepts do not yield performance gains on the document-level clinical coding task. We explore possible explanations and future research directions.


Introduction
Clinical decision support from raw-text notes taken by clinicians about patients has proven to be a valuable alternative to state-of-the-art models built from structured EHRs.Clinical notes contain valuable information that the structured part of the EHR does not provide, and do not rely on expensive and time-consuming human annotation (Torres et al., 2017;American Academy of Professional Coders, 2019).Impressive advances using deep learning have allowed for modeling on the raw text alone (Mullenbach et al., 2018;Rios and Kavuluru, 2018a;Baumel et al., 2018).However, there exist some shortcomings to these approaches: clinical text is noisy, and often contains heavy amounts of abbreviations and acronyms, a challenge for machine reading (Nguyen and Patrick, 2016).Additionally, rare words replaced with "UNK" tokens for better generalization may be crucial for predicting rare labels.
Clinical concept extraction tools abstract over the noise inherent in surface representations of clinical text by linking raw text to standardized concepts in clinical ontologies.The Apache clinical Text Analysis Knowledge Extraction System (cTAKES, Savova et al., 2010) is the most widelyused such tool, with over 1000 citations.Based on rules and non-neural machine learning methods and engineered for almost a decade, cTAKES provides an easily-obtainable source of human-encoded domain knowledge, although it cannot leverage deep learning to make document-level predictions.
Our goal in this paper is to maximize the predictive power of clinical notes by bridging the gap between information extraction and deep learning models.We address the following research questions: how can we best leverage tools such as cTAKES on clinical text?Can we show the value of these tools in linking unstructured data to structured codes in an existing ontology for downstream prediction?
We explore two novel hybrids of these methods: data augmentation (augmenting text with extracted concepts) and multi-task learning (learning to predict the output of cTAKES).Unfortunately, in neither case does cTAKES improve downstream performance on the document-level clinical coding task.We probe this negative result through an extensive series of ablations, and suggest possible explanations, such as the lack of word variation captured through concept assignment.

Related Work
Clinical Ontologies Clinical concept ontologies facilitate the maintenance of EHR systems with standardized and comprehensive code sets, allowing consistency across healthcare institutions and practitioners.The Unified Medical Language System (UMLS) (Lindberg et al., 1993) maintains a standardized vocabulary of clinical concepts, each of which is assigned a concept unique identifier (CUI).The Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) (Donnelly, 2006) and the International Classification of Diseases (ICD) (National Center for Health Statistics, 1991) build off of the UMLS and provide structure by linking concepts based on their relationships.The SNOMED ontology has over 340,000 active concepts, ranging from fine-grained ("Adenylosuccinate lyase deficiency") to extremely general ("patient").The ICD ontology is narrower in scope, with around 13,000 diagnosis and procedure codes used for insurance billing.Unlike SNOMED, which has an unconstrained graph structure, ICD9 is organized into a top-down hierarchy of specificity (see Figure 1).

Clinical Information Extraction Tools
There are several tools for extracting structured information from clinical text.Popular types of information extraction include named-entity recognition, identifying words or phrases in the text which align with clinical concepts, and ontology mapping, labelling the identified words and phrases with their respective clinical codes from an existing ontology. 1Of the tools which perform both of these tasks, the open-source Apache cTAKES is used in over 50% of recent work (Wang et al., 2017), outpacing competitors such as MetaMap (Aronson, 2001) and MedLEE (Friedman, 2000).
cTAKES utilizes a rule-based system for performing ontology mapping, via a UMLS dictionary lookup on the noun phrases inferred by a part-ofspeech tagger.Taking raw text as input, the software outputs a set of UMLS concepts identified in 1 Ontology mapping also serves as a form of text normalization.
the text and their positions, with functionality to map them to other ontologies such as SNOMED and ICD9.It is highly scalable, and can be deployed locally to avoid compromising identifiable patient data.Figure 2 shows an example cTAKES annotation on a clinical record.
Clinical Named-Entity Recognition (NER) Recent work has focused on developing tools to replace cTAKES in favor of modern neural architectures such as Bi-LSTM CRFs (Boag et al., 2018;Tao et al., 2018;Xu et al., 2018;Greenberg et al., 2018), varying in task definition and evaluation.Newer approaches leverage contextualized word embeddings such as ELMo (Zhu et al., 2018;Si et al., 2019).In contrast, we focus on maximizing the power of existing tools such as cTAKES.This approach is more practical in the near-term, because the adoption of new NER systems in the clinical domain is inhibited by the amount of computational power, data, and gold-label annotations needed to build and train such token-level models, as well as considerations for the effectiveness of domain transfer and a necessity to perform annotations locally in order to protect patient data.Newer models do not provide these capabilities.

NER in Text-based Models
Prior works use the output of cTAKES as features for disease-and drug-specific tasks, but either concatenate them as shallow features, or substitute them for the text itself (see Wang et al. (2017) for a literature review).Weng et al. (2017) incorporate the output of cTAKES into their input feature vectors for the task of predicting the medical subdomain of clinical notes.However, they use them as shallow features in a non-neural setting, and combine cTAKES annotations with the text representations by concatenating the two into one larger feature vector.
In contrast, we propose to learn dense neural concept embedding representations, and integrate the concepts in a learnable fashion to guide the representation learning process, rather than simply concatenating them or using them as a text replacement.We additionally focus on a more challenging task setting.Boag and Kané (2017) augment a Word2Vec training objective to predict clinical concepts.This work is orthogonal to ours as it is an unsupervised "embedding pretraining" approach rather than an end-to-end supervised model.Automated Clinical Coding The automated clinical coding task is to predict from the raw text of a hospital discharge summary describing a patient encounter all of the possible ICD9 (diagnosis and procedure) codes which a human annotator would assign to the visit.Because these annotators are trained professionals, the ICD codes assigned serve as a natural label set for describing a patient record, and the task can be seen as a proxy for a general patient outcome or treatment prediction task.State-of-the-art methods such as CAML (Mullenbach et al., 2018) treat each label prediction as a separate task, performing many binary classifications over the many-thousand-dimensional label space.The model is described in more detail in the next section.
The label space is very large (tens of thousands of possible codes) and frequency is long-tailed.Rios and Kavuluru (2018b) find that CAML performs weakly on rare labels.

Problem Setup
Task Notation A given discharge summary is represented as a matrix X ∈ R de×N . 3The set of diagnosis and procedure codes assigned to the visit is represented as the one-hot vector y ∈ {0, 1} L .The task can be framed as L = |L| binary classifications: predict y l ∈ {0, 1} for code l in labelspace L.
Data We use the publically-available MIMIC-III dataset, a collection of deidentified discharge summaries describing patient stays in the Beth Israel Deaconess Medical Center ICU between 2001and 2012(Johnson et al., 2016;Pollard and Johnson, 2016).Each discharge summary has been tagged with a set of ICD9 codes.See Figure 3 for an example of a record, and Appendix A for a description of the dataset and preprocessing.

Concept Annotation
We run cTAKES on the discharge summaries (described in Appendix B). 3 We use notation for a single instance throughout.Results on the extracted concepts are presented in Table 1.Note the difference in number of annotations provided by using the SNOMED ontology compared to ICD9. 4

ICD9
Base model We evaluate against CAML (Mullenbach et al., 2018), a state-of-the-art text-based model for the clinical coding task.The model leverages a convolutional neural network (CNN) with per-label attention to predict the combination of codes to assign to a diven discharge summary.Applying convolution over X results in a convolved input representation H ∈ R dc×N (with d c < d e ) in which the column-dimensionality N is preserved.
H is then used to predict y, by attentional pooling over the columns.We include implementation details of all methods, including hyperparameters and training, in Appendix A.

Approach 1: Augmentation Model
One limitation of learning-based models is their tendency to lose uncommon words to "UNK" tokens, or to suffer from poor representation learning for them.We hypothesize that rare words are important for predicting rare labels, and that text-based 4 Preliminary experiments with sparser ontologies (RXNORM) were not promising, leading us to choose these two ontologies based on their annotation richness (SNOMED) and direct relation to the prediction task (ICD9).models may be improved by augmenting word embeddings with concept embeddings as a means to strengthen representations of rare or unseen words.We additionally hypothesize that linking multiple words to a shared concept via cTAKES annotation will reduce textual noise by grouping word variants to a shared representation in a smaller and more frequently updated parameter space.

Method
Given a discharge summary containing words w 1 , w 2 , ..., w N ∈ W * and an embedding function γ : We additionally assume a code embedding function φ : C → R de and a set of annotated codes for a given document c 1 , c 2 , . . ., c N ∈ C * , where C is the full codeset for the ontology used to annotate the document, and c n is the code annotated for word token w n , if one exists (else c n = ∅, by abuse of notation).We construct a representation for each document, D, of the same dimensionality as X, by learning one representation leveraging both the concept and word embedding at each position: For token n, β wn,cn ∈ [0, 1] is a learned parameter specific to each observed word+concept pair, including UNK tokens.5 Intuitively, if there is a concept associated with index n, a concept embedding φ(c n ) is generated and a linear combination of the word and concept embedding is learned, using a learned parameter specific to that word+concept pair. 6We fix β wn,cn=∅ = 0, which reverts to the word embedding when there is no concept assigned.
We additionally propose a simpler version of this method, full replace, in which word embeddings are completely replaced with concept embeddings if they exist (i.e.β wn,cn = 1, ∀w n , c n = ∅).In this formulation, if a concept spans multiple words, all of those words are represented by the same vector.Conversely, the CAML baseline corresponds to a model in which β wn,cn = 0, ∀w n , c n .

Evaluation Setup
Metrics In addition to the metrics reported in prior work, we report average precision score (AP), which is a preferred metric to AUC for imbalanced classes (Saito and Rehmsmeier, 2015;Davis and Goadrich, 2006).We report both macro-and micro-metrics, with the former being more favorable toward rare labels by weighting all classes equally.We additionally focus on the precision-atk (P@k) metric, representing the fraction of the k highest-scored predicted labels that are present in the ground truth.Both macro-metrics and P@k are useful in a computer-assisted coding use-case, where the desired outcome is to correctly identify needle-in-the-haystack labels as opposed to more frequent ones, and to accurately suggest a small subset of codes with the highest confidence as annotation suggestions (Mullenbach et al., 2018).
Baselines Along with CAML, we evaluate on a raw codes baseline where the ICD9 annotations generated by cTAKES c 1 , c 2 , . . ., c N are used directly as the document-level predictions.Formally, ŷcn = 1 when c n ∈ L and c n = ∅, for all n in integers 1 to N .

Results
We present results on the test set in Table 2. Overall, the concept-augmented models are indistinguishable from the baseline, and there is no significant difference between annotation type or recombination method, although the linear combination method with ICD9 annotations is the best performing and rivals the baseline.
Following the negative results for our initial attempt to augment word embeddings with concept embeddings, we tried two alternative strategies: • We concatenated the ICD9 annotations with two other ontologies: RXNORM and SNOMED.While this led to greater coverage over the text (with slightly more than one third of the tokens in the text receiving corresponding concept annotations), it did not improve downstream performance.
• Prior work has demonstrated that leveraging clinical ontological structure can allow models to learn more effective code embeddings in fully structured data models (Singh et al., 2014;Choi et al., 2017).We applied the methodology of Choi et al. (2017) on both the ICD9 and SNOMED annotations, but this did not improve performance.For more details, see Appendix D.

Error Analysis
Error analysis of the word-to-concept mapping produced by cTAKES exposes limitations of our initial hypothesis that cTAKES mitigates word-level variation by assigning multiple distinct word phrases to shared concepts.Figure 4 demonstrates that the vast majority of the ICD9 concepts in the corpus are assigned to only one distinct word phrase, and the same results are observed for SNOMED concepts.This may explain the virtually indistinguishable performance of the augmentation models from the baseline, because randomly-initialized word and concept embeddings which are observed in strictly identical contexts should theoretically converge to the same representation. 7*These metrics were computed by randomly selecting k elements from those predicted, since there are no sorted probabilities associated with this baseline.For the same reasons we cannot report AUC or AP metrics.
7 Simulations of the augmentation method under a contrived setting with more concept annotations per note as well The raw codes baseline performs poorly, which aligns with the observation that cTAKES codes assigned to a discharge summary often do not have appropriate or proportional levels of specificity (for example, the top-level ICD9 code '428 Heart Failure' may be assigned by cTAKES, but the goldlabel code is '428.21Acute Systolic Heart Failure').This may also contribute to the negative result of the proposed model.
Figure 6 (included in the Appendix) illustrates prediction performance as a function of code frequency in the training set, showing that the proposed model does not improve upon the baseline for rare or semi-rare codes. 8

Ablations
We separate and analyze the two distinct components of cTAKES' annotation ability for further analysis: 1) how well cTAKES recognizes the location of concepts in the text (NER), and 2) how accurately cTAKES maps the recognized positions to the correct clinical concepts (ontology mapping).Annotation sparsity (NER) and/or cTAKES mapping error may lend the raw text on its own equally useful, as observed in Table 2.We investigate these hypotheses here.We evaluate performance of ablations relative to the augmentation model and baseline to determine whether each component individas more unique word phrases mapping to a single concept demonstrate solid performance increases over the baseline.This provides supporting evidence that the findings presented in this section may be the cause of the negative result rather than our proposed architecture. 8We use the following grouping criteria: rare codes have 50 or fewer occurrences in the training data, semi-rare have between 50 and 1000, and common have more than 1000.ually adds value.The ablations are: 1. Dummy Concepts We replace all word embeddings annotated by cTAKES with 0-vectors, and only use remaining embeddings for prediction.If this alternative shows similar performance to the baseline, then we conclude that the positions in the text annotated by cTAKES (NER) are not valuable for prediction performance.

Concepts Only
We test the complement by replacing all word embeddings not annotated by cTAKES with a 0-vector.In contrast to Dummy Concepts, strong performance of this approach relative to the baseline will allow us to conclude that the positions in the text annotated by cTAKES are valuable for prediction performance.

Concepts Only, Concept Embeddings
We replace all word embeddings not annotated by cTAKES with a 0-vector, and then replace all remaining word embeddings with their concept embedding.If this model performs better than Concepts Only, it will demonstrate the strength of cTAKES' ontology mapping component.
Note that Dummy Concepts and Concepts Only are the decomposition of the baseline CAML.Similarly, Dummy Concepts and Concepts Only, Concept Embeddings are the decomposition of the fullreplace augmentation model presented in Section 4.
Results Results are presented in Tables 3 and  4. Results are consistent with previous experiments in that augmentation with concept annotations does not improve performance.For both ontologies, neither the Dummy Concepts nor the Concepts Only models outperform the full-text models (in which both token representations are used).However, there are some interesting findings.Using SNOMED annotations, performance of the Concepts Only model is significantly higher than Dummy Concepts and very close to full-text model performance.This finding is strengthened by considering the concept coverage discussed in Table 1: the Concepts Only model achieves comparable performance receiving only about 35% (1% in the ICD9 setting) of the input tokens which the fulltext baseline receives, and the Dummy Concepts Model receives about 65% (99% in the ICD9 setting).Thus, a significant proportion of downstream prediction performance can be attributed a small portion of the text which is recognized by cTAKES in both the SNOMED and ICD9 settings, indicating the strength of cTAKES' NER component.

Approach 2: Multi-task Learning
We present an alternative application of cTAKES as a form of distant supervision.Our approach is inspired by recent successes in multi-task learning for NLP which demonstrate that cheaply-obtained labels framed as an auxiliary task can improve performance on downstream tasks (Swayamdipta et al., 2018;Ruder, 2017;Zhang and Weiss, 2016).We propose to predict clinical information extraction system annotations as an auxiliary task, and share lower-level representations with the clinical coding task through a jointly-trained model architecture.We hypothesize that domain-knowledge embedded in cTAKES will guide the shared layers of the model architecture towards a more optimal representation for the clinical coding task.
We formulate the auxiliary task as follows: given each word-embedding or word-embedding span in the input which cTAKES has assigned a code, can the model predict the code assigned to it by cTAKES?

Method
We annotate the set of non-null ground-truth codes output by cTAKES for document i in the training data as {(a i,1 , c i,1 ), (a i,2 , c i,2 ), . . ., (a i,M , c i,M )}, where each anchor a i,m indicates the span of tokens in the text for which concept c i,m is annotated, and The loss term of the model is augmented to include the multi-class cross-entropy of predicting the correct code for all annotated spans in the training batch: where BCE(y i , ŷi ) is the standard (binary crossentropy) loss from the baseline for the clinical coding task, p(c i,m | a i,m ) is the probability assigned by the auxiliary model to the true cTAKESannotated concept given word span a i,m as input, λ is the hyperparameter to tradeoff between the two objectives, and I is the number of instances in the batch.
Because we use the auxiliary task as a "scaffold" (Swayamdipta et al., 2018) for transferring domain knowledge encoded in cTAKES' rules into the learned representations for the clinical coding task, we must only run cTAKES and compute a forward pass through the auxiliary module at training time.At test-time, we evaluate only on the clinical coding task, so the time complexity of model inference remains the same as the baseline, an advantage of this architecture.We model p(c i,m | a i,m ) via a multi-layer perceptron with a Softmax output layer to obtain a distribution over the codeset, C. We additionally experiment with a linear layer variant to combat overfitting on the auxiliary task by reducing the capacity of this module.The input to this module is a single vector, z i,m ∈ R de , constructed by selecting the maximum value over s word embeddings for each dimension, where s is the length of the input span.9To facilitate information transfer between the clinical coding and auxiliary task, we experiment with tying both the randomly-initialized embedding layer, X, and a higher-level layer of the network (e.g. the outputs of the convolution layer H described in Section 3).See Figure 5 for the model architecture.

Experiment and Results
Results are presented in Table 5 and Table 6 for ICD9 annotations.Overall, the cTAKES spanprediction task does more to hurt than help performance on the main task.Tying the model weights at a higher layer (post-convolution as opposed to preconvolution) results in worse performance, even though the model fits the auxiliary task well.This indicates either that the model may not have enough capacity to adequately fit both tasks, or that the cTAKES prediction task as formulated may actually misguide the clinical coding task slightly in parameter search space. 10 We additionally remark that increasing the weight of the auxiliary task generally lowers performance on the clinical coding task, and tuning λ on the dev set does not result in more optimal performance (we include results with λ = 1 here; see Table 9 in the Appendix).Notably, for even very small values of λ, we achieve very high validation accuracy on the auxiliary task.This performance does not change with larger weightings, indicating that the auxiliary task may not be difficult enough to result in effective knowledge transfer. 1110 We found similar results using SNOMED annotations. 11While the models in Sections 4 did not introduce new hyperparameters to the baseline architecture, hyperparameters for this architecture were selected by human intuition.Room for future work includes more extensive tuning (see Table 8 in Appendix A).

Conclusion
Integrating existing clinical information extraction tools with deep learning models is an important direction for bridging the gap between rule-based and learning-based methods.We have provided an analysis of the quality of the widely-used clinical concept annotator cTAKES when integrated into a state-of-the-art text-based prediction model.In two settings, we have shown that cTAKES does not improve performance over raw text alone on the clinical coding task.We additionally demonstrate through error analysis and ablation studies that the amount of word variation captured and the differentiation between the named-entity recognition and ontology-mapping tasks may affect cTAKES' effectiveness.
While automated coding is one application area, the models presented here could easily be extended to other downstream prediction tasks such as patient diagnosis and treatment outcome prediction.Future work will include evaluating newlydeveloped clinical NER tools with similar functionalities to cTAKES in our framework, which can potentially serve as a means to evaluate the effectiveness of newer systems vis-à-vis cTAKES.

A Experimental Details
Data Following Mullenbach et al. (2018), we use the same train/test/validation splits for the MIMIC-III dataset, and concatenate all supplemental text for a patient discharge summary into one record.We use the authors' provided data processing pipeline12 to preprocess the corpus.The vocubulary includes all words occurring in at least 3 training documents.See Table 7 for descriptive statistics of the dataset.
We construct a concept vocabulary for embedding initialization following the same specification as the word vocabulary: any concept which does not occur in at least 3 training documents is replaced with an UNK token.Details on the size of the vocabulary can be found in Table 8  Training We train with the same specifications as Mullenbach et al. (2018) unless otherwise specified, with dropout performed after concept augmentation for the models in Sections 4, and early stopping with a patience of 10 epochs on the precision at 8 metric, for a maximum of 200 epochs (note that in the multi-task learning models the stopping criterion is only a function of performance on the clinical coding task).Unlike previous work, we reduce the batch size to 12 in order to allow each batch to fit on a single GPU, and we do not use pretrained embeddings as we find this improves performance.All models are trained on a single NVIDIA Titan X GPU with 12,189 MiB of RAM.We port the optimal hyperparameters reported in Mullenbach et al. (2018) to our experiments.With more extensive hyperparameter tuning, we may expect to see a potential increase in the performance of our models over the baseline.See

B Concept Extraction
We build a custom dictionary from the UMLS Metathesaurus that includes mappings from UMLS CUIs to SNOMED-CT and ICD9-CM concepts.
We run the cTAKES annotator in advance of training for all 3 dataset splits using the resulting dictionary, allowing us to obtain annotations for each note in the dataset, and the positions of the annotations in the raw text.Note that for the multitask learning experiments (Section 5), we only require annotations for training data.Annotating the MIMIC-III datafiles using these specifications takes between 4 and 5 hours for 3,000 discharge summaries on a single CPU, and can be parallelized for efficiency.

C Attention for Overlapping Concepts
We implement an attention mechanism (Bahdanau et al., 2014) to compute a single concept embedding φ(C n ) ∈ R de when C n = {c 1 , c 2 , . . ., c J } represents a set of concepts annotated at position n instead of a single concept.Intuitively, we want to more heavily weight those concepts in the set which have the most similarity to the surrounding text.We define a context vector for position n as: The context is defined as the concatenated word embeddings surrounding position n.We use a context size of n + / − 2, where 2 is a hyperparameter.We choose to use a smaller value for computational efficiency.Table 9: The effect of tuning λ on dev set performance on the ICD9 coding task, for the pre-convolution model with a linear auxiliary layer and ICD9 annotations.We select λ = 1 for reporting test results; there isn't a clear value which produces strictly better performance.
We concatenate the word-context vector and each concept embedding c j in C n as [v n , φ(c j )] ∈ R 5de , and pass it through a multi-layer perceptron to compute a similarity score: f : R 5de → R 1 .An attention score for each c j is computed as:

Figure 2 :
Figure 2: An example of cTAKES annotation output with part-of-speech tags and UMLS CUIs for named entities.2

Figure 3 :
Figure 3: An example clinical discharge summary and associated ICD codes.

Figure 4 :
Figure 4: A histogram showing the distribution of ICD9 concepts in C grouped according to the number of unique word phrases in the MIMIC-III corpus associated with each.We observe the same trend when plotting SNOMED annotations.

Figure 5 :
Figure 5: The proposed architecture (for prediction on a single document, i, and auxiliary supervision on a single annotation, m).The bottom box illustrates the preconvolution model, and the top box post-convolution.The architecture on the left is the baseline.
α j = exp(f (v n , φ(c j )) J k=1 exp(f (v n , φ(c k ))This represents the relevance of the concept to the surrounding word-context, normalized by the other concepts in the set.A final concept embedding φ(C n ) ∈ R de is computed as a linear combination of the concept vectors, weighted by their attention scores:φ(C n ) = J j=1 α j • φ(c j )D Leveraging Ontological Graph StructureFollowing the methodology of Choi et al. (2017), we experiment with learning higher-quality concept representations using the hierarchical structure of the ICD9 ontology.We replace concept embedding φ(c n ) with a learned linear combination of itself and its parent concepts' embeddings (see Figure1).For child concepts which are observed infrequently or have poor representations, prior work has shown that a trained model will learn to weight the parent embeddings more heavily in the linear combination.Because the parent concepts represent more general concepts, they have most often been observed more frequently in the training set and have stronger representations.This also allows for learned representations which capture relationships between concepts.We refer the reader toChoi et al. (2017) for details.

Figure 6 :
Figure 6: F1 on Test Data based on Frequency of Codes in Training Data, where the metric is defined ('meca' indicates the linear combination ICD9 augmentation model).

Table 1 :
Descriptive Statistics on concept extraction for the MIMIC-III corpus.

Table 2 :
Test set results using the augmentation methods.

Table 3 :
Test set results of ablation experiments on the MIMIC-III dataset, using ICD9 concept annotations.

Table 4 :
Test set results of ablation experiments on the MIMIC-III dataset, using SNOMED concept annotations.

Table 5 :
Test set performance on the ICD9 coding task for λ = 1 and using ICD9 annotations.

Table 6 :
Dev set performance on the auxiliary task for λ = 1 and using ICD9 annotations.Relatively high task performance is achieved even after one epoch with a simple model. .

Table 8 :
Table 8 for hyperparameters and other details specific to our proposed model architectures.All neural models are implemented using PyTorch 13 , and built on the open-source implementation of CAML. 14Model details.