Automatic Diagnosis Coding of Radiology Reports: A Comparison of Deep Learning and Conventional Classification Methods

Diagnosis autocoding services and research intend to both improve the productivity of clinical coders and the accuracy of the coding. It is an important step in data analysis for funding and reimbursement, as well as health services planning and resource allocation. We investigate the applicability of deep learning at autocoding of radiology reports using International Classification of Diseases (ICD). Deep learning methods are known to require large training data. Our goal is to explore how to use these methods when the training data is sparse, skewed and relatively small, and how their effectiveness compares to conventional methods. We identify optimal parameters that could be used in setting up a convolutional neural network for autocoding with comparable results to that of conventional methods.


Introduction
Hospitals and other medical clinics invest in clinical coders to abstract relevant information from patients' medical records and decide which diagnoses and procedures meet the criteria for coding, as per coding standards such as International Statistical Classification of Diseases referred to as ICD Code. For example, Multiple fractures of foot is represented by the ICD-10 code 'S92.7'. These codes are used to find statistics on diseases and treatments as well as for billing purposes. Clinical coding is a specialized skill requiring excellent knowledge of medical terminology, disease processes, and coding rules, as well as attention to detail, and analytical skills. Apart from high costs of labor, human errors could lead to over and undercoding resulting in misleading statistics.
To alleviate the costs and increase the accuracy of coding, autocoding has been studied by the Natural Language Processing (NLP) community. It has been studied for a variety of clinical texts such as radiology reports (Crammer et al., 2007;Perotte et al., 2014;Scheurwegs et al., 2016), surveillance of diseases or type of cancer from death certificates (Koopman et al., 2015a,b), and coding of cancer sites and morphologies (Nguyen et al., 2015).
Text classification using deep learning is relatively recent with promises to reduce the load of domain or application specific feature engineering. Conventional classifiers such as SVMs with well-engineered features have long shown high performance in different domains. We investigate if deep learning methods can further improve clinical text classification. Specifically, we investigate how and in what setting some of the most popular neural architectures such as Convolutional Neural Networks (CNNs) can be applied to the autocoding of radiology reports. The outcomes of our work can inform similar tasks with decision making on type and settings of text classifiers.

Related Work
In 2007 Pestian et al. (2007) organised a shared task which introduced a dataset of radiology reports to be autocoded with ICD9 codes. This multi-label classification task attracted a large body of research over the years-e.g., (Farkas and Szarvas, 2008;Suominen et al., 2008)-which tackled the problem with methods such as rule-based, decision trees, entropy and SVM classifiers. Text classification using SVM has long been known to be state-of-the-art.  Recently, neural network based learning methods have been investigated in generic NLP as well as domain-specific applications. For text classification, two dominant methods are: (1) Convolutional Neural Networks (CNNs) from the category of feed-forward neural networks; and (2) Long Short-Term Memory (LSTM) with a recurrent neural network (RNN) architecture. Also the use of word embeddings (Le and Mikolov, 2014)which are to capture semantic representations of words in text-has been investigated in a variety of applications to replace one-hot (vector space) models which is the traditional method of text representation.
Text classification using CNNs has been increasingly studied in recent years (Kalchbrenner et al., 2014;Kim, 2014;. For example,  applied CNN to classify biomedical articles for indexing, and Kavuluru et al. (2016) on suicide watch forums.

Method
We build a CNN network with the architecture proposed by (Kim, 2014). It consists of one convolutional layer using multiple filters and filter sizes followed by a max pooling and fully-connected layer to assign a label.
This model is chosen based on its success in other tasks. This will set a base for what is achievable using this set of algorithms without using a very deep network or more complicated architecture.
Input text to the network is represented using two different settings: (1) a matrix of random vectors representing all the words in a document; or (2) word embeddings. We refer to word embeddings created from a corpus of medical text such as Medline citations as in-domain, and out-of-domain otherwise (i.e., using Wikipedia). We also experimented with static and dynamic embeddings. In static setting, the embedding vector values were pre-fixed based on the collection they were created on, whereas dynamic embeddings changed values during the training.
One goal of this work is to quantify the impact of CNN hyperparameters. Tuning hyperparameters can be considered equivalent of feature engineering in conventional machine learning tasks. We list some of the main hyperparameters to be set in a CNN in Table 1 (first two columns). Our experiments are focused on tuning these and investigate how they differ for different datasets.

Datasets
We experiment on two different datasets, indomain and out-of-domain, in order to find common characteristics and domain specific properties of these datasets for text classification. These datasets are: (a) ICD9, a dataset of radiology reports, and (b) IMDB, a sentiment analysis dataset. These corpora are publicly available and are explained below. ICD9 dataset is an open challenge dataset published by the Computational Medicine Center in 2007 (Pestian et al., 2007). The dataset consists of clinical free text which is a set of 978 anonymized radiology reports and their corresponding ICD-9-CM codes. 1 There are 38 unique ICD-9 codes present in the dataset. Given the imbalance of different disease categories in the dataset with some categories only having one or two instances, we created a revised subset rICD9. In rICD9 those codes with less than 15 instances are removed. This subset contains 894 documents with 16 unique codes. To measure how our grid search for hyper-parameters are robust and how much they are task and dataset dependent, we use an out-of-domain dataset. IMDB movie review dataset is a sentiment analysis dataset provided by Maas et al. (2011). It contains 100, 000 movie reviews from IMDB.

Experimental Setup
We treat this task as a multi-label classification problem. Our implementations use Tensorflow and Scikit-learn. For word embeddings we use Word2Vec (Mikolov et al., 2013). For SVM and other conventional methods, we used normalized tf-idf features similar to (Wang and Manning, 2012).
Evaluation For evaluations on ICD9 and its variant rICD9, we use stratified 10-fold crossvalidation. We measure classification accu-1 Testing data for this dataset is no longer available.  Hyperparameter Tuning Effect of varying different hyperparameters on classification accuracy is examined by a grid search method that incrementally changes the values of hyperparameters. We start from a default setting as shown in Table 1 as a baseline. We also change one parameter at a time, according to a wide range given in column three, and analyze the results to find the optimal hyperparameter values. Based on the optimal parameter values, all experiments are repeated to measure the effects.

CNN versus Conventional Classifiers
Classification accuracy was calculated varying values of different hyperparameters. Based on the best results we chose the optimal values for each hyper parameter as listed in columns 5 to 7 of Table 1. Table 2 compares three conventional classifiers, including SVM, Random Forests and logistics regression to CNNs. The results for CNN with default values as well as accuracy-optimized values on ICD9 dataset shows comparable re-sults to all the three conventional classifiers. That means the two sets of algorithms can achieve similar baselines with minimal feature engineering or parameter tuning.

Effect of Pre-trained Word Vectors
Pre-trained word vectors can be considered as prior knowledge on meaning of words in a dataset. That is, instead of random values, the embedding layer can be initialized to values obtained from word embeddings. We investigated whether using word embeddings would improve classification accuracy in our coding task. Therefore, we created different word vectors trained using both Wikipedia and Medline with various vector sizes. We then compared the accuracy of random embeddings with these pre-trained embeddings. Our results, shown in Table 3, can be summarized as below: (1)

Conclusion
We explored the potential of machine learning methods using neural networks to compete with conventional classification methods. We used ICD9 coding of radiology reports. Our experiments showed that some of CNN hyperparameters such as depth are specific to a dataset or task and should be tuned, whereas some of the parameters (e.g., learning rate or vector size) can be set in advance without sacrificing the results. Our results also showed the value of using dynamic word embeddings. Our best classification results achieved comparable or superior results to SVM and logistic regression classifiers for autocoding of radiology reports. Our work is continuing in two major directions: (1) quantifying the relationships between hyperparameters using linear-regression analysis; and (2) applying CNN and LSTM models for ICD-10 autocoding of patient encounters in hospital settings.