Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text

Medical code assignment, which predicts medical codes from clinical texts, is a fundamental task of intelligent medical information systems. The emergence of deep models in natural language processing has boosted the development of automatic assignment methods. However, recent advanced neural architectures with flat convolutions or multi-channel feature concatenation ignore the sequential causal constraint within a text sequence and may not learn meaningful clinical text representations, especially for lengthy clinical notes with long-term sequential dependency. This paper proposes a Dilated Convolutional Attention Network (DCAN), integrating dilated convolutions, residual connections, and label attention, for medical code assignment. It adopts dilated convolutions to capture complex medical patterns with a receptive field which increases exponentially with dilation size. Experiments on a real-world clinical dataset empirically show that our model improves the state of the art.


Introduction
Medical code assignment categorizes clinical documents with sets of codes to facilitate hospital management and improve health record searching (Hsia et al., 1988;Farkas and Szarvas, 2008).These clinical texts comprise physiological signals, laboratory tests, and physician notes, where the International Classification of Diseases (ICD) coding system is widely used for annotation.Most hospitals rely on manual coding by human coders to assign standard diagnosis codes to the discharge summaries for billing purposes.However, this work is and error-prone (Hsia et al., 1988;Farzandipour et al., 2010).Incorrect coding can cause billing mistakes and mislead other general practitioners when patients are readmitted.Intelligent automated coding systems could act as a recommendation system to help coders to allocate correct medical codes to clinical notes.
Automatic medical code assignment has been intensively researched during the past decades (Crammer et al., 2007;Stanfill et al., 2010).Recent advances in natural language processing (NLP) with deep learning techniques have inspired many methods for automatic medical code assignment (Shi et al., 2017;Mullenbach et al., 2018;Li and Yu, 2020).Zhang et al. (2019) incorporated structured knowledge into medical text representations by preserving translational property of concept embeddings.However, several challenges remain in medical text understanding.Diagnosis notes contain complex diagnosis information, which includes a large number of professional medical vocabulary and noisy information such as non-standard synonyms and misspellings.Free text clinical notes are lengthy documents, usually from hundreds to thousands of tokens.Thus, medical text understanding requires effective feature representation learning and complex cognitive process to enable multiple diagnosis code assignment.
Previous neural methods for medical text encoding generally fall into two categories.Medical text modeling is commonly regarded as a synonym of recurrent neural networks (RNNs) that capture the sequential dependency.Such works include Atten-tiveLSTM (Shi et al., 2017), Bi-GRU (Mullenbach et al., 2018) and HA-GRU (Baumel et al., 2018).The other category uses convolutional neural networks (CNNs) such as CAML (Mullenbach et al., 2018) and MultiResCNN (Li and Yu, 2020).These methods only capture locality but have achieved the optimal predictive performance on medical code assignment.
Inspired by the generic temporal convolutional network (TCN) architecture (Bai et al., 2018), we consider medical text modeling with causal con-straints, where the encoding of the current token only depends on previous tokens, using the dilated convolutional network.We combine it with the label attention network for fine-grained information aggregation.

Distinction of Our Model
The MultiResNet is currently the state-of-the-art model.It applies multi-channel CNN with different filters to learn features and further concatenates these features to produce a final prediction.In contrast, our model extends the TCN to sequence modeling that uses a single filter and the dilation operation to control the receptive field.In addition, instead of weight tying used in the TCN, we customize it with label attention pooling to extract relevant rich features.
Our Contributions We contribute to the literature in three ways.(1) We consider medical text modeling from the perspective of imposing the sequential causal constraint in medical code assignment using dilated convolutions, which effectively captures long sequential dependencies and learns contextual representations in the long clinical notes.
(2) We propose a dilated convolutional attention network (DCAN), coupling residual dilated convolution, and label attention network for more effective and efficient medical text modeling.(3) Experiments in real-world medical data show improvement over the state of the art.Compared with multi-channel CNN and RNN models, our model also offers a smaller computational cost.

Proposed Model
This section describes the proposed model -Dilated Convolutional Attention Network (DCAN).It includes three main components, i.e., dilated convolution for learning features from word embeddings of clinical notes, residual connection for stacking a deep neural architecture, and label attention module for prioritizing relevant representation for different labels.The architecture of our proposed model is illustrated in Fig. 1.
Our model benefits from the effective integration of these three neural modules.Dilated convolutions are widely used in audio signal modeling (Oord et al., 2016) and semantic segmentation (Yu and Koltun, 2015).Yu and Koltun (2015) proposed dilated convolutions with an exponentially large receptive filed.Bai et al. (2018) utilized causal convolutions with dilation and tied weighting for sequence modeling.Following their works, we integrated dilated convolution with label attention network for better medical text encoding to predict diagnosis codes.The dilated convolution follows the causal constraint of sequence modeling.By stacking dilated convolutions with residual connection (He et al., 2016), our DCAN model can be built as a very deep neural network to learn different levels of features.And the final label attention module further extracts the most relevant information to the label space.

Label Attention
Layer Res. Conn.

Dilated Convolution Layers
A clinical note with n words is denoted as {x 0 , . . ., x n }.We use word2vec (Mikolov et al., 2013) to train word embeddings from raw tokens.Word embedding matrix of a clinical note is denoted as [w 1 , . . ., w n ] T ∈ R n×de , where d e is the dimension of word vectors.The word embeddings are then inputted into the dilated convolution layers, which are also called convolutions with dilated filters.Specifically, we use a 1D convolution operator to each dimension (i.e.channel) of the word vectors.Given a sequence of onedimensional elements x ∈ R n and a convolutional filter f : {0, . . ., k − 1} → R, the one-dimensional dilated convolution F d is denoted as where d is the dilation size of the spacing between kernel elements, s is the element of input sequence, k is the convolving kernel (aka, the filter) size, and s − d • i refers to past time steps.The 1D dilated convolution has h l output channels, i.e., for each of the d e input channels h l features are learned and summed over the input channels.The dilated convolution is followed by a weight normalization, an activation function, and a dropout operation.Two dilated convolution layers ared stacked to a dilated convolution block.It outputs a hidden representation H l ∈ R n×h l of the l-th layer, where the dimension d h of the hidden representation is the number of output channels in the last dilated convolution layer.To expand the receptive field, the dilation size is exponentially increased, i.e., d i ∈ {2 i } for i = 0, 1, . . ., l − 2.

Residual Connections
Residual connections (He et al., 2016) of l residual blocks are built upon the dilated convolution layers to create deep neural networks.Given the input encoding vector x, the output of residual connection is denoted as o = σ(x + G(x)), where G represents neural layers and σ is a non-linear activation function.We use residual mechanism between two stacked dilated convolution layers, which is formalized as: (2)

Label Attention Layer
We apply the label attention layer to prioritize important information in the hidden representation relevant to ICD codes.Specifically, the dot product attention is used to calculate the attention score A ∈ R n×m as: where H L (the superscript represents the ordinal of the layer H and not the power) is the hidden encoding of the L-th layer, U ∈ R h L ×m is the parameter matrix of the label attention layer (also known as the query), and m is the number of ICD codes.The attention matrix A captures the importance of ICD code and hidden word representation pair.The output of the attention layer is then calculated by multiplying attention A with the hidden representation from residual dilated convolution layers.The attentive representation V ∈ R m×h L is formalized as With features representing sequential dependency and label awareness, the final representation is further used for medical code classification.

Classification Layer
The classification layer is a linear fully-connected layer.The i-th projected representation Y i ∈ R k is calculated as: where is the bias, and Y i and V i are the i-th row of Y and V for i ∈ {1, . . ., m}.The predicted logits ŷ ∈ R m between 0 and 1 are produced by a pooling operation over the linearly projected matrix Y and passed into the Sigmoid activation function, denoted as:

Training
ICD code assignment is a typical multi-label multiclass classification problem.We adopt the binary cross entropy loss denoted as: where y i ∈ {0, 1} is the ground-truth label, ŷi is the sigmoid score for prediction, and m is the number of ICD codes.To mitigate the effect of noisy labels, we apply label smoothing over ground-truth labels y penalizing model from over-confident predictions.The modified targets ỹ are calculated as: where α is the smoothing coefficient.We use Adam optimizer (Kingma and Ba, 2014) to train the model with backpropagation.

Experiments
This section introduces the experimental analysis of real-world clinical datasets.Our proposed models are compared with several recent strong baselines.The code is publicly available at https://agit.ai/jsx/DCAN.
The NLTK package1 is utilized for tokenization and all tokens are converted into lowercase.Alphabetic characters such as numbers and punctuations are removed.All documents are truncated at the length of 2500 tokens.We choose some common settings from prior publications.For example, the word embedding dimension is 100, the dropout rate is 0.2.The Adam optimizer Kingma and Ba ( 2014) is used to optimize our model parameters.The rest choices of hyper-parameters are configured via random search.

Results
We evaluate the F1-score and area under the receiver operating characteristic curve (AUC-ROC) with both micro and macro averaging, and the precision at k codes with k = 5 (P@5).The results are shown in Table 1.Our model outperforms the state-of-the-art in all the metrics.To compare with the MultiResCNN model, we follow its setting and run our model for three times.We average the predictive scores and calculate their standard deviation.Our model has a clear improvement in the macro F1-score when the macro score calculates the label-wise average by treating all codes equally.For the other metrics, our model still has a marginal improvement with a lower or comparable standard deviation.We also try the pre-trained Bidirectional Encoder Representations from Transformers(BERT) model (Devlin et al., 2018) for sequence classification.However, the BERT model does not work well in this task.This conclusion is also reported by Li and Yu (2020).

Table 1 :
Results on MIMIC-III dataset with top-50 ICD codes."-" indicates no results reported in the original paper.

Table 2 :
Computational cost comparisonComputational efficiency We compared the computational efficiency from two perspectives, i.e., number of parameters and convergence epochs, results are shown in Table2.Not relying on concatenated multi-channel features, our model has fewer trainable parameters and takes less training time than the state-of-the-art MultiResCNN.Moreover, our model converges faster.Recent years extensively studies the automatic medical code assignment.Neural clinical text encoding models use CNNs to extract local features and RNNs to preserve sequential dependency.This paper combines both by using dilated convolution.The dilated convolutional attention network (DCAN) consists of dilated convolution layers, residual connections, and the label attention layer.The DCAN model obeys the causal constraint of sequence encoding and learns rich representations to capture label-aware importance.Through experiments on the MIMIC-III dataset, our model shows better predictive performance than the state-of-theart methods.