Noise Pollution in Hospital Readmission Prediction: Long Document Classification with Reinforcement Learning

This paper presents a reinforcement learning approach to extract noise in long clinical documents for the task of readmission prediction after kidney transplant. We face the challenges of developing robust models on a small dataset where each document may consist of over 10K tokens with full of noise including tabular text and task-irrelevant sentences. We first experiment four types of encoders to empirically decide the best document representation, and then apply reinforcement learning to remove noisy text from the long documents, which models the noise extraction process as a sequential decision problem. Our results show that the old bag-of-words encoder outperforms deep learning-based encoders on this task, and reinforcement learning is able to improve upon baseline while pruning out 25% text segments. Our analysis depicts that reinforcement learning is able to identify both typical noisy tokens and task-specific noisy text.


Introduction
Prediction of hospital readmission has always been recognized as an important topic in surgery. Previous studies have shown that the post-discharge readmission takes up tremendous social resources, while at least a half of the cases are preventable (Basu Roy et al., 2015;Jones et al., 2016). Clinical notes, as part of the patients' Electronic Health Records (EHRs), contain valuable information but are often too time-consuming for medical experts to manually evaluate. Thus, it is of significance to develop prediction models utilizing various sources of unstructured clinical documents.
The task addressed in this paper is to predict 30day hospital readmission after kidney transplant, which we treat it as a long document classification problem without using specific domain knowledge. The data we use is the unstructured clinical documents of each patient up to the date of discharge.
In particular, we face three types of challenges in this task. First, the document size can be very long; documents associated with these patients can have tens of thousands of tokens. Second, the dataset is relatively small with fewer than 2,000 patients available, as kidney transplant is a non-trivial medical surgery. Third, the documents are noisy, and there are many target-irrelevant sentences and tabular data in various text forms (Section 2).
The lengthy documents together with the small dataset impose a great challenge on representation learning. In this work, we experiment four types of encoders: bag-of-words (BoW), averaged word embedding, and two deep learning-based encoders that are ClinicalBERT (Huang et al., 2019) and LSTM with weight-dropped regularization (Merity et al., 2018). To overcome the long sequence issue, documents are split into multiple segments for both ClinicalBERT and LSTM (Section 4).
After we observe the best performed encoders, we further propose to combine reinforcement learning (RL) to automatically extract out task-specific noisy text from the long documents, as we observe that many text segments do not contain predictive information such that removing these noise can potentially improve the performance. We model the noise extraction process as a sequential decision problem, which also aligns with the fact that clinical documents are received in time-sequential order. At each step, a policy network with strong entropy regularization (Mnih et al., 2016) decides whether to prune the current segment given the context, and the reward comes from a downstream classifier after all decisions have been made (Section 5).
Empirical results show that the best performed encoder is BoW, and deep learning approaches suffer from severe overfitting under huge feature space in contrast of the limited training data. RL is experimented on this BoW encoder, and able to improve upon baseline while pruning out around 25% text segments (Section 6). Further analysis shows that RL is able to identify traditional noisy tokens with few document frequencies (DF), as well as task-irrelevant tokens with high DF but of little information (Section 7).

Data
This work is based on the Emory Kidney Transplant Dataset (EKTD) that contains structured chart data as well as unstructured clinical notes associated with 2,060 patients. The structured data comprises 80 features that are lab results before the discharge as well as the binary labels of whether each patient is readmitted within 30 days after kidney transplant or not where 30.7% patients are labeled as positive.
The unstructured data includes 8 types of notes such that all patients have zero to many documents for each note type. It is possible to develop a more accurate prediction model by co-training the structured and unstructured data; however, this work focuses on investigating the potentials of unstructured data only, which is more challenging.

Preprocessing
As the clinical notes are collected through various sources of EMRs, many noisy documents exist in EKTD such that 515 documents are HTML pages and 303 of them are duplicates. These documents are removed during preprocessing. Moreover, most documents contain not only written text but also tabular data, because some EMR systems can only export entire documents in the table format. While there are many tabular texts in the documents (e.g., lab results and prescription as in Table 2), it is impractical to write rules to filter them out, as the exported formats are not consistent across EMRs. Thus, any tokens containing digits or symbols, except for one-character tokens, are removed during Lab Fishbone (BMP, CBC, CMP, Diff) and critical labs -Last 24 hours 03/08/2013 12:45 142(Na) 104(Cl) 70H(BUN) -10.7L(Hgb) < 92(Glu) 6.5(WBC) 137L(Plt) 3.6(K) 26(CO2) preprocessing. Although numbers may provide useful features, most quantitative measurements are already included in the structured data so that those features can be better extracted from the structured data if necessary. The remaining tabular text contains headers and values that do not provide much helpful information and become another source of noise, which we handle by training a reinforcement learning model to identify them (Section 5). Table 1 gives the statistics of each clinical note type after preprocessing. The average number of tokens is measured by counting tokens in all documents from the same note type of each patient. Given this preprocessed dataset, our task is to take all documents in each note type as a single input and predict whether or not the patient associated with those documents will be readmitted. Shin et al. (2019) presented ensemble models utilizing both the structured and the unstructured data in EKTD, where separate logistic regression (LR) models are trained on the structured data and each type of notes respectively, and the final prediction of each patient is obtained by averaging predictions from each models. Since some patients may lack documents from certain note types, prediction on these note types are simply ignored in the averaging process. For the unstructured notes, concatenation of Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Dirichlet Allocation (LDA) representation is fed into LR. However, we have found that the representation from LDA only contributes marginally, while LDA takes significantly more inferring time. Thus, we drop LDA and only use TF-IDF as our BoW encoder (Section 4.1).

Related Work
Various deep learning models regarding text classification have been proposed in recent years. Pretrained language models like BERT have shown state-of-the-art performance on many NLP tasks (Devlin et al., 2019). ClinicalBERT is also introduced on the medical domain (Huang et al., 2019). However, deep learning approaches have two drawbacks on this particular dataset. First, deep learning requires large dataset to train, whereas most of our unstructured note types only have fewer than 2,000 samples. Second, these approaches are not designed for long documents, and difficult to keep long-term dependencies over thousands of tokens.
Reinforcement learning has been explored to combat data noise by previous work (Zhang et al., 2018; on the short text setting. A policy network makes decision left-to-right over tokens, and is jointly trained with another classifier. However, there is little investigation of using RL on the long text setting, as it still requires an effective encoder to give meaningful representation of long documents. Therefore, in our experiments, the first step is to select the best encoder, and then apply RL on the long document classification.

Bag-of-Words
For the baseline model, the bag-of-words representation with TF-IDF scores, excluding stopwords (Nothman et al., 2018), is fed into logistic regression (LR). The objective is to minimize the negative log likelihood of the gold label y i : where g i is the TF-IDF representation of D i . In addition, we experiment two common techniques in the encoder to reduce feature space: token stemming, and document frequency cutoff.

Averaged Word Embedding
Word embeddings generated by fastText are used to establish another baseline, that utilizes subwords to better represent unseen terms (Bojanowski et al., 2017). It is suitable for this task as unseen terms or misspellings frequently appear in these clinical notes. The averaged word embedding is used to represent the input document consisting of multiple notes, which gets fed into LR with the same training objective.

ClinicalBERT
Following Huang et al. (2019), the pretrained language BERT model (Devlin et al., 2019) is first tuned on the MIMIC-III clinical note corpus (Johnson et al., 2016), which has shown to provide better related word similarities in medical domains. Then, a dense layer is added on the CLS token of the last BERT layer. The entire parameters are fine-tuned to optimize the binary cross entropy loss, that is the same objective as Equation 1.
Since BERT has a limit on the input length, the input document of each patient is split into multiple subsequences. Each subsequence is within the BERT length limit, and serves as an independent sample with the same label of the patient. The training data is therefore noisily inflated. The final probability of readmission is computed as follows: where g i is the BERT representation of patient i, n i is the corresponding number of subsequences, and c is a hyperparameter to control the influence of n i . p n i max and p n i mean are the max and mean probability across the subsequences, respectively.
The motivation behind balancing between the max and mean probability is that subsequences do not contain equal information. p n i max represents the best potential, while longer text should give more importance to p n i mean , because p n i max is more easily affected by noise as the text length grows. Although Equation 2 seems intuitive, the use of pseudo labels on subsequences becomes another source of noise, especially when there are thousands of tokens; thus, the performance is uncertain. Section 6.2 provides detailed empirical analysis for this model.

Weight-dropped LSTM
We split documents of each patient into multiple short segments, and feed the segment representation to long short-term memory network (LSTM) at each time step: where h j is the hidden state at time step j, s j is the jth segment, and θ is the set of parameters.
Although segmentation of documents is still necessary, no pseudo labels are needed. We get the segment representation by averaging its token embedding from the last layer of BERT. The final hidden state at each step j is the concatenated hidden states of a single-layer Bi-directional LSTM.
After we get the hidden state for each segment, a max-pooling operation is performed on h 1:n over the time dimension to obtain a fixed-length vector, similar to Kim (2014); Adhikari et al. (2019). A dense layer is immediately followed. It is particularly important to strengthen regularization on this dataset with small sample size. Dropout (Srivastava et al., 2014) as a way of regularization has been shown effective in deep learning models, and Merity et al. (2018) has successfully applied dropout-like technique in LSTM: the use of DropConnect (Wan et al., 2013) is applied on the four hidden-to-hidden matrices, preventing overfitting from occurring on the recurrent weights.

Reinforcement Learning
Reinforcement learning is applied to the best performing encoder in Section 4 to prune noisy text, which can lead to comparable or even better performance, as many text segments in these clinical notes are found to be irrelevant to this task. Figure 1 describes the overview of our reinforcement learning approach. The pruning process is modeled as a sequential decision problem, for the fact that these notes are received in time-order. It consists of two separate components: a policy network, and a downstream classifier. To avoid having too many time steps, the policy is performed on the segment level instead of token level. For each patient, documents are split into short segments g 1:T = {g 1 , g 2 , · · · , g T }, and the policy network conducts a sequence of decisions a 1:T = {a 1 , a 2 , · · · , a T } over segments. The downstream classifier is re-sponsible for the reward, and the REINFORCE algorithm is used to train the policy (Williams, 1992).
State At each time step, the state s t is the concatenation of two parts: the representation of previously selected text, and the current segment representation g i . The previously selected text serves as the context and provides a prior importance. Both parts are represented by an effective encoder, e.g. the best performing encoder from Section 4.

Action
The action space at each step is binary: {Keep, Prune}. If the action is Keep, the current segment is added to the selected text; otherwise, it is discarded. The final selected text for a patient is the concatenated segments selected by the policy.

Reward
The reward comes at the end when all actions are sampled for the entire sequence. The final selected text is fed to the downstream classifier, and negative log-likelihood of the gold label is used as the reward R. In addition, we also include a reward term R p to encourage pruning, as follows: where c and β are hyperparameters to control the scale of R p , l is the number of segments, α is the ratio of pruned segments |{a k = Prune}| /l, σ is the sigmoid function. The value of the term 2σ( l β ) − 1 falls into range (0, 1). When l is small, it downgrades the encouragement of pruning; when l is large, it also gives an upper bound of R p . Additionally, we apply exponential decay on the reward. The final reward is d l R + R p . d is the discount rate.
Policy Network The policy network maintains a stochastic policy π(a t |s t ; θ): π(a t |s t ; θ) = σ(W s t + b) where θ is the set of policy parameters W and b, a t and s t are the action and state at the time step t respectively. During training, an action is sampled at  each step with the probability from the policy. After the sampling is performed over the entire sequence, the delayed reward is computed. During evaluation, the action is picked by argmax a π(a|s t ; θ). The training is guided by the REINFORCE algorithm (Williams, 1992), which optimizes the policy to maximize the expected reward: and the gradient has the following form: where τ represents the sampled trajectory {a 1 , a 2 , · · · , a T }, N is the number of sampled trajectories. R τ i here equals the delayed reward from the downstream classifier at the last step.
To encourage exploration and avoid local optima, we add the entropy regularization (Mnih et al., 2016) on the policy loss: where H is the entropy, and λ is the regularization strength, T i is the trajectory length. Finally, the downstream classifier and policy network are warm-started by separate training, and then jointly trained together.

Experiments
Before experiments, we perform the preprocessing described in Section 2.1, and then randomly split patients in every note type by 5 folds to perform cross-validation as suggested by Shin et al. (2019). To evaluate each fold F i , 12.5% of the training set, that is the combined data of the other 4 folds, are held out as the development set and the best configuration from this development set is used to decode F i . The same split is used across all experiments for fair comparison. Following Shin et al. (2019), the averaged Area Under the Curve (AUC) across these 5 folds is used as the evaluation metric.

Baseline
Bag-of-Words We first conduct experiments using the bag-of-words encoder (BoW; Section 4.1) to establish the baseline. Many experiments are performed on all note types using the vanilla TF-IDF, document frequency (DF) cutoff at 2 (removing all tokens whose DF ≤ 2), and token stemming. For every experiment, the class weight is assigned inversely proportional to class frequencies, and the inverse of regularization strength C is searched from {0.01, 0.1, 1, 10}, where the best results are achieved with C = 1 on the development set. Table 3 describes the cross-validation results on every note type. The top AUC is 62.3%, which is within expectation given the difficulty of this task. Some note types are not as predictive as the others, such as Operative (OP) and Social Worker (SW), with the AUC under 52%. Most note types have the standard deviations in range 0.02 to 0.03.
In comparison to the previous work (Shin et al., 2019), we achieve 0.671 AUC combining both structured and unstructured data, despite without the use of LDA in our encoder.

Noise Observation
The DF cutoff coupled with token stemming significantly reduce feature space for the BoW model. As shown in Table 4, the DF cutoff itself can achieve about 50% reduction of the feature space. Furthermore, applying the DF cutoff leads to slightly higher AUCs on most of the note types, despite almost a half of the tokens are removed from the vocabulary. This implies that there exists a large amount of noisy text that appears only in few documents, causing the models to be overfitted more easily. These results further verify our previous observation and strengthen the necessity to extract noise from these long documents using reinforcement learning (Section 6.3).
Averaged Word Embedding For the averaged word embedding encoder (AWE; Section 4.2), embeddings generated by FastText trained on the Common Crawl and the English Wikipedia with the 300 dimension is used. 1 AWE is outperformed by BoW on every note type except Operative (OP; Table 3). This empirical result implies that AWE over thousands of tokens is not so effective in generating the document representation so that the averaged embeddings are less discriminative than the sparse vectors generated by BoW for such long documents.

Deep Learning-based Encoders
For deep learning encoders, the four note types with good baseline performance (≈ 60% AUC) and reasonable sequence length (< 5000) are selected to use in the following experiments, which are Consultations (CO), Discharge Summary (DS), History and Physical (HP), and Selection Conference (SC) (see Tables 1 and 3).
Segmentation For both ClinicalBERT and the LSTM models, the input document is split into segments as described in Section 4.3. For LSTM, we set the maximum segment length to be 128 for CO and HP, 64 for DS and SC, to balance between segment length and sequence length. The segment length for ClinicalBERT is set to 318 (approaching 500 after BERT tokenization) to avoid noise brought by too many pseudo labels. More statistics about segmentation are summarized in For the ClinicalBERT, we use the PyTorch BERT implementation with the base configuration: 2 768 embedding dimensions and 12 transformer layers, and we load the weights provided by Huang et al. (2019) whose language model has been finetuned on large-scale clinical notes. 3 We finetune the entire ClinicalBERT with batch size 4, learning rate 2 × 10 −5 , and weight decay rate 0.01.
For the weight-dropped LSTM, we set the batch size to 64, the learning rate to 10 −3 , the weightdrop rate to 0.5, and search the hidden state dimension from {128, 256, 512} on the development set. Early stop is used for both approaches.   Table 3 shows the final results achieved by the ClinicalBERT and LSTM models. The AUCs of both models experience a non-trivial drop from the baseline. After further investigation, the issue is that both models suffer from severe overfitting under the huge feature spaces, and struggle to learn generalized decision boundaries from this data. Figure 2 shows an example of the weak correlation between the training loss and the AUC scores on the development set.

Result Analysis
As more steps are processed, the training loss gradually decreases to 0. However, the model has high variance and it does not necessarily give better performance on the development set as the training loss drops. This issue is more apparent with Clini-calBERT on CO because there are too many pseudo labels acting as noise, which makes it harder for the model to distinguish useful patterns from noise.

Reinforcement Learning
According to Table 3, the BoW model achieves the best performance. Therefore, we decide to use TF-IDF to represent the long text of each patient, along with logistic regression as the classifier for reinforcement learning. Document segmentation is the same as LSTM (Table 5). During training, segments within each note are shuffled to reduce overfitting risks, and sequences with more than 36 segments are truncated. The downstream classifier is warm-started by loading weights from the logistic regression model in the previous experiment. The policy network is then trained for 400 episodes while freezing the downstream classifier. After the warm start, both models are jointly trained. We set the number of sampling N as 10 episodes, learning rate 2 × 10 −4 , and fix the scaling factor β in Equations 4 as 8, and discount rate as 0.95. Moreover, we search the reward coefficient c in {0.02, 0.1, 0.4}, and entropy coefficient λ in {2, 4, 6, 8}.  The AUC scores and the pruning ratios (the number of pruned segments divided by the sequence length) are shown in Table 6. Our reinforcement learning approach outperforms the best performing models in Table 3, achieving around 1% higher AUC scores on three note types, CO, HP, and SC, while pruning out up to 26% of the input documents.

CO
Tuning Analysis We find that two hyperparameters are essential to the final success of reinforcement learning (RL). The first is the reward discount rate d. The scale of the policy gradient ∇ θ J(θ) depends on the sequence length T , while the delayed reward R τ is always on the same scale regardless of T . Therefore, different sequence length across episodes causes turbulence on the policy gradient, leading to unstable training. It is important to apply reward decay to stabilize the scale of ∇ θ J(θ).
The second is the entropy regularization coefficient λ, which forces the model to add bias towards uncertainty. Without strong entropy regularization, the training is easy to fall into local optima in early stage, which is to keep all segments, as shown by Figure 3(a). λ = 6 gives the model descent incentive to explore aggressively, as shown by Figure  3(b), and finally leads to higher AUC.

Noise Analysis
To investigate the noise extracted by RL, we analyze the pruned segments on the validation sets of the Consultations type (CO), and compare the results with other basic noise removal techniques. Table 7 demonstrates the potential of the learned policy to automatically identify noisy text from the long documents. The original notes of shown examples are tabular text with headers and values, mostly lab results and medical prescription. After the data cleaning step, the text becomes broken and does not make much sense for humans to evaluate. The learned policy can identify noisy segments by looking at the presence of headers such as "lab fishbone", "lab report", and certain medical terms that frequently appear in tabular reports such as "chloride", "creatinine", "hemoglobin", "methylprednisolone", etc. We find that many pruned segments have strong indicators lab fishbone ( bmp , cbc , cmp , diff ) and critical labs -last hours ( not an official lab report . please see flowsheet ( or printed official lab reports ) for official lab results . ) ( na ) ( cl ) h ( bun ) -( hgb ) ( glu ) ( wbc ) ( plt ) ( ) h ( cr ) ( hct ) na = not applicable a = abnormal ( ftn ) = footnote . laboratory studies : sodium , potassium , chloride , . , bun , creatinine , glucose . total bilirubin 1 , phos of , calcium , ast 9 , alt , alk phos . parathyroid hormone level . white blood cell count , hemoglobin , hematocrit , platelets . inr , ptt , and pt . methylprednisolone ivpb : mg , ivpb , give in surgery , routine , / , infuse over : minute . mycophenolate mofetil : mg = 4 cap , po , capsule , once , now , / , stop date / , ml . documented medications documented accupril : mg , po , qday , 0 refill , substitution allowed . the social worker met with this pleasant year old caucasian male on this date for kidney transplant evaluation . the patient was alert , oriented and easily engaged in conversation with the social worker today . he resides in atlanta with his spouse of years , who he describes as very supportive . he reports occasional alcohol drinks per month but denies any illicit drug use . he has a grade education . he has been married for years . he is working full -time while on peritoneal dialysis as a business asset manager . he has medicare and an aarp prescriptions supplement . family history : mother deceased at age with complications of obesity , high blood pressure and heart disease . of headers and specific medical terms, which appear mostly in tabular text rather than written notes. Table 8 shows examples that are kept by the policy. Tokens that contribute towards Keep action are words related with human and social life, such as "social worker", "engaged", "drinks", "married", "medicare", and terms related with health conditions, such as "obesity", "heart", "high blood pressure". These terms indeed appear mostly in written text rather than tabular data.

Qualitative Analysis
In addition, we also notice that the policy is able to remove certain duplicate segments. Medical professionals sometimes repeat certain description from previous notes to a new document, causing duplicate content. The policy learns to make use of the already selected context, and assigns negative coefficients to certain tokens. Duplicate segments are only selected once if the segment contains many tokens that have opposite feature importance in the context and segment vectors.
Quantitative Analysis We examine tokens that are pruned by RL and compare with document frequency (DF) cutoff. We select 3000 unique tokens in the vocabulary that have the top negative feature importance (towards Prune action) in the segment vector of CO. Figure 4 shows the DF distribution of these tokens. We observe that the majority of those tokens have small DF values. It shows that the learned policy is able to identify certain tokens with small DF values as noise, which aligns with DF cutoff. Moreover, the distribution also shows a non-trivial amount of tokens with large DF values, demonstrating that RL can also identify task-specific noisy tokens that commonly appear in documents, which in this case are certain tokens in noisy tabular text.
Either RL or DF cutoff achieves higher AUC while reducing input features, proving that given the small sample size, the extracted text is more likely to cause overfit than being generalizable pattern, which also verifies our initial hypothesis.

Conclusion
In this paper, we address the task of 30-day readmission prediction after kidney transplant, and propose to improve the performance by applying reinforcement learning with noise extraction capability. To overcome the challenge of long document representation with a small dataset, four different encoders are experimented. Empirical results show that bagof-words is the most suitable encoder, surpassing overfitted deep learning models, and reinforcement learning is able to improve the performance, while being able to identify both traditional noisy tokens that appear in few documents, and task-specific noisy text that commonly appear.