Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation

Clinical notes contain rich information, which is relatively unexploited in predictive modeling compared to structured data. In this work, we developed a new clinical text representation Clinical XLNet that leverages the temporal information of the sequence of the notes. We evaluated our models on prolonged mechanical ventilation prediction problem and our experiments demonstrated that Clinical XLNet outperforms the best baselines consistently. The models and scripts are made publicly available.


Introduction
Unstructured clinical notes within Electronic Health Records (EHR) contain valuable information to support clinical decisions [1].However, most prognostic models used in medical practice currently rely on scoring systems that only incorporates structured data [2; 3; 4; 5].Yet to make accurate clinical decisions, clinicians have to go through numerous clinical notes to extract additional prognostic information from unstructured text.This adds to the workload of clinicians when they need to make quick decisions and thus may introduce human errors.Therefore, predictive models that utilize unstructured data could be very helpful in medical practice.
A major challenge to unleash the clinical power of unstructured data is in representing notes in ways that allow effective mining of clinically meaningful knowledge.Natural Language Processing (NLP) methods can be used to generate effective notes representation and exploit the predictive values.There are many recent advances in the standard NLP domain, such as BERT [6], XLNet [7].However, clinical notes are far different from the general domain text (Wikipedia, BookCorpus, etc).For example, clinical notes contain jargon and abbreviations, different grammar and syntax.It is notoriously difficult to obtain an effective note representation.Hence, to bridge the gap, it requires to design an architecture that captures the nuances of the individual clinical note while taking into account the temporal nature of the sequence of the notes.Recently, ClinicalBERT, which adapts the BERT model from the standard NLP domain to model clinical notes [8; 9] achieved superior performance in clinical text prediction.High-quality relationships between human-interpreted medical concepts in clinical notes were uncovered and ClinicalBERT outperformed competitive baselines in predicting 30-day hospital readmission [8].However, previous works still have the following limitations: 1. Notes representation could be improved.In the standard NLP domain, BERT ignores the discrepancy of masked positions between the pretraining and finetuning stage.An autoregressive pretraining method named XLNet has been proposed, which empirically outperforms BERT by a large margin on many NLP tasks [7].XLNet overcomes the pretrain-finetune discrepancy by using the permutations of factorization sequences to capture the bidirectional context.This presents an opportunity to further improve clinical note representation by adapting XLNet into the clinical domain.2. Failure to incorporate the temporal dimension of clinical notes.Clinical notes have a temporal dimension where the order of information in sequential notes can provide additional predictive signals.Many previous models [8; 9] only aggregates individual risk scores from each note which ignore the temporal information charted in EHR. 3. Unrealistic prediction setup from clinical reality.Many previous works [10; 11; 12] have used predictors that would not be available at the time point a clinical decision must be made.
In this paper, we present Clinical XLNet, which processes a patient's notes and predict the probability of PMV and mortality.In particular, this model mitigates the aforementioned limitations via the following technical contributions: 1. Improved clinical notes representation.We apply the permutation language modeling method proposed in XLNet on the corpus from clinical notes to generate better clinical embeddings, as demonstrated in Section.4. 2. Inclusion of temporal information.We maintain the temporal order of the note embeddings generated from Clinical XLNet and feed them into a bidirectional LSTM layer [13], which leverages information along the temporal dimension (Fig. 1).3. Realistic prediction setup.We performed meticulous cohort curation in MIMIC-III dataset [14] with the clinician team and set up an actionable prediction task according to the real clinical setting.Clinical notes used in this prediction task are strictly within the 48-hour time window.
To evaluate Clinical XLNet, we examined its performance to predict prolonged mechanical ventilation (PMV).PMV consumes a substantial amount of healthcare resources, results in great financial and emotional burdens for patients and their families, and is associated with high one-year mortality around 50-60% [15; 16; 17].It is projected that over 600, 000 patients in the United States will require PMV by the year 2020 [18].A surgical procedure to create an opening (stoma) in the trachea, named tracheotomy, allows breathing through an alternative airway [19].Tracheotomy results in improved patient comfort, decreased duration of ICU and hospital stay, and reduced mortality [20].
However, the problem with tracheotomy, as an invasive procedure, is that it may not be necessary if a patient's condition improves quickly.Thus, an early and correct decision of tracheotomy is critical.The clinical team make this decision based on available ICU evidence as well as clinical judgment on the likelihood of PMV and potential mortality.The current prognosis of these factors relies on the ProVent score, which incorporates only a limited number of structured data in the EHR [21].

Data
We use the Multiparameter Intelligent Monitoring in Intensive Care III (MIMIC-III) dataset [14] hosted on PhysioNet [22]  ventilation for at least 2 days with more than 6 hours each day.The patient was also not organ donors and was nor transferred patients from other hospitals.To alleviate confounding, we further remove patients with diseases that always lead to PMV such as neuromuscular disease, malignant neoplasm, extensive burns, etc.
For each admission, we use the first ICU stay.For clinical notes, we are interested in physician and nursing notes, but we narrow down to nursing and respiratory notes within 48 hours from the start of the first ventilation event.The reason for only selecting nursing-related notes is to expand the cohort as MIMIC-III is missing physician notes from 2001 to 2008.Additional criteria that are applied in our data curation process are in Fig. 2. In the end, we obtain a cohort of 7,287 unique patients and their corresponding 73,224 clinical notes.Table .1 shows the cohort demographics.
Cohort Labels.Our cohort is labeled with PMV and 90-day mortality in binary.PMV is defined as being on mechanical ventilation for more than 7 days with at least 6 hours each day [23].For 90-day mortality, it is defined as death occurring within 90 days of first ICU admission.

Methods
This section presents our Clinical XLNet framework (Fig. 1).Clinical XLNet is an extension of XLNet [7] on the clinical text domain.It first generates a deep latent representation for clinical notes and then by applying a bidirectional LSTM (Bi-LSTM) layer, it also leverages the sequential order of notes.
Problem Settings.Our target task aims to leverage a patient's clinical notes to predict a patient's prognostic variables such as PMV and mortality.We denote a patient as P, and each patient P is associated with a ordered sequence of notes {N 1 , • • • , N i }, where i is the total number of notes.
To predict mortality L M and PMV L P , we aim to learn two mappings      and fine-tuning in BERT [6].For each sequence, XLNet and BERT are both appended with a [CLS] classification token at the beginning of the sequence for downstream task usage.PLM first factorizes the input sequence into a list of order-factorized sequences.Then, it applies language modeling to predict the next word given previous words.For a more detailed description, we refer the readers to the original paper [7].
After pre-training, given any clinical note N i , we use the last encoder layer hidden representation E i of the [CLS] token to represent the note.Note, as we train with a supervised signal in the downstream task, [CLS] token would gather useful information in the entire note sequence due to the Transformer-XL's self-attention mechanism.Now, given a temporally ordered sequence of notes associated with a patient, we obtain a temporally-ordered sequence of notes representations Finetuning Clinical XLNet.To leverage the temporal information among the notes, we feed {E 1 , • • • , E i } into a sequential modeling layer.Specifically, we use Bi-LSTM model [25; 13].We use bidirectional model because not only the latter notes depend on the previous notes as patients develop their symptoms in a temporal order but also the latter notes may contain useful clinical knowledge to help enrich the representation of previous notes.The output of the Bi-LSTM layer H N is then fed into a predictor neural network, which at last, generates a probability p that measures the likelihood of downstream target variable, PMV, and mortality.The network is then tuned using binary classification loss.
Meta-Finetuning.End-to-end training is ideal as it can further tune the note representation module to allow it fit the specific task at hand.However, as each patient is associated with a large number of notes, during training, the forward and backward information through the large note representation module networks for each note are all stored in GPU memory.This is not scalable.Instead, we propose to approximate the task-specific note representation through an additional meta finetuning stage.During meta finetuning, we use one piece of note as input, and further train the pre-trained Clinical XLNet through the downstream task label signal from the corresponding patient.The meta finetuned network can then generate a task specific note representation.Then, during the finetuning stage, we freeze the note representation XLNet module, and use a static fixed note representation from the meta-finetuned XLNet.
This meta finetuned stage also allows fast adaptation to new dataset for the same downstream task because we only need to train on the temporal Bi-LSTM layer, which is relatively small, and takes around 5 minutes to converge using one GPU.This is an ideal quality to have in a realistic ICU setup that requires fast adapted information.

Experiment
To evaluate our model, we propose to examine the prediction performance under realistic setup.We use the first 48 hours of clinical notes starting from the initial mechanical ventilation event to predict two variables: mechanical ventilation longer than 7 days and 90-days mortality. 1yperparameters.For data split, we first obtain a 10% holdout test set.Then we generate different 8:1 train:validation splits using different random seeds for model performance robustness examination.
For the pre-training, we further pre-train the XLNet embedding for another 200K steps using 16 batch size.For the meta fine-tuning, we use 32 batch size with learning rate 1e-5 for four epochs with early stopping on the area under the receiver operating characteristic curve (AUROC) score of validation.For fine-tuning, we use a two layers Bi-LSTM module with batch size 128 and learning rate 1e-4.The pre-training and meta fine-tuning process was conducted on a server with 2 Intel Xeon E5-2670v2 2.5GHZ CPUs, 128GB RAM, and 2 NVIDIA Tesla P40 GPUs.
Evaluation Metrics.We use area under the receiver operating characteristic curve (AUROC) for evaluation, which are standard metrics in the clinical informatics domain.
Baselines.We conduct a thorough set of experiments with some of the popular baselines: 1. LSTM [13] is the classic language modeling method that uses long term document memory.2. LSTM + Attention adds an attention layer on top of the sequence output of LSTM hidden layers.3. Hierarchical Attention Networks (HAN) [13] is a hierarchical LSTM designed specifically for document level text classification.4. Recurrent Convolutional Neural Network (RCNN) [26] uses a recurrent structure on the classic CNN network to capture contextual information as far as possible.5. BERT [6] uses transformer based bidirectional contextual representation through a massive pre-training dataset and is later finetuned using a task-specific signal.6. XLNet [7] uses similar pre-training and fine-tuning fashion as BERT.However, XLNet uses permutation language modeling with TransformerXL network backbone.7. ClinicalBERT [8; 9] further pre-trains on BERT using MIMIC-III notes dataset.8. Clinical XLNet-mean is an ablation study that uses conventionally used heuristic method to report the aggregated output from a sequence of notes, instead of the bidirectional LSTM layers.
Note that for BERT, XLNet, and ClinicalBERT, we all attach a bidirectional-LSTM layer on top of them to leverage the sequential dimension of notes.And for ClinicalBERT, we pre-train using the same corpus as the Clinical XLNet.These steps ensure a fair comparison between note representation power.

Discussion
In this work, we propose a method for predicting prognosis based on only contextual information available from clinician notes.The proposed method is based on the recent advancements in the field of NLP.Therefore we compare it with other recently proposed methods in NLP for a fair comparison.We perform one ablation study to show the relevance of the proposed sequential modeling of the notes embedding in the time domain and we compare against several state of the art baselines which have been used extensively in the natural language domain as well as in the clinical context.Clinical Relevance.Our work provides timely aid in clinical decision making.For a patient under the ICU observation, the clinical team could start the evaluation to consider a tracheotomy procedure as soon as 48 hours after mechanical ventilation.The time period of 48 hours was chosen in consultation with a team of clinicians.Furthermore, a predictive analytics on the prolonged mechanical ventilation for seven days or more is important for clinicians in deciding the tracheotomy decision.Besides, the doctors could reduce the risk of a burdensome procedure and treatment by considering the patient's 90-days mortality prediction.This approach assist patients and their families by providing more time to process and make a major decision.
Model Efficiency.The proposed method uses XLNet [7] which uses TransformerXL [24] as the base architecture to extract embedding from the notes.Since every set of notes require their individual embeddings, we run the base architecture for multiple runs where the number of runs is proportional to the number of notes.Therefore, obtaining the embedding for the whole sequence of notes is computationally expensive both during the training as well as inference.However, there is a recent line of work [27] which can allow executing the transformer based models at a much lesser computational cost.
Limitations and Future Work.Our proposed method only mines task relevant information from clinical notes from nurses and respiratory therapists.However, one can utilize other sources of data as well such as structured notes.While structured data are commonly used in prognostic models, our preliminary study showed that they did not improve the performance by any significant factor.One future direction could be to explore a novel architecture design that could utilize both sources of information to improve the performance further.Another future direction would be to explore ways of combining multiple sources of clinical notes such as physician notes, admission notes, and dsischarge notes.
N i< l a t e x i t s h a 1 _ b a s e 6 4 = " 7 / s X + V x 5 j q l M T O e 0 I l D w G P c o 6 0 3 h z n p w X 5 9 3 5 m I 0 u O f n O A f y B 8 / k D 7 Z K S 4 A = = < / l a t e x i t > . . .
E i< l a t e x i t s h a 1 _ b a s e 6 4 = " 8 4 U u S 6 Z 7 g L k n F J F m 9 2 J b J S w X 0 i c = "> A A A B + H i c b V D N T g I x G P z W X 8 Q / 1 K O X R j D x R H b x o E e i M f G I i Q g G V t I t X W h o u 5 u 2 q y E b 3 s O L B 7 0 Y r z 6 K N 9 / G L u x B w U n a T G a + L 5 1 O E H O m j e t + O 0 v L K 6 t r 6 4 W N 4 u b W 9 s 5 u a W / / T k e J I r R J I h 6 p d o A 1 5 U z S p m G G 0 3 a s K B Y B p 6 1 g d J n 5 r U e q N I v k r R n H 1 B d 4 I F n I C D Z W e q h 0 B T Z D J d K r H p t U e q W y W 3 W n Q I v E y 0 k Z c j R 6 p a 9 u P y K J o N I Q j r X u e G 5 s / B Q r w w i n k 2 I 3 0 T T G Z I Q H t G O p x I J q P 5 2 m n q B j q / R R G C l 7 p E F T 9 f d G i o X W Y x H Y y S y k n v c y 8 T + v k 5 j w 3 E + Z j B N D J Z k 9 F C Y c m Q h l F a A + U 5 Q Y P r Y E E 8 V s V k S G W G F i b F F F W 4 I 3 / + V F c l + r e q d V e 9 V u a u X 6 R d 5 I A Q 7 h C E 7 A g z O o w z U 0 o A k E F D z D K7 w 5 T 8 6 L 8 + 5 8 z E a X n H z n A P 7 A + f w B 3 6 + S 1 w = = < / l a t e x i t > H N < l a t e x i t s h a 1 _ b a s e 6 4 = " z 2 j o M J e 8 i 7 9 w O n R w y 4 D a 4 2 W q q L w = " > A A A B + H i c b V D N T g I x G P w W / x D / U I 9 e G s H E E 9 n F g x 6 J X j g Z T E Q w s J J u K d D Q d j d t V 0 M 2 v I c X D 3 o x X n 0 U b 7 6 N X d i D g p O 0 m c x 8 X z q d I O J M G 9 f 9 d n I r q 2 v r G / n N w t b 2 z u 5 e c f / g T o e x I r R w Z b g L X 5 5 m d x X K 9 5 Z x V 7 V m 2 q p d p k 1 k o c j O I Z T 8 O A c a l C H B j S B g I J n e I U 3 5 8 l 5 c d 6 d j / l o z s l 2 D u E P n M 8 f u t 2 S v w = = < / l a t e x i t > p < l a t e x i t s h a 1 _ b a s e 6 4 = " u F F A 5 8 P F s K n G C 9 + S x n T 0 1 1 y Ca r E = " > A A A B 9 n i c b V C 9 T s M w G P x S / k r 5 K z C y W L R I T F V S B h g r W B i L R G l R E 1 W O 6 7 R W b S e y H a Q q 6 m u w M M C C W H k W N t 4 G p 8 0 A L S f Z O t 1 9 n 3 y + M O F M G 9 f 9 d k p r 6 x u b W + X t y s 7 u 3 v 5 B 9 f D o Q c e p I r R D Y h 6 r X o g 1 5 U z S j m G G 0 1 6 i K B Y h p 9 1 w c p P 7 3 S e q N I v l v Z k m N B B 4 J F n E C D Z W 8 u u + w G a s R J b M 6 o N q z W 2 4 c 6 B V 4 h W k B g X a g + q X P 4 x J K q g 0 h G O t + 5 6 b m C D D y j D C 6 a z i p 5 o m m E z w i P Y t l V h Q H W T z z D N 0 Z p U h i m J l j z R o r v 7 e y L D Q e i p C O 5 l H 1 M t e L v 7 n 9 V M T X Q U Z k 0 l q q C S L h 6 K U I x O j v A A 0 Z I o S w 6 e W Y K K Y z Y r I G C t M j K 2 p Y k v w l r + 8 S h 6 b D e + i Y a / m X b P W u i 4 a K c M J n M I 5 e H A J L b i F N n S A Q A L P 8 A p v T u q 8 O O / O x 2 K 0 5 B Q 7 x / A H z u c P n K 6 S J g = = < / l a t e x i t > 1|h [CLS] )< l a t e x i t s h a 1 _ b a s e 6 4 = " i N x q 1 0 v D s 3 u L x x f / H P T e 1 z n D 5V w = " > A A A C C n i c b V C 7 T s M w F H V 4 l v I K M C I h i x a p L F V S B l i Q K r o w d C i C 0 q I 0 i h z Xa a 0 6 D 9 k O U h W y s f A r L A y w I F a + g I 2 / w W k z Q M u R b B 2 d c 6 / u v c e N G B X S M L 6 1 h c W l 5 Z X V w l p x f W N z a 1 v f 2 b 0 V Y c w x a e O Q h b z r I k E Y D U h b U s l I N + I E + S 4 j H X f U y P z O P e G C h s G N H E f E 9 t E g o B 7 F S C r J 0 Q / K P R / J I f e T V q U J z 6 E J H 4 Z O Y j W a 1 3 Z 6 n J Y d v W R U j Q n g P D F z U g I 5 W o 7 + 1 e u H O P Z J I D F D Q l i m E U k 7 Q V x S z E h a 7 M W C R A i P 0 I B Y i g b I J 8 J O J n e k 8 E g p f e i F X L 1 A w o n 6 u y N

Figure 1 :
Figure 1: Clinical XLNet framework.A. We first pre-train the XLNet embedding with MIMIC-III clinical notes dataset using Permutation Language Modeling.After pre-training, given a clinical note, the model outputs a numerical vector to be used as a note representation.B. To alleviate the computation burden from training end-to-end, the meta-finetuning stage uses the supervised signal to further tune the pre-trained network with input consisting of individual note N i .The meta finetuned stage then generates a static task-specific note representation.C. Given a sequence of a patient's notes {N 1 , • • • , N i }, the meta-finetuned Clinical XLNet network generates a sequence of representation of notes {E 1 , • • • , E i }.The ordered representation sequence is then fed into a bidirectional LSTM layer, which then outputs a fixed size latent vector H N , representing the entire sequence.H N is finally fed into a predictor neural network to generate a probability p measuring the likelihood of the target variable.

Table 1 :
Cohort Statistics.For continuous variable, it reports mean with the standard deviation.For categorical variable, the count is given with percentage.
1] is a probability that measures the likelihood of having mortality and PMV respectively.Pretraining Clinical XLNet.The text representation generated from large pre-training models depends on the corpus it is pre-trained on.XLNet is pre-trained on common language corpora such as BookCorpus, Wikipedia, Common Crawl and etc.However, these corpora are different from clinical notes which are filled with jargon, abbreviations and difficult syntax and grammar.Hence, to learn an effective representation of clinical notes, we need to further pre-train the XLNet using clinical [24]s.Specifically, we use our interested notes type (nursing, nursing/others and respiratory therapy) in MIMIC-III dataset.The clinical notes we used in pre-training are NOT in the holdout test set to avoid biased results.XLNet is a stack of Transformer-XL encoder[24].For pre-training, it uses Permutation Language Modeling (PLM) to tackle the challenge of [MASK] token information gap between pre-training

Table 2 :
Prediction result with three independent data splits mean and standard deviation reported.The left panel is the PMV prediction result and the right panel is the 90 days mortality prediction results.Results.Table.2reports the result for our prolonged mechanical ventilation and 90-days mortality tasks.Clinical XLNet achieves the best results with AUROC score of 0.663 (± 0.011) and 0.779 (± 0.006) for PMV and 90-days mortality respectively.From the difference between clinically pre-trained embedding Clinical XLNet & Clinical BERT and no pre-trained model BERT & XLNet, we demonstrate the necessity of pre-training on domain-specific corpus.From the difference between Clinical BERT and Clinical XLNet, we show our Clinical XLNet has better note representation.From the difference between Clinical XLNet and Clinical XLNet-mean, we see the usage of sequential modeling of the temporal dimension of notes.