Incorporating Risk Factor Embeddings in Pre-trained Transformers Improves Sentiment Prediction in Psychiatric Discharge Summaries

Reducing rates of early hospital readmission has been recognized and identified as a key to improve quality of care and reduce costs. There are a number of risk factors that have been hypothesized to be important for understanding re-admission risk, including such factors as problems with substance abuse, ability to maintain work, relations with family. In this work, we develop Roberta-based models to predict the sentiment of sentences describing readmission risk factors in discharge summaries of patients with psychosis. We improve substantially on previous results by a scheme that shares information across risk factors while also allowing the model to learn risk factor-specific information.


Introduction
About 1 in 5 Medicare patients discharged from the hospital is rehospitalized within 30 days (Jencks et al., 2009). Four out of the top ten conditions with the largest number of readmissions among the Medicaid enrollees were mental health conditions or substance use disorders (Hines et al., 2014). Readmissions are harmful both in being disruptive to patients and families, and as a major driver of health-care costs in psychiatry (Wu et al., 2005;Mangalore and Knapp, 2007). Also, premature discharge of patients contributes not only to rehospitalization but to increased risk of homelessness and the possibility of violent behavior or suicide. Thus, reducing rates of early hospital readmission has been recognized and identified as a key to improve quality of care and reduce costs.
There are a number of risk factors that previous work compiled from the literature (Holderness et al., 2018) to be important for understanding readmission risk, including such factors as problems with substance abuse, ability to maintain work, relations with family. Studying readmission through the use of explicit risk factors may go against prevailing trends in machine learning (black box models that take in raw signals), but has at least two potential benefits: 1) Allowing for descriptive study of, and thus better understanding of, the factors that are important for readmission; and 2) The possibility that a risk classifier that uses explicit risk factors will be more interpretable and trustworthy to providers, and thus more likely to be put into practical use.
In this work, we make use of a publicly available data set of sentences from discharge summary that is annotated for seven risk factors, along with the "sentiment," marked as Positive, Negative, or Neutral (more details in background). Previous work on this dataset showed that the task was approachable with neural methods, but performance suffered because of small dataset size (Holderness et al., 2019). Here, we address this issue with two advances. First, we show that transfer learning helps dramatically. While the previous work used smaller neural models trained from scratch, we start from pre-trained transformer models (RoBERTa (Liu et al., 2019)), and improve them for the task with architectural and data augmentation strategies.
Second, we show that sharing data between the different risk factor domains is better for performance than training separately, at least with pretrained models. Previous work trained separate classifiers for each of the risk factor domains, due to the (legitimate) belief that there are significant differences in predicting sentiment in different risk factor domains (Holderness et al., 2019). Here, we demonstrate that a method that allows for sharing of information between different risk factor domains can improve performance. Because of a mismatch in the way training and test data were annotated, we introduce a data augmentation method for the training set that allows our method to improve over the vanilla RoBERTa baseline. The improvements from this latter technique point in a direction that could allow for further improvements even without additional training data.

Background
The dataset we use was created in previous work (Holderness et al., 2018) and made publicly available. It consists of sentences extracted from discharge summaries of patients with psychosis at Boston-area hospitals. Previous work had examined two tasks on this dataset, the identification of the risk domain represented by the sentence (Appearance, Mood, Interpersonal, Substance Use, Occupation, Thought Content and Thought Process), and the sentiment of the sentence given the risk domain. In the training data, each sentence has only one risk factor, and thus one sentiment, but the test data allows for sentences to have multiple risk factors, and thus multiple different sentiments for a single sentence, conditioned on the risk factor domain. Both the previous work for risk factor classification (Holderness et al., 2018) and sentiment classification (Holderness et al., 2019) used multilayer neural networks in their experiments and found them to be the best-performing.
Despite the promise of the above work, the tasks are still unsolved. Recent work in contextualized embeddings has shown great success for sentence classification tasks. Specifically, BERT-based models (Devlin et al., 2019), based on the transformer architecture (Vaswani et al., 2017), showed that pretraining deep transformer encoders on massive text datasets with a language modeling objective could lead to improvements in a variety of tasks. The best performance is typically obtained by fine tuning, where a classifier head is attached to a special sentence token, and new tasks are learned via standard supervised learning, in which the weights of the classifier head are trained from scratch while the weights of the transformer encoder are allowed to update. In this work, we make use of the RoBERTa updates to BERT (Liu et al., 2019), which use the same architecture but pre-trained on a larger dataset and for a longer time.

Data
The training dataset contains 3500 sentence-length texts, 500 from each of the seven readmission risk factors mentioned above. The test dataset contains 1650 texts which can involve multiple risk factors and are more variable in length compared with the training data, as described in the previous study (Holderness et al., 2019). We divided the training instances into training and development set with an 80%/20% split, 1 leaving us with 2800 training instances and 700 development instances.
Since we are focusing on sentiment prediction in this work, we take the risk factor domains as given, and for test sentences with multiple risk factor domains, we create multiple instances where the input pairs the sentence text with each domain, and the target is the gold sentiment label for that domain. This results in 2103 test instances. 750 of these are labeled as the Other domain, which does not have training instances nor reported results, so we discard these, leaving 1353 test instances.

Methods
We developed several variations of a risk factor sentiment classifier based on the RoBERTa architecture.

Baseline
We fine-tune seven independent RoBERTa models as the baseline, one for each of the risk factor domains. This follows previous work (Holderness et al., 2019), which suggested that positive or negative clinical sentiments might differ in different risk factor domains.

Plain RoBERTa
Instead of fine-tuning seven independent models, we fine-tune one RoBERTa model on all training texts to learn the shared representation of sentiments, ignoring the risk factor domain of the sentences during training and only learning sentiment labels. Since the test set allows for a sentence to have multiple risk factor domains, but this model can only make one sentiment prediction per sentence, the model is penalized on cases where a sentence has multiple risk factor domains with different sentiments.

Risk Factor Domain Embeddings
In this method, we modify the input representation to contain both the input sentence and the risk factor domain to be classified. In BERT-style models, this means using the special sentence-separating token ([SEP]) between the sentence tokens and the domain tokens. For example, the first sentence in Figure 1: One example of the augmented sentences. Two randomly chosen sentences from the training data are concatenated to create two training instances with different risk factor domains. Figure 1 would be represented as [CLS] has been off cocaine and etoh x 3 wks per report.
[SEP] substance use [SEP]. Giving the domain as the second sentence allows the model to potentially learn what part of the input the classifier should focus on, and previous work has shown similar methods to be effective (Shi and Lin, 2019). In contrast to another possible approach with a separate input stream for domain feature, this method allows us to utilize the pre-trained contextual embeddings of the domain words and let the model learn to use its attention mechanism to relate the risk factor domains with the corresponding part of the sentences and make sentiment predictions about only that part.

Data Augmentation
The training data only has one risk factor domain and sentiment annotation per sentence, and so the above approach alone may fail on test data, because at training time the model may never need to use the domain embedding, because the whole sentence is relevant because of the way the data was constructed. Then when applied to the test data the risk factor domain embedding could be useful, but the model has never been trained to use it. Therefore, we developed a data augmentation scheme that creates new synthetic training instances from pairs of existing instances, such that these synthetic instances more closely resemble test instances with multiple risk factor domain and sentiment labels. We concatenate two randomly sampled sentences in the training data from two different domains, and create two training instances from that concatenated input, with the two different domain and sentiment labels from the original instances. We add a period and space between sentences missing them to look more natural. Figure 1 shows an example of the augmented sentences. These synthetic instances are surely lacking discourse coherence that the real instances in the test set will have, but our hypothesis is that they will at least force the model to associate domain embeddings with specific parts of the input at training time and not just learn the sentiment for the whole input sequence.
We created 2800 new training instances with this sampling procedure (1400 unique texts), for an augmented training data size of 5600 instances (this method is only use for augmenting training and is not applied to the test data). We then finetune the model from the previous section with this augmented data set, using the domain risk factor as the second sentence as above.

Training Details
We use the HuggingFace Transformer library (Wolf et al., 2019) for our RoBERTa implementations. We use grid search to find the optimal hyperparameters (batch size, learning rate and max sequence length) for the RoBERTa fine-tuning process. We monitor the training and validation loss for each training epoch and save the model with the highest Macro-F1 score on the development set before testing on the test set. We use pandas for data processing (Wes McKinney, 2010) and scikit-learn for model evaluation (Pedregosa et al., 2011).

Evaluation
We evaluated the four different architectures on this task to explore the importance of sharing information between different risk factor domains as well as learning the domain specific information.
We first evaluate at the instance level, computing precision and recall for each sentiment label, and combining to get an F1 score for each sentiment label, as well as overall accuracy (proportion of instances correctly predicted regardless of sentiment or risk factor domain). Table 1 summarizes the accuracy F1 score results, with Macro F1 also   Table 2: Results of Roberta and Roberta fine-tuned on augmented dataset (Roberta+D+Aug) with the highest score on each performance metric in bold. The top row is reported results of the "Fully Supervised MLP" system of Holderness et al. (2019). "Average" scores are computed by taking the average of 7 risk factor domains in the same column.
Example input text Domain GOLD Roberta Roberta+D+Aug A. Pt.'s affect appeared slightly brighter, but remains flat at baseline. Her mood appears slightly improved as well.

Mood
Positive  reported as the average across sentiment labels. Table 2 shows detailed results of the plain RoBERTa and the augmented (RoBERTa+D+Aug) model, including precision and recall, and broken down by risk factor domain. The average across risk factor domains is substantially higher than the results reported in previous work. The overall best performance was obtained by using domain embeddings with data augmentation (RoBERTa+D+Aug). Although fine-tuning seven independent RoBERTa models for each risk factor domain is the worst performing model, its baseline scores are still higher than the previous study which did not use pre-trained models (Holderness et al., 2019). Plain RoBERTa is surprisingly strong in this task despite the fact that it ignores the risk factor domain and is forced to make the same predictions for texts with multiple sentiment labels.
The improvement obtained by fine-tuning one single RoBERTa model instead of seven independent models suggests that the benefits of sharing information between risk factor domains outweigh the potential risks that the model will learn conflicting information. However, simply using domain as the second sentence during RoBERTa fine-tuning does not lead to improvement in the overall model performance. Augmenting the training data to look more like the test data was necessary in order for the domain embedding input to show benefits.
We tested the significance of the improvements in accuracy and Macro-F1 between RoBERTa and RoBERTa+D+Aug by fine-tuning with 20 randomly selected seeds, and performing a one-tail t-test, and results were found to be significant (p<0.05, p<0.005 for accuracy and Macro-F1).

Discussion and Conclusion
We selected instances from the test set where the system trained on augmented data (RoBERTa+D+Aug) did well, and others where it did not (see Table 3). Example A shows that the system is able to distinguish between the positive mood and neutral appearance, where plain RoBERTa was forced to select a single sentiment label. Example B shows where RoBERTa+D+Aug still makes mistakes -plain RoBERTa actually does better by picking the single sentiment for the sentence that fits best, while RoBERTa+D+Aug tries to pick two different sentiments and gets one wrong. In fact, in 66.5% of the 221 test sentences with multiple risk factor domains, the sentiment labels are the same, which means plain RoBERTa is usually not penalized for picking a single sentiment label. Example C shows an example where both models make errors, probably due to missing the complex inference that the patient has a substance abuse issue that requires treatment (MI=motivational interviewing).
Overall, our new approach shows major gains in performance over the existing state of the art for this problem. The biggest gains come from simply using large pre-trained models. However, the modified architecture and data augmentation technique lead to further gains, and have the ability to separate out multiple sentiments for a single sentence on new data. Future work may see larger benefit with methods for creating augmented training data that create more natural-looking sentence pairs. The source code used to fine-tune the model will be made publicly available 2 .