IBMResearch at MEDIQA 2021: Toward Improving Factual Correctness of Radiology Report Abstractive Summarization

Although recent advances in abstractive summarization systems have achieved high scores on standard natural language metrics like ROUGE, their lack of factual consistency remains an open challenge for their use in sensitive real-world settings such as clinical practice. In this work, we propose a novel approach to improve factual correctness of a summarization system by re-ranking the candidate summaries based on a factual vector of the summary. We applied this process during our participation in MEDIQA 2021 Task 3: Radiology Report Summarization, where the task is to generate an impression summary of a radiology report, given findings and background as inputs. In our system, we first used a transformer-based encoder-decoder model to generate top N candidate impression summaries for a report, then trained another transformer-based model to predict a 14-observations-vector of the impression based on the findings and background of the report, and finally, utilized this vector to re-rank the candidate summaries. We also employed a source-specific ensembling technique to accommodate for distinct writing styles from different radiology report sources. Our approach yielded 2nd place in the challenge.


Introduction
The radiology report is a crucial instrument in patient care and an essential part of every radiological procedure, serving as the official interpretation of a radiological study and the primary means of communication between the radiologist and referring physician. According to the American College of Radiology, a radiology report should contain certain components, such as relevant clinical information, imaging findings, limitations of the study, and an impression or conclusion (American College of Radiology, 2020). Of these, the impression is the most important component of the radiology report, containing conclusions based on the pertinent findings and suggestions for additional diagnostic studies if warranted (Wallis and McCoubrie, 2011). Previous studies have shown that oftentimes it is the only part of the report that is read; one previous study found that 43% of referring physicians only read the impression if the report was longer than one page (Clinger et al., 1988), while another study found that 23.1% of clinicians agreed with the statement "I usually only read the conclusion of a radiology report" (Bosmans et al., 2011).
In an effort to support radiologists in writing impressions in radiology reports, Zhang et al. (2018) introduced the task of automatic generation of radiology impression statements by summarizing textual findings written by radiologists. MEDIQA 2021 (Asma Ben Abacha, 2021), as part of NAACL-BioNLP 2021 workshop, aims to further research efforts in summarization in the medical domain. Task 3 of the challenge, Radiology Report Summarization (RRS), focuses specifically on radiology impression generation. The basic task setup is as follows: given the findings and background sections of a radiology report, predict the impression or summary.
In this paper, we detail our participation in MEDIQA 2021 RRS challenge. We developed an approach that utilizes a structured label vector of the impression as our proxy for facts for the impression (predicted using findings and background of the report), to re-rank the generated abstractive summaries from a trained encoder-decoder model. We further employed a source-specific ensembling technique utilizing models fine-tuned to each radiology report source to accommodate for distinct language patterns in each source. Our system performed well in the challenge, placing us 2nd on the leaderboard.

Related Work
Abstractive Summarization Systems. Abstractive text summarization has been intensively stud-ied in recent literature. Rush et al. (2015) introduces an attention-based sequence-to-sequence (seq2seq) model for abstractive sentence summarization. Recent models (e.g. ; Zhang et al. (2020)) employ techniques like denoising or Gap Sentence Generation task for pretraining, to help generation tasks including summarization. However, there are a few domain-specific versions of these state-of-the-art models. Other works like Liu and Lapata (2019); Rothe et al. (2020) have demonstrated the effectiveness of initializing encoder-decoder models from pre-trained encoder-only models, such as BERT (Devlin et al., 2018) and RoBERTa , for seq2seq tasks providing competitive results in summarization tasks. Our works builds on these findings and utilizes a pre-existing domain-specific pretrained transformer model in an encoder-decoder setting for our summarization task.
Summarization and Factual Correctness in Radiology Reports. Zhang et al. (2018) first studied the problem of automatic generation of radiology impressions by summarizing textual radiology findings, and showed that an augmented pointergenerator model achieves high overlap with human references. They also found that about 30% of the radiology summaries generated from neural models contain factual errors. Research scholars also integrated Radlex ontology into seq2seq models (MacAvaney et al., 2019) to enhance the clinical validity of automated impression prediction systems within the radiology workflows. In their next work, Zhang et al. (2019b) improved upon the problem of factual correctness in radiology reports by optimizing fact scores defined in radiology reports with reinforcement learning methods. They also introduced a new metric Factual F 1 comparing the predicted summaries using a descriptor vector of the gold summary. In our work, we extend the ideas put forward by Zhang et al. (2019b) by utilizing a descriptor vector (generated using off-the-shelf systems like CheXpert (Irvin et al., 2019) or CheXbert (Smit et al., 2020)) to re-rank the automatically generated summaries.

Task Description and Dataset
The MEDIQA-2021 RRS task is defined as follows: given a passage of findings represented as a sequence of tokens x = {x 1 , x 2 , . . . , x N }, with N being the length of the findings, and a passage of background represented as a sequence of tokens y   Type  Source-specific Size  Total  Size  MIMIC-CXR  Indiana   Training  91,544  0  91,544  Validation  2,000  2,000  4,000  Test  ? ? 600 = {y 1 , y 2 , . . . , y M } with M being the length of the background, find a sequence of tokens z = {z 1 , z 2 , . . . , z L } that best summarizes the salient and clinically significant findings in x, with L being an arbitrary length of the impression or summary 1 . Datasets for training and validation of summarization models provided by the MEDIQA organizers consisted of radiology reports with findings, background, and impression sections. The training set consists of 91,544 radiology reports from the MIMIC-CXR database (Johnson et al., 2019), while the validation set consists of an additional 4,000 radiology reports -2,000 from MIMIC-CXR and 2,000 from the Indiana Network for Patient Care (Indiana) (Demner-Fushman et al., 2016). As part of the shared task rules, the rest of the publicly available MIMIC-CXR and Indiana radiology reports were not allowed for use in training or validation. However, the organizers allowed the use of validation data for training. At the conclusion of the shared task, to evaluate participant systems, a test set of 600 radiology reports containing only findings and background sections were released with their sources unknown at the time of the challenge. Dataset statistics are presented in the Table  1.

System Description
Our system is a three-step process in which we (1) utilize pre-trained transformer-based language models in an encoder-decoder setting to get our base summarization models, (2) improve the factual correctness of our base models' predictions by incorporating a re-ranking methodology, and (3) utilize a source-specific ensembling technique which identifies the source of a radiology report, and chooses the prediction of the best performing source-specific model accordingly. We detail the above three steps in the following sections.

Base Models
Previous work by Liu and Lapata (2019); Rothe et al. (2020) have demonstrated the effectiveness of initializing encoder-decoder models from pretrained encoder-only models, such as BERT and RoBERTa, for seq2seq tasks. Inspired by this work, we experimented with pre-trained transformer models used as both encoder and decoder with parameters shared between encoder and decoder. Using this setup, we experimented with RoBERTa-large, which showed promising results in Rothe et al. (2020), and BioMed-RoBERTa-base, a domainspecific version of RoBERTa that is publicly available 2 from AllenNLP (Gururangan et al., 2020), and fine-tuned both models using the training set of 91,544 MIMIC-CXR reports. Of the two models, BioMed-RoBERTa-base achieved better results and was therefore used as our initial model for subsequent experiments.
Next, we conducted experiments to evaluate the performance of this initial model on different radiology report sources. As the provided training and validation data contains two sources, MIMIC-CXR and Indiana, each with their distinct language (more details in Section 4.3) and official test data could be any source, we further developed two more base models. Using the initial BioMed-RoBERTa-base model fine-tuned on MIMIC-CXR training set, we further fine-tuned the initial model in two settings: (1) with a subset of reports in the Indiana validation dataset, and (2) with a subset of reports in the Indiana and MIMIC-CXR validation dataset.
Our end result is three base models tuned for 3 source categories: • BioRoBERTa (M) : BioMed-RoBERTa-base fine-tuned on MIMIC-CXR training data. This is the base model for MIMIC-CXR source.
• BioRoBERTa (M+I) : BioRoBERTa (M) further fine-tuned on Indiana validation data. This is our base model for Indiana source.
• BioRoBERTa (M+M+I) : BioRoBERTa (M) further fine-tuned on both MIMIC-CXR and Indiana validation data. This is our base model for unknown sources.

Fact-Aware Re-ranking (FAR)
Previous works in extracting structured labels from free-text radiology reports have identified 14 observations based on clinical relevance and the prevalence in the reports, and have developed automated systems to predict a 14-observations-vector for an impression summary of a radiology report (Irvin et al., 2019;Smit et al., 2020). The 14 observations are: "Atelectasis", "Cardiomegaly", "Consolidation", "Edema", "Enlarged Cardiomediastinum", "Fracture", "Lung Opacity", "Lung Lesion", "No Finding", "Pneumonia", "Pneumothorax", "Pleural Effusion", "Pleural Other", and "Support Devices". "Pneumonia", despite being a clinical diagnosis, was included as a label in order to represent the images that suggested primary infection as the diagnosis. The 13 observations (excluding "No Finding") take on one of the following classes: blank, positive, negative, and uncertain. The 14th observation, "No Finding", is intended to capture the absence of all pathologies, and takes on only one of the two following classes: blank or positive.
Utilizing this 14-observations-vector we developed an approach to improve the factual correctness of our base models by incorporated a factual re-ranking component that re-ranks our N highest scoring summaries predicted from a base model. We achieve this in the following steps, we (1) first fine-tune a transformer-based language model to predict the 14-observation-vector of the impression given the finding and background of a radiology report, (2) obtain top N highest scoring candidate summaries predicted from our base encoderdecoder model (3) use CheXbert to obtain the 14observation-vector for each of the N candidate summaries, and (4) use a similarity function between predicted 14-observation-vector for impression (obtained in step 1) and each vector for N candidate summaries obtained in step 3 to re-rank these summaries. Finally, we use the highest similarity scoring candidate summary as our impression summary. We detail our impression 14-observation-vector prediction and our similarity function in the following sections.
We apply our FAR methodology on the three base models introduced in section 4.1 to get our three source-specific models, and denote the new models as BioRoBERTa (M),FAR , BioRoBERTa (M+I),FAR , and BioRoBERTa (M+M+I),FAR , respectively.

Source Finding
Background Impression

Impression 14-Observations-Vector Prediction
We utilize the 14-observation-vector representation of the impression section of a radiology report predicted by CheXbert as our ground truth label in a prediction task given the finding and background section of the report as inputs. In this process, for each given radiology report that has findings, background and impression section, we (1) first utilize CheXbert to obtain 14-observations-vector representation of the impression section, (2) convert the multiple values of each of the 14 observations to be binary (i.e. presence or absence of the observation) 3 , (3) train a transformer-based language model using finding and background (concatenated) as input to predict 14-observations-vector of the impression section.

Similarity Function
Among the 14 observations categories predicted in CheXbert, "No Finding" is intended to capture absence of all pathologies, i.e. if "No Finding" is positive then all other observations must be negative. Therefore, we constructed our similarity function in cases where (1) "No Finding" is not matched, we assign a similarity score of 0, (2) "No Finding" is a match, the similarity score is the cosine similarity between the rest of the vector representing the 13 other observations.
3 CheXbert outputs for 13 observations one of the following classes: blank, positive, negative, and uncertain. For the 14th observation corresponding to No Finding, the labeler only outputs one of the two following classes: blank or positive. We convert uncertain to positive and blank to negative to get binary positive and negative output for all 14 observations.

Source-specific Ensemble
We observed in the provided training and validation data that MIMIC-CXR and Indiana reports use distinctly different language when expressing findings, background, and impression, even when the conveyed content is very similar. As shown in Table 2, although both the MIMIC-CXR report and Indiana report convey the same two key findings in their impression, "emphysema" and "no acute cardiopulmonary disease", the MIMIC-CXR report describes these findings with more detail in prose form, while the Indiana report lists the findings more concisely using a numbered list form. This variation in language between different healthcare organizations is common in the clinical NLP domain, resulting in a need to adapt algorithms depending on the applicable dataset (Carrell et al., 2017).
To address this, we trained a BERT-based sourcespecific classifier which predicts the source given the findings and background as input. We trained this model using a subset MIMIC-CXR and Indiana reports. However, during prediction or evaluation phase, we chose a higher threshold of 0.7 for predicting a source i.e. if an input is predicted to be Indiana or MIMIC-CXR with a probability of 0.7 or higher, we predict it to be Indiana or MIMIC-CXR respectively, otherwise it is marked to be of an unknown source. Based on the predicted source of a test sample (MIMIC-CXR, Indiana or unknown), the source-specific models' output is chosen as the prediction for that sample.

Evaluation Metrics
We use two sets of metrics to evaluate model performance at the corpus level, ROUGE and Factual   ROUGE We use the standard ROUGE scores (Lin, 2004), and report the F 1 scores for ROUGE-1, ROUGE-2 and ROUGE-L, which compare the word-level unigram, bigram and longest common sequence overlap with the reference summary, respectively.

RoBERTa
Factual F 1 For factual correctness evaluation, we use a Factual F 1 score as proposed by Zhang et al. (2019b). The Factual F 1 scores are calculated by 1) running the CheXbert labeler on both the reference and generated summaries to obtain the binary presence values of a collection of disease variables 2) calculating the F 1 score for each of the variables over the entire test set, using reference values as oracle; and 3) obtaining the macro-averaged F 1 score over all variables. Following the process in Zhang et al. (2019b), we exclude some variables due to their small sample sizes (with less than 5% positive ratio in the entire dataset  Table 6 are on the official external test data of 600 radiology reports.

Base Models
We conducted four experiments to get our three base models specific to MIMIC-CXR, Indiana and unknown sources. We utilized MIMIC train to train our first two models, RoBERTa-large (M) and BioRoBERTa (M) . We used Indiana 1800 for the model BioRoBERTa (M+I) , and used Indiana 1800 and MIMIC 1800 for the model BioRoBERTa (M+M+I) . In each setting we split the available dataset into 90/10 for training and validation splits. We evaluated all our models on the internal test set of 400 radiology reports. Each of our models uses a seq2seq architecture with encoder and decoder both composed of Transformer layers. For both encoder and decoder, we inherited the RoBERTa Transformer layer implementations. We also added an encoder-decoder attention mechanism. All models were fine-tuned on the target task using Adam optimizer with a learning rate of 0.05. We used Huggingface's transformers library 4 (Wolf et al., 2019) for executing our experiments. In our encoder-decoder setup, our input was capped at 128, output summary at 40, beam size was 10, our length penalty was set as 0.8. Finally, in our summary generation, trigram and higher length phrases were not repeated. Table 3 presents results of the 4 experiments. Between the 2 models that were trained using only MIMIC train , BioRoBERTa (M) consistently outperform RoBERTa-large (M) in this task, likely due to BioRoBERTa (M) utilizing a domain adapted version of RoBERTa. Among the 3 BioMed-RoBERTa-base based models, BioRoBERTa (M) performs better for MIMIC 200 , and BioRoBERTa (M+I) provides better performance for Indiana 200 . BioRoBERTa (M+M+I) fine-tuned on both MIMIC-CXR and Indiana provides better performance on the Combined 400 but performs poorly when we consider each source separately.

Fact-aware Re-ranking (FAR)
For the prediction of the 14-observations-vector we combined MIMIC train , MIMIC 1800 , and Indiana 1800 to form our training and validation splits. Table  4 presents our F 1 scores for our impression 14observations-vector prediction model evaluated on the internal test dataset Combined 400 . We utilized Smit et al. (2020)'s publicly available implementation 5 to train the domain-specific RoBERTa model (BioMed-RoBERTa-base) for predicting impression 14-observations-vector. In this setup, the transformer architecture was modified with 14 linear heads, corresponding to 14 observations. We concatenate Findings and background of a radiology 4 https://github.com/huggingface/ transformers 5 https://github.com/stanfordmlgroup/ CheXbert   report to be our input, which is then tokenized and the input is capped at 128. The hidden state of the CLS token is fed as input to each of the linear heads. The model is trained using cross-entropy loss and Adam optimization with a learning rate of 2 × 10 -5 . The cross-entropy losses for each of 14 observations are added to produce the final loss.
During training, the model was periodically evaluated and the best performing model averaged over 14 observations was saved.
For fact-aware re-ranking we utilize the model trained above to re-rank the top 10 (N=10 was empirically determined) generated summaries from our three base models presented in section 5.2. Table 3 presents results for our following three factually correct re-ranking experiments, BioRoBERTa (M),FAR , BioRoBERTa (M+I),FAR , and BioRoBERTa (M+M+I),FAR . As BioRoBERTa (M),FAR shows best performance for MIMIC-CXR radiology reports (MIMIC 200 ), BioRoBERTa (M+I),FAR exhibits best performance for Indiana radiology reports (Indiana 200 ) and the combined BioRoBERTa (M+M+I),FAR shows best performance for the combined test data (Combined 400 ), these models are chosen to be our source-specific models for MIMIC-CXR, Indiana and unknown sources respectively.

Source-specific Ensemble
For training our source-specific classifier we used a downsampled subset of MIMIC train of 10,000 radiology reports and Indiana 1800 and formed 90/10 training and validation splits. We evaluated the model on MIMIC 200 and Indiana 200 and present our results in Table 5. We again utilized Huggingface transformers library to conduct our experiments. In this setup, we used the BERT-base architecture with a single linear head for our classification of the source. We concatenate Findings and background for a radiology report to be our input, which is then tokenized and input is capped at 512. The model is trained using cross-entropy loss and Adam optimization with a learning rate of 2 × 10 -5 . Our model was trained for 3 epochs.
Utilizing the above model we identify the source of a radiology report and apply the source-specific models. Ensemble results in Table 3 presents our results after we apply source-specific ensembling technique to our internal test dataset. Our ensembled results show a slight drop in performance for individual source MIMIC 200 and Indiana 200 (due to classification errors), but show best performance on the combined dataset (Combined 400 ). Table 6 presents our top 2 official submission results. Ensemble presents our best performing source-specific ensemble technique applied to the official test data. In our another submission (Ensemble + post-processing) we remove certain tokens (like "1.", "2.", "__") to clean up our sourcespecific ensemble technique output which slightly improved the rouge scores.

Discussion
In this section, we present two major findings of our approach. First, we find that radiology reports from different sources have distinct language, and fine-tuning a model trained on source A with a small amount of data from source B provides significant gains in performance on source B, allowing the model to be transferable. As it can be seen in Table 3, zero-shot application of our model BioRoBERTa (M) , which is fine-tuned only on MIMIC-CXR (MIMIC train ), shows lower performance on the Indiana dataset. However, on further fine-tuning BioRoBERTa (M) on a small dataset of 1,800 Indiana reports (Indiana 1800 ) leads to huge gains in performance on Indiana dataset (model BioRoBERTa (M+I) on Indiana 200 ).
Second, fact-aware re-ranking methodology improves performance of the models on natural language metrics (ROUGE) as well as factual correctness of our predictions, but metrics beyond lexical overlap are needed. As shown in Table 3, models using FAR outperform the base models when measured in ROUGE even through FAR's objective is not to optimize ROUGE. Table 7 shows examples of the most probable predictions from base model compared with the predictions after FAR, and the human-generated ground-truth impressions. ROUGE scores for both predictions compared to the ground-truth are shown at the end of each example. In the first example, FAR chooses a better ROUGE scoring prediction over the most probable prediction by the base model. However, in the second example, FAR doesn't choose the higher ROUGE scoring prediction but rather the more factually correct one. With the current evaluation metric ROUGE, this would lead to a drop in performance. Developing and adopting new metric that consider both lexical as well as factual correctness jointly (Mrabet and Demner-Fushman, 2020) is crucial to steer the research community to develop systems that ensure factual correctness as well as readability.
Limitations and Future Work. We acknowledge several limitations to our work. First, we recognize our dependence on an external structured label generator. As we use CheXbert labels as our proxy for ground truth for training our 14observations-vector predictor, as well as in our similarity function, any errors in CheXbert have a direct impact on our system's performance. Second, though FAR methodology has shown significant gains in performance in Factual F 1 and ROUGE scores, the system is limited by the generated candidate summaries. We aim to build on this approach by incorporating this methodology during training as a modified version of beam search. Third, all of our presented results are evaluated using a relatively small set of internal test data, due to the limitations on data during the challenge. Though  Table 7: Examples depicting the most probable prediction from base model, re-ranked prediction using our FAR methodology compared to the ground truth (human-generated impression). our approach has translated into similar good performance on the official test data, we aim to further evaluate our approach on an increased test data. Finally, as ROUGE has been shown to be an imperfect metric for radiology report summarization evaluation (Zhang et al., 2019b), we aim to further evaluate our system (1) using other automated metrics such as BERTScore (Zhang et al., 2019a), BLEURT (Sellam et al., 2020), and HOLMS (Mrabet and Demner-Fushman, 2020), (2) by conducting qualitative evaluation of our system's predictions by involving human annotators such as radiologists or subject matter experts.

Conclusion
We have presented our system developed during our participation in MEDIQA 2021 RRS challenge. We found that radiology reports from different sources have distinct language and fine-tuning a trained model with a small amount of data from another source leads to gains in performance and allows the models to be transferable. Further, techniques like fact-aware re-ranking, which utilizes a factual vector of the summary to re-rank candidate summaries, not only improves factual correctness of the summary but also improves the performance of the model on the traditional natural language metrics like ROUGE. We have also identified limitations of our work, and discussed promising areas of future research.