CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

The extraction of labels from radiology text reports enables large-scale training of medical imaging models. Existing approaches to report labeling typically rely either on sophisticated feature engineering based on medical domain knowledge or manual annotations by experts. In this work, we introduce a BERT-based approach to medical image report labeling that exploits both the scale of available rule-based systems and the quality of expert annotations. We demonstrate superior performance of a biomedically pretrained BERT model first trained on annotations of a rule-based labeler and then finetuned on a small set of expert annotations augmented with automated backtranslation. We find that our final model, CheXbert, is able to outperform the previous best rules-based labeler with statistical significance, setting a new SOTA for report labeling on one of the largest datasets of chest x-rays.


Introduction
The extraction of labels from radiology text reports enables important clinical applications, including large-scale training of medical imaging models . Many natural language processing systems have been designed to label reports using sophisticated feature engineering of medical domain knowledge (Pons et al., 2016). On chest x-rays, the most common radiological exam, rule-based methods have been engineered to label some of the largest available datasets (Johnson et al., 2019). While these methods have generated considerable advances, they have been unable to capture the full diversity of complexity, ambiguity and subtlety of natural language in the context of radiology reporting.  Figure 1: We introduce a method for radiology report labeling, in which a biomedically pretrained BERT model is first trained on annotations of a rule-based labeler, and then finetuned on a small set of expert annotations augmented with automated backtranslation.

* Equal contribution
More recently, Transformers have demonstrated success in end-to-end radiology report labeling (Drozdov et al., 2020;Wood et al., 2020). However, these methods have shifted the burden from feature engineering to manual annotation, requiring considerable time and expertise for high quality. Moreover, these methods do not take advantage of existing feature-engineered labelers, which represent state-of-the-art on many medical tasks.
In this work, we introduce a simple method for gaining the benefits of both existing radiology report labelers with expert annotations to achieve highly accurate automated radiology report labeling. This approach begins with a biomedically pretrained BERT model (Devlin et al., 2019;Peng et al., 2019) trained on the outputs of an existing labeler, and further finetuned on a small corpus of expert annotations augmented with automated backtranslation. We apply this approach to the task of radiology report labeling of chest x-rays, and call our resulting model CheXbert.
CheXbert outperforms the previous best reported labeler (Irvin et al., 2019) on an external dataset, MIMIC-CXR (Johnson et al., 2019), with an improvement of 0.055 (95% CI 0.039, 0.070) on the F1 metric, and only 0.007 F1 away from a radiologist performance benchmark. We expect this method of training medical report labelers is broadly useful for natural language processing within the medical domain, where collection of expert labels is expensive, and feature engineered labelers already exist for many tasks.

Related Work
Many natural language processing systems have been developed to extract structured labels from free-text radiology reports (Pons et al., 2016;Yadav et al., 2016;Hassanpour et al., 2017;Annarumma et al., 2019;Savova et al., 2010;Wang et al., 2018;Chen et al., 2018;Bozkurt et al., 2019). In many cases, these methods have relied on heavy feature engineering that include controlled vocabulary and grammatical rules to find and classify properties of radiological findings. NegEx (Chapman et al., 2001), a popular component of rule-based methods, uses simple regular expressions for detecting negation of findings and is often used in combination with ontologies such as the Unified Medical Language System (UMLS) (Bodenreider, 2004). NegBio , an extension to NegEx, utilizes universal dependencies for pattern definition and subgraph matching for graph traversal search, includes uncertainty detection in addition to negation detection for multiple pathologies in chest x-ray reports, and is used to generate labels for the ChestX-Ray14 dataset .
The CheXpert labeler (Irvin et al., 2019) improves upon NegBio on chest x-ray report classification through more controlled extraction of mentions and an improved NLP pipeline and rule set for uncertainty and negation extraction. The CheXpert labeler has been applied to generate labels for the CheXpert dataset and MIMIC-CXR (Johnson et al., 2019), which are amongst the largest chest x-ray datasets publicly available.
Deep learning approaches have also been trained using expert-annotated sets of radiology reports (Xue et al., 2019). In these cases, training set size, often driving the performance of deep learning approaches, is limited by radiologist time and expertise. Chen et al. (2017) trained CNNs with GloVe embeddings (Pennington et al., 2014) on 1000 radiologist-labeled reports for classification of pulmonary embolism in chest CT reports and improved upon the previous rules-based SOTA, peFinder (Chapman et al., 2011). Bustos et al. (2019) trained both recurrent and convolutional networks in combination with attention mechanisms on 27,593 physician-labeled radiology reports and apply their labeler to generate labels. More recently, Transformer-based models have also been applied to the task of radiology report labeling. Drozdov et al. (2020) trained classifiers using BERT (Devlin et al., 2019) and XLNet (Yang et al., 2020) on 3,856 radiologist labeled reports to detect normal and abnormal labels. Wood et al. (2020) developed ALARM, an MRI head report classifier on head MRI data using BioBERT (Lee et al., 2019) models trained on 1500 radiologist-labeled reports, and demonstrate improvement over simpler fixed embedding and word2vec-based (Mikolov et al., 2013) models (Zech et al., 2018. Our work is closely related to approaches to reduce the number of expert annotations required for training medical report labelers (Callahan et al., 2019;Ratner et al., 2020;Banerjee et al., 2018). A method of weak supervision known as data programming (Ratner et al., 2018) has seen successful application to medical report labeling: in this method, users write heuristic labelling functions that programmatically label training data. Saab et al. (2019) used data programming to incorporate labeling functions consisting of regular expressions that look for phrases in radiology reports, developed with the help of a clinical expert in a limited time window, to label for intracranial hemorrhage in head CTs. Dunnmon et al. (2019) demonstrated that in under 8 hours of cumulative clinician time, a data programming method can approach the efficacy of large hand-labeled training sets annotated over months or years for training medical imaging models, including chest x-ray classifiers on the task of normal / abnormal detection. Beyond data programming approaches, Drozdov et al. (2020) developed a fully unsupervised approach utilizing a Siamese Neural Network and Gaussian Mixture Models, reporting performance similar to the CheXpert labeler without requiring any radiologistlabeled reports on the simplified task of assigning normal and abnormal labels.

Task
The report labeling task is to extract the presence of one or more clinically important observations (e.g. consolidation, edema) from a free-text radiology report. More formally, a labeler takes in as inputs sentences from a radiology report and outputs for 13 observations one of the following classes: blank, positive, negative, and uncertain. For the 14th observation corresponding to No Finding, the labeler only outputs one of the two following classes: blank or positive.

Data
Two existing large datasets of chest x-rays, CheXpert (Irvin et al., (Irvin et al., 2019), from the Impression section, or other parts of the radiology report. A subset of both datasets also contain manual annotations by expert radiologists. On CheXpert, a total of 1000 reports (CheXpert manual set) were reviewed by 2 board certified radiologists with disagreement resolution through consensus. On MIMIC-CXR, a total of 687 reports (MIMIC-CXR test set) were reviewed by 2 board certified radiologists and manually labeled for the same 14 medical observations as in CheXpert. In this study, CheXpert is used for the development of models, and the MIMIC-CXR test set is used for evaluation.
Some reports from the same patient appear multiple times in the CheXpert dataset. Removing duplicate reports as well as the CheXpert manual set from the CheXpert dataset results in 190,460 reports, the class prevalences for which are shown in Table A1 of the Appendix. We remove excess spaces and newlines from all reports.

Model Architecture
All models use a modification of the BERT-base architecture (Devlin et al., 2019) with 14 linear heads (as shown in Figure 2): 12 heads correspond to various medical abnormalities, 1 to medical support devices, and 1 to 'No Finding'. Each radiology report text is tokenized, and the maximum number of tokens in each input sequence is capped at 512. The final-layers hidden state corresponding to the CLS token is then fed as input to each of the linear heads.

Training Details
For all our models, unless otherwise specified, we finetune all layers of the BERT model, including the embeddings, and feed the CLS token into the 14 linear heads to generate class scores for each medical observation. BERT-Base contains ∼ 110 million parameters, and the linear heads contain ∼ 40, 000 parameters.
All models are trained using cross-entropy loss and Adam optimization with a learning rate of 2 × 10 −5 , as used in Devlin et al. (2019) for finetuning tasks. The cross-entropy losses for each of the 14 observations are added to produce the final loss. During training, we periodically evaluate our model on the dev set and saved the checkpoint with the highest performance averaged over all 14 observations. All models are trained using 3 TITAN-XP GPUs with a batch size of 18.

Evaluation
Models are evaluated on their average performance on three retrieval tasks: positive extraction, negative extraction, and uncertainty extraction. For each of the tasks, the class of interest (e.g. negative for the negative extraction and uncertain for the uncertainty extraction) is treated as the positive class, and the other classes are considered negative. For each of the 14 observations, we compute a weighted average of the F1 scores on each of the above three tasks, weighted by the support for each class of interest, which we call the weighted-F1 metric, henceforth simply abbreviated to F1.
We report the simple average of the F1 across all of the observations. We include the 95% twosided confidence intervals of the F1 using the nonparametric percentile bootstrap method with 1000 bootstrap replicates (Efron and Tibshirani, 1986).

Supervision Strategies
We investigate models trained using three strategies: trained only on radiologist-labeled reports, trained only on labels generated automatically by the CheXpert labeler (Irvin et al., 2019), and trained on a combination of the two.
Radiologist Labels T-rad is obtained by training the model on the CheXpert manual set, finetuning all weights. As baselines, we also train models that freeze all weights in the BERT layers, and only update the weights in the linear heads: T.cls-rad is identical to T-rad in architecture, while T.tokenrad averages the non-padding output tokens as the input into the linear heads rather than using the CLS token output. All models are trained using a random 75%-25% train-dev split on this set, and are trained until convergence.
Automatic Labels T-auto is obtained using labels generated by the rule-based CheXpert labeler, described in Irvin et al. (2019). T-auto is trained using a random 85%-15% train-dev split of the CheXpert training set, different from the models trained on radiologist labels. T-auto is trained for 8 epochs, since slightly higher dev performance is observed compared to the typical 2-4 epochs for BERT fine-tuning tasks.
Hybrid Labels T-hybrid is obtained by initializing a model with the weights of T-auto, and then finetuning it on radiologist-labeled reports, as for T-rad. Table 1, T-rad achieves an F1 of 0.705 (0.680, 0.725), significantly higher than the performance of the baselines with T.clsrad at 0.286 (0.265, 0.305), and T.token-rad at 0.396 (0.374, 0.416). T-auto achieves a higher F1 of 0.755 (0.731, 0.774). Superior performance is obtained by T-hybrid, with an F1 of 0.775 (0.753, 0.795).

Biomedical Language Representations
We investigate the effect of having models pretrained on biomedical data. For the following models, we use an identical training procedure to Trad, but initialize the weights differently. Tbiorad is obtained by using BioBERT weight ini-tializations (Lee et al., 2019). BioBERT was obtained by further pretraining the BERT weights on a large biomedical corpus comprising PubMed abstracts (4.5 billion words) and PMC full-text articles (13.5 billion words). Tclinical-rad is obtained by using Clinical BioBERT weight initializations (Alsentzer et al., 2019), which were obtained by further pretraining the BioBERT weights on 2 million clinical notes from the MIMIC-III database. Finally, Tblue-rad is obtained by using BlueBERT, a BERT model pretrained on PubMed abstracts and clinical notes (MIMIC-III) (Peng et al., 2019).
Results As shown in Table 1, Tbio-rad achieves an F1 of 0.616 (0.587, 0.639) and Tclinical-rad achieves an F1 of 0.677 (0.651, 0.699), lower than T-rad. However, Tblue-rad achieves an F1 of 0.741 (0.714, 0.763), higher than T-rad. The drop in performance with Tbio-rad and Tclinical-rad may possibly be attributed to using different vocabulary, sequence length, and other configurations (stopping procedure, embedding dimensions) than those used by Tblue-rad, which uses the configurations provided in Devlin et al. (2019).

Data Augmentation using Backtranslation
We investigate the use of backtranslation to improve the performance of the models. Backtranslation is designed to generate alternate formulations of sentences by translating them to another language and back. Although backtranslation has been successfully used to augment text data in a variety of NLP tasks (Yu et al., 2018;Poncelas et al., 2018), to our knowledge, the technique is yet to be applied to a medical report extraction task. In this experiment, we augment the CheXpert manual set using Facebook-FAIR's winning submission to the WMT'19 news translation task  to generate backtranslations. Although this submission includes models that produce German/English and Russian/English translations, initial experiments with Russian did not demonstrate semantically correct translations, so we only continued experiments with German. We use beam search with a beam size of 1 to select the single most likely translation. We perform this experiment using our best models: Tblue-rad-bt is obtained by using an identical training procedure to Tbluerad on the augmented dataset (which is twice the size of the CheXpert manual set). Tblue-hybridbt is obtained by first training a BlueBERT-based  labeler on automatically generated CheXpert labels, and then fine-tuning on radiologist-labeled reports of the CheXpert manual set, augmented by backtranslation. We also report the performance of T-rad-bt and T-hybrid-bt.

Comparison to previous SOTA and radiologist benchmark
We compare the performance of our best model to the previous best reported labeler, the CheXpert labeler (Irvin et al., 2019), and to a radiologist benchmark. CheXpert is an automated rule-based labeler that extracts mentions of conditions like pneumonia by searching against a large manually curated list of words associated with the condition and then classifies mentions as uncertain, negative, or positive using rules on a universal dependency parse of the report. For the radiologist benchmark, the annotations by one of the 2 radiologists on the MIMIC-CXR test set is used, while the other is used as ground truth. We report the improve-ment of our best model, Tblue-hybrid-bt, which we also call CheXbert, over the CheXpert labeler by computing the paired differences in F1 scores on 1000 bootstrap replicates and provide the mean difference along with a 95% two-sided confidence interval.
Results We observe that CheXbert has a statistically significant improvement (p < 0.001) over the existing SOTA, CheXpert, which achieves a score of 0.743 (0.719, 0.764). Notably, we also find that Tblue-rad-bt, the best model trained only on manually labeled radiology reports, performs at least as well as the CheXpert labeler. Table 2 shows the F1 per class (along with 95% confidence intervals) for CheXbert and for the improvements over CheXpert. CheXbert records an improvement in all but 2 medical conditions, and a statistically significant improvement in 9 of the 14 conditions. The largest improvements are observed for Pneumonia Training times For all our models except the baselines, training on radiologist-labeled reports takes ∼ 30 minutes, training on the radiologistlabeled reports augmented via backtranslation takes ∼ 50 minutes, and training on the reports labeled by the CheXpert labeler) takes ∼ 7 hours.

T-auto versus CheXpert
We analyze whether T-auto, which is trained exclusively on labels from CheXpert (a rules-based labeler), can generalize beyond those rules.
We analyze specific examples in the CheXpert manual test set which T-auto correctly labels but CheXpert mislabels. On one example, T-auto is able to correctly detect uncertainty expressed in the phrase cannot be entirely excluded, which CheXpert is not able to detect because the phrase does not match any rule in its ruleset. Similarly, on another example containing no evidence of pneumothorax or bony fracture, T-auto correctly labels fracture as negative, while CheXpert labels fracture as positive since the phrasing does match any negation construct part of its ruleset. T-auto, in contrast to CheXpert, also recognizes conditions with misspellings in the report like cariomegaly in place of cardiomegaly and mediastnium in place of mediastinum. Examples of T-auto correctly labeling conditions mislabeled by CheXpert are provided in Table A4 of the Appendix.

CheXbert versus T-auto and CheXpert
We analyze how CheXbert improves on T-auto and CheXpert using examples which CheXbert correctly labels but T-auto and CheXpert incorrectly label.
Chexbert is able to correctly detection conditions which CheXpert and T-auto are not able to. On one example, T-auto and CheXpert both mislabel a "mildly enlarged heart" as blank for cardiomegaly, while CheXbert correctly labels it positive. On another containing "Right hilum appears slightly more prominent" (an indicator for enlarged cardiomediastinum), CheXbert correctly classifies enlarged cardiomediastinum as positive, while T-auto and CheXpert do not detect the condition.
Furthermore, CheXbert correctly labels nuanced expressions of negation that both CheXpert and T-auto mislabel. On the example containing "heart size is slightly larger but still within normal range," CheXpert and T-auto mistakenly label cardiomegaly as positive, while CheXbert correctly labels cardiomegaly as negative. On another example containing the phrase "interval removal of PICC lines, CheXpert and T-auto detect "PICC lines" as an indication of a support device but are unable to detect the negation indicated by "removal", which CheXbert correctly does.
Additionally, CheXbert is able to correctly detect expressions of uncertainty that both CheXpert and T-auto mislabel. On an example containing new bibasilar opacities, which given the clinical history are suspicious for aspiration, CheXbert correctly identifies lung opacity as positive while CheXpert and T-auto incorrectly detect uncertainty Report Segment and Labels Reasoning ...two views of chest demonstrate cariomegaly with no focal consolidation...

Cardiomegaly CheXpert: Blank T-auto: Positive
T-auto, in contrast to CheXpert, recognizes conditions with misspellings in the report like cariomegaly in place of cardiomegaly.

Edema CheXpert: Positive T-auto: Uncertain
T-auto incorrectly detects uncertainty in the edema label, likely from the and/or; CheXpert correctly classifies this example as positive.

Enlarged Cardiomediastinum
CheXpert: Negative T-auto: Negative CheXbert: Uncertain T-auto and CheXpert both incorrectly label this example as negative for enlarged cardiomediastinum; CheXbert correctly classifies it as uncertain, likely recognizing that "unchanged" is associated with uncertainty of the condition. The condition cannot be labeled positive or negative without more information. (associating suspicious as a descriptor of opacities). More examples which CheXbert correctly labels but CheXpert and T-auto mislabel can be found in Table A6 of the Appendix.

Report Changes with Backtranslation
We analyze the phrasing and vocabulary changes that backtranslation introduces into the reports. Backtranslation frequently rephrases text. For instance, the sentence "redemonstration of multiple right-sided rib fractures is backtranslated to "redemonstration of several rib fractures of the right side. Backtranslation also introduces some error: the phrase left costophrenic angle is translated to "left costophrine angle ("costophrine is not a word), and the phrase "left anterior chest wall pacer in place" is backtranslated to "pacemaker on the left front of the chest wall", which omits the critical attribute of being in place. In many examples, the backtranslated text paraphrases medical vocabulary into possible semantic equivalents: cutaneous" becomes "skin", "left clavicle" becomes "left collarbone", "osseous" becomes "bone" or "bony", "anterior" becomes "front", and "rib fracture" becomes "broken ribs". More backtranslations with analyses are provided in Table A7 of the Appendix.

Limitations
Our study has several limitations. First, our hybrid/auto approaches rely on the existence of an existing labeler to generate labels. Second, our report labeler has a maximum input token size of 512 tokens and would need to include further engineering to work on longer medical/radiology reports. Third, our task is limited to the 14 observations labeled for, and we do not test for the model's ability to label rarer conditions. However, CheXbert can mark No Finding as blank, which can indicate the presence of another condition if the other 13 conditions are also blank. Fourth, the ground truth labels for the MIMIC-CXR test set were determined by a single board-certified radiologist, and the use of more radiologists could demonstrate a truer comparison to the radiologist benchmark. Fifth, while we do test performance on a dataset from an institution unseen in training, additional datasets across institutions could be useful in further establishing the model's ability to generalize.

Conclusion
In this study, we propose a simple method for combining existing report labelers with handannotations for accurate radiology report labeling.
In this method, a biomedically pretrained BERT model is first trained on the outputs of a labeler, and then further finetuned on the manual annotations, the set of which is augmented using backtranslation. We report five findings on our resulting model, CheXbert. First, we find that CheXbert outperforms models trained only on radiologistlabeled reports, or only on the existing labeler's outputs. Second, we find that CheXbert outperforms the BERT-based model not pretrained on biomedical data. Third, we find that CheXbert outperforms models which do not use backtranslation. Fourth, we find that CheXbert outperforms the previous best labeler, CheXpert (which was rules-based), with an improvement of 0.052 (95% CI 0.037, 0.067) on the F1 metric; we also find that the best model trained only on manually labeled radiology reports (Tblue-rad-bt) performs at least as well as the CheXpert labeler. Fifth, we find that CheXbert is 0.007 F1 points from the radiologist performance benchmark, suggesting that the gap to ceiling performance is narrow.
We expect this method of training medical report labelers is broadly useful within the medical domain, where collection of expert labels can produce a small set of high quality labels, and existing feature engineered labelers can produce labels at scale. Extracting highly accurate labels from medical reports by taking advantage of both sources can enable many important downstream tasks, including the development of more accurate and robust medical imaging models required for clinical deployment.  Table A2: Dev set F1 scores for all our models. The dev set for all rad models and T-hybrid consists of 250 randomly sampled reports from the CheXpert manual set. The dev set for T-auto is a random 15% split of the CheXpert dataset. The dev set for all models using backtranslation is obtained by augmenting the 250 randomly sampled reports from the CheXpert manual set by backtranslation. Tblue-hybrid-bt is first trained on labels generated by the CheXpert labeler, and then fine-tuned on radiologist labels augmented by backtranslation. Before fine-tuning on radologist labels, it obtains an F1 of 0.977 on the 15% dev split of the CheXpert dataset.

Model F1
Training Strategy  Table A3: The differences in the number of times labels were correctly assigned by one model versus another model. For example, in the first column named T-auto > CheXpert, we report the difference between the number of times T-auto correctly classifies a label and the number of times CheXpert correctly classifies a label. We record the differences between a pair of models by category (blank, positive, negative, uncertain) and by total. These occurrences are obtained on the MIMIC-CXR test set. Blank  0  29  56  Positive  -22  11  56  Negative  14  45  9  Uncertain 16  46  -3  Total  8  131  118   Table A4: Examples where T-auto correctly assigns a label while CheXpert misassigns that label on the CheXpert manual set. We include speculative reasoning for the classifications.
Example & Labels Reasoning ...redemonstration of diffuse nodular air space opacities which are unchanged from prior examination which may represent air space pulmonary edema versus infection, as clinically correlated...

Edema CheXpert: Positive T-auto: Uncertain
T-auto appears to detect uncertainties indicated by words like "may" and "versus" on conditions. In this case, this phrase did not match an uncertainty detection rule in the CheXpert classifier. ... there has been interval development of left basilar patchy airspace opacity, which likely represents atelectasis, although consolidation cannot be entirely excluded...

Consolidation CheXpert: Positive T-auto: Uncertain
Unlike CheXpert, T-auto correctly detects uncertainty conveyed in the phrase "cannot be entirely excluded".
1. no radiographic evidence of acute cardiopulmonary disease. 2. no evidence of pneumothorax or bony fracture.

Fracture
CheXpert: Positive T-auto: Negative In this example, T-auto is able to detect a negation indicated by "no evidence of. Chexpert is not able to pick up this negation construction as part of its ruleset.  The word suspicious does not modify opacities in this sentence.
Although CheXbert correctly identifies this, CheXpert and T-auto misclassify the opacities as uncertain.
...Coalescent areas in the left upper and lower zones could well reflect regions of consolidation. The right lung is essentially clear...

Consolidation
CheXpert: Positive T-auto: Positive CheXbert: Uncertain CheXbert correctly detects that consolidation is uncertain, as indicated by the phrase could well reflect.
Removal of dialysis catheter with no evidence of pneumothorax. Heart is mildly enlarged and is accompanied by vascular engorgement and new septal lines consistent with interstitial edema...

Cardiomegaly
CheXpert: Blank T-auto: Blank CheXbert: Positive Due to a ruleset limitation, CheXpert only looks at the heart or heart size but not heart independently when checking for mentions of cardiomegaly.
However, CheXbert recognizes mentions of cardiomegaly implied by phrases like heart is mildly enlarged. No previous images. There is hyperexpansion of the lungs suggestive of chronic pulmonary disease. Prominence of engorged and ill-defined pulmonary vessels is consistent with the clinical diagnosis of pulmonary vascular congestion, though in the absence of previous images it is difficult to determine whether any this appearance could reflect underlying chronic pulmonary disease. The possibility of supervening consolidation would be impossible to exclude on this single study, especially without a lateral view. No evidence of pneumothorax.

Consolidation
CheXpert: Positive T-auto: Positive CheXbert: Uncertain CheXbert correctly detects uncertainty for consolidation indicated by the word possibility. Both T-auto and CheXpert misclassify consolidation.
1. Left suprahilar opacity and fiducial seeds are again seen, although appears slightly less prominent/small in size, although as mentioned on the prior study, could be further evaluated by chest CT or PET-CT. 2. Right hilum appears slightly more prominent as compared to the prior study, which may be due to patient positioning, although increased right hilar lymphadenopathy is not excluded.
CheXpert: Blank T-auto: Blank CheXbert: Positive The right hilum appearing more prominent is an indicator of enlarged cardiomediastinum, which is clinically understood.
If the hilum is growing, then the entire mediastinum is growing. Although both CheXpert and T-auto mislabeled this report impression, CheXbert successfully labeled it positive for enlarged cardiomediastinum.

Example (cont.) & Labels (cont.)
Reasoning (cont.) As compared to the previous radiograph, there is no relevant change. The reduced volume of the right hemithorax with areas of lateral pleural thickening. The areas of pleural thickening are constant, size and morphology. Unchanged perihilar areas of fibrosis. Unchanged size and aspect of the cardiac silhouette, no pathologic changes in the left lung.
Cardiomegaly CheXpert: Positive T-auto: Positive CheXbert: Uncertain CheXbert correctly identifies uncertainty, as the cardiac silhouette is "unchanged," which means that it cannot be labeled positive or negative without additional information regarding the previous state. Both CheXpert and T-auto incorrectly label this example as positive for cardiomegaly.
AP chest compared to : Small-to-moderate left pleural effusion has increased slightly over the past several days. Moderate enlargement of the cardiac silhouette accompanied by mediastinal vascular engorgement is also slightly more pronounced. Pulmonary vasculature is engorged but there is no edema. Consolidation has been present without appreciable change in the left lower lobe since at least . Mediastinum widened at the thoracic inlet by a combination of tortuous vessels and mediastinal fat deposition. Right jugular introducer ends just above the junction with left brachiocephalic vein.
CheXpert: Blank T-auto: Blank CheXbert: Positive CheXbert correctly identifies enlarged cardiomediastinum from the phrase "mediastinum widened," which is a slightly different way of describing enlarged cardiomediastinum that CheXpert and T-auto both miss.
Moderately enlarged heart size, stable since . No findings concerning for pulmonary edema or pneumonia.

Edema
CheXpert: Uncertain T-auto: Uncertain CheXbert: Negative Unlike T-auto and CheXpert, CheXbert correctly labels edema as negative, presumably understanding that the initial phrase "no findings" applies to both edema and pneumonia. AP chest compared to and : As far as I can tell, given the severe anatomic distortion of the chest cage and its contents, lungs were clear on . Small region of opacification may have been developing lateral to the left hilus on , and today there is a suggestion of some new opacification at the base of the lung, but these observations are far from certain. I am not even confident that conventional radiographs, should the patient be able to cooperate for them, would clarify the issue. CT scanning, if feasible, would certainly confirm if the lungs are clear, but in the absence of a baseline study it might be difficult to distinguish atelectasis from pneumonia. Pleural effusion is minimal if any. Heart is probably not enlarged. Nasogastric tube is looped in the stomach. Right PIC line ends in the mid SVC. No pneumothorax.

Atelectasis
CheXpert: Positive T-auto: Positive CheXbert: Uncertain The report states that "it might be difficult to distinguish atelectasis from pneumonia" which indicates uncertainty, and this is correctly identified by CheXbert. CheXpert and T-auto simply label atelectasis as positive.

Example (cont.) & Labels (cont.)
Reasoning (cont.) Two frontal views of the chest show new mild interstitial pulmonary edema. Interval increase in mediastinal caliber therefore is probably due to distention of mediastinal veins. Heart size is slightly larger but still within normal range. Pleural effusions are minimal, if any. No focal pulmonary abnormality. No pneumothorax. ET tube is in standard placement and a nasogastric tube passes below the diaphragm and out of view.

Although
CheXpert and T-auto mistakenly label cardiomegaly as positive given the phrase the "heart is slightly larger," the following phrase "but still within normal range" implies that cardiomegaly is negative. CheXbert correctly classifies this example as negative for cardiomegaly.
As compared to the previous radiograph, the pre-existing right upper lobe pneumonia is completely resolved. The pre-existing signs of mild fluid overload, however, are still present. The pre-existing cardiomegaly is unchanged. Several calcified lung nodules are also unchanged. Unchanged alignment of the sternal wires. No acute pneumonia, no pleural effusions.

Pneumonia
CheXpert: Positive T-auto: Positive CheXbert: Negative CheXbert correctly labels pneumonia as negative, as implied by the phrase " pneumonia is completely resolved," while CheXpert and T-auto both mislabel pneumonia as positive.
Subsegmental right lung base atelectasis. Increasing loss of vertebral body height at T11. Stable L1 compression fracture. Right shoulder humeral DJD. Interval removal of PICC lines.

Support Devices
CheXpert: Positive T-auto: Positive CheXbert: Negative CheXbert, presumably using a semantic understanding of the word "removal", correctly labels support devices as negative. CheXpert and T-auto pick up on "PICC lines" but do not detect the negation. Both incorrectly label support devices as positive.
AP chest compared to : Small-to-moderate left pleural effusion has increased slightly over the past several days. Moderate enlargement of the cardiac silhouette accompanied by mediastinal vascular engorgement is also slightly more pronounced. Pulmonary vasculature is engorged but there is no edema. Consolidation has been present without appreciable change in the left lower lobe since at least . Mediastinum widened at the thoracic inlet by a combination of tortuous vessels and mediastinal fat deposition. Right jugular introducer ends just above the junction with left brachiocephalic vein.

Support Devices
CheXpert: Blank T-auto: Blank CheXbert: Positive A jugular introducer is a support device that wasn't included in CheXpert's list of mentions for support devices. Consequently CheXpert and T-auto, which trains on CheXpert labels, both incorrectly label support devices as blank. CheXbert, however, correctly labels support devices as positive.

Example (cont.) & Labels (cont.)
Reasoning (cont.) 1. Interval removal of the sternal wires with placement of new sternal closure devices, mediastinal staples and tubes. Lungs are well inflated with linear streaky opacities seen at the left base likely representing scarring and/or subsegmental atelectasis. No evidence of pulmonary edema, pneumothorax, pleural effusions or focal airspace consolidation to suggest pneumonia. Slight lucency at the left apex is felt to be related to underlying emphysema rather than representing a pneumothorax.

Pneumothorax
CheXpert: Positive T-auto: Positive CheXbert: Negative CheXbert correctly labels pneumothorax as negative, as the radiologist notes that the observation is related to emphysema rather than pneumothorax. In this complex negation, T-auto and CheXpert incorrectly label pneumothorax as positive. Table A7: Examples of additional data samples generated using backtranslation on radiologist-annotated reports from the CheXpert manual set. Augmenting our relatively small set of radiologist-annotated reports with backtranslation proved useful in improving performance of our labeler on the MIMIC-CXR test set.