Learning Visual-Semantic Embeddings for Reporting Abnormal Findings on Chest X-rays

Automatic medical image report generation has drawn growing attention due to its potential to alleviate radiologists’ workload. Existing work on report generation often trains encoder-decoder networks to generate complete reports. However, such models are affected by data bias (e.g. label imbalance) and face common issues inherent in text generation models (e.g. repetition). In this work, we focus on reporting abnormal findings on radiology images; instead of training on complete radiology reports, we propose a method to identify abnormal findings from the reports in addition to grouping them with unsupervised clustering and minimal rules. We formulate the task as cross-modal retrieval and propose Conditional Visual-Semantic Embeddings to align images and fine-grained abnormal findings in a joint embedding space. We demonstrate that our method is able to retrieve abnormal findings and outperforms existing generation models on both clinical correctness and text generation metrics.


Introduction
Understanding abnormal findings on radiographs (e.g. chest X-Rays) is a crucial task for radiologists. There has been growing interest in automatic radiology report generation to alleviate the workload of radiologists and improve patient care. Following the success of neural network models in imageto-text generation tasks (e.g. image captioning), researchers have trained CNN-RNN encoder-decoder networks to generate reports given radiology images (Shin et al., 2016;Kougia et al., 2019).
Although such models are able to generate fluent reports, the generation quality is often limited by biases introduced from training data or the training process. Figure 1 shows an example of chest * Now at Google X-rays (CXRs) and the associated reports from a public dataset (Johnson et al., 2019), along with the outputs generated by different models. 1 One issue is that models trained on complete reports tend to generate normal findings as they dominate the dataset (Harzig et al., 2019); another issue is that such generation models struggle to generate long and diverse reports as in other natural language generation (NLG) tasks .
In this work, we focus on reporting abnormal findings on radiology images which are of higher importance to radiologists. To address issues of data bias, we propose a method to identify abnormal findings from existing reports and further use K-Means plus minimal mutual exclusivity rules to group these abnormal findings, which reduces the substantial burden of curating templates of abnormal findings. Given the fact that radiology reports are highly similar and have a limited vocabulary (Gabriel et al., 2018), we propose a cross-modal retrieval method to capture relevant abnormal findings from radiology images. Our contributions are summarized as: • We learn conditional visual-semantic embeddings on radiology images and reports, which can be used to measure the similarity between image regions and abnormal findings by optimizing a triplet ranking loss. • We develop an automatic approach to identify and group abnormal findings from large collections of radiology reports. • We conduct comprehensive experiments to show that our retrieval-based method trained on the abnormal findings largely outperforms encoder-decoder generation models on clinical correctness and NLG metrics. CXR-CVSE (Abnormal): increased density projecting over the spine which could be due to additional atelectasis; however, pneumonia is also possible.. possible retrocardiac opacity could be prominent vessels but consolidation is not excluded and could represent pneumonia in the appropriate clinical setting.  It jointly predicts whether the next sentence is a normal or abnormal finding, and uses the corresponding decoder to generate the next sentence. However, it still formulates the task as text generation and has the limitations of such models.

Hybrid retrieval-generation models
There has been increasing interest in studying hybrid retrieval-generation models to complement generation. Li et al. (2018) introduced a hybrid retrieval-generation framework which decides at each step whether it retrieves a template or generates a sentence.  proposed a model based on abnormality graphs, which first predicts existing abnormalities on the radiology images, then retrieves and paraphrases the templates of that abnormality. However, such models usually require non-trivial human effort to construct high quality prior knowledge (e.g. sentence tem-2 https://semrep.nlm.nih.gov/ plates, abnormality terms). Unlike previous work, we leverage unsupervised methods and minimal rules to group sentences into different abnormality clusters, seeking to minimize human effort.

Visual-semantic embeddings for cross-modal retrieval
Learning visually grounded semantics to facilitate cross-modal retrieval (i.e., image-to-text and textto-image) is a challenging task for cross-modal learning (Faghri et al., 2018;Wu et al., 2019). Different from image captioning tasks, radiology reports are often longer and consist of multiple sentences, each related to different abnormal findings; meanwhile, there are fewer distinct objects in radiology images and the differences among images are more subtle.

Approach
Given radiology images I f and I l from the frontal and lateral view, Hierarchical CNN-RNN based methods predict complete medical reports R = {s 1 , s 2 , . . . , s N }, consisting of N sentences. Each sentence s i is generated hierarchically: where E f and E l are the feature maps of the images I f and I l generated by the CNN encoder, and w t i is the t-th word at the i-th sentence.
Instead of training such generation models, we approach the task as a cross-modal retrieval method. In particular, we propose a model that (1) measures the similarity between images and abnormal find-ings, and (2) identifies fine-grained relevant image regions for each abnormal finding.

Problem definition
Assume each report R a = {a 1 , a 2 , . . . , a M } includes M abnormal findings (i.e., sentences). R a is a subset of the complete report R = {s 1 , s 2 , . . . , s N }, where s i can either be an abnormal sentence a i or not.
Let v ∈ R d1 be the semantic embedding of an abnormal finding a of this report, and E = {m j ∈ R d2 } w×h j=1 be the feature maps of the radiology image I associated with R a , where j means the j-th region of the feature map. We first transform them into the joint embedding space R d with separate linear projection layers: where we apply l 2 normalization on the joint embeddings to improve training stability, following work in visual-semantic embeddings (Faghri et al., 2018).
Next, we need to measure the similarity between the semantic and visual embeddings. As different regions may include details about different abnormal findings, we propose Conditional Visual-Semantic Embeddings (CVSE) to learn the finegrained matching between regions and a target abnormal finding: where α j is the attention score that represents the relevance between the region m j and the abnormal finding v, d(a, I) is the similarity score between image I and the abnormal finding a, which is calculated as an attention-weighted sum over the similarity scores of each region with the abnormal finding. We use the (negative) squared l 2 distance to measure similarity. Since each report has both frontal and lateral views, the final similarity score is calculated as the average: Finally, we optimize the hinge-based triplet ranking loss to learn the visual-semantic embeddings: where δ is the margin, [x] + = max (x, 0) is the hinge loss, a + (I + ) denotes a matched abnormal finding (image) from the training set while a − (I − ) denotes an unmatched abnormal finding (image) sampled during training.

Extracting and clustering abnormal findings
To identify abnormal findings in radiology reports, we train a sentence-level classifier which determines whether a sentence includes abnormal findings or not. We fine-tuned BERT (Devlin et al., 2019) on an annotated sentence-level dataset released by Harzig et al. (2019), which is a labeled subset of the Open-I dataset . We achieve an F1-score of 98.3 on the held-out test set. We then use it to distantly label the reports from the MIMIC-CXR dataset (Johnson et al., 2019), which is the largest public CXR imaging report dataset. Given that most medical reports are written following certain templates, many abnormal findings are often paraphrases of each other. We obtain the sentence embeddings via pre-trained models and apply K-Means to cluster the sentences about similar abnormal findings into 500 groups. We also design several simple mutual exclusivity rules to refine the groupings. We consider critical attributes such as position (e.g. left, right), severity (e.g. mild, severe) which often are not present at the same time. Then we apply these rules to separate each group formed by K-Means. Ultimately, we obtained 1,306 groups of abnormal findings.

Experiments
We compare CVSE with the state-of-the-art report generation models and simple baseline models to answer two research questions-RQ1: Does our retrieval-based method outperform generation models? RQ2: Do the visual-semantic embeddings capture abnormal findings grounded on images?

Baselines
We consider (1) the Hier-CNN-RNN model (Jing et al., 2017;Liu et al., 2019), as denoted in eq. (1); (2) Hier-CNN-RNN + co-attention (Jing et al., 2017) with co-attention on both the images and the predicted medical concepts; (3) Hier-CNN-RNN + dual, with the dual word-level decoders (Harzig et al., 2019). We also implement two simple variants: (4) Hier-CNN-RNN + complete, which considers the complete medical reports (i.e., both normal and abnormal findings) as input; (5) Hier-CNN-RNN + shuffle, whose input reports have a shuffled sentence order. Vinyals et al. (2015) has shown that input order affects the performance for encoderdecoder models and (5) could potentially address the training issue due to the static input order.
In all experiments, the abnormal set and complete set consist of the same (image, report) pairs. As discussed in Section 3.1, the abnormal set only considers the abnormal finding sentences of the report, which is a subset of sentences of the complete report. We compare these two sets to show that models trained on the abnormal sentences would achieve substantial improvement than those trained on the complete reports, which has not been studied before.
We use the CheXpert labeler to evaluate the clinical accuracy of the abnormal findings reported by each model, which is the state-of-the-art medical report labeling system (Irvin et al., 2019;Johnson et al., 2019). Given sentences of abnormal findings, CheXpert will give a positive and negative label for 14 diseases. We then calculate the Precision, Recall and Accuracy for each disease based on the labels obtained from each model's output and from the ground-truth reports.

Implementation details
We consider CXRs from the MIMIC-CXR dataset with both frontal and lateral views which include at least one abnormal finding. Ultimately, we obtain 26,946/3,801/7,804 CXRs for the train/dev/test sets, respectively. For the CVSE model, we set α to 0.2 and for each sample we randomly pick 8 negative samples. We use the pre-trained DenseNet-121 to obtain the feature maps of the CXR images. We use the pre-trained biomedical sentence embeddings  to obtain initial embeddings for the abnormal findings. 3 The final dimension of the joint embedding d is set to 512. We take the top 3 retrieval results as the predicted abnormal findings. For all CNN-RNN based models, we use a VGG-19 model as the encoder, a 1-layer LSTM as the sentence decoder and a 2-layer LSTM as the word decoder. All dimensions are set to 512. Greedy search is applied during the decoding stage, following Jing et al. (2017). Our code are available online. 4

Performance comparison
We conduct experiments on both the abnormal and complete set of the MIMIC-CXR dataset which consider the abnormal findings in reports and the complete reports, respectively. As shown in Table 1, adding co-attention over medical concepts and dual decoders both improve the vanilla Hier-CNN-RNN model's clinical accuracy on the complete dataset. However, simply training the Hier-CNN-RNN model on the abnormal set would achieve better clinical accuracy. This shows the importance of addressing dataset bias. We also observe that the Hier-CNN-RNN model with a shuffled sentence order doesn't improve performance, which indicates the difficulty of addressing order bias during training of encoder-decoder models.
Our CVSE model outperforms all baselines on clinical accuracy metrics, which demonstrates its capability to accurately report abnormal findings. Notably, CVSE achieves significant improvements Real: pa and lateral chest radiographs demonstrate a left basilar opacity most consistent with atelectasis , though an underlying infectious process can not be excluded Prediction: increased density projecting over the spine which could be due to additional atelectasis; however, pneumonia is also possible.
Real: heart size is mildly enlarged Prediction: interval increase in heart size.
Real: sternotomy wires and post-surgical clips project over the cardiac silhouette Prediction: sternotomy wires and mediastinal clips are again noted. on precision and recall. On the other hand, the baseline models will always miss abnormal findings thus leading to 0 precision and recall for many disease classes. More detailed results are included in the appendices.
Refining the groups with mutual exclusivity rules further improves the performance of CVSE. We also report the automatic evaluation of NLG metrics. As shown in Table 1, CVSE achieves higher scores than other baselines on the abnormal set. 5

Qualitative analysis
We performed a human evaluation in which we sampled 20 images and asked a board-certified radiologist to give Likert scores (1 to 10) based on how closely the results generated by the model relate to the input images. The ground-truth obtained an average score of 7.85; our CVSE achieved a score of 6.35, higher than Hier-CNN-RNN trained on the abnormal set which obtained 6.15. The radiologist commented that Hier-CNN-RNN's outputs were simpler predictions, with less details; meanwhile, CVSE covered more abnormalities but may included false information sometimes.
In Figure 2, we visualize the attended regions on CXRs to investigate what part is important for reporting abnormal findings. We observe that our attention mechanism is able to detect relevant regions (e.g. heart, left opacity, wires) to determine which abnormal findings reside in the CXRs.

Conclusions
In this paper, we study how to build assistive medical imaging systems that report abnormal findings • improve|resolve|clear, worsen.

A.2 Parameter settings
We use PyTorch to implement all models and run them on 2 1080Ti GPUs. We resize all images into size of 512 × 512 for both models. For all experiments, we save the models that perform best on the validation set. For CVSE, we measure recall on validation set; for CNN-RNN models, we consider perplexity on validation set. For CVSE we use an Adam optimizer with a learning rate 0.001 and training continues for 40 epochs. For all Hier-CNN-RNN models, we set the learning rate for encoder and decoder as 5e −6 and 2e −4 , respectively. We train the models for 100 epochs. We use a VGG-19 model as the encoder, a 1-layer LSTM as the sentence decoder and a 2-layer LSTM as the word decoder. We observe slightly better performance from VGG-19 compared to DenseNet-121 for the generation models. For models that require medical concepts, we use SemRep (i.e. a UMLS-based program released by NIH) to extract 93 highly frequent medical concepts from the training set.  Table 2 shows the detailed accuracy, precision and recall on all 14 diseases from our CVSE model with mutual exclusiveness rules and the Hier-CNN-RNN model trained on the abnormal set. Overall, CVSE outperforms Hier-CNN-RNN on the macro-average of accuracy, precision and recall.
Notably, CVSE achieves higher recall on 12 out of 14 diseases with a comparative or higher precision. Meanwhile, Hier-CNN-RNN outputs 0 positive predictions on 4 disease types that are dominated by the negative findings, which shows its limited capability to generate diverse predictions.