Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports: one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9%). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.


Introduction
An important new application of natural language generation (NLG) is to build assistive systems that take X-ray images of a patient and generate a textual report describing clinical observations in the images (Jing et al., 2018;Li et al., 2018;Liu et al., 2019;Boag et al., 2020;Chen et al., 2020). Figure  1 shows an example of a radiology report generated by such a system. This is a clinically important task, offering the potential to reduce radiologists' repetitive work and generally improve clinical communication (Kahn et al., 2009).

Reference Report
Large right pleural effusion is unchanged in size. There is associated right basilar atelectasis/scarring, also stable. Healed right rib fractures are noted. On the left, there is persistent apical pleural thickening and apical scarring. Linear opacities projecting over the lower lobe are also compatible with scarring, unchanged. There is no left pleural effusion. There is no pneumothorax. …

Image Encoder
Text Decoder Generated Report … The heart size remains unchanged and is within normal limits. Unchanged appearance of thoracic aorta. The pulmonary vasculature is not congested. Bilateral pleural effusions are again noted and have increased in size on the right than the left. The left-sided pleural effusion has increased in size and is now moderate in size.
contradiction Figure 1: A (partial) example of a report generated from our system (with ". . . " representing abbreviated text). The system encodes images and generates text from that encoded representation. Underlined words are disease and anatomy entities. The shaded sentences are an example of a contradictory pair.
Automatic radiology report generation systems have achieved promising performance as measured by widely used NLG metrics such as CIDEr (Vedantam et al., 2015) and BLEU (Papineni et al., 2002) on several datasets (Li et al., 2018;Jing et al., 2019;Chen et al., 2020). However, reports that achieve high performance on these NLG metrics are not always factually complete or consistent. In addition to the use of inadequate metrics, the factual incompleteness and inconsistency issue in generated reports is further exacerbated by the inadequate training of these systems. Specifically, the standard teacher-forcing training algorithm (Williams and Zipser, 1989) used by most existing work can lead to a discrepancy between what the model sees during training and test time (Ran-zato et al., 2016), resulting in degenerate outputs with factual hallucinations (Maynez et al., 2020). Liu et al. (2019) and Boag et al. (2020) have shown that reports generated by state-of-the-art systems still have poor quality when evaluated by their clinical metrics as measured with an information extraction system designed for radiology reports. For example, the generated report in Figure 1 is incomplete since it neglects an observation of atelectasis that can be found in the images. It is also inconsistent since it mentions left-sided pleural effusion which is not present in the images. Indeed, we show that existing systems are inadequate in factual completeness and consistency, and that an imageto-text radiology report generation system can be substantially improved by replacing widely used NLG metrics with simple alternatives.
We propose two new simple rewards that can encourage the factual completeness and consistency of the generated reports. First, we propose the Exact Entity Match Reward (fact ENT ) which captures the completeness of a generated report by measuring its coverage of entities in the radiology domain, compared with a reference report. The goal of the reward is to better capture disease and anatomical knowledge that are encoded in the entities. Second, we propose the Entailing Entity Match Reward (fact ENTNLI ), which extends fact ENT with a natural language inference (NLI) model that further considers how inferentially consistent the generated entities are with their descriptions in the reference. We add NLI to control the overestimation of disease when optimizing towards fact ENT . We use these two metrics along with an existing semantic equivalence metric, BERTScore (Zhang et al., 2020a), to potentially capture synonyms (e.g., "left and right" effusions are synonymous with "bilateral" effusions) and distant dependencies between diseases (e.g., a negation like ". . . but underlying consolidation or other pulmonary lesion not excluded") that are present in radiology reports.
Although recent work in summarization, dialogue, and data-to-text generation has tried to address this problem of factual incompleteness and inconsistency by using natural language inference (NLI) (Falke et al., 2019;Welleck et al., 2019), question answering (QA) (Wang et al., 2020a), or content matching constraint (Wang et al., 2020b) approaches, they either show negative results or are not directly applicable to the generation of radiology reports due to a substantial task and do-main difference. To construct the NLI model for fact ENTNLI , we present a weakly supervised approach that adapts an existing NLI model to the radiology domain. We further present a report generation model which directly optimizes a Transformerbased architecture with these rewards using reinforcement learning (RL).
We evaluate our proposed report generation model on two publicly available radiology report generation datasets. We find that optimizing the proposed rewards along with BERTScore by RL leads to generated reports that achieve substantially improved performance in the important clinical metrics (Liu et al., 2019;Boag et al., 2020;Chen et al., 2020), demonstrating the higher clinical value of our approach. We make all our code and the expert-labeled test set for evaluating the radiology NLI model publicly available to encourage future research 1 . To summarize, our contributions in this paper are: 1. We propose two simple rewards for imageto-text radiology report generation, which focus on capturing the factual completeness and consistency of generated reports, and a weak supervision-based approach for training a radiology-domain NLI model to realize the second reward. 2. We present a new radiology report generation model that directly optimizes these new rewards with RL, showing that previous approaches that optimize traditional NLG metrics are inadequate, and that the proposed approach substantially improves performance on clinical metrics (as much as ∆ + 64.2%) on two publicly available datasets.
2 Related Work 2.1 Image-to-Text Radiology Report Generation Wang et al. (2018) and Jing et al. (2018) first proposed multi-task learning models that jointly generate a report and classify disease labels from a chest X-ray image. Their models were extended to use multiple images (Yuan et al., 2019), to adopt a hybrid retrieval-generation model (Li et al., 2018), or to consider structure information (Jing et al., 2019).
More recent work has focused on generating reports that are clinically consistent and accurate. Liu et al. (2019) presented a system that generates accurate reports by fine-tuning it with their Clinically Coherent Reward. Boag et al. (2020) evaluated several baseline generation systems with clinical metrics and found that standard NLG metrics are ill-equipped for this task. Very recently, Chen et al. (2020) proposed an approach to generate radiology reports with a memory-driven Transformer. Our work is most related to Liu et al. (2019); their system, however, is dependent on a rule-based information extraction system specifically created for chest X-ray reports and has limited robustness and generalizability to different domains within radiology. By contrast, we aim to develop methods that improve the factual completeness and consistency of generated reports by harnessing more robust statistical models and are easily generalizable.

Consistency and Faithfulness in Natural Language Generation
A variety of recent work has focused on consistency and faithfulness in generation. Our work is inspired by Falke et al. (2019), Welleck et al. (2019), and Matsumaru et al. (2020 in using NLI to rerank or filter generations in text summarization, dialogue, and headline generations systems, respectively. Other attempts in this direction include evaluating consistency in generations using QA models (Durmus et al., 2020;Wang et al., 2020a;Maynez et al., 2020), with distantly supervised classifiers (Kryściński et al., 2020), and with task-specific content matching constraints (Wang et al., 2020b). Liu et al. (2019) and Zhang et al. (2020b) studied improving the factual correctness in generating radiology reports with rule-based information extraction systems. Our work mainly differs from theirs in the direct optimization of factual completeness with an entity-based reward and of factual consistency with a statistical NLI-based reward.

Image Captioning with Transformer
The problem of generating text from image data has been widely studied in the image captioning setting. While early work focused on combining convolutional neural network (CNN) and recurrent neural network (RNN) architectures (Vinyals et al., 2015), more recent work has discovered the effectiveness of using the Transformer architecture (Vaswani et al., 2017). Li et al. (2019) and Pan et al. (2020) introduced an attention process to exploit semantic and visual information into this architecture. Herdade et al. (2019), Cornia et al. (2020), and Guo et al. (2020) extended this architecture to learn geometrical and other relationships between input  regions. We find Meshed-Memory Transformer (Cornia et al., 2020) (M 2 Trans) to be more effective in our radiology report generation task than the traditional RNN-based models and Transformer models (an empirical result will be shown in §4), and therefore use it as our base architecture.

Methods
3.1 Image-to-Text Radiology Report Generation with M 2 Trans Formally, given K individual images x 1...K of a patient, our task involves generating a sequence of words to form a textual reportŷ, which describes the clinical observations in the images. This task resembles image captioning, except with multiple images as input and longer text sequences as output. We therefore extend a state-of-the-art image captioning model, M 2 Trans (Cornia et al., 2020), with multi-image input as our base architecture. We first briefly introduce this model and refer interested readers to Cornia et al. (2020). Figure 2 illustrates an overview of the M 2 Trans model. Given an image x k , image regions are first extracted with a CNN as X = CNN(x k ). X is then encoded with a memory-augmented attention process M mem (X) as where W q , W k , W v are weights, M k , M v are memory matrices, d is a scaling factor, and [ * ; * ] is the concatenation operation. Att(Q, K, V ) is an attention process derived from the Transformer architecture (Vaswani et al., 2017) and extended to include memory matrices that can encode a priori knowledge between image regions. In the encoder, this attention process is a self-attention process since all of the query Q, the key K, and the value V depend on X. M mem (X) is further processed with a feed forward layer, a residual connection, and a layer normalization to outputX. This encoding process can be stacked N times and is applied to K images, and n-th layer output of K image will beX n,K . The meshed decoder first processes an encoded text Y with a masked self-attention and further processes it with a feed forward layer, a residual connection, and a layer normalization to outputŸ . Y is then passed to a cross attention C(X n,K ,Ÿ ) and a meshed attention M mesh (X N ,K ,Ÿ ) as where is element-wise multiplication, max K is max-pooling over K images, σ is sigmoid function, W n is a weight, and b n is a bias. The weighted summation in M mesh (X N ,K ,Ÿ ) exploits both low-level and high-level information from the N stacked encoder. Differing from the self-attention process in the encoder, the cross attention uses a query that depends on Y and a key and a value that depend on X. M mesh (X N ,K ,Ÿ ) is further processed with a feed forward layer, a residual connection, and a layer normalization to outputỸ . As like in the encoder, the decoder can be stacked N times to outputỸ N .Ỹ N is further passed to a feed forward layer to output reportŷ.

Exact Entity Match Reward (fact ENT )
We designed an F-score entity match reward to capture factual completeness. This reward assumes that entities encode disease and anatomical knowledge that relates to factual completeness. A named entity recognizer is applied toŷ and the corresponding reference report y. Given entities E gen and E ref recognized from y gen and y ref respectively, precision (pr) and recall (rc) of entity match are calculated as δ(e, E) = 1, for e ∈ E 0, otherwise The harmonic mean of precision and recall is taken as fact ENT to reward a balanced match of entities. We used Stanza (Qi et al., 2020) and its clinical models (Zhang et al., 2020c) as a named entity recognizer for radiology reports. For example in the case of Figure 1, the common entities among the reference report and the generated report are pleural and effusion, resulting to fact ENT = 33.3.

Entailing Entity Match Reward (fact ENTNLI )
We additionally designed an F-score style reward that expands fact ENT with NLI to capture factual consistency. NLI is used to control the overestimation of disease when optimizing towards fact ENT .
In fact ENTNLI , δ in Eq. 10 is expanded to 1, for e ∈ E ∧ NLIe(P , h) = contradiction 1, for NLIe(P , h) = entailment 0, otherwise (11) where h is a sentence that includes e, P is all sentences in a counter part text (if h is a sentence in a generated report, P is all sentences in the corresponding reference report), nli( * , * ) is an NLI function that returns an NLI label which is one of {entailment, neutral, contradiction}, and sim( * , * ) is a text similarity function. We used BERTScore (Zhang et al., 2020a) as sim( * , * ) in the experiments (the detail of BERTScore can be found in Appendix A). The harmonic mean of precision and recall is taken as fact ENTNLI to encourage a balanced factual consistency between a generated text and the corresponding reference text. For example in the case of Figure 1, the sentence "The left-sided pleural effusion has increased in size and is now moderate in size." will be contradictory to "There is no left pleural effusion." resulting in pleural and effusion being rejected in y gen .

Joint Loss for Optimizing Factual Completeness and Consistency
We integrate the proposed factual rewards into selfcritical sequence training (Rennie et al., 2017). An RL loss L RL is minimized as the negative expectation of the reward r. The gradient of the loss is estimated with a single Monte Carlo sample as whereŷ sp is a sampled text andŷ gd is a greedy decoded text. Paulus et al. (2018) and Zhang et al. (2020b) have shown that a generation can be improved by combining multiple losses. We combine a factual metric loss with a language model loss and an NLG loss as where L NLL is a language model loss, L RL_NLG is the RL loss using an NLG metric (e.g., CIDEr or BERTScore), L RL_FACT is the RL loss using a factual reward (e.g., fact ENT or fact ENTNLI ), and λ * are scaling factors to balance the multiple losses.

A Weakly-Supervised Approach for Radiology NLI
We propose a weakly-supervised approach to construct an NLI model for radiology reports. (There already exists an NLI system for the medical domain, MedNLI (Romanov and Shivade, 2018), but we found that a model trained on MedNLI does not work well on radiology reports.) Given a large scale dataset of radiology reports, a sentence pair is sampled and filtered with weakly-supervised rules. The rules are prepared to extract a randomly sampled sentence pair (s 1 and s 2 ) that are in an entailment, neutral, or contradiction relation. We designed the following 6 rules for weak-supervision.
Entailment 1 (E1) (1) s 1 and s 2 are semantically similar and (2) NE of s 2 is a subset or equal to NE of s 1 .
Neutral 1 (N1) (1) s 1 and s 2 are semantically similar and (2) NE of s 1 is a subset of NE of s 2 .
Neutral 2 (N2) (1) NE of s 1 are equal to NE of s 2 and (2) s 1 include an antonym of a word in s 2 .
Neutral 3 (N3) (1) NE types of s 1 are equal to NE types of s 2 and (2)   introduce a certain level of similarity between s 1 and s 2 .
Neutral 4 (N4) (1) NE of s 1 are equal to NE of s 2 and (2) s 1 and s 2 include observation keywords.
Contradiction 1 (C1) (1) NE of s 1 is equal or a subset to NE of s 2 and (2) s 1 is a negation of s 2 .
The rules rely on a semantic similarity measure and the overlap of entities to determine the relationship between s 1 and s 2 . In the neutral rules and the contradiction rule, we included similarity measures to avoid extracting easy to distinguish sentence pairs. We evaluated this NLI by preparing training data, validation data, and test data. For the training data, the training set of MIMIC-CXR (Johnson et al., 2019) is used as the source of sentence pairs. 2k pairs are extracted for E1 and C1, 0.5k pairs are extracted for N1, N2, N3, and N4, resulting in a total of 6k pairs. The training set of MedNLI is also used as additional data. For the validation data and the test data, we sampled 480 sentence pairs from the validation section of MIMIC-CXR and had them annotated by two experts: one medical expert and one NLP expert. Each pair is annotated twice swapping its premise and hypothesis resulting in 960 pairs and are split in half resulting in 480 pair for a validation set and 480 pairs for a test set. The test set of MedNLI is also used as alternative test data.
We used BERT (Devlin et al., 2019) as an NLI model since it performed as a strong baseline in the existing MedNLI system (Ben Abacha et al., 2019), and used Stanza (Qi et al., 2020) and its clinical models (Zhang et al., 2020c) as a named entity recognizer. Table 1 shows the result of the model trained with and without the weakly-supervised data. The accuracy of NLI on radiology data increased substantially by +24.5% with the addition of the radiology NLI training set. (See Appendix A for the detail of the rules, the datasets, and the model configuration.)

Data
We used the training and validation sets of MIMIC-CXR (Johnson et al., 2019) to train and validate models. MIMIC-CXR is a large publicly available database of chest radiographs. We extracted the findings sections from the reports with a text extraction tool for MIMIC-CXR 2 , and used them as our reference reports as in previous work (Liu et al., 2019;Boag et al., 2020). Findings section is a natural language description of the important aspects in a radiology image. The reports with empty findings sections were discarded, resulting in 152173 and 1196 reports for the training and validation set, respectively. We used the test set of MIMIC-CXR and the entire Open-i Chest X-ray dataset (Demner-Fushman et al., 2012) as two individual test sets. Open-i is another publicly available database of chest radiographs which has been widely used in past studies. We again extracted the findings sections, resulting in 2347 reports for MIMIC-CXR and 3335 reports for Open-i. Open-i is used only for testing since the number of reports is too small to train and test a neural report generation model.

Evaluation Metrics
BLEU4, CIDEr-D & BERTScore: We first use general NLG metrics to evaluate the generation quality. These metrics include the 4-gram BLEU scroe (Papineni et al., 2002, BLEU4), CIDEr score (Vedantam et al., 2015) with gaming penalties (CIDEr-D), and the F 1 score of the BERTScore (Zhang et al., 2020a). Clinical Metrics: However, NLG metrics such as BLEU and CIDEr are known to be inadequate for evaluating factual completeness and consistency. We therefore followed previous work (Liu et al., 2019;Boag et al., 2020;Chen et al., 2020) by additionally evaluating the clinical accuracy of the generated reports using a clinical information extraction system. We use CheXbert (Smit et al., 2020), an information extraction system for chest reports, to extract the presence status of a series of observations (i.e., whether a disease is present or not), and score a generation by comparing the values of these observations to those obtained from 2 https://github.com/MIT-LCP/mimic-cxr/tree/master/txt the reference 3 . The micro average of accuracy, precision, recall, and F 1 scores are calculated over 5 observations (following previous work (Irvin et al., 2019)) for: atelectasis, cardiomegaly, consolidation, edema, and pleural effusion 4 . fact ENT & fact ENTNLI : We additionally include our proposed rewards fact ENT and fact ENTNLI as metrics to compare their values for different models.

Model Variations
We used M 2 Trans as our report generation model and used DenseNet-121 (Huang et al., 2017) as our image encoder. We trained M 2 Trans with the following variety of joint losses.
NLL M 2 Trans simply optimized with NLL loss as a baseline loss.
NLL+CDr CIDEr-D and NLL loss is jointly optimized with λ 1 = 0.01 and λ 2 = 0.99 for the scaling factors.
We additionally prepared three previous models that have been tested on MIMIC-CXR.  which is a reward based on the clinical metrics.

R2Gen
The model of Chen et al. (2020) with a CNN encoder and a memory-driven Transformer optimized with NLL loss. We used the publicly available official code and its checkpoint as its implementation.
For reproducibility, we include model configurations and training details in Appendix B. Table 2 shows the results of the baselines 5 and M 2 Trans optimized with the five different joint losses. We find that the best result for a metric or a reward is achieved when that metric or reward is used directly in the optimization objective. Notably, for the proposed factual rewards, the increases of +3.6 fact ENT and +4.9 fact ENTNLI are observed on MIMIC-CXR with M 2 Trans when compared against M 2 Trans w/ BS. For the clinical metrics, the best recalls and F 1 scores are obtained with M 2 Trans using fact ENT as a reward, achieving a substantial +22.1 increase (∆+63.9%) in F 1 score against the best baseline R2Gen. We further find that using fact ENTNLI as a reward leads to higher precision and accuracy compared to fact ENT with decreases in the recalls. The best precisions and accuracies were obtained in the baseline CNN-RNN 2 . This is not surprising since this model directly optimizes the clinical metrics with its Clinically Coherent Reward. However, this model is strongly optimized against precision resulting in the low recalls and F 1 scores.

Evaluation with NLG Metrics and Clinical Metrics
The results of M 2 Trans without the proposed rewards and BERTScore reveal the strength of M 2 Trans and the inadequacy of NLL loss and CIDEr for factual completeness and consistency. M 2 Trans w/ NLL shows strong improvements in the clinical metrics against R2Gen. These improvements are a little surprising since both models are Transformer-based models and are optimized with NLL loss. We assume that these improvements are due to architecture differences such as memory matrices in the encoder of M 2 Trans. The differ-  ence between NLL and NLL+CDr on M 2 Trans indicates that NLL and CIDEr are unreliable for factual completeness and consistency.

Human Evaluation
We performed a human evaluation to further confirm whether the generated radiology reports are factually complete and consistent. Following prior studies of radiology report summarization (Zhang et al., 2020b) and image captioning evaluation (Vedantam et al., 2015), we designed a simple human evaluation task. Given a reference report (R) and two candidate model generated reports (C1, C2), two board-certified radiologists decided whether C1 or C2 is more factually similar to R. To consider cases when C1 and C2 are difficult to differentiate, we also prepared "No difference" as an answer. We sampled 100 reports randomly from the test set of MIMIC-CXR for this evaluation. Since this evaluation is (financially) expensive and there has been no human evaluation between the baseline models, we selected R2Gen as the best previous model and M 2 Trans w/ BS as the most simple proposed model, in order to be able to weakly infer that all of our proposed models are better than all of the baselines. Table 3 shows the result of the evaluation. The majority of the reports were labeled "No difference" but the proposed approach received three times as much preference as the baseline. There are two main reasons why "No difference" was frequent in human evaluation. First, we found that a substantial portion of the examples were normal studies (no abnormal observations), which leads to generated reports of similar quality from both models. Second, in some reports with multiple abnormal observations, both models made mistakes on a subset of these observations, making it difficult to decide which model output was better.

Estimating Clinical Accuracy with Factual Rewards
The integrations of fact ENT and fact ENTNLI showed improvements in the clinical metrics. We further examined whether these rewards can be  used to estimate the performance of the clinical metrics to see whether the proposed rewards can be used in an evaluation where a strong clinical information extraction system like CheXbert is not available. Table 4 shows Spearman correlations calculated on the generated reports of NLL+BS. fact ENTNLI shows the strongest correlation with the clinical accuracy which aligns with the optimization where the best accuracy is obtained with NLL+ BS+fact ENTNLI . This correlation value is slightly lower than a Spearman correlation which Maynez et al. (2020) observed with NLI for the factual data (0.264). The result suggests the effectiveness of using the factual rewards to estimate the factual completeness and consistency of radiology reports, although the correlations are still limited, with some room for improvement.

Qualitative Analysis of Improved Clinical Completeness and Consistency
The evaluation with the clinically findings metrics showed improved generation performance by integrating BERTScore, fact ENT , and fact ENTNLI . As a qualitative analysis, we examined some of the generated reports to see the improvements. Example 1 in Figure 3 shows the improved factual completeness and consistency with BERTScore. The atelectasis is correctly generated and left plural effusion is correctly suppressed with NLL+BS.
Example 2 in Figure 4 shows the improved factual completeness with fact ENTNLI . The edema is correctly generated and atelectasis is correctly suppressed with NLL+BS+fc EN . These examples reveal the strength of integrating the three metrics to generate factually complete and consistent reports.
Despite observing large improvements with our model in the clinical finding metrics evaluation, the model is still not complete and some typical factual errors can be found in their generated reports. For example, Example 3 in Figure 4 includes a comparison of an observation against a previous study as Large right pleural effusion is unchanged in size. There is associated right basilar atelectasis/scarring, also stable. Healed right rib fractures are noted. On the left, there is persistent apical pleural thickening and apical scarring. Linear opacities projecting over the lower lobe are also compatible with scarring, unchanged. There is no left pleural effusion. There is no pneumothorax. Hilar and cardiomediastinal contours are difficult to assess, but appear unchanged. Vascular stent is seen in the left axillary/subclavian region.
PA and lateral chest views were obtained with patient in upright position. Analysis is performed in direct comparison with the next preceding similar study of DATE. The heart size remains unchanged and is within normal limits. Unchanged appearance of thoracic aorta. The pulmonary vasculature is not congested. Bilateral pleural effusions are again noted and have increased in size on the right than the left. The left-sided pleural effusion has increased in size and is now moderate in size.
As compared to prior chest radiograph from DATE, there has been interval improvement of the right pleural effusion. There is a persistent opacity at the right lung base. There is persistent atelectasis at the right lung base. There is no left pleural effusion. There is no pneumothorax. The cardiomediastinal and hilar contours are unchanged. Figure 3: An example of radiology reports generated by R2Gen and by the proposed model with the optimization integrating BERTScore. Repeated sentences are removed from the example to improve readability.  ". . . appear more prominent since . . . " in the reference but our model (or any previous models) can not capture this kind of comparison since the model is not designed to take account the past reports of a patient as input. Additionally, in this example, edema is mentioned with uncertainty as "cannot be excluded" in the reference but the generated report with fact ENTNLI simply indicates it as "There is mild pulmonary edema".

Conclusion
We proposed two new simple rewards and combined them with a semantic equivalence metric to improve image-to-text radiology report generation systems. The two new rewards make use of radiology domain entities extracted with a named entity recognizer and a weakly-supervised NLI to capture the factual completeness and consistency of the generated reports. We further presented a Transformer-based report generation system that directly optimizes these rewards with self-critical reinforcement learning. On two open datasets, we showed that our system generates reports that are more factually complete and consistent than the baselines and leads to reports with substantially higher scores in clinical metrics. The integration of entities and NLI to improve the factual completeness and consistency of generation is not restricted to the domain of radiology reports, and we predict that a similar approach might similarly improve other data-to-text tasks.

A.1 Rules & Examples of Weakly-Supervised Radiology NLI
We prepared the 6 rules (E1, N1-N4, and C1) to train the weakly-supervised radiology NLI. The rules are applied against sentence pairs consisting from premises (s 1 ) and hypotheses (s 2 ) to extract pairs that are in entailment, neutral, or contradiction relation.
2. The named entities (NE) of s 2 is a subset or equal to the named entities of s 1 as NE(s 2 ) ⊆ NE(s 1 ).
We used BERTScore (Zhang et al., 2020a) as a similarity metric and set the threshold to sim(s 1 , s 2 ) ≥ 0.7 6 . The clinical model of Stanza (Zhang et al., 2020c) is used to extract anatomy entities and observation entities. s 1 and s 2 are conditioned to be both negated or both non-negated. The negation is determined with a negation identifier or the existence of uncertain entity, using NegBio (Peng et al., 2018) as the negation identifier and the clinical model of Stanza is used to extract uncertain entities. s 2 is further restricted to include at least 2 entities as |NE(s 2 )| ≥ 2. These similarity metric, named entity recognition model, and entity number restriction are used in the latter neutral and contradiction rules. The negation restriction is used in the neutral rules but is not used in the contradiction rule. The following is an example of a sentence pair that matches E1 with entities in bold: s 1 The heart is mildly enlarged.
s 2 The heart appears again mild-to-moderately enlarged.
2. The named entities of s 1 is a subset of the named entities of s 2 as NE(s 1 ) NE(s 2 ).
Since s 1 is a premise, this condition denotes that the counterpart hypothesis has entities that are not included in the premise. The following is an example of a sentence pair that matches N1 with entities in bold: s 1 There is no pulmonary edema or definite consolidation.
s 2 There is no focal consolidation, pleural effusion, or pulmonary edema.
Neutral Rule 2: N2 1. The named entities of s 1 are equal to the named entities of s 2 as NE(s 1 ) = NE(s 2 ).
Anatomy modifiers are extracted with the clinical model of Stanza and antonyms are decided using WordNet (Fellbaum, 1998). Antonyms in anatomy modifiers are considered in this rule to differentiate experessions like left vs right and upper vs lower. The following is an example of a sentence pair that matches N2 with antonyms in bold: s 1 Moreover, a small left pleural effusion has newly occurred.
s 2 Small right pleural effusion has worsened.
Neutral Rule 3: N3 1. The named entity types (NE type ) of s 1 are equal to the named entity types of s 2 as NE type (s 1 ) = NE type (s 2 ).
2. The named entities of s 1 is different from the named entities of s 2 as NE(s 1 ) ∩ NE(s 2 ) = ∅.
Specific entity types that we used are anatomy and observation. This rule ensures that s 1 and s 2 have related but different entities in same types. The following is an example of a sentence pair that matches N3 with entities in bold: s 1 There is minimal bilateral lower lobe atelectasis.
s 2 The cardiac silhouette is moderately enlarged.
Neutral Rule 4: N4 1. The named entities of s 1 are equal to the named entities of s 2 as NE(s 1 ) = NE(s 2 ).
2. s 1 and s 2 include observation keywords (KEY) that belong to different groups as KEY(s 1 ) = KEY(s 2 ).
The groups of observation keywords are setup following the observation keywords of CheXpert labeler (Irvin et al., 2019). Specifically, G1 = {normal, unremarkable}, G2 = {stable, unchanged}, and G3 = {clear} are used to determine words included in different groups as neutral relation. The following is an example of a sentence pair that matches N4 with keywords in bold: s 2 Cardiomediastinal silhouette is unchanged.
Contradiction Rule: C1 1. The named entities of s 1 is a subset or equal to the named entities of s 2 as NE(s 2 ) ⊆ NE(s 1 ).
2. s 1 or s 2 is a negated sentence.
Negation is determined with the same approach as E1. The following is an example of a sentence pair that matches C1 with entities in bold: s 1 There are also small bilateral pleural effusions.

A.2 Validation and Test Datasets of Radiology NLI
We sampled 480 sentence pairs that satisfy the following conditions from the validation section of MIMIC-CXR: 1. Two sentences (s 1 and s 2 ) have BERTScore(s 1 , s 2 ) ≥ 0.5.
These conditions are introduced to reduce neutral pairs since most pairs will be neutral with random sampling. The sampled pairs are annotated twice swapping its premise and hypothesis by two experts: one medical expert and one NLP expert. For pairs that the two annotators disagreed, its labels are decided by a discussion with one additional NLP expert. The resulting 960 bidirectional pairs are splitted in half resulting in 480 pairs for a validation set and 480 pairs for a test set.

A.3 Configuration of Radiology NLI Model
We used bert-base-uncased as a pre-trained BERT model and further fine-tuned it on MIMIC-III (Johnson et al., 2016) radiology reports with a masked language modeling loss for 8 epochs. The model is further optimized on the training data with a classification negative log likelihood loss. We used Adam (Kingma and Ba, 2015) as an optimization method with β 1 = 0.9, β 2 = 0.999, batch size of 16, and the gradient clipping norm of 5.0. The learning rate is set to lr = 1e −5 by running a preliminary experiment with lr = {1e −5 , 2e −5 }.
The model is optimized for the maximum of 20 epochs and a validation accuracy is used to decide a model checkpoint that is used to evaluate the test set. We trained the model with a single Nvidia Titan XP taking approximately 2 hours to complete 20 epochs.

B.1 M 2 Trans
We used DenseNet-121 (Huang et al., 2017) as a CNN image feature extractor and pre-trained it on CheXpert dataset with the 14-class classification setting. We used GloVe (Pennington et al., 2014) to pre-train text embeddings and the pre-trainings were done on a training set with the embedding size of 512. The parameters of the model is set up to the dimensionality of 512, the number of heads to 8, and the number of memory vector to 40. We set the number of Transformer layer to n layer = 1 by running a preliminary experiment with n layer = {1, 2, 3}. The model is first trained against NLL loss using the learning rate scheduler of Transformer (Devlin et al., 2019) with the warm-up steps of 20000 and is further optimized with a joint loss with the fixed learning rate of 5e −6 . Adam is used as an optimization method with β 1 = 0.9 and β 2 = 0.999. The batch size is set to 48 for NLL loss and 24 for the joint losses. For λ * , we first swept the optimal value of λ 1 from {0.03, 0.02, 0.01, 0.001} using the development set. We have restricted λ 2 and λ 3 to have equal values in our experiments and constrined that all λ * values sum up to 1.0. The model is trained with NLL loss for 32 epochs and further trained for 32 epochs with a joint loss. Beam search with the beam size of 4 is used to decode texts when evaluating the model against a validation set or a test set. We trained the model with a single Nvidia Titan XP taking approximately 10 days to complete its optimization.

B.2 TieNet
We used ResNet-50 as a CNN image feature extractor with default ImageNet pre-trained weights.
We used GloVe to pre-train text embeddings with the same configuration as M 2 Trans. The parameters of the model is set up to the LSTM dimension of 256 and the number of global attentions to 5. The combination of NLL loss and the multi-label classification loss is used as its joint loss with the balance parameter α = 0.85. The model is trained against the joint loss using a linear rate scheduler with the initial learning rate of 1e −4 and the multiplication of 0.5 per 8 epochs. The batch size is set to 32 and the model is trained with the joint loss for 32 epochs. Adam is used as an optimization method with β 1 = 0.9 and β 2 = 0.999. Beam search with the beam size of 4 is used to decode texts. We trained the model with a single Nvidia Titan XP taking approximately 2 days to complete its optimization.

B.3 CNN-RNN 2
We used DenseNet-121 as a CNN image feature extractor with default ImageNet pre-trained weights. We used GloVe to pre-train text embeddings with the same configuration as M 2 Trans. The parameters of the model is set up to the LSTM dimension of 256. We modified an information extraction system from CheXpert to CheXbert to improve the training speed of this model. The combination of CIDEr and Clinically Coherent Reward is used as its joint loss with the balance parameter λ = 10.0. The model is first trained against NLL loss using a linear rate scheduler with the initial learning rate of 1e −4 and the multiplication of 0.5 per 8 epochs.
The model is further optimized with the joint loss with the fixed learning rate of 5e −6 . Adam is used as an optimization method with β 1 = 0.9 and β 2 = 0.999. The batch size is set to 32 for the NLL loss and 24 for the joint losses. The model is trained with NLL loss for 32 epochs and further trained for 32 epochs with the joint loss. Beam search with the beam size of 4 is used to decode texts. We trained the model with a single Nvidia Titan XP taking approximately 11 days to complete its optimization. Table 5 shows the detailed results of the clinical metrics for R2Gen, M 2 Trans w/ BS, M 2 Trans w/ BS+fc E , and M 2 Trans w/ BS+fc EN . In most cases, the best F 1 scores are observed in the cases when fact ENT or fact ENTNLI is included in the joint losses. Consolidation is one exception where the best precisions, recalls, and F 1 scores vary among the joint losses. We assume this is due to the infrequent appearance of consolidation in both MIMIC-CXR and Open-i. For comparison against some past studies, we show the detailed results when CheXpert is used instead of CheXbert in Table 6. Since CheXbert is more or equally accurate for most observations than CheXpert, the scores in Table 6 follow similar trends against ones in Table 5. Table 7 shows the detailed results for the 9 remaining observations that are defined in CheXpert. Note that many of these observations are infrequent and have relatively weaker and unstable extraction performances compared to the 5 observations in Table 5.