Reinforcement Learning with Imbalanced Dataset for Data-to-Text Medical Report Generation

Automated generation of medical reports that describe the findings in the medical images helps radiologists by alleviating their workload. Medical report generation system should generate correct and concise reports. However, data imbalance makes it difficult to train models accurately. Medical datasets are commonly imbalanced in their finding labels because incidence rates differ among diseases; moreover, the ratios of abnormalities to normalities are significantly imbalanced. We propose a novel reinforcement learning method with a reconstructor to improve the clinical correctness of generated reports to train the data-to-text module with a highly imbalanced dataset. Moreover, we introduce a novel data augmentation strategy for reinforcement learning to additionally train the model on infrequent findings. From the perspective of a practical use, we employ a Two-Stage Medical Report Generator (TS-MRGen) for controllable report generation from input images. TS-MRGen consists of two separated stages: an image diagnosis module and a data-to-text module. Radiologists can modify the image diagnosis module results to control the reports that the data-to-text module generates. We conduct an experiment with two medical datasets to assess the data-to-text module and the entire two-stage model. Results demonstrate that the reports generated by our model describe the findings in the input image more correctly.


Introduction
Writing medical reports manually from medical images is a time-consuming task for radiologists. To write reports, radiologists first recognize what findings are included in medical images, such as computed tomography (CT) and X-ray images. Then radiologists compose reports that describe the recognized findings correctly without omission. Doctors prefer radiology reports written in natural language. Other types of radiology reports, such as  tabular reports, are difficult to understand because of their complexity.
The purpose of our work is to build an automated medical report generation system to reduce the workload of radiologists. As shown in Figure 1, the medical report generation system should generate correct and concise reports for the input images. However, data imbalance may reduce the quality of automatically generated reports. Medical datasets are commonly imbalanced in their finding labels because incidence rates differ among diseases; moreover, the ratios of abnormalities to normalities are also significantly imbalanced. Figure 2 shows an imbalanced distribution of finding labels in the MIMIC-CXR dataset (Johnson et al., 2019). For example, the finding label "Enlarged Cardiomediastinum.Negative" appears approximately 70 times more frequently than the finding label "Atelectasis.Negative". As a result of that imbalance, the generation model tends to train only the frequent finding labels, and tends to omit descriptions of the infrequent labels. This tendency increases incorrectness of generated reports.
To improve the correctness of generated reports, we propose a novel reinforcement learning (RL) strategy for a data-to-text generation module with a reconstructor. We introduce a new reward, Clinical Reconstruction Score (CRS), to quantify how much information the generated reports retain about the input findings. The reconstructor calculates CRS and uses it as a reward for RL to train the model to generate a greater number of correct reports. Additionally, we introduce a new Reinforcement Learning with Data Augmentation method (RL-DA) to alleviate data imbalance problems that arise from infrequent findings.
To replace the entire workflow of radiologists, end-to-end image captioning approach is primarily considered (Monshi et al., 2020). They generate reports solely from input medical images. However, such approaches are difficult to apply to the real medical field for the following two reasons. First, the quality of generated reports is adversely affected by the insufficient accuracy of image diagnosis systems. To generate correct reports, radiologists must be able to correct wrong image diagnosis results. Second, end-to-end models cannot reflect the intentions of radiologists to reports. In contrast to abnormalities, normalities are less important but frequently appear in the images. Radiologists sometimes deliberately omit the descriptions of some normalities to write concise reports, especially at return visits. To generate concise reports, radiologists should be able to select which findings the system should include in the reports.
We employed the Two-Stage Medical Report Generator (TS-MRGen), a novel framework for controllable report generation. Figure 1 presents an overview of TS-MRGen. TS-MRGen consists of two separate stages: an image diagnosis module and a data-to-text generation module. The image diagnosis module recognizes the findings in the image. Subsequently, reports are generated by the data-to-text module. Radiologists can modify the wrong or unintended results of the image diagnosis module. Next, the modified findings are used as the input to the data-to-text module. This approach greatly improves the correctness and conciseness of generated reports.
Overall, the main contributions of this study are as follows: • We introduce a reinforcement learning strategy with Clinical Reconstruction Score (CRS) to generate more clinically correct reports.
• We propose a novel Reinforcement Learning with Data Augmentation (RL-DA) to address data imbalance difficulties.
• We design and conduct experiments to validate the effectiveness of Two-Stage Medical Report Generator (TS-MRGen) with a modification process.
We evaluate the proposed approach on two datasets: the Japanese Computed Tomography (JCT) dataset and the MIMIC-CXR dataset. Automatic and manual evaluations on the JCT dataset show that our CRS and RL-DA improve the correctness of generated reports. An experiment conducted on the MIMIC-CXR dataset shows the generality of CRS and RL-DA; moreover, the experiment on the MIMIC-CXR dataset demonstrates that TS-MRGen with the modification process generates more correct reports than the two-stage model without a modification process.

Related Work
Medical Report Generation. Many end-to-end medical report generation models have been proposed (Monshi et al., 2020) to generate reports from images. Jing et al. (2018) introduced a coattention mechanism to align semantic tags and sub-regions of images. However, this model tends to generate sentences describing normalities to an excessively degree. This tendency results from an imbalanced frequency of findings among the medical images. Jing et al. (2019) and Harzig et al. (2019) use different decoders to generate normalities or abnormalities to address these data imbalance difficulties. Biswal et al. (2020) accepts doctors' anchor words for controllable medical report generation. This model generates reports that are more faithful to doctors' preferences by retrieving template sentences from the word entered by the doctor. Data-to-Text Generation. Data-to-text generation is a task to generate a fluent text that is faithful to the input data. Wiseman et al. (2017) proposed a data-to-text model with reconstruction-based techniques. The method trains the model so that the input data can be reconstructed from the decoder hidden state. This reconstruction makes it more likely that the hidden state of the decoder can capture the input data properly. Ma et al. (2019); Moryossef et al. (2019) proposed a two-step data-to-text model, comprising a text planning module and a text realization module. This model not only generates a text that is more faithful to the input data than end-to-end models, but it also allows for user control over the generated text by supplying a modified plan to the text realization module. Data Augmentation for Text Generation. Typical machine learning approaches that address data imbalance, such as undersampling and oversampling, are difficult to apply to this task because the input images or finding labels are sets of multiple finding class label and the target reports are discrete sequences. Kedzie and McKeown (2019) applied data augmentation method to data-to-text generation. To obtain additional training data, they generated data-text pairs by the model itself using the noise injection sampling method. Text Generation with Reinforcement Learning (RL). Text generation with Reinforcement Learning (RL) enables the model to train with indifferentiable rewards, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) metrics. Zhang et al. (2020b improved the radiology report summarization model with RL using a factual correctness reward. Liu et al. (2019a) applied RL for medical report generation with Clinically Coherent Reward (CCR) to directly optimize the model for clinical efficacy. Both methods leverage CheXpert Labeler (Irvin et al., 2019), a medical observation annotator, to calculate rewards.
Our work addresses the data imbalance difficulties beyond the imbalance between normalities and abnormalities, as Jing et al. (2019) addressed. Moreover, with our approach, the doctors can reflect their intentions to reports more directly to a greater degree than Biswal et al. (2020). We extend the factual-based RL method (Liu et al., 2019a) to cases for which rule-based annotators are not available. Furthermore, we propose data augmentation (Kedzie and McKeown, 2019) for RL to train the model using only the input labels.

Method
Medical report generation is a task to generate reports consisting of a sequence of words Y = {y 1 , y 2 , ...y N } from a set of images X = {x k } M k=1 . Most cases Y include more than one sentence. We annotated a set of finding labels F = {f 1 , f 2 , ...f T } for each set of images. The finding labels include abnormalities (indicated as .Positive), normalities (indicated as .Negative) and uncertain findings (indicated as .Uncertain) . Each finding label can be disassembled into a sequence of words as f t = {w t1 , w t2 , ...w tK }. For example, an abnormality "Airspace Opacity.Positive" label is divided into a sequence of {airspace, opacity, positive}.

Two-Stage Medical Report Generator
We employ Two-Stage Medical Report Generator (TS-MRGen), a framework that consists of two separate stages: an image diagnosis module and a data-to-text generation module. The image diagnosis module can be regarded as an image classification task that recognizes input images X and classifies them into a set of findings F . Radiologists can modify the image diagnosis module result F if errors are found in F . Alternatively, they can intentionally omit or append findings labels. The data-to-text generation module generates a report Y from F . We consider the text generation module as a data-to-text task.

Image Diagnosis Module
We train an image classification model that takes as input a single-view chest X-ray and output a set of probabilities of four types of labels (positive, negative, uncertain, and no mention) for each possible finding label. We use EfficientNet-B4 (Tan and Le, 2019) as a network architecture that was initialized with the pretrained model on ImageNet (Deng et al., 2009).
In some cases, the reports are described based on two images: front view and lateral view. Following Irvin et al. (2019), this module outputs the mean probability of the model between two images.

Text Generation Module
We adopt a table-to-text encoder-decoder model (Liu et al., 2018) for the text generation module to use words in the findings class labels. The encoder of the text generation module has two layers: a word-level encoder and a label-level layer.
Therein, [h w t0 , h w tK ] denotes the concatenation of vectors h w t0 and h w tK . MLP label represents a multilayer perceptron. We use a one-layer bi-directional gated recurrent unit (GRU) for the word level encoder.
For the decoder, we use one-layer GRU with an attention mechanism (Bahdanau et al., 2015): where h l represents the max-pooled vector from {h l 0 , ..., h l T }. The context vector c n is calculated over the label-level hidden vectors h l t and the decoder hidden state h d n .

RL with Reconstructor
We use RL to train the text generation model to improve the clinical correctness of the generated reports. A benefit of RL is that the model can be trained to produce sentences that maximize the reward, even if the word sequence does not match the correct answer. Many studies of text generation with RL ( a rule-based finding mention annotator. However, in the medical domain, no such annotator is available in most cases other than English chest X-ray reports. We propose a new reward, Clinical Reconstruction Score (CRS), to quantify the factual correctness of reports with a reconstructor module. Figure  3 shows an overview of our method, RL with CRS. Contrary to the data-to-text generator, the reconstructor reversely predicts the appropriate finding labels from the generated reports. This reconstructor quantifies the clinical correctness of the reports. Therefore, we can estimate the correctness of reports without rule-based annotators.
We utilize BERT (Devlin et al., 2019) as a reconstructor and reconstructed the finding labelsF as a multi-label text classification task: where FC and BERT represent the fully connected layer and the BERT layer, respectively.Ŷ denotes a generated report. In addition, CRS is defined as an F-score of the predicted finding labelsF against the input finding labels for the data-to-text module F . This BERT reconstructor is trained with a Class-Balanced Loss (Cui et al., 2019) to address imbalanced datasets. We design the overall reward as a combination of ROUGE-L score and CRS: where Y t represents a gold report regarding the predicted report Y and λ rouge is a hyperparameter. The goal of RL is to find parameters to minimize the negative expected reward R(Ŷ ) forŶ : where P θ denotes a policy network for the text generation model. We adopt SCST (Rennie et al., 2017) to approximate the gradient of this loss: whereŶ s is a sampled sequence with a Monte Carlo sampling. We use the softmax function with temperature τ for sampling sequences. R(Ŷ g ) is a baseline reward calculated from a greedily decoded sequenceŶ g .
To train the language model, RL with only CRS and ROUGE as a reward is insufficient. Therefore, we use the cross-entropy loss to generate fluent sentences. We design an overall loss function for training as a combination of the RL loss and crossentropy loss L xent : where L xent is the cross-entropy loss calculated between the gold reports and generated reports, and λ rl is a hyperparameter.

Reinforcement Learning with Data Augmentation (RL-DA)
We propose a novel method, RL with Data Augmentation method (RL-DA), to encourage the model to focus on infrequent findings. We focus on the asymmetricity between the augmentation cost of the input data and that of the target report sentences. The input data, which comprise a set of finding labels, can be augmented easily by adding or removing a finding label automatically. However, the augmentation cost is higher for the target reports than the input data because the target reports are written in natural language. Therefore, we introduce a semi-supervised reinforcement learning method to train the model solely by augmenting the input data. We conduct a data augmentation process of RL-DA as the following steps.
Step 1: List and Filter all Candidate Finding Labels. Given a set of finding labels F = {f 1 , f 2 , ...f T }, the objective of the data augmentation is to obtain a new set of finding labelsF , for which an additional finding label f T +1 is added to F . We list all finding labels that can be appended to F . We filter the finding labels inappropriate for appending F according to the clinical relation between the labels. Some pairs of finding labels have clinically contradictory relations. We filter the labels based on the following two rules.
a. Contradictory Relation. We exclude a pair of contradictory finding labels. For example, the abnormality "Pleural Effusion.Positive" and the normality "Pleural Effusion.Negative" must not be included in the same setF .
b. Supplementary Relation. We exclude a pair of contradicting finding labels that supplement other finding labels in F . For example, "Pleural Effusion.Mild" is excluded if "Pleural Effusion.Positive" not in F .
Step 2: Assign Sample Finding Labels. We sample an additional finding label f T +1 to append to F . The label is extracted from a set of candidates by random sampling. The data imbalance is mitigated because the data augmentation process appends a new finding label irrespective of the frequency of this finding labels in the training data. We use this augmented set of finding labelsF for RL. The overall loss function is as follows: (9) where λ rl and λ aug are hyperparameters. L aug denotes the RL loss calculated using the augmented setF . L aug is calculated in the same way as L rl with a reward R(Y ) under the condition of λ rouge = 0. This is because no reference report is available for the augmented setF . Hence, RL-DA method enables training of the model with more data at a low cost.

Experiment
First, to evaluate the effects of our proposed CRS and RL-DA on the data-to-text module, we conduct an experiment with the Japanese Computed Tomography (JCT) dataset. Moreover, to evaluate the generality of CRS and RL-DA and the effects of the modification process on TS-MRGen, we conduct an experiment with the MIMIC-CXR dataset.

Evaluation on the JCT Dataset
Dataset and Experimental Settings. We evaluate the data-to-text module with the JCT dataset. The JCT dataset has pairs of input sets for finding labels and target medical reports written in Japanese. The JCT dataset is used only to evaluate the data-to-text module. Therefore, we did not prepare medical images for the JCT dataset.
We defined a system of finding labels after consultation with radiologists. Annotators with fully sufficient knowledge in the radiology report manually annotate the finding labels to the reports. Descriptions that were unrelated to any finding labels in the reports were omitted from preprocessing for privacy reasons.
We chose all hyperparameters based on the CRS scores of the validation data. Details of our models, metrics, training, and dataset are included in the Supplementary section for reproducibility.
We compare the following six text generation models : (1) Table-to-Text (Baseline): Table-to-text model without RL proposed in Section 3.3 (2) 1-NN: calculates the relevance of input finding labels using TF-IDF and selects the most relevant reports from the training data.
(3) Rule-Based: generates reports based on manually prepared templates. We prepared one template sentence per one finding label, and the method concatenates template sentences to construct the entire reports.
Additionally, we compare the following four RL strategies to train the table-to-text text generation model: (7) RL R : trains the model by RL using only ROUGE as a reward.  Table 1 presents automatic evaluation results regarding the text generation models. The rule-based method obtained the lowest ROUGE-L because it generated considerably redundant reports.  improves CRS scores. This indicates that RL improves the metric used as a reward. Our proposed RL-DA CRS+R achieved higher both CRS and ROUGE scores than RL CRS+R . From the automatic evaluation results, we cannot conclude that our proposed CRS and RL-DA improved correctness because we have no means of deciding which metric is more appropriate for evaluating the generated reports. To estimate the effects of our proposed method on the data-to-text model, we also conducted a manual evaluation. As in previous research (Zhang et al., 2020a), two specialists who are knowledgeable in radiology reports measured 100 randomly selected samples for each experimental condition.
We defined the following metrics for a manual evaluation: • Grammaticality: The percentages of reports that contain no grammatical errors.
• Correctness: Measure how well the reports describe the clinically correct information.
We define the correctness ofŶ as an F-score with the following precision and recall: where N T P indicates the number of findings correctly noted inŶ , N F N indicates the number of missing findings inŶ , and N F P indicates the number of findings mistakenly noted inŶ . Table 2 presents manual evaluation results. Compared with RL R , our proposed RL-DA CRS+R also improves the correctness of manual evaluation. This indicates that proposed RL-DA CRS+R does not merely improve the CRS score; it improves the clinical correctness of the generated reports.

Evaluation on the MIMIC-CXR Dataset
Datasets. We evaluated the data-to-text module and the entire system on the MIMIC-CXR dataset, which includes chest X-ray images and their corresponding medical reports written in English. Notably, these reports include descriptions other than findings, such as indications and impressions. We omitted these descriptions other than findings because these descriptions cannot be generated from the input images. We used the CheXpert dataset (Irvin et al., 2019) to train the image diagnosis module.
We annotated finding labels to the MIMIC-CXR dataset with CheXpert Labeler (Irvin et al., 2019) and the image diagnosis module. CheXpert Labeler annotated findings labels for 14 categories of three types: positive, negative, and uncertain labels. For the training data of the data-to-text module, we labeled the reports using CheXpert Labeler only.
In addition to BLEU metrics, we adopted CheXpert accuracy, precision, and F-score metrics to quantify the correctness of generated reports. This is because the domain-agnostic metrics (such as BLEU) are doubtful in evaluating the quality of reports, and CheXpert-based metrics are more reliable metrics, as reported by . We chose all hyperparameters based on the F-scores of the validation data. Details of our models, metrics, training, and dataset are described in the Supplementary section for reproducibility. Experimental Settings. For evaluation, we prepare the following four experimental conditions.
(a) Data-to-Text Evaluation. We provide only the gold finding labels as inputs to the data-to-text module, and then evaluate the generated reports. This evaluation is intended to assess whether our proposed method is also applicable to the MIMIC-CXR dataset or not. Therefore, in this evaluation, we focus only on the data-to-text module. We compare our proposed model RL-DA CRS+R which is trained by RL with CRS and ROUGE, and applied RL-DA with the baseline table-to-text model. (b) End-to-End Evaluation. We compare our TS-MRGen with the end-to-end models, such as CNN-RNN  and CCR applied models (Liu et al., 2019a). As shown on the left side of Figure 1, the end-to-end model directly generates target reports from the input images. This evaluation setting do not use the finding labels in any way. (c) Two-Stage Evaluation without Modification. We evaluate our TS-MRGen using the same inputs and outputs as the end-to-end models. As shown in Figure 1, TS-MRGen first predicts the finding labels to describe the findings in the input images. Next, it generates reports from the finding labels. We employ RL-DA CRS+R to the data-to-text module of TS-MRGen. (d) Two-Stage Evaluation with Modification. In addition to (c) above, we apply the modification process to the finding labels predicted by the image diagnosis module. However, it is too expensive to evaluate the model in this condition because the cost of radiologist services is too high. Therefore, we imitate this modification flow using CheXpert Labeler using the following process. (i) Obtain the output probability vector p(f t |X) of the finding labels predicted by the image diagnosis module.
(ii) Classify the predicted finding labels as confident or untrustworthy according to probability p(f t |X). If p(f t |X) is within the range of (p low th , p high th ), then we regard the predicted resultf t as untrustworthy, and the result is discarded. (iii) Apply the modification process to the predicted finding labels. We obtain the finding labels using CheXpert Labeler and replace all untrustworthy labels classified in (ii).
This replacement process imitates the modification flow of radiologists.  Table 3: Automatic evaluation of the data-to-text module using the MIMIC-CXR dataset. Acc, Prec, F (micro), and F (macro) indicate accuracy, precision, micro F-score, and macro F-score, respectively. CheXpert scores quantify the correctness of generated reports. For the data-to-text module, our proposed RL-DA CRS+R achieved the best result (bold) for all metrics. For the entire report generation system, TS-MRGen with the modification process improved the correctness of the generated reports. The CheXpert scores of the proposed model and TS-MRGen with modification were statistically significant compared with the baseline model and TS-MRGen without modification (p < 0.05), respectively. improves the clinical accuracy of the generated reports for the MIMIC-CXR dataset. The lower part of Table 3 presents a comparison related to the entire report generation system. Compared with the TS-MRGen without the modification process, the TS-MRGen with the modification process achieved significantly better result for BLEU, CheXpert precision, micro and macro F-scores. CheXpert F-score quantifies the clinical correctness more adequately. Therefore, this result demonstrates that our TS-MRGen has an important advantage because the system enables radiologists to modify the mistakenly predicted finding labels. Figure 4 presents an evaluation of generated reports for each finding label evaluated using CheXpert La-beler. Both our proposed RL-DA CRS+R and the baseline method exhibit the same tendency: more infrequent finding labels in the training data are associated with the lower correctness of the generated reports. RL-DA CRS+R outperforms the baseline model, especially for the infrequent finding labels, This result demonstrates that our proposed RL-DA and CRS generate more accurate reports, especially with infrequent labels in the training data.

Qualitative Results
The upper part of Table 4 presents an example of a generated report for the JCT dataset. The baseline model generated a report with an incorrect description: "is accompanied by a pleural indentation." The data imbalance causes such an error. "Pleural Indentation.Positive" is more frequent finding label than "Pleural Indentation.Negative" in Example of generated reports of JCT dataset. Input Finding Labels: Nodule.Positive, Nodule.Solid, Pleural.Indentation.Negative, (truncated) Border.Well Defined Report Generated by the Baseline Model There is a 20 mm dilated nodule in the lung. (truncated) It is well-defined and is accompanied by a pleural indentation.

Report Generated by the Proposed (RL-DA CRS+R ) Model
There is a 20 mm dilated nodule in the lung. (truncated) It is well-defined and there is no pleural indentation.
Report Generated by TS-MRGen with Modification the lungs are clear without focal consolidation. no pleural effusion or pneumothorax is seen. the cardiac silhouette is mildly enlarged. the mediastinal and hilar contours are within normal limits. there is no pulmonary edema. Table 4: (upper) Example of a report generated from the JCT dataset. The italic part represents the fault in the baseline model. The underlined part represents the correct description corresponding to the italic part. A Japanese-English translation is applied. (lower) Example of a report generated from the MIMIC-CXR dataset. The modification process compensates for the missing labels predicted by the image diagnosis module. It thereby generates a report more faithful to the gold finding labels. the training data. Therefore, the baseline model mistakenly outputted a more frequently occurring description. However, our proposed RL-DA generated a correct description: "there is no pleural indentation". This result demonstrates that our proposed RL-DA and CRS trained the model more accurately on infrequent finding labels.
The lower part of Table 4 presents an example of a generated report for the MIMIC-CXR dataset. Without modification processes, the generated report includes only the description for "Cardiomegaly.Positive." The image diagnosis module has a tendency to omit normalities because the image diagnosis module is not able to train the intention of radiologists of whether normalities are omitted or not. With modification processes, the generated reports include the exact description of the gold finding labels with no omissions. Modification processes correct the missing finding labels to the predicted labels, thereby generating more faithful reports.

Conclusion
We proposed a novel Clinical Reconstruction Score (CRS) and Reinforcement Learning and Data Augmentation (RL-DA) methods to train a data-to-text model for an imbalanced dataset. Additionally, we employed a Two-Stage Medical Report Generator (TS-MRGen) for controllable medical report generation from input medical images.
An evaluation of the data-to-text module revealed that our proposed CRS and RL-DA methods improved the clinical correctness of generated reports, especially for infrequent finding labels. An evaluation of the entire medical report generation system revealed that our TS-MRGen generated more correct reports than an end-to-end generation model.
In future work, we would like to explore whether our method is applicable to other domain tasks in data-to-text generation, such as sports summary generation and biography generation tasks.

A Supplementary Material
A.1 Dataset and Preprocessing the JCT Dataset. We built the JCT dataset to train the data-to-text module of the medical report generation system. For the JCT dataset, we collected 4,454 medical reports regarding pulmonary nodules from a hospital. To train an accurate medical report generation system, we focused only on the findings in the reports and excluded the sentences that violated patient privacy. During a consultation with radiologists, we defined 57 types of finding labels. As preprocessing, all descriptions that were not related to any findings were truncated by annotators. We lexicalized phrases referring to the existence of nodules and phrases referring to the size of the nodules to improve the stability of training of the data-to-text generation model. We used MeCab 1 and mecab-ipadic-NEologd (Sato et al., 2017) to tokenize the reports, and keep tokens with 2 or more occurrences.
To prevent data leakage in validation/test datasets, we split the dataset in a way to ensure that the same sets of finding labels are not included in the training, validation, and test data. Additionally, to avoid the negative influence of the imbalanced frequency of sets of finding labels, we omitted the samples with duplicated sets of finding labels in the validation/test dataset. These strategies for data splitting and duplicate input handling caused differences in average labels and lengths, as shown in Table 5. If samples contained shorter sentences and fewer input labels, the validation and test datasets tended to contain longer sentences and a greater number of input labels. the MIMIC-CXR Dataset. Medical reports in the MIMIC-CXR dataset 2 contain descriptions that are irrelevant to the findings in the input images. Hence, we extracted the finding sections of the reports using the scripts provided in   3 . In training data, we truncated the sentences in the reports that were not related to any findings using CheXpert Labeler and NegBio (Peng et al., 2018) parser to improve the stability of training the model. We omitted the reports that did not mention any findings or had no finding sections from the training data. Note that the reports in the validation and test data may contain a description that does not mention any findings. We   use this approach to align our experimental conditions with previous end-to-end research . We used the Natural Language Toolkit 4 to tokenize the reports, and keep tokens with 10 or more occurrences. We have split the dataset into train, validation, and test data based on the split distributed in the MIMIC-CXR-JPG (Johnson et al., 2019) 5 dataset. Table 5 presents the statistics of the MIMIC-CXR dataset.

A.2 Training Details
Image Diagnosis Module All images were fed into a network with a size of 512 × 512 pixels. We set up the loss as the sum of the multi-class cross-entropy for each observations and used the RAdam (Liu et al., 2019b) optimizer with a learning rate of 1.0 × 10 −4 . We trained the model for 5 epochs with the CheXpert dataset (Irvin et al., 2019).
Subsequently, we evaluated the image diagnosis module with the CheXpert dataset. To evaluate the accuracy of image classification correctly for the infrequent labels, we performed a 5-fold cross-validation. Table 7 presents F-scores for each finding labels evaluated in 5-fold cross-validation. Although the F-scores of the no-mention labels are high, the F-scores of the positive, negative, and uncertain finding labels are relatively low. This is because the CheXpert dataset is significantly imbalanced, and almost all finding labels in the training data are in the no-mention category. Data-to-Text Module For the JCT and MIMIC-CXR datasets, we trained the data-to-text module for 50 and 20 epochs, respectively. We used a CRS score of the validation data as the stopping criteria. Finally, we reported evaluation scores that achieved the highest CRS score on the validation data. Table 6 presents hyperparameters used to train our models. Before we trained the model with RL, we pretrained the model with only cross-entropy loss for an epoch. The number of parameters of the data-to-text module was 127k for the JCT dataset and 463k for the MIMIC-CXR dataset. Reconstructor Module To train the reconstructor for the JCT dataset, we used the pretrained Japanese BERT model 6 . We have split the training data of the data-to-text module into 4:1 and used the former part as training data and the latter part as validation data for the reconstructor. For fine-tuning, we used the AdamW optimizer with a learning rate of 2.0 × 10 −5 for the BERT layer and 2.0 × 10 −3 for the fully connected layer. We used binary cross-entropy loss to train the model, and applied Class Balanced Loss (CBL) (Cui et al., 2019) with β = 0.999. The number of parameters of the reconstruction module is 110M. We fine-tuned the model with 10 epochs, and the F-score on the validation dataset was 90.3.
To train the reconstructor for the MIMIC-CXR dataset, we use the pretrained bert-base-uncased model. We also verified the BioBERT model (Lee et al., 2020), but the results showed no significant differences with the bert-base-uncased model. For fine-tuning, we used the AdamW optimizer with a learning rate 2.0 × 10 −5 for the BERT layer and 2.0 × 10 −3 for the fully connected layer. By analogy with the JCT dataset, we have split the training data into 4:1 and used the former part as the training data and the latter part as the validation data   for the reconstructor. We used binary cross-entropy loss to train the model, and applied Class Balanced Loss (CBL) (Cui et al., 2019) with β = 0.999. The number of parameters of the reconstruction module was 109M. We fine-tuned the model with 10 epochs, and the F-score on the validation dataset was 97.9. We used an Intel Core i7-6850K CPU and NVIDIA GTX 1080Ti GPU for training on the JCT dataset, and the training time was approximately 3 h. We used an Intel Xeon Gold 6148 CPU and NVIDIA Tesla V100 GPU for training on the MIMIC-CXR dataset, which required approximately 12 hours.

A.3 Evaluation Settings.
We use an approximate randomization test 7 to evaluate the statistical significance. Evaluation Metrics on the JCT Dataset. For automatic evaluation on the JCT dataset, we used BLEU (Papineni et al., 2002), F-scores of ROUGE-L (Lin, 2004), and CRS as metrics. We used Natural Language Toolkit 8 to calculate BLEU 7 https://github.com/smartschat/art 8 https://www.nltk.org/ scores, and the ROUGE Python library 9 to calculate ROUGE-L scores. Evaluation Metrics on the MIMIC-CXR Dataset. For comparison with the previous image captioning approaches, we used BLEU-1, BLEU-2, BLEU-3, and BLEU-4 metrics calculated by the nlg-eval 10 library. However, word-overlap based metrics, such as BLEU, fail to assume the factual correctness of generated reports. We compared the labels assigned in CheXpert Labeler between the generated reports and gold reports to calculate the CheXpert accuracy, precision, micro F-score, and macro F-score. The micro F-score was obtained by the overall numbers of true positives, false positives, and false negatives. The macro F-score was obtained by the average of F-scores per class label. Although the micro F-score neglects infrequent labels, the score is significantly biased by the imbalanced distribution of the test dataset.
Note that precision and F-score are preferred to evaluate the clinical correctness of the reports in CheXpert. In contrast, CheXpert accuracy does not quantify the clinical correctness of the generated reports adequately. The imbalanced dataset results in an excessive number of true negatives rather than true positives. Hence, CheXpert accuracy overestimates the clinical correctness of generated reports if the reports comprise many descriptions that are not related to the findings. Modification Flow We apply the modification process to the image diagnosis module result with the parameters of (p low th , p high th ) = (0.1, 0.9) for the positive finding labels. However, we regard all neg-ative and uncertain labels predicted by the image diagnosis module as unreliable. This is because negative or uncertain findings are highly dependent on the radiologist's judgment.