Generating Radiology Reports via Memory-driven Transformer

Medical imaging is frequently used in clinical practice and trials for diagnosis and treatment. Writing imaging reports is time-consuming and can be error-prone for inexperienced radiologists. Therefore, automatically generating radiology reports is highly desired to lighten the workload of radiologists and accordingly promote clinical automation, which is an essential task to apply artificial intelligence to the medical domain. In this paper, we propose to generate radiology reports with memory-driven Transformer, where a relational memory is designed to record key information of the generation process and a memory-driven conditional layer normalization is applied to incorporating the memory into the decoder of Transformer. Experimental results on two prevailing radiology report datasets, IU X-Ray and MIMIC-CXR, show that our proposed approach outperforms previous models with respect to both language generation metrics and clinical evaluations. Particularly, this is the first work reporting the generation results on MIMIC-CXR to the best of our knowledge. Further analyses also demonstrate that our approach is able to generate long reports with necessary medical terms as well as meaningful image-text attention mappings.


Introduction
Radiology report generation, which aims to automatically generate a free-text description for a clinical radiograph (e.g., chest X-ray), has emerged as a prominent attractive research direction in both artificial intelligence and clinical medicine. It can greatly expedite the automation of workflows and improve the quality and standardization of health care. Recently, there are many methods proposed † Corresponding author. 1 Our code and the best performing models are released at https://github.com/zhjohnchan/R2Gen.  in this area Johnson et al., 2019;Liu et al., 2019;Jing et al., 2019).
Practically, a significant challenge of radiology report generation is that radiology reports are long narratives consisting of multiple sentences. As illustrated by Figure 1, a radiology report generally consists of a section of findings which describes medical observations, including both normal and abnormal features, as well as an impression or concluding remark summarizing the most prominent observations. Therefore, applying conventional image captioning approaches (Vinyals et al., 2015;Anderson et al., 2018) may be insufficient for radiology report generation, as such approaches are designed to briefly describe visual scenes with short sentences. The ability to provide accurate clinical descriptions for a radiograph is of the highest priority, which places a higher demand on the generation process. Nevertheless, despite the difficulties posed by these evident length and accuracy requirements, radiology reports do have their own distinctive characteristics. An important feature to note is their highly patternized nature, as illustrated by the sample report described above (Figure 1). On the basis of this patternization, many approaches have been proposed to address the challenges of radiology report generation. For example, Liu et al. (2019) found that a simple retrieval-based method could achieve a comparative performance for this task.  combined retrieval-based and generation-based methods with manually extracted templates. Although promising results may be obtained by the retrieval-based approaches, they are still limited in the preparation of large databases, or the explicit construction of template lists to determine the patterns embedded in various reports.
In this paper, we propose to generate radiology reports via memory-driven Transformer. In detail, a relational memory (RM) is proposed to record the information from previous generation processes and a novel memory-driven conditional layer normalization (MCLN) is designed to incorporate the relational memory into Transformer (Vaswani et al., 2017). As a result, similar patterns in different medical reports can be implicitly modeled and memorized during the generation process, which thereby can facilitate the decoding of Transformer and is capable of generating long reports with informative content. Experimental results on two benchmark datasets confirm the validity and effectiveness of our approach, where Transformer with RM and MCLN achieves the state-of-the-art performance on all datasets. To summarize, the contributions of this paper are four-fold: • We propose to generate radiology reports via a novel memory-driven Transformer model. • We propose a relational memory to record the previous generation process and the MCLN to incorporate relational memory into layers in the decoder of Transformer. • Extensive experiments are performed and the results show that our proposed models outperform the baselines and existing models. • We conduct analyses to investigate the effect of our model with respect to different memory sizes and show that our model is able to generate long reports with necessary medical terms and meaningful image-text attention mappings.

The Proposed Method
Generating radiology reports is essentially an image-to-text generation task, for which there exist several solutions (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018;Cornia et al., 2019). We follow the standard sequence-to-sequence paradigm for this task. In doing so, we treat the input from a radiology image as the source sequence X = {x 1 , x 2 , ..., x S }, x s ∈ R d , where x s are patch features extracted from visual extractors and d the size of the feature vector. The corresponding report is the target sequence Y = {y 1 , y 2 , ..., y T }, y t ∈ V, where y t are the generated tokens, T the length of generated tokens and V the vocabulary of all possible tokens. An overview of our proposed model is shown in Figure 2, where the details are illustrated in following subsections.

The Model Structure
Our model can be partitioned into three major components, i.e., the visual extractor, the encoder and the decoder, where the proposed memory and the integration of the memory into Transformer are mainly performed in the decoder. The overall description of the three components and the training objective of the task is detailed below.
Visual Extractor Given a radiology image Img, its visual features X are extracted by pre-trained convolutional neural networks (CNN), e.g., VGG (Simonyan and Zisserman, 2015) or ResNet (He et al., 2016), and the encoded results are used as the source sequence for all subsequent modules. The process is formulated as: where f v (·) represents the visual extractor.
Encoder In our model, we use the standard encoder from Transformer, where the outputs are the hidden states h i encoded from the input features x i extracted from the visual extractor: where f e (·) refers to the encoder.
Decoder The backbone decoder in our model is the one from Transformer, where we introduce an extra memory module to it by improving the original layer normalization with MCLN for each decoding layer as shown in Figure 2. Therefore the decoding process can be formalized as (3) where f d (·) refers to the decoder and the details of the memory (RM) and MCLN are presented in following subsections.
Objective Given the aforementioned structure, the entire generation process can be formalized as a recursive application of the chain rule p(Y |Img) =  P (Y |Img) through the negative conditional loglikelihood of Y given the Img:

Relational Memory
For any relevant Img, they may share similar patterns in their reports and they can be used as good references for each other to help the generation process. As shown in Figure 1, patterns such as "The lungs are clear bilaterally" and "no evidence of focal consolidation, or pleural effusion" always appear in the reports of similar images and are shown simultaneously. To exploit such characteristics, we propose to use an extra component, i.e., relational memory, to enhance Transformer to learn from the patterns and facilitate computing the interactions among patterns and the generation process.
In doing so, the relational memory uses a matrix to transfer its states over generation steps, where the states record important pattern information with each row (namely, memory slot) representing some pattern information. 2 During the generation, the matrix is updated step-by-step with incorporating the output from previous steps. Then, at time step t, the matrix from the previous step, M t−1 , is functionalized as the query and its concatenations with the previous output serve as the key and value to feed the multi-head attention module. Given H heads used in Transformer, there are H sets of queries, keys and values via three linear transformations, respectively. For each head, we obtain the query, key and value in the relational memory Figure 3: The illustration of the gate mechanism.
the trainable weights of linear transformation of the query, key and value, respectively. Multi-head attention is used to model Q, K and V so as to depict relations of different patterns. As a result, where d k is the dimension of K, and Z the output of the multi-head attention module. Consider that the relational memory is performed in a recurrent manner along with the decoding process, it potentially suffers from gradient vanishing and exploding. We therefore introduce residual connections and a gate mechanism. The former is formulated as where f mlp (·) refers to the multi-layer perceptron (MLP). The detailed structure of the gate mechanism in the relational memory is shown in Figure 3, where the forget and input gates are applied to balance the inputs from M t−1 and y t−1 , respectively. To ensure that y t−1 can be used for computation with M t−1 , it is extended to a matrix Y t−1 by duplicating it to multiple rows. Therefore, the forget and input gate are formalized as where W f and W i are trainable weights for Y t−1 in each gate; similarly, U f and U i are the trainable weights for M t−1 in each gate. The final output of the gate mechanism is formalized as where refers to the Hadamard product and σ the sigmoid function and M t is the output of the entire relational memory module at step t.

Memory-driven Conditional Layer Normalization
Although memory shows its effectiveness in many NLP tasks (Sukhbaatar et   2019), it is by default applied to encoding with rather isolated designs. However, given that text generation is a dynamic process and largely affected by the output at each decoding step, memory is expected to be closely integrated to the decoder. Therefore, we propose a novel MCLN and use it to incorporate the relational memory to enhance the decoding of Transformer. Recall that in the conventional Transformer, to improve generalization, γ and β are two crucial parameters for scaling and shifting the learned representations, 3 respectively. Thus we propose to incorporate the relational memory via MCLN by feeding its output M t to γ and β. Consequently, this design takes the benefit from the memory while preventing it from influencing too many parameters of Transformer so that some core information for generation is not affected.
As shown in Figure 2, in each Transformer decoding layer, we use three MCLNs, where the output of the first MCLN is functionalized as the query to be fed into the following multi-head attention module together with the hidden states from the encoder as the key and value. To feed each MCLN, at step t, the output of the relational memory M t is expanded into a vector m t by simply concatenating all rows from M t . Then, an MLP is used to predict a change ∆γ t on γ t from m t , and update it via Similarly, ∆β t andβ t are performed by Afterwards, the predictedβ t andγ t are applied to the mean and variance results of the multi-head  Table 2: The performance of all baselines and our full model on the test sets of IU X-RAY and MIMIC-CXR datasets with respect to NLG and CE metrics. BL-n denotes BLEU score using up to n-grams; MTR and RG-L denote METEOR and ROUGE-L, respectively. The average improvement over all NLG metrics compared to BASE is also presented in the "AVG. ∆" column. The performance of all models is averaged from five runs. self-attention from the previous generated outputs: where r refers to the output from the previous module; µ and υ are the mean and standard deviation of r, respectively. The result f mcln (r) from MCLN is then fed to the next module (for the 1st and 2nd MCLN) or used as the final output for generation (for the 3rd MCLN).

Datasets
We conduct our experiments on two datasets, which are described as follows: • IU X-RAY (   to exclude the samples without reports. Then we apply their conventional splits. Specifically, IU X-RAY is partitioned into train/validation/test set by 7:1:2 of the entire dataset, and MIMIC-CXR's official split is adopted. The statistics of the datasets are shown in Table 1, with the numbers of images, reports, patients and the average length of reports.

Baseline and Evaluation Metrics
To compare with our proposed model, the following ones are used as the main baselines: The performance of the aforementioned models is evaluated by conventional natural language generation (NLG) metrics and clinical efficacy (CE) metrics 6 . The NLG metrics 7 include BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011) and ROUGE-L (Lin, 2004). For clinical efficacy metrics, we use the CheXpert (Irvin et al., 2019) 8 to label the generated reports and compare the results with ground truths in 14 different categories related to thoracic diseases and support devices. Precision, recall and F1 are used to evaluate model performance for these metrics.

Implementation Details
We adopt the ResNet101 (He et al., 2016) pretrained on Imagenet (Deng et al., 2009) as the visual extractor to extract patch features with the dimension of each feature set to 2,048. Note that for IU X-RAY, we use two images of a patient as input to ensure consistency with the experiment settings of previous work. The Transformer in our proposed model and all baselines are randomly initialized. For relational memory, its dimension and the number of heads in multi-head attention are set to 512 and 8, respectively, and the number of memory slots is set to 3 by default. For MCLN, we use two MLPs to obtain ∆γ and ∆β where they do not share parameters. The model is trained under cross entropy loss with ADAM optimizer (Kingma and Ba, 2015). We set the learning rate to 5e-5 and 1e-4 for the visual extractor and other parameters, respectively. We decay such rate by a factor of 0.8 per epoch for each dataset and set the beam size to 3 to balance the generation effectiveness and efficiency. Note that the aforementioned hyper-parameters are obtained by evaluating the models on the validation sets of the two datasets.

Effect of Relational Memory
To illustrate the effectiveness of our proposed method, we experiment with the aforementioned baselines on the two benchmark datasets. The results are reported in Table 2, with BASE+RM+ MCLN representing our full model (same below).
There are several observations. First, on NLG metrics, both BASE+RM and BASE+RM+MCLN outperform the vanilla Transformer (BASE) on both datasets, which confirms the validity of incorporating memory into the decoding process in Transformer because that highly-patternized text in radiology reports are reasonably modeled to some extent. Second, our full model achieves the best performance over all baselines on different metrics, and it particularly outperforms BASE+RM with significant improvement, which clearly indicates the usefulness of MCLN in incorporating memory rather than other ways of integration. Third, on NLG metrics, when comparing between the datasets, the performance gains from two memory-driven models (i.e., BASE+RM and BASE+RM+MCLN) over BASE on IU X-RAY are larger than that of MIMIC-CXR. The reason behind might be that the IU X-RAY is relatively small and patterns among different reports in this dataset are more consistent so that our model helps more with the proposed memory. Fourthly, on the CE metrics on MIMIC-CXR, our full model shows the same trend as that for NLG metrics, where it outperforms all its baselines in terms of precision, recall and F1. This observation is important because higher NLG scores do not always result in higher clinical scores (e.g., the precision of BASE+RM on CE is lower than that of BASE), so  that the performance from CE further confirms the effectiveness of our method, whereas compared to BASE+RM, MCLN is able to leverage memory in a rather fine-grained way and thus better produce reasonable descriptions for clinical abnormalities.

Comparison with Previous Studies
We compare our full model (denoted as OURS) with existing models on the same datasets, with all results reported in Table 3 on both NLG and CE metrics. There are several observations drawn from different aspects. First, Transformer confirms its superiority to sequence-to-sequence structures in this task, which is illustrated by the comparison between our models (all baselines and our full model) and ST. Our full model also outperforms conventional image captioning models, e.g., ATT2IN, ADAATT and TOPDOWN, which are designed to generate a short piece of text for an image. This observation confirms that designing a specific model for long report generation is necessary for this task. Second, memory shows its effectiveness in this task when compared with those complicated models, e.g., HRGR uses manually extracted templates. Particularly, although on the two datasets, reinforcement learning (CMAS-RL) is proved to be the best solution with a careful design of adaptive rewards, our model achieves the same goal with a simpler method. Third, It is noticed that there are studies, e.g., HRGR, requires to utilize extra information for this task and our full model outperforms them without such requirements. This observation indicates that an appropriate end-to-end design (such as RM and MCLN) of using memory in Transformer can alleviate the need for extra resources to enhance this task.

Analysis
We analyze several aspects of our model regarding its hyper-parameters and generation results. Memory Size To show the impacts of the memory size, we train RM with different numbers of memory slots, i.e., |S| ∈ {1, 2, 3, 4} and the results on MIMIC-CXR are shown in Table 4. In general, since memory size controls how much information is preserved in the past generation steps, it is confirmed in the observation that enlarging memory size by the number of slots results in better overall performance, with |S| = 3 achieving the best results. Still, we notice that the overall performance drops when |S| = 4, which indicates that too large memory may introduce redundant and invalid information so as to negatively affect the generation process. Although enlarging memory size results in increasing parameter numbers, it is demonstrated that there are not too many parameters (comparing to the total number of parameters) introduced whenever adding one slot in the memory. This observation suggests that the proposed model is effective and efficient in learning with memory for the radiology report generation task.
Report Length In addition to NLG and CE metrics, another important criterion to evaluate generation models is the length of generated reports comparing to the ground-truth. In doing so, we categorize all reports generated on the MIMIC-CXR test set into 10 groups (within [0, 100] with interval of 10) according to their round-down lengths and draw curves for their numbers in each category for BASE, BASE+RM and BASE+RM+MCLN, as well as the ground-truth. The results are presented in Figure 4. Overall, more reports generated from BASE+RM and BASE+RM+MCLN are longer than that from BASE and their length distributions are closer to the ground-truth reports, which thus leads to better evaluation results on NLG metrics. The reason behind might be that the memory provides  Figure 5: Illustrations of reports from ground-truth, BASE and BASE+RM+MCLN models for two X-ray chest images. To better distinguish the content in the reports, different colors highlight different medical terms. more detailed information for the generation process so that the decoder tends to produce more diversified outputs than the original Transformer. Particularly, when comparing BASE+RM+MCLN and BASE+RM, the length distribution of the former generated reports is closer to the ground-truth, which can be explained by that, instead of applying memory to the final output, leveraging memory at each layer in Transformer is more helpful and thus controls the decoding process in a fine-grained way.

BASE BASE+RM+MCLN
The above observations show that both memory and the way of using it are two important factors to enhance radiology report generation.
Case Study To further investigate the effectiveness of our model, we perform qualitative analysis on some cases with their ground-truth and generated reports from different models. Figure  5 shows two examples of front and lateral chest Xray images from MIMIC-CXR and such reports, where different colors on the texts indicate different medical terms. It is observed in these cases that BASE+RM+MCLN is able to generate descriptions aligned with that written by radiologists with similar content flow. For example, in both cases, patterns in the generated reports follow the structure that starting from reporting abnormal findings (e.g., "cardiac silhouette" and "lung volumes"), and then concluding with potential diseases (e.g., "pleural effusion" and "atelectasis"). In addition, for the necessary medical terms in the ground-truth reports, BASE+RM+MCLN covers almost all of them in its generated reports while vanilla Transformer did much worse, e.g., the key terms "enlarged cardiac silhouette", "atelectasis" and "small pleural effusion" in the two examples are not generated.
To further investigate different models qualitatively, we randomly select a chest X-ray on the MIMIC-CXR test set and visualize the image-text attention mappings from BASE and BASE+RM+MCLN. Figure 6 shows the intermediate image-text correspondences for several words from the multi-head attentions in the first layer of the decoders. It is observed that BASE+RM+MCLN is better at aligning the locations with the indicated disease or parts. This observation suggests that our model not only enhances the power of radiology report generation, but also improves the interaction between the images and the generated texts. Error Analysis To analyze the errors from our model, especially in targeting the low CE scores, it is found that the class imbalance is severe on the datasets and affects the model training and inference, where majority voting is observed in the generation process. For example, on MIMIC-CXR, consolidation only accounts for 3.9% in the training set so that the trained model only recognizes that 2.9% results in this case compared with the ground truth 6.3%. Thus how to address the data bias problem is a possible future work to improve the accuracy of the generated radiology reports.

Related Work
The most popular related task to ours is image captioning (Vinyals et al., 2015;Xu et al., 2015;Anderson et al., 2018;, which aims to describe images with sentences. Different from them, radiology report generation requires much longer generated outputs, and possesses other features such as patterns, so that this task has its own characteristics requiring particular solutions. For example,  proposed a co-attention Original Image "right" "heart" "pleural" "lungs" 1.0 0.0 Figure 6: Visualizations of image-text attention mappings between a specific chest X-ray and generated reports from BASE and BASE+RM+MCLN, respectively. Colors from blue to red represent the weights from low to high. mechanism and leveraged a hierarchical LSTM to generate reports. Li et al. ( , 2019 proposed to use a manually extracted template database to help generation with bunches of special techniques to utilize templates. Liu et al. (2019) proposed an approach with reinforcement learning to maintain the clinical accuracy of generated reports. Compared to these studies, our model offers an alternative solution to this task with an effective and efficient enhancement of Transformer via memory.
Extra knowledge (e.g., pre-trained embeddings (Song et al., 2017;Song and Shi, 2018; and pretrained models (Devlin et al., 2019;Diao et al., 2019)) can provide useful information and thus enhance model performance for many NLP tasks (Tian et al., 2020a,b,c). Specifically, memory and memory-augmented neural networks (Zeng et al., 2018;Santoro et al., 2018;Diao et al., 2020;Tian et al., 2020d) are another line of related research, which can be traced back to , which proposed memory networks to leverage extra information for question answering; then Sukhbaatar et al. (2015) improved it with an end-to-end design to ensure the model being trained with less supervision. Particularly for Transformer, there are also memory-based methods proposed. For example, Lample et al. (2019) proposed to solve the under-fitting problem of Transformer by introducing a product-key layer that is similar to a memory module. Banino et al. (2020) proposed MEMO, an adaptive memory to reason over long-distance texts. Compared to these studies, the approach proposed in this paper focuses on leveraging memory for decoding rather than encoding, and presents a relational memory to learn from previous generation processes as well as patterns for long text generation. To the best of our knowledge, this is the first study incorporating memory for decoding with Transformer and applied for a particular task, which may provide a reference for studies in the line of this research.

Conclusion
In this paper, we propose to generate radiology reports with memory-driven Transformer, where a relational memory is used to record the information from previous generation processes and a novel layer normalization mechanism is designed to incorporate the memory into Transformer. Experimental results on two benchmark datasets illustrate the effectiveness of the memory by either concatenating it with the output or integrating it with different layers of the decoder by MCLN, which obtains the state-of-the-art performance. Further analyses investigate how memory size affects model performance and show that our model is able to generate long reports with necessary medical terms and meaningful image-text attention mappings.