Show, Describe and Conclude: On Exploiting the Structure Information of Chest X-ray Reports

Chest X-Ray (CXR) images are commonly used for clinical screening and diagnosis. Automatically writing reports for these images can considerably lighten the workload of radiologists for summarizing descriptive findings and conclusive impressions. The complex structures between and within sections of the reports pose a great challenge to the automatic report generation. Specifically, the section Impression is a diagnostic summarization over the section Findings; and the appearance of normality dominates each section over that of abnormality. Existing studies rarely explore and consider this fundamental structure information. In this work, we propose a novel framework which exploits the structure information between and within report sections for generating CXR imaging reports. First, we propose a two-stage strategy that explicitly models the relationship between Findings and Impression. Second, we design a novel co-operative multi-agent system that implicitly captures the imbalanced distribution between abnormality and normality. Experiments on two CXR report datasets show that our method achieves state-of-the-art performance in terms of various evaluation metrics. Our results expose that the proposed approach is able to generate high-quality medical reports through integrating the structure information.


Introduction
Chest X-Ray (CXR) image report generation aims to automatically generate detailed findings and diagnoses for given images, which has attracted growing attention in recent years (Wang et al., 2018a;Jing et al., 2018;Li et al., 2018). This technique can greatly reduce the workload of radiologists for interpreting CXR images and writing corresponding reports. In spite of the progress made in this area, it is still challenging for computers

Findings:
The cardiac silhouette is enlarged and has a globular appearance. Mild bibasilar dependent atelectasis. No pneumothorax or large pleural effusion. No acute bone abnormality.

Impression:
Cardiomegaly with globular appearance of the cardiac silhouette. Considerations would include pericardial effusion or dilated cardiomyopathy. Figure 1: An example of chest X-ray image along with its report. In the report, the Findings section records detailed descriptions for normal and abnormal findings; the Impression section provides a diagnostic conclusion. The underlined sentence is an abnormal finding.
to accurately write reports. Besides the difficulties in detecting lesions from images, the complex structure of textual reports can prevent the success of automatic report generation. As shown in Figure 1, the report for a CXR image usually comprises two major sections: Findings and Impression. Findings section records detailed descriptions about normal and abnormal findings, such as lesions (e.g. increased lung marking). Impression section concludes diseases (e.g. pneumonia) from Findings and forms a diagnostic conclusion, consisting of abnormal and normal conclusions.
Existing methods (Wang et al., 2018a;Jing et al., 2018;Li et al., 2018) ignored the relationship between Findings and Impression, as well as the different distributions between normal and abnormal findings/conclusions. In addressing this problem, we present a novel framework for automatic report generation by exploiting the structure of the reports. Firstly, considering the fact that Impression is a summarization of Findings, we propose a two-stage modeling strategy given in Figure 3, where we borrow strength from im-age captioning task and text summarization task for generating Impression. Secondly, we decompose the generation process of both Findings and Impression into the following recurrent sub-tasks: 1) examine an area in the image (or a sentence in Findings) and decide if an abnormality appears; 2) write detailed (normal or abnormal) descriptions for the examined area. In order to model the above generation process, we propose a novel Co-operative Multi-Agent System (CMAS), which consists of three agents: Planner (PL), Abnormality Writer (AW) and Normality Writer (NW). Given an image, the system will run several loops until PL decides to stop the process. Within each loop, the agents cooperate with each other in the following fashion: 1) PL examines an area of the input image (or a sentence of Findings), and decides whether the examined area contains lesions. 2) Either AW or NW will generate a sentence for the area based on the order given by PL. To train the system, RE-INFORCE algorithm (Williams, 1992) is applied to optimize the reward (e.g. BLEU-4 (Papineni et al., 2002)). To the best of our knowledge, our work is the first effort to investigate the structure of CXR reports.
The major contributions of our work are summarized as follows. First, we propose a twostage framework by exploiting the structure of the reports. Second, We propose a novel Cooperative Multi-Agent System (CMAS) for modeling the sentence generation process of each section. Third, we perform extensive quantitative experiments to evaluate the overall quality of the generated reports, as well as the model's ability for detecting medical abnormality terms. Finally, we perform substantial qualitative experiments to further understand the quality and properties of the generated reports.

Related Work
Visual Captioning The goal of visual captioning is to generate a textual description for a given image or video. For one-sentence caption generation, almost all deep learning methods (Mao et al., 2014;Vinyals et al., 2015;Donahue et al., 2015;Karpathy and Fei-Fei, 2015) were based on Convolutional Neural Network (CNN) -Recurrent Neural Network (RNN) architecture. Inspired by the attention mechanism in human brains, attention-based models, such as vi-sual attention (Xu et al., 2015) and semantic attention (You et al., 2016), were proposed for improving the performances. Some other efforts have been made for building variants of the hierarchical Long-Short-Term-Memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to generate paragraphs (Krause et al., 2017;Yu et al., 2016;Liang et al., 2017). Recently, deep reinforcement learning has attracted growing attention in the field of visual captioning (Ren et al., 2017;Rennie et al., 2017;Liu et al., 2017;Wang et al., 2018b). Additionally, other tasks related to visual captioning, (e.g., dense captioning (Johnson et al., 2016), multi-task learning (Pasunuru and Bansal, 2017)) also attracted a lot of research attention.
Chest X-ray Image Report Generation Shin et al. (2016) first proposed a variant of CNN-RNN framework to predict tags (location and severity) of chest X-ray images. Wang et al. (2018a) proposed a joint framework for generating reference reports and performing disease classification at the same time. However, this method was based on a single-sentence generation model (Xu et al., 2015), and obtained low BLEU scores. Jing et al. (2018) proposed a hierarchical language model equipped with co-attention to better model the paragraphs, but it tended to produce normal findings. Despite Li et al. (2018) enhanced language diversity and model's ability in detecting abnormalities through a hybrid of template retrieval module and text generation module, manually designing templates is costly and they ignored the template's change over time.
Multi-Agent Reinforcement Learning The target of multi-agent reinforcement learning is to solve complex problems by integrating multiple agents that focus on different sub-tasks. In general, there are two types of multi-agent systems: independent and cooperative systems (Tan, 1993). Powered by the development of deep learning, deep multi-agent reinforcement learning has gained increasing popularity. Tampuu et al. (2017) extended Deep Q-Network (DQN) (Mnih et al., 2013) into a multi-agent DQN for Pong game; Foerster et al. (2016); Sukhbaatar et al. (2016) explored communication protocol among agents; Zhang et al. (2018) further studied fully decentralized multi-agent system. Despite these many attempts, the multi-agent system for long paragraph generation still remains unexplored.

Overall Framework
As shown in Figure 3, the proposed framework is comprised of two modules: Findings and Impression. Given a CXR image, the Findings module will examine different areas of the image and generate descriptions for them. When findings are generated, the Impression module will give a conclusion based on findings and the input CXR image. The proposed two-stage framework explicitly models the fact that Impression is a conclusive summarization of Findings.
Within each module, we propose a Co-operative Multi-Agent System (CMAS) (see Section 4) to model the text generation process for each section.

Overview
The proposed Co-operative Multi-Agent System (CMAS) consists of three agents: Planner (PL), Normality Writer (NW) and Abnormality Writer (AW). These agents work cooperatively to generate findings or impressions for given chest X-ray images. PL is responsible for determining whether an examined area contains abnormality, while NW and AW are responsible for describing normality or abnormality in detail ( Figure 2).
The generation process consists of several loops, and each loop contains a sequence of ac-tions taken by the agents. In the n-th loop, the writers first share their local states LS n−1,T = {w n−1,t } T t=1 (actions taken in the previous loop) to form a shared global state GS n = (I, {s i } n−1 i=1 ), where I is the input image, s i is the i-th generated sentence, and w i,t is the t-th word in the i-th sentence of length T . Based on the global state GS n , PL decides whether to stop the generation process or to choose a writer (NW or AW) to produce the next sentence s n . If a writer is selected, then it will refresh its memory by GS n and generate a sequence of words {w n,t } T t=1 based on the sequence of local state LS n,t = {w n,1 , · · · , w n,t−1 }.
Once the generation process is terminated, the reward module will compute a reward by comparing the generated report with the ground-truth report. Given the reward, the whole system is trained via REINFORCE algorithm (Williams, 1992).

Global State Encoder
During the generation process, each agent will make decisions based on the global state GS n . Since GS n contains a list of sentences {s i } n−1 i=1 , a common practice is to build a hierarchical LSTM as Global State Encoder (GSE) for encoding it. Equipping such an encoder with an excessive number of parameters for each agent in CMAS would be computation-consuming. We address this problem in two steps. First, we tie weights of GSE across the three agents. Second, instead of encoding previous sentences from scratch, GSE dynamically encodes GS n based on GS n−1 . Specifically, we propose a single layer LSTM with soft-attention (Xu et al., 2015) as GSE. It takes a multi-modal context vector ctx n ∈ R H as input, which is obtained by jointly embedding sentence s n−1 and image I to a hidden space of dimension H, and then generates the global hidden state vector gs n ∈ R H for the n-th loop by: We adopt a visual attention module for producing context vector ctx n , given its capability of capturing the correlation between languages and images Xu et al., 2015). The inputs to the attention module are visual feature vectors {v p } P p=1 ∈ R C and local state vector ls n−1 of sentence s n−1 . Here, {v p } P p=1 are extracted from an intermediate layer of a CNN, C and p are the number of channels and the position index of v p . ls n−1 is the final hidden state of a writer (defined in section 4.2.3). Formally, the context vector ctx n is computed by the following equations: where W h , W att and W ctx are parameter matrices; {α p } P p=1 are weights for visual features; and [; ] denotes concatenation operation.
At the beginning of the generation process, the global state is GS 1 = (I). Letv = 1 P P i=1 v i , the initial global state gs 0 and cell state c 0 are computed by two single-layer neural networks: where W gs and W c are parameter matrices.

Planner
After examining an area, Planner (PL) determines: 1) whether to terminate the generation process; 2) which writer should generate the next sentence. Specifically, besides the shared Global State Encoder (GSE), the rest part of PL is modeled by a two-layer feed-forward network: where W 1 , W 2 , and W 3 are parameter matrices; idx n ∈ {0, 1, 2} denotes the indicator, where 0 is for STOP, 1 for NW and 2 for AW. Namely, if idx n = 0, the system will be terminated; else, NW (idx n = 1) or AW (idx n = 2) will generate the next sentence s n .

Writers
The number of normal sentences is usually 4-12 times to the number of abnormal sentences for each report. With this highly unbalanced distribution, using only one decoder to model all of the sentences would make the generation of normal sentences dominant. To solve this problem, we design two writers, i.e., Normality Writer (NW) and Abnormality Writer (AW), to model normal and abnormal sentences. Practically, the architectures of NW and AW can be different. In our practice, we adopt a single-layer LSTM for both NW and AW given the principle of parsimony. Given a global state vector gs n , CMAS first chooses a writer for generating a sentence based on idx n . The chosen writer will re-initialize its memory by taking gs n and a special token BOS (Begin of Sentence) as its first two inputs. The procedure for generating words is: where y w t−1 is the one-hot encoding vector of word w t−1 ; h t−1 , h t ∈ R H are hidden states of LSTM; W e is the word embedding matrix and W out is a parameter matrix. p t gives the output probability score over the vocabulary. Upon the completion of the procedure (either token EOS (End of Sentence) is produced or the maximum time step T is reached), the last hidden state of LSTM will be used as local state vector ls n , which will be fed into GSE for generating next global state vector GS n+1 .

Reward Module
We use BLEU-4 (Papineni et al., 2002) to design rewards for all agents in CMAS. A generated paragraph is a collection (s ab , s nr ) of normal sentences s nr = {s nr 1 , . . . , s nr Nnr } and abnormal sentences s ab = {s ab 1 , . . . , s ab N ab }, where N ab and N nr are the number of abnormal sentences and the number of normal sentences, respectively. Similarly, the ground truth paragraph corresponding to the generated paragraph (s ab , s nr ) is (s * ab , s * nr ).

Reinforcement Learning
Given an input image I, three agents (PL, NW and AW) in CMAS work simultaneously to generate a paragraph s = {s 1 , s 2 , . . . , s N } with the joint goal of maximizing the discounted reward R(s n ) (Equation 15) for each sentence s n . The loss of a paragraph s is negative expected reward: where π θ denotes the entire policy network of CMAS. Following the standard REINFORCE algorithm (Williams, 1992), the gradient for the expectation E n,sn∼π θ [R(s n )] in Equation 16 can be written as: where − log π θ (s n , idx n ) is joint negative loglikelihood of sentence s n and its indicator idx n , and it can be decomposed as: where L AW , L N W and L P L are negative loglikelihoods; p AW , p N W and p P L are probabilities of taking an action; 1 denotes indicator function. Therefore, Equation 17 can be re-written as:

Imitation Learning
It is very hard to train agents using reinforcement learning from scratch, therefore a good initialization for policy network is usually required (Bahdanau et al., 2016;Silver et al., 2016;Wang et al., 2018b). We apply imitation learning with crossentropy loss to pre-train the policy network. Formally, the cross-entropy loss is defined as: where w * and idx * denote ground-truth word and indicator respectively; λ P L , λ N W and λ AW are balancing coefficients among agents; N and T are the number of sentences and the number of words within a sentence, respectively.

CMAS for Impression
Different from the Findings module, the inputs of the Impression module not only contain images I but also the generated findings f = {f 1 , f 2 , . . . , f N f }, where N f is the total number of sentences. Thus, for the Impression module, the n-th global state becomes GS n = (I, f , {s i } n−1 i=1 ). The rest part of CMAS for the Impression module is exactly the same as CMAS for the Findings module. To encode f , we extend the definition of multi-modal context vector ctx n (Equation 5) to: where f att is the soft attention (Bahdanau et al., 2014;Xu et al., 2015) vector, which is obtained similar as v att (Equation 3 and 4).

IU-Xray Indiana University Chest X-Ray Collection (Demner-Fushman et al., 2015)
is a public dataset containing 3,955 fully de-identified radiology reports collected from the Indiana Network for Patient Care, each of which is associated with a frontal and/or lateral chest X-ray images, and there are 7,470 chest X-ray images in total. Each report is comprised of several sections: Impression, Findings and Indication etc. We preprocess the reports by tokenizing, converting tokens into lowercases and removing non-alpha tokens.
CX-CHR CX-CHR (Li et al., 2018) is a proprietary internal dataset, which is a Chinese chest X-ray report dataset collected from a professional medical examination institution. This dataset contains examination records for 35,500 unique patients, each of which consists of one or multiple chest X-ray images as well as a textual report written by professional radiologists. Each textual report has sections such as Complain, Findings and Impression. The textual reports are preprocessed through tokenizing by "jieba" 1 , a Chinese text segmentation tool, and filtering rare tokens.

Experimental Setup
Abnormality Term Extraction Human experts helped manually design patterns for most frequent medical abnormality terms in the datasets. These patterns are used for labeling abnormality and normality of sentences, and also for evaluating models' ability to detect abnormality terms. The abnormality terms in Findings and Impression are different to some degree. This is because many abnormality terms in Findings are descriptions rather than specific disease names. For examples, "low lung volumes" and "thoracic degenerative" usually appear in Findings but not in Impression.

Evaluation Metrics
We evaluate our proposed method and baseline methods on: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and CIDEr (Vedantam et al., 2015). The results based on these metrics are obtained by the standard image captioning evaluation tool 2 . We also calculate precision and average False Positive Rate (FPR) for abnormality detection in generated textual reports on both datasets.

Implementation Details
The dimensions of all hidden states in Abnormality Writer, Normality Writer, Planner and shared Global State Encoder are set to 512. The dimension of word embedding is also set as 512.
We adopt ResNet-50 (He et al., 2016) as image encoder, and visual features are extracted from its last convolutional layer, which yields a 7 × 7 × 2048 feature map. The image encoder is pretrained on ImageNet (Deng et al., 2009)). For the IU-Xray dataset, the image encoder is fine-tuned on ChestX-ray14 dataset , since the IU-Xray dataset is too small. For the CX-CHR dataset, the image encoder is fine-tuned on its training set. The weights of the image encoder are then fixed for the rest of the training process.
During the imitation learning stage, the crossentropy loss (Equation 20) is adopted for all of the agents, where λ P L , λ AW and λ N W are set as 1.0. We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 5 × 10 −4 for both datasets. During the reinforcement learning stage, the gradients of weights are calculated based on Equation 19. We also adopt Adam optimizer for both datasets and the learning rate is fixed as 10 −6 .
Comparison Methods For Findings section, we compare our proposed method with state-of-theart methods for CXR imaging report generation: CoAtt (Jing et al., 2018) and HGRG-Agent (Li et al., 2018), as well as several state-of-the-art image captioning models: CNN-RNN (Vinyals et al., 2015), LRCN (Donahue et al., 2015), AdaAtt ), Att2in (Rennie et al., 2017. In addition, we implement several ablated versions of the proposed CMAS to evaluate different components in it: CMAS W is a single agent system containing only one writer, but it is trained on both normal and abnormal findings. CMAS NW,AW is a simple concatenation of two single agent systems CMAS NW and CMAS AW , which are respectively trained on only normal findings and only abnormal findings. Finally, we show CMAS's performances with imitation learning (CMAS-IL) and reinforcement learning (CMAS-RL).
For Impression section, we compare our method with Xu et al. (2015): SoftAtt vision and SoftAtt text , which are trained with visual input only (no findings) and textual input only (no images). We also report CMAS trained only on visual and textual input: CMAS text and CMAS vision . Finally, we also compare CMAS-IL with CMAS-RL.  Table 2: Main results for impression generation on the CX-CHR (upper) and IU-Xray (lower) datasets. BLEU-n denotes the BLEU score that uses up to n-grams.

Main Results
Comparison to State-of-the-art Table 1 shows results on the automatic metrics for the Findings module. On both datasets, CMAS outperforms all baseline methods on almost all metrics, which indicates its overall efficacy for generating reports that resemble those written by human experts. The methods can be divided into two different groups: single sentence models (CNN-RNN, LRCN, AdaAtt, Att2in) and hierarchical models (CoAtt, HGRG-Agent, CMAS). Hierarchical models consistently outperform single sentence models on both datasets, suggesting that the hierarchical models are better for modeling paragraphs. The leading performances of CMAS-IL and CMAS-RL over the rest of hierarchical models demonstrate the validity of our practice in exploiting the structure information within sections.
Ablation Study CMAS W has only one writer, which is trained on both normal and abnormal findings.   heart size is normal", while AW believes "the heart is enlarged". Such conflict would negatively affect their joint performances. As evidently shown in Table 1, CMAS-IL achieves higher scores than CMAS NW, AW , directly proving the importance of communication between agents and thus the importance of PL. Finally, it can be observed from Table 1 that CMAS-RL consistently outperforms CMAS-IL on all metrics, which demonstrates the effectiveness of reinforcement learning.
Impression Module As shown in Table 2, CMAS vision and CMAS text have higher scores than SoftAtt vision and SoftAtt text , indicating the effectiveness of CMAS. It can also be observed from Table 2 that images provide better information than text, since CMAS vision and SoftAtt vision exceed the scores of CMAS text and SoftAtt text to a large margin on most of the metrics. However, further comparison among CMAS-IL, CMAS text and CMAS vision shows that text information can help improve the model's performance to some degree.

Abnormality Detection
The automatic evaluation metrics (e.g. BLEU) are based on n-gram similarity between the generated sentences and the ground-truth sentences. A model can easily obtain high scores on these automatic evaluation metrics by generating normal findings (Jing et al., 2018). To better understand CMAS's ability in detecting abnormalities, we report its precision and average False Positive Rate (FPR) for abnormality term detection in Table 3 and Table 4. Table 3 shows that CMAS-RL obtains the highest precision and the lowest average FPR on both datasets, indicating the advantage of CMAS-RL for detecting abnormalities. Table 4 shows that CMAS-RL achieves the highest precision scores, but not the lowest FPR. However, FPR can be lowered by simply generating normal sentences, which is exactly the behavior of CMAS text .

Qualitative Analysis
In this section, we evaluate the overall quality of generated reports through several examples. Figure 4 presents 5 reports generated by CMAS-RL and CMAS W , where the top 4 images contain abnormalities and the bottom image is a normal case. It can be observed from the top 4 examples that the reports generated by CMAS-RL successfully detect the major abnormalities, such as "cardiomegaly", "low lung volumes" and "calcified granulomas". However, CMAS-RL might miss secondary abnormalities sometimes. For instance, in the third example, the "right lower lobe" is wrongly-written as "right upper lobe" by CMAS-RL. We find that both CMAS-RL and CMAS W are capable of producing accurate normal findings since the generated reports highly resemble those written by radiologists (as shown in the last example in Figure 4). Additionally, CMAS W tends to produce normal findings, which results from the overwhelming normal findings in the dataset.

Template Learning
Radiologists tend to use reference templates when writing reports, especially for normal findings.
Manually designing a template database can be costly and time-consuming. By comparing the most frequently generated sentences by CMAS with the most used template sentences in the ground-truth reports, we show that the Normality Writer (NW) in the proposed CMAS is capable of learning these templates automatically. Several most frequently used template sentences (Li et al., 2018) in the IU-Xray dataset are shown in Table 5. The top 10 template sentences generated by NW are presented in Table 6. In general, the templates sentences generated by NW are similar to those top templates in ground-truth reports.

Conclusion
In this paper, we proposed a novel framework for accurately generating chest X-ray imaging reports by exploiting the structure information in the reports. We explicitly modeled the between-section structure by a two-stage framework, and implicitly captured the within-section structure with a novel Co-operative Multi-Agent System (CMAS) com- The lungs are clear. The heart is normal in size. Heart size is normal. There is no acute bony abnormality. There is no pleural effusion or pneumothorax. There is no pneumothorax. No pleural effusion or pneumothorax. There is no focal air space effusion to suggest a areas. No focal consolidation. Trachea no evidence of focal consolidation pneumothorax or pneumothorax. prising three agents: Planner (PL), Abnormality Writer (AW) and Normality Writer (NW). The entire system was trained with REINFORCE algorithm. Extensive quantitative and qualitative experiments demonstrated that the proposed CMAS not only could generate meaningful and fluent reports, but also could accurately describe the detected abnormalities.