On the Automatic Generation of Medical Imaging Reports

Medical imaging is widely used in clinical practice for diagnosis and treatment. Report-writing can be error-prone for unexperienced physicians, and time-consuming and tedious for experienced physicians. To address these issues, we study the automatic generation of medical imaging reports. This task presents several challenges. First, a complete report contains multiple heterogeneous forms of information, including findings and tags. Second, abnormal regions in medical images are difficult to identify. Third, the reports are typically long, containing multiple sentences. To cope with these challenges, we (1) build a multi-task learning framework which jointly performs the prediction of tags and the generation of paragraphs, (2) propose a co-attention mechanism to localize regions containing abnormalities and generate narrations for them, (3) develop a hierarchical LSTM model to generate long paragraphs. We demonstrate the effectiveness of the proposed methods on two publicly available dataset.


Introduction
Medical images, such as radiology and pathology images, are widely used in hospitals and clinics for the diagnosis and treatment of many diseases, such as pneumonia, pneumothorax, interstitial lung disease, heart failure, bone fracture, hiatal hernia, to name a few. The reading and interpretation of medical images are usually conducted by specialized medical professionals. For example, radiology and pathology images are read by radiologists and pathologists. They write textual reports ( Figure 1) to narrate the findings regarding each area of the body examined in the imaging study, specifically whether each area was found to be normal, abnormal or potentially abnormal.
For less-experienced radiologists and pathologists, es- Figure 1. An exemplar chest x-ray report, which consists of multiple sections of information. In the impression section, the radiologist combines the findings, patient clinical history and indication for the imaging study and provides a diagnosis. The findings section lists the radiology observations and findings regarding each area of the body examined in the imaging study. The tags section lists the keywords which represent the critical information in the findings. These keywords are identified using the Medical Text Indexer (MTI).
pecially those working in the rural area where the quality of healthcare is relatively low, writing medical-imaging reports is demanding. For instance, to correctly read a chest xray image, the following skills are needed [2]: (1) thorough knowledge of the normal anatomy of the thorax, and the basic physiology of chest diseases; (2) skills of analyzing the radiograph through a fixed pattern; (3) ability of evaluating the evolution over time; (4) knowledge of clinical presentation and history; (5) knowledge of the correlation with other diagnostic results (laboratory results, electrocardiogram, respiratory function tests). For experienced radiologists and pathologists, writing imaging reports is tedious and time-consuming. In nations with large population such as China, a radiologist may need to read hundreds of radiology images per day. Typing the findings of each image into computer takes about 5-10 minutes, which occupies most of their working time. In sum, for both unexperienced and experienced medical professionals, writing imaging reports is unpleasant.
This motivates us to investigate whether it is possible to automatically generate medical image reports. Several challenges need to be addressed. First, a complete diagnostic report is comprised of multiple heterogeneous forms of in-formation. As shown in Figure 1, the report for a chest x-ray contains impression which is a sentence, findings which are a paragraph and tags which are a list of keywords. Generating this heterogeneous information in a unified framework is technically demanding. We address this problem by building a multi-task framework, which treats the prediction of tags as a multi-label classification task, and treats the generation of long descriptions (e.g. impressions, findings) as a text generation task. In the framework, the two tasks share the same CNN used for learning visual features and are performed jointly. Secondly, an imaging report usually focuses more on narrating the abnormal findings since they directly indicate diseases and guide treatment. How to localize image-regions that contain abnormalities and attach the right description to them are challenging. We solve this problem by introducing a co-attention mechanism, which simultaneously attends to images and predicted tags and explores the synergistic effects of visual and semantic information. Third, the descriptions in imaging reports are usually long, containing multiple sentences or even multiple paragraphs. Generating such long text is highly nontrivial. Rather than adopting a single-layer LSTM, which is less capable of modeling long word sequences, we leverage the compositional nature of the report and adopt a hierarchical LSTM to produce long texts. Combined with the coattention mechanism, the hierarchical LSTM first generates high-level topics, and then produces fine-grained descriptions according to the topics.
Overall, the main contributions of our work are: • We propose a multi-task learning framework which can simultaneously predict the tags and generate the text descriptions.
• We introduce a co-attention mechanism for localizing abnormal regions and generating the corresponding descriptions.
• We build a hierarchical LSTM to generate long sentences and paragraphs.
• We perform extensive quantitative and qualitative experiments to show the effectiveness of the propose method.
The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 introduces the method. Section 4 present the experimental results and Section 5 concludes the paper.

Related works
Textual labeling of medical images There have been several works aiming at attaching "texts" to medical images. In their settings, the target "texts" are either fully-structured or semi-structured (e.g. tags, attributes, templates), rather than natural texts. Kisilev et al. [10] build a pipeline to predict the attributes of medical images. Some of these attributes are textual tags. Given an medical image, they first perform image segmentation, then extract visual features of the segments and finally build a classifier to categorize the segments into predefined (textual) categories. Shin et al. [16] adopt a CNN-RNN based framework to predict tags of chest x-ray images. They use CNN to detect a disease from an image and use RNN to describe the contexts (e.g., location, severity and the affected organs) of the detected disease. The work closest to medical report generation is recently contributed by Zhang et al. [22]. They aim at generating pathology reports containing 30-59 words. However, the reports studied in their work are semi-structured and are not fully natural. Most reports are generated by rephrasing a small number of "standard" reports, whose contents are restricted to 5 predefined topics.
To our best knowledge, our work represents the first one that generates truly natural reports written down by physicians. These reports are long and cover diverse topics, which are much more challenging to generate than tags and semi-structured paragraphs studied in previous works.
Image captioning with deep learning Image captioning aims at automatically generating text descriptions for given images. Most of recent image captioning models are based on a CNN-RNN framework [19,6,9,20,21,8,11]. Vinyals et al. [19] feed the image features extracted from the last hidden layer of CNN into an LSTM to generate captions. Fang et al. [6] first use CNN to detect the concepts in the image, and then use these detected concepts to guide the language model to generate a sentence. Karpathy et al. [9] propose to use multi-modal recurrent neural network to learn the alignment between visual and semantic features, and then produce the descriptions for given images.
Recently, attention mechanisms have shown to be useful for image captioning [20,21]. Xu et al. [20] introduce a spatial-visual attention mechanism over image features extracted from intermediate layers of CNN. You et al. [21] propose a semantic attention mechanism over tags of given images. To better leverage both of the visual features and semantic tags, we propose a co-attention mechanism for report generation.
Instead of only generating one sentence caption for an image, Johnson et al. [8] introduce the dense captioning task which requires the model to generate a description for each of the detected image regions. Krause et al. [11] and Liang et al. [12] generate paragraph captions for images through hierachical LSTM. Our method also adopts a hierarchical LSTM for paragraph generation, whereas different from [11], we use a co-attention network to generate topics.

Methods
In this section, we introduce the methods, beginning with an overview followed by the detailed description of each component.

Overview
A complete diagnostic report for medical image is comprised of both unstructured descriptions (in the form of sentences and paragraphs) and semi-structured tags (in the form of keyword lists), as shown in Figure 1. We propose a multi-task hierarchical model with co-attention for automatically predicting keywords and generating long paragraphs. Given an image which is divided into regions, we use a CNN to learn visual features for these patches. Then these visual features are fed into a multi-label classification (MLC) network to predict the relevant tags. In the tag vocabulary, each tag is represented by a word-embedding vector. Given the predicted tags for a specific image, their word-embedding vectors are retrieved to serve as the semantic features of this image. Then the visual features and semantic features are fed into a co-attention model to generate a context vector that simultaneously captures the visual and semantic information of this image. As of now, the encoding process is completed. Next, starting from the context vector, the decoding process generates the text descriptions. The description of an medical image usually contains multiple sentences, and each sentence focuses on one specific topic. Our model leverages this compositional structure to generate reports in a hierarchical way: it first generates a sequence of high-level topic vectors representing sentences, then generates a sentence (a sequence of words) from each topic vector. Specifically, the context vector is inputted into a sentence LSTM, which unrolls for a few steps, each producing a topic vector. A topic vector represents the semantics of a sentence to be generated. Given a topic vector, the word LSTM takes it as input and generates a sequence of words to form a sentence. The termination of the unrolling process is controlled by the sentence LSTM.

Tag prediction
The first task of our model is predicting the tags of the given image. We treat the tag prediction task as a multilabel classification task. Specifically, given an image I, we first extract its features {v n } N n=1 ∈ R D from an intermediate layer of a CNN, and then feed {v n } N n=1 into a multi-Label classification (MLC) network to generate a distribution over all of the L tags: where l ∈ R L is tag vector, l i = 1 and l i = 0 denote the presence and absence of the i-th tag respectively, and MLC i means the i-th output of MLC network. For simplicity, we extract visual features from the last convolutional layer of the VGG-19 [17] and adopt the last two fully connected layers of VGG-19 as MLC.
Finally, the embedding of the top M tags {a m } M m=1 ∈ R E are used as semantic features for topic generation.

Co-Attention
Previous works have shown that visual attention alone can perform fairly well for localizing the objects [1] and aiding caption generation [20]. However, visual attention doesn't provide sufficient high level semantic information. For example, only looking at the right lower region of the chest x-ray image ( Figure 1) without concerning about other areas, we might not be able to recognize what we are looking at, not to even mention detecting the abnormalities. This will be problematic for topic generation. On the contrary, the tags can always provide the needed high level information about overall or part of images. To this end, we propose a co-attention mechanism which can simultaneously attend to visual and semantic modalities.
At time step s of sentence LSTM, the joint context vector ctx (s) ∈ R C is generated by co-attention network sent ∈ R H is the hidden state of sentence LSTM at time step s−1. The co-attention network f coatt first adopts a single layer feedforward network to compute the soft visual and semantic attentions over input image features and tags: where W v , W v,h and W vatt are parameter matrices of the visual attention network. W a , W a,h and W aatt are parameter matrices of the semantic attention network.
The soft visual and semantic attentions are thus: There are many ways to combine the visual and semantic context vectors such as concatenation and element-wise operations. In this paper, we first concatenate visual v

Sentence LSTM
The sentence LSTM is a single-layer LSTM that takes the joint context vector ctx ∈ as its input, and it generates topic vector t ∈ R K for word LSTM through topic generator and determines whether to continue or stop generating captions by stop control component.

Topic Generator
We use a deep output layer [15] to strengthen the context information in topic vector t (s) . The deep output layer takes as input the hidden state h (s) sent and the joint context vector ctx (s) of current step: where W t,hsent and W t,ctx are the parameter matrices.

Stop Control
We also apply a deep output layer to control the continuation of the sentence LSTM. The layer takes as input the previous and current hidden state h where W Stop , W Stop,s−1 and W Stop,s are parameter matrices. If p is greater than predefined threshold (e.g. 0.5), then sentence LSTM will stop producing new topic vectors and word LSTM will also stop producing words.

Word LSTM
The words of each sentence are generated by a word LSTM, which is a single layer LSTM initialized by topic vector t produced by sentence LSTM. Following [11], when generating words, topic vector t and the special START token are used as the first and the second input of the word LSTM, and the subsequent inputs are the word sequence.
The hidden state h word ∈ R H of word LSTM is directly used to predict the distribution over words by: where W out is the parameter matrix. After each word-LSTM has generated its word sequences, the final report is the simple concatenation of all the generated sequences.

Parameter learning
Each training example is a tuple (I, l, w) where I is an image, l denotes the ground-truth tag vector and w is the diagnostic paragraph, which is comprised of S sentences, and each of sentence consists of T s words.
Given a training example (I, l, w), our model first performs multi-label classification for I, and produces a distribution p l,pred over all tags. Note that l is a binary vector which encodes the presence and absence of tags. We can obtain ground-truth tag distribution by normalizing l: p l = l/||l|| 1 . Therefore, the training loss of this step is cross-entropy loss tag between p l and p l,pred .
Next, the sentence LSTM is unrolled for S steps to produce topic vectors and also distributions over {STOP, CON-TINUE}: p s stop . Finally, the produced S topic vectors are fed into the word LSTM to generate words w s,t . The training loss of caption generation is the combination of two cross-entropy losses: sent over stop distributions p s stop and word over word distributions p s,t .
Thus, the training loss is computed through: (I, l, w) = λ tag tag + λ sent In addition to the above training loss, there is also a regularization loss for visual and semantic attentions. Following [20], suppose α ∈ R N ×S and β ∈ R M ×S are matrices of visual and semantic attentions respectively, then the regularization loss over α and β is:

Datasets
We use two publicly available medical image datasets to evaluate our proposed model.

IU X-Ray
The Indiana University Chest X-Ray Collection (IU X-Ray) [3] is a set of chest x-ray images paired with their corresponding diagnostic reports. The dataset contains 7,470 pairs of images and reports. Each report consists of the following sections: impression, findings, tags 1 , comparison and indication. In this paper, we treat the contents in impression and findings as the target captions to be generated and the MTI generated tags in tags as the tags of the reports (Figure 1 provides an example).
We preprocess the data through converting all tokens to lowercases, removing all of non-alpha tokens, which resulting in 572 unique tags, and 1915 unique words remaining in the entire dataset. On average, each image is associated with 2.2 tags, 5.7 sentences and each sentence contains 6.5 words. Besides, we find that top 1,000 words covers 99.0% word occurrences in the dataset, therefore we only use top 1,000 words to build dictionary. Finally, we randomly select 500 images for validation and 500 images for test.

PEIR Gross
The Pathology Education Informational Resource (PEIR) digital library 2 is a public medical image library for medical education. We collect the images with their descriptions in Gross sub-collection of PEIR Pathology collection. The PEIR Gross dataset contains 7,442 image caption pairs from 21 different sub-categories. Different from IU X-Ray dataset, each caption in PEIR Gross dataset is a one sentence description of its corresponding image. We use this dataset to evaluate our model's ability to generate single sentence report.
For PEIR Gross dataset, we apply the same preprocessing as IU X-Ray which yields 4,452 unique words for the dataset and 12.0 words for each image. Besides, similar to [21], we treat words with top 1000 occurrences as tags.

Implementation details
We set the dimensions of all hidden states and embeddings as 512. For words and tags, we use different embedding matrices since an tag might contain multiple words. We use full VGG-19 model [19] for tag prediction. We use the embedding of top 10 tags with highest probabilities to be the semantic feature vectors {a m } M =10 m=1 . We extract the visual features from the last convolutional layer of the same VGG-19 network, which yields a 14×14×512 feature map.

Baselines
We first compare our method with several state-of-the-art image captioning models: CNN-RNN [19], LRCN [5], Soft ATT [20], ATT-RK [21]. We re-implemented all of these models and adopt VGG-19 [17] as CNN encoder. However, these models are built for single sentence captions. To better show the effectiveness of hierarchical LSTM for paragraph generation, we also implement a hierarchical model without any attention: Ours-no-Attention. The input of Oursno-Attention is the overall image feature of VGG-19 network, which has a dimension of 4096. Ours-no-Attention can be viewed as CNN-RNN [19] equipped with hierarchical LSTM decoder. Besides, to show the effectiveness of the proposed co-attention mechanism, we also implement two ablated versions of our model: Ours-Semantic-only and Ours-Visual-only, which takes only the semantic attention or visual attention context vector to produce topic vectors.

Quantitative results
We report the paragraph generation (upper part of Table  1) and one sentence generation (lower part of Table 1) results using the following captioning evaluation tools: BLEU [14], METEOR [4], ROUGE [13] and CIDER [18].
For paragraph generation, as shown in the upper part of Table 1, it is clear that models with single LSTM decoder perform much worse than hierarchical LSTM decoder. Note that the only difference between Ours-No-Attention and CNN-RNN [19] in Table 1 is that Ours-No-Attention adopts a hierarchical LSTM decoder while CNN-RNN [19] only adopts a single-layer LSTM. The comparison between these two models directly demonstrates the effectiveness of hierarchical LSTM. This result is not surprising since it is well-known that single-layer LSTM cannot effectively model long sequences. Additionally, employing semantic attention alone (Ours-Semantic-Only) or visual attention alone (Ours-Visual-Only) to generate topic vectors does not seem to help caption generation a lot. The potential reason might be that visual attention can only capture the visual information of sub-regions of the image while it is unable to correctly describe it. While semantic attention only knows the potential abnormalities but it cannot confirm its findings by looking at images. Finally, our full model (Ours-CoAttention) achieves the best results on all of the evaluation metrics, which demonstrates the effectiveness of proposed co-attention mechanism.
For the results of single sentence generation (shown in the lower part of Table 1), the ablated versions of our model (Ours-Semantic-Only and Ours-Visual-Only) achieve competitive scores compared with the state-of-the-arts models. Our full model (Ours-CoAttention) outperforms all of the baseline models, which indicates the effectiveness of the proposed co-attention mechanism.  Table 1. Main results for paragraph generation on IU X-Ray dataset (upper part), and single sentence generation on PEIR Gross dataset (lower part). BLUE-n denotes the BLEU score uses up to n-grams.

Paragraph Generation
An illustration of paragraph generation by three models: Ours-CoAttention, Ours-No-Attention and Soft Attention models is shown in Figure 3. Note that the underlined sentences are the descriptions of abnormalities. First property we can observe is that the generated paragraphs are comprised of more sentences than the ground truth. Secondly, most of the generated sentences and ground truth sentences are descriptions of normal areas while only a few sentences are about the abnormalities. This observation can explain why Ours-No-Attention can achieve very high scores to a certain degree: it can just generate the normalities to easily obtain a high score on n-gram bases evaluation systems.
As we dive into the content of the generated captions, it is surprising that different sentences have different topics. The first sentence is usually a high level description of the image, while each of the following sentences is associated with one area of the image (e.g. "lung", "heart", etc.). Besides, it is worth noting that Soft Attention and Ours-No-Attention models detect only a few abnormalities of the images, and the abnormalities detected by them are incorrect. In contrast, Ours-CoAttention model is able to correctly describe the many real abnormalities in the images (top three images). The comparison demonstrates that hierarchical LSTM is better for paragraph generation and model with co-attention is better at generating topics than model with no-attention.
For the third image, Ours-CoAttention model successfully detects the area ("right lower lobe") which is abnormal ("eventration"), however, it fails to precisely describe this abnormality. In addition, the model also finds abnormalities about "interstitial opacities" and "atheroscalerotic calcification", which are not considered as real abnormality by human experts. The potential reason of mis-description might be that this x-ray image is darker (compared with the above images), and our model might be very sensitive to this change. The image at the bottom is a failure case of Ours-CoAttention. Even though the model makes the wrong judgment about the major abnormalities in the image, it does find some unusual regions: "lateral lucency" and "left lower lobe". Additionally, it is also surprising to find that the model tries to reason about the findings by using "this may indicate".
To better understand models' ability of detecting real or potential abnormalities, we present the portion of sentences which describe the normalities and abnormalities in Table  2. Note that we consider sentences which contain "no" or "normal" or "clear" or "stable" as sentences describe the normalities. It is clear that Ours-CoAttention best approximates the distribution over normality and abnormality of the ground truth.  of the image. For example, the third sentence of the first example is about "cardio", and the visual attention concentrates on regions near the heart. Similar behavior can also be found for semantic attention: for the last sentence in first example, our model correctly concentrates on "degenerative change" which is the topic of the sentence. Besides, it is surprising that the content of first sentence in the second example contradicts the concentration of semantic attention. This is unlikely to happen for single attention mechanism. This contradiction implies that co-attention mechanism has certain fault tolerance, and thus co-attention might be more robust than single attention. Finally, the first sentence of last example presents a mis-description caused by incorrect semantic attention over tags. We believe incorrect attentions can be reduced by building a better tag prediction module.

Tag prediction
A diagnostic report contains not only text descriptions, but also tags which highlights key concepts in the report. We treat tag prediction as a multi-label classification task. To evaluate the prediction performance of multi-task learn-ing, we compare our model and vanilla VGG-19 [17] on re-call@5, racall@10, recall@20. Note that the loss function of both models are softmax cross-entropy loss [7].  Table 3. Tag prediction on IU X-Ray dataset (upper part), and PEIR Gross dataset (lower part). R denotes recall. Table 3 show that Ours-CoAttention and VGG-19 [17] network performs very similar for tag prediction. Even though no improvement has been made by multitask learning, our model is an end-to-end model which avoids managing a complex sequential pipeline. Figure 4 also provides some qualitative results of tag prediction. The results show that in addition to tags which are relevant to the images, the model also produces many irrelevant tags. Even though co-attention mechanism can filter out many noisy tags, the irrelevant tags can still mislead the model to produce many false alarms. We believe a better tag prediction module will help model to attend to correct tags and thus help to improve the quality of generated captions.

Conclusion
In this paper, we study how to automatically generate textual reports for medical images, with the goal to help medical professionals produce reports more accurately and efficiently. Our proposed methods address three major challenges: (1) how to generate multiple heterogeneous forms of information, (2) how to localize abnormal regions and produce accurate descriptions for them, (3) how to gener-ate long texts that contain multiple sentences or even paragraphs. To cope with these challenges, we propose a multitasking learning framework which jointly predicts tags and generates descriptions. We introduce a co-attention mechanism that can simultaneously explore visual and semantic information to accurately localize and describe abnormal regions. We develop a hierarchical LSTM network that can more effectively capture long-range semantics and produce high quality long texts. On two medical datasets containing radiology and pathology images, we demonstrate the effectiveness of the proposed methods through quantitative and qualitative studies.