Coherent and Concise Radiology Report Generation via Context Specific Image Representations and Orthogonal Sentence States

Neural models for text generation are often designed in an end-to-end fashion, typically with zero control over intermediate computations, limiting their practical usability in downstream applications. In this work, we incorporate explicit means into neural models to ensure topical continuity, informativeness and content diversity of generated radiology reports. For the purpose we propose a method to compute image representations specific to each sentential context and eliminate redundant content by exploiting diverse sentence states. We conduct experiments to generate radiology reports from medical images of chest x-rays using MIMIC-CXR. Our model outperforms baselines by up to 18% and 29% respective in the evaluation for informativeness and content ordering respectively, relative on objective metrics and 16% on human evaluation.


Introduction
Presenting information in text format has been critical to the development of human civilizations. Thus text generation is an important field in artificial intelligence and natural language processing, where the input to such natural language generation models could take on the form of text, graphs, images or database records (Koncel-Kedziorski et al., 2019;See et al., 2017;Kinghorn et al., 2018).
Recent advancements in natural language generation has been propelled by end-to-end neural models (e.g. (Chopra et al., 2016)), which has strong capabilities to learn associations within large-scale datasets. However, since it is challenging to exert control over the neural generation process and the corresponding output, the usability of such models in practical scenarios are limited, as the generated content could be erroneous, incoherent, or even socially inappropriate (Liu et al., 2020;Wiseman et al., 2017). It is therefore ideal to include explicit provisions in neural text generation to better model characteristics such as informativeness and topical continuity. It has also been shown that informativeness and textual cohesion are important properties in clinical texts to make them more easily comprehensible (Smith et al., 2011;Liu and Rawl, 2012).
Image to text generation is a natural language generation task that has been popular in communities beyond NLP (e.g. computer vision, machine learning). A general approach is to construct the representation of the entire input image and decode the output text conditioned on the image representation (You et al., 2016). Such approaches work well for scenarios where only a short generated sentence is needed in the output (e.g. image captioning), as typically what is needed is to identify individual objects and fill in the most probable words to describe the overall situation. However, such approaches might not generalize to scenarios where complex semantics embodied in the input images need further inferencing or where the generated outputs need to articulate detailed or specific information, logical reasoning, or recommendationsall of these cases typically require at least multiple sentences (to form a report) (Jing et al., 2017). Medical reports are a classic example of such a scenario where each sentence in a report describes very precise clinical observations or inferences.
We present a neural approach for producing radiology reports from images in a sentence-bysentence order to pinpoint more targeted and precise medical information from the input images and at the same time minimize hallucination from neural text generation. The modeling components ensures the generated report is informative, coherent, and concise via gated mechanisms to model topical continuity, orthogonality criteria in sentence state selection to reduce redundancy, and a neural architecture that is pretrained to predict domain entities during each context of sentence generation in order to encourage induction bias.

Natural Language Generation
Quests for more efficient methods arising from machine translation using dense sentence representations resulted in the development of neural text-totext generation models Srivastava et al., 2014;Wiseman et al., 2018). Subsequently, neural approaches for text-totext generation for summarization tasks also started to gain traction (Cheng and Lapata, 2016;Nallapati et al., 2017;See et al., 2017;Paulus et al., 2017). A major interest in the medical NLP community focuses on information extraction (see Wang et al. (2018) for a review). There has been work in areas such as automatic ICD code assignment Scheurwegs et al., 2017;Mullenbach et al., 2018), risk prediction (Ma et al., 2018), and dialogue comprehension (Liu et al., 2019), and text generation (Buchanan et al., 1995;Moradi and Ghadiri, 2018;Pauws et al., 2019).

Image to Text
There has been much work in image to text generation, which typically constructs a representation of the input image using CNN and generates the output text using RNN (Fang et al., 2015;Krause et al., 2017;Vinyals et al., 2015). Such work has been improved further by incorporating the attention mechanism on input representations (Xu et al., 2015;You et al., 2016). Xu et al. (2015) used visual spatial attention for improving text generation while You et al. (2016) introduced semantic attention on concepts. All of the aforementioned work demonstrated effectiveness on single sentence generation such as captions. Image to text generation becomes more challenging when considering multi-sentence outputs. Some recent work generated multi-sentence outputs using hierarchical decoding (Krause et al., 2017;Liang et al., 2017). Jing et al. (2017) adapted this approach for radiology report generation by incorporating co-attention. Yuan et al. (2019) further improved the design by incorporating concept prediction and leveraging the predicted concept for guiding generation. In our work, the network is pre-trained to predict context entities so that each sentence generation is implicitly guided by domain entities. In addition, our system explicitly models informativeness and topical continuity to improve coherence while reducing redundancy to increase factual correctness and readability.

Method
In this section we delineate the following: (1) The proposed neural architecture and the corresponding network computations; (2) How we pre-train the network to predict the context entities from each sentence representation using a multi-label classifier; (3) How we further train the neural architecture to decode the corresponding sentences from each sentence representation to form the report.

Neural Architecture
Each input to our network is a set of images S I with different views of the chest from the same patient and an indication text Q, which is a short sentence or phrase describing the purpose of the radiology investigation (e.g. intense coughing) 1 . Figure 1 depicts the architecture of our neural model, consisting of components for image encoding, indication text encoding, image feature selection for informativeness, sentential content creation for topical continuity and redundancy reduction and for decoding individual sentences in the report. Before the network computations commence, content creation RNN is initialized with a zero vector hidden state. We elaborate each component in the following subsections.

Image Encoding and Sentential Content Creation
Our network is designed to generate the radiology report in a sentence-by-sentence manner from the input set of images, guided by the indication text. The sentence-by-sentence design allows the report generation to focus on specific and important details in the medical image and reduces possible pitfalls of hallunciation in neural text generation. The Image encoder is a ResNet152 network with pretrained weighs (He et al., 2016). Using the encoder, each of the image matrix i in the input image set is converted to I i R n , as depicted on the left hand side of Figure 1. The network updates the image representations during each context of sentence generation. The network employs gates for informative content selection and topical continuity weighted by a control gate.

Informative Content Selection:
The content selection gate is represented by the trapezium on the top of Figure 1

Repeated During the Generation of each Sentence
where W gc is the parameter matrix, H t−1 is the previous hidden state of content creation RNN and H Q is the indication text encoded using a transformer network. The presence of H t−1 ensures that features are selected in the context of previously generated sentences.

Content Selection for Topical Continuity:
The gate gcont selects the content for topical continuity at time-step t from the image representations computed for the previous time-step t − 1. In Figure 3.1, the continuity gate is represented by the trapezium at the bottom. The Gate gcont selects the content for topical continuity as follows: (1) W gcont is a parameter matrix and I i,t−1 is the representation of the i t h image in the input set computed at time-step t − 1.

Control Gate:
The control gate is represented by the first vertical rectangle in Figure 1. Control gate weighs and creates the representation of the i t h image for time step t as follows.
where W cont is a parameter matrix.

Sentence Content Creation
The content creation RNN is represented by the vertical rectangle, encompassing smaller rectangles corresponding to different states as depicted in the middle of Figure 3.1. Content Creation RNN computes the content for the sentence to be decoded at time step t by taking final representations for the images in the input set into account. The input I t at the current time step t of content creation RNN is computed as follows: where m is the number of images in the input set. The hidden state H t for content creation RNN at time step t is computed as below.

Reducing Redundancy via Orthogonal Sentence States
Avoiding redundant content generation is a problem to be explicitly addressed by text generation systems (Nema et al., 2017). Hidden states of content creation RNN represents the content corresponding to each sentence in the final report. Enforcing diversity among these hidden representations can reduce the redundant content in the resultant report. We ensure that each hidden state of content creation RNN used to initialize decoder to be orthogonal to the mean of previous hidden states. In the purview of this orthogonality H t of content creation RNN is updated as follows.
where H M t−1 is the mean of previous hidden states.

Pre-Training via Entity Prediction
For the purpose of pre-training we predict context entities from the constructed content H t using a multi-label classifier: N N is a two layered fully connected neural network where the individual layer computations are a linear transformation followed by a ReLU activation. Ent k t represents the set of top k ranking context entities, which are intended to contain the entities to be mentioned in the sentence to be generated at time-step t of content creation RNN. The pre-training is done using binary cross entropy loss.

Training the Sentence Decoder
We use a decoder with beam search decoding to generate sentences. The sentence decoder RNN is initialized with H t , which represents the content to be materialized at time step t. At each time step t of decoder RNN, a word in the sentence under construction is generated as follows: where h t is the hidden state of decoder RNN at time-step t . Negative log likelihood is used for training the network to generate sentences.

Data Setup
A subset of 19,800 entries were selected from the MIMIC-CXR Database 2 for generating radiology reports from medical images of chest X-rays (Johnson et al., 2019), where each entry is represented by a triplet (S I , Q, SEQ F ). S I is a set of m input radiology images where there are one or more images corresponding to different views of a patient's chest, Q is a short text span specifying the purpose of the radiology investigation, and SEQ F represents the sentences written by a radiologist in the context of S I and Q. SEQ F is a sequence of sentences f 1 , .., f n , each representing an individual finding.
We reformulate the dataset so that each entry record is (S I , Q, SEQ F , SEQ E ). SEQ E represents a sequence of entity sets ent 1 , ..., ent n , where ent i represents the set of entities mentioned in sentence f i . We extracted entities from individual sentences 3 and identified a frequently occurring set of 1,060 entity clusters 4 suitable for learning to predict context entities and subsequent sentence generation. Sentences that do not consist a single mention of any of these entities were removed because they were evaluated to be subject to information not included in the corresponding images. Our dataset consists of 18,000, 900 and 900 training, test and development records respectively.

Experimental Setup
• Img + RNN : The entire radiology report is decoded as a single sequence from the mean of image representations (Fang et al., 2015).
• Img + Attn : The decoder RNN attends over the input image representation to generate a single sequence that constitutes the report (You et al., 2016).  • Img+ Pred + Co-Attn : A multi-image variant of the co-attention based method (Jing et al., 2017), in which sentence context vectors by co-attending over input images and entities.
• Img + Ent+Attn : This setting is a variant of (Yuan et al., 2019), where the decoder attends over a predicted set of entities to generate sentences.
• Our Method: We experiment with different settings of our approach depicted in Figure  3.1 with different combinations of Informative Content selection (IC), Topical Continuity (TC), Orthogonal Sentence States (O) and Pre-Training (PT).
Encoded image size is 900 after linear transformation of ResNet output, H t R 900 and h t R 900 . Other parameters are adjusted accordingly. For all settings, a beam size of 9 is set for the decoder.
For all the settings and for each of the test record we generate five sentences as the average number of sentences in development set reports is approximately five. The set of parameters which gave maximum recall for entity prediction in the development set during pre-training is used initialize the network during training.

Text Generation and Content Ordering
We evaluated the quality of text generation using BLEU and ROUGE metrics as shown in Table 1. The setting Img + IC did not perform well with respect to other counterparts. This suggests that just informative content selection gate and hidden state of content ordering RNN alone is insufficient for defining the context of a sentence. However Img + IC + T C achieves an incremental accuracy by employing the efficient gated mechanism for sentence content creation. Img+IC +T C +O performs consistently well on all metrics, especially using BLEU-4, implying the approach of eliminating redundant content in long text generation via enforcing topic diversity with orthogonal sentence states is effective. The setting with pre-training on entity prediction (Img + IC + T C + O + P T ) achieved a slight incremental improvement in accuracy. We observe that for a large set of 1,060 domain entities, our training data is not dense enough for a significant improvement through pre-training. However the incremental improvement is encouraging. Coherent reading results from accurate content ordering. For evaluating content ordering we relied on the method used by Kurisinkel and Chen (2019). They utilize the bigrams constituted by words in preceding and succeeding sentences irrespective of their positions within text in order to measure ImgEnc + Ent + Attn 1) The lungs are clear without airspace consolidation.
2) Lungs are hyperinflated with no pleural effusion or pneumothorax is seen.
3) Lungs are hyperinflated with no pleural effusion or pneumothorax is seen.
4) The lungs are clear without focal consolidation.

Img + IC+ TC
1) The lungs are clear without airspace consolidation.
3) Lungs are hyperinflated with no pleural effusion or pneumothorax is seen.

4)
Lungs are hyperinflated with no pleural effusion or pneumothorax is seen.

5)
Degenerative changes of the thoracic spine with calcification of the anterior longitudinal ligament are present.
2) Lungs are hyperinflated with no pleural effusion, pulmonary edema or pneumothorax is seen.

4)
The cardiac and mediastinal silhouettes are stable.

5)
Degenerative changes of the thoracic spine with calcification of the anterior longitudinal ligament are present.
Radiologist Report Written by Physicians 1) PA and lateral views of the chest demonstrate the lungs are well expanded, with no evidence of pleural effusion, pulmonary edema, pneumothorax, or focal airspace consolidation.
3) Previously demonstrated bilateral fat-containing Bochdalek hernias are better assessed on prior CT of the chest.

4)
The heart is mildly enlarged, Otherwise, the cardiomediastinal silhouette is unremarkable.

5)
Multilevel degenerative changes are noted throughout the thoracic spine, with calcification. of the anterior longitudinal ligament

Human Evaluation
We resort to human evaluation for rating the factual accuracy of radiology reports with respect to the reference report in hand. Four human evaluators were asked to rate the reports generated by all settings in Table 2 for a set of 100 test records that were randomly chosen. Reports were presented to the evaluators in a random order to minimize potential bias. The rating of a sentence is the sum of individual ratings of all the sentences in a report. Sentences describing an abnormal condition is weighed more than a sentence explaining a normal condition as they are clinically more relevant. A non-redundant sentence explaining an accurate normal condition is given a rating of 1.5 while that explaining an abnormal condition is given a rating of 3. A factually incorrect or redundant sentence receives a score of 0. The mean and standard deviation for each experimental setting are shown in the rightmost column of Table 1. The Pearson Coefficient is 0.67, suggesting that the agreement among the human evaluators are reasonably consistent (Benesty et al., 2009). The settings with content selection and continuity gates and diverse state computation achieved a clear advantage over the other settings implying it is effective to generate specific content for each sentence while explicitly eliminating redundancy in our proposed approach.

Qualitative Comparisons
Examples of radiology reports generated by different settings for the same set of images are shown in Table 3 to give readers more qualitative context of the generation results. Settings which used the gated mechanism for sentence content creation and orthogonal state computation better emulate human written reports in terms of informativeness and content ordering. There is an adequate number of domain entities in the generated report. which are found to be clinically relevant when compared with the corresponding human written report. There are portions of text in the human written report which are subjective to the situation and are irrelevant in the objective scheme of text generation.

Conclusion
We presented a technical approach on radiology report generation which ensures global text properties such as informativeness, topical continuity for coherence while reducing redundant content. Both objective metrics and human evaluations showed significant performance over competitive baselines.