Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports

Neural abstractive summarization models are able to generate summaries which have high overlap with human references. However, existing models are not optimized for factual correctness, a critical metric in real-world applications. In this work, we develop a general framework where we evaluate the factual correctness of a generated summary by fact-checking it automatically against its reference using an information extraction module. We further propose a training strategy which optimizes a neural summarization model with a factual correctness reward via reinforcement learning. We apply the proposed method to the summarization of radiology reports, where factual correctness is a key requirement. On two separate datasets collected from hospitals, we show via both automatic and human evaluation that the proposed approach substantially improves the factual correctness and overall quality of outputs over a competitive neural summarization system, producing radiology summaries that approach the quality of human-authored ones.


Introduction
Neural abstractive summarization systems aim at generating sentences which compress a document while preserving the key facts in it (Nallapati et al., 2016b;See et al., 2017;Chen and Bansal, 2018). These systems have been shown useful in many real-world applications. For example, Zhang et al. (2018) have recently shown that customized neural abstractive summarization models are able to generate radiology summary statements with high quality by summarizing textual findings written by radiologists. This task has significant clinical value because the successful application of it has the potential to accelerate the radiology workflow, reduce repetitive human labor and improve clinical communications (Kahn Jr et al., 2009).
However, while existing abstractive summarization models are optimized to generate summaries Background: radiographic examination of the chest. clinical history: 80 years of age, male, post-op cv surgery. comparison: procedure...
Findings: frontal radiograph of the chest demonstrates repositioning of the right atrial lead possibly into the ivc. otherwise, there is unchanged life-support hardware. a right apical pneumothorax can be seen from the image. moderate right and small left pleural effusions continue. no pulmonary edema is observed. heart size is upper limits of normal.
Human Summary: pneumothorax is seen. bilateral pleural effusions continue.
Summary B (ROUGE-L = 0.44): pneumothorax is observed on radiograph. bilateral pleural effusions continue to be seen. Figure 1: An example radiology report and summaries with their ROUGE-L scores. Compared to the humanwritten summary, Summary A has high textual overlap (i.e., ROUGE-L) but makes a factual error; Summary B has a lower ROUGE-L score but is factually correct. that are relevant to the context and highly overlap with human references (Paulus et al., 2018), this does not guarantee factually correct summaries, as shown in Figure 1. Therefore, maintaining factual correctness of the generated summaries remains a critical yet unsolved problem. For example, Zhang et al. (2018) found that about 30% of the outputs from a radiology summarization model contain factual errors or inconsistencies. This has prevented the application of the system, as factual correctness is critically important in this domain to prevent medical errors.
Existing attempts at improving the factual correctness of abstractive summarization models have achieved very limited success. For example, Cao et al. (2017) proposed to augment the attention mechanism of neural models with factual triples extracted with open information extraction sys-tems; Falke et al. (2019) studied using natural language inference systems to rerank generated summaries based on their factual consistencies; Kryściński et al. (2019b) proposed to verify factual consistency of generated summaries with a weakly-supervised model. Despite these efforts, even state-of-the-art systems trained with ample data still produce summaries with a substantial number of factual errors (Goodrich et al., 2019;Kryściński et al., 2019a).
In this work we aim to improve the factual correctness of existing neural summarization systems, with a focus on summarizing radiology reports. This task has several key properties that make it ideal for studying factual correctness in summarization models. First, the clinical facts or observations present in radiology reports have less ambiguity compared to open-domain text, which allows objective comparison of facts. Second, radiology reports involve a relatively limited space of facts, which makes automatic measurement of factual correctness in the generated text approachable. Lastly, as factual correctness is key to the success of the resulting system in this domain, improving factual correctness will directly lead to an ability to use the system.
To this end, we design a framework where an external information extraction system is used to extract information in the generated summary and produce a factual accuracy score by comparing it against the human reference summary. We further develop a training strategy where we combine a factual correctness objective, a textual overlap objective and a language model objective, and jointly optimize them via self-critical sequence training.
On two datasets of radiology reports collected from real hospitals, we show that our training strategy substantially improves the factual correctness of the summaries generated from a competitive neural summarization system. Interestingly, our experiments also show that even in the absence of a factual correctness objective, optimizing textual overlap substantially improves the factual correctness of the resulting system compared to traditional maximum likelihood training. We further show via human evaluation and analysis that our training strategy leads to summaries with higher overall quality and correctness and which are closer to the human-written ones.
Our main contributions are: (i) we propose a general framework and a training strategy for im-proving the factual correctness of summarization models via reinforcement learning (RL); (ii) we apply the proposed strategy to radiology reports, and empirically show that it improves the factual correctness of the generated summaries; and (iii) we demonstrate via radiologist evaluation that our system is able to generate summaries with clinical validity and quality close to human-written ones. To our knowledge our work represents the first attempt at directly optimizing a neural summarization system with a factual correctness objective via RL.

Related Work
Neural Summarization Systems. Neural models for text summarization can be broadly divided into extractive approaches (Cheng and Lapata, 2016;Nallapati et al., 2016a), where a system learns to select sentences from the context to form the summary; and abstractive approaches Nallapati et al., 2016b;See et al., 2017), where a system can generate new words and sentences to form the summary. While these models are often trained in an end-to-end manner by maximizing the likelihood of the reference summaries, RL has been shown useful in recent work (Chen and Bansal, 2018;Dong et al., 2018). Specifically, Paulus et al. (2018) found that directly optimizing an abstractive summarization model on the ROUGE metric via RL can improve the summary quality. Our work extends the ROUGE rewards used in existing work with a factual correctness reward to further improve the correctness of the generated summaries.
Factual Correctness in Summarization. Our work is closely related to recent work that studies factual correctness in summarization. Cao et al. (2017) first proposed to improve the faithfulness of neural abstractive summarization models by attending to fact triples extracted from the context using open information extraction systems. Goodrich et al. (2019) compared different information extraction systems to evaluate the factual accuracy of generated text. Falke et al. (2019) studied whether existing natural language inference systems can be used to evaluate the factual correctness of generated summaries, and found models trained on existing datasets to be inadequate for this task. Kryściński et al. (2019b) proposed to evaluate factual consistencies in the generated summaries using a weakly-supervised fact verification model.
Summarization of Radiology Reports. Existing work on summarizing radiology reports has focused on the extraction of information from the reports (Hripcsak et al., 2002;Hassanpour and Langlotz, 2016). Zhang et al. (2018) first studied the problem of automatic generation of radiology impressions by summarizing radiology findings, and showed that an augmented pointergenerator model is able to generate summaries which have high overlap with human references. MacAvaney et al. (2019) extended this model with an ontology-aware pointer-generator and showed improved summarization quality. Jing et al. (2018) and Li et al. (2019) studied the problem of generating descriptions of radiology findings from medical images. While Zhang et al. (2018) found that about 30% of the radiology summaries generated from neural models contain factual errors, methods to improve factual correctness in radiology summarization remain unstudied.

Task & Baseline Pointer-Generator
We start by briefly introducing the task of summarizing radiology findings. Given a passage of radiology findings represented as a sequence of tokens x = {x 1 , x 2 , . . . , x N }, with N being the length of the findings, the task involves finding a sequence of tokens y = {y 1 , y 2 , . . . , y L } that best summarizes the salient and clinically significant findings in x. In routine radiology workflow, an output sequence y is produced by the radiologist, which we treat as a reference summary sequence. 1 To model the summarization process, we use the background-augmented pointer-generator network (Zhang et al., 2018) as the backbone of our method. This abstractive summarization model extends a pointer-generator model (See et al., 2017) with a separate background section encoder and is shown to be effective in summarizing radiology notes with multiple sections. Here we briefly describe this model and refer readers to the original papers for full details.
At a high level, this model follows the encoderdecoder architecture, and first encodes the input sequence x into hidden states with a Bi-directional Long Short-Term Memory (Bi-LSTM) network: Next, conditioned on h, the output sequence is decoded from an LSTM decoder. Formally, at the t-th step, given the previously generated token y t−1 and the previous decoder state s t−1 , the decoder calculates the current state s t with: To make the input information available at decoding time, an attention mechanism (Bahdanau et al., 2015) is added to the decoder. The attention output and s t are then used to predict the output word.
The baseline pointer-generator model by Zhang et al. (2018) adds two augmentations to this attentional encoder-decoder model to make it suitable for summarizing radiology findings: Copy Mechanism. To enable the model to copy words from the input, a copy mechanism (Vinyals et al., 2015;See et al., 2017) is added to calculate a generation probability at each step of decoding. This generation probability is then used to blend the original output vocabulary distribution and a copy distribution to generate the next word.
Background-guided Decoding. As shown in Figure 1, radiology reports often consist of a background section which documents the crucial study background information (e.g., purpose of the study, patient conditions), and a findings section which documents clinical observations. While words can be copied from the findings section to form the summary, Zhang et al. (2018) found it worked better to separately encode the background section and inject the representation into the decoding process. Specifically, the background section is encoded into a vector b with an attentional LSTM encoder. Then at each step of decoding, b is concatenated with the input word y t−1 to calculate the new state s t as in Eq. (2).

Fact Checking in Summarization
Summarization models such as the one described in Section 3 are commonly trained with the teacher-forcing algorithm (Williams and Zipser, 1989) by maximizing the likelihood of the reference, human-written summaries. However, this training strategy results in a significant discrepancy between what the model sees during training and test time, often referred to as the exposure bias issue (Ranzato et al., 2016), leading to degenerate output at test time.
An alternative training strategy is to directly optimize standard metrics such as ROUGE scores (Lin, 2004) with RL and it was shown to improve the quality of the generated summaries (Paulus et al., 2018). Nevertheless, this method still provides no guarantee that the generated summary is factually accurate and complete, since the ROUGE scores merely measure the superficial text overlap between two sequences and do not account for the factual alignment between them. To illustrate this, a reference sentence "pneumonia is seen" and a generated sentence "pneumonia is not seen" have substantial text overlap and thus the generated sentence would achieve a high ROUGE score, however the generated sentence conveys an entirely opposite fact. In this section we first introduce a method to verify the factual correctness of the generated summary against the reference summary, and then describe a training strategy to directly optimize a factual correctness objective to improve summary quality.

Evaluating Factual Correctness via Fact Extraction
A convenient way to explicitly measure the factual correctness of a generated summary against the reference is to first extract and represent the facts in a structured format. To this end, we define a fact extractor to be an information extraction (IE) module, noted as f , which takes in a summary sequence y and returns a structured fact vector v: where v i is a variable that we want to measure via fact checking and m the total number of such variables. For example, in the case of summarizing radiology reports, v i can be a binary variable that describes whether an event or a disease such as pneumonia is present or not in a radiology study. Given a fact vector v output by f from a reference summary andv from a generated summary, we further define a factual accuracy score s to be the ratio of variables inv which equal the corresponding variables in v, namely: where s ∈ [0, 1]. Note that this method requires a summary to be both precise and complete in order to achieve a high s score: missing out a positive variable or falsely claiming a negative variable will be equally penalized.
Our general definition of the fact extractor module f allows it to have different realizations for different domains. For our task of summarizing radiology findings, we make use of the open-source CheXpert radiology report labeler (Irvin et al., 2019). 2 At its core, the CheXpert labeler parses the input sentences into dependency structures and runs a series of surface and syntactic rules to extract the presence status of 14 clinical observations seen in chest radiology reports. 3 It was evaluated to have over 95% overall F 1 when compared against oracle annotations from multiple radiologists on a large-scale radiology report dataset.

Improving Factual Correctness via Policy Learning
The fact extractor module introduced above not only enables us to measure the factual accuracy of a generated summary, but also provides us with an opportunity to directly optimize the factual accuracy as an objective. This can be achieved by viewing our summarization model as an agent, the actions of which are to generate a sequence of words to form the summaryŷ, conditioned on the input x. 4 The agent then receives rewards r(ŷ) for its actions, where the rewards can be designed to measure the quality of the generated summary. Our goal is to learn an optimal policy P θ (y|x) for the summarization model, parameterized by the network parameters θ, which achieves the highest expected reward under the training data. Formally, we train our summarization model to minimize loss L, the negative expectation of the reward r(ŷ) over the training data: According to the REINFORCE algorithm (Williams, 1992), the gradient of this loss can be calculated as the following: Severe cardiomegaly is seen.
nsubj:pass ✓ … Background: patient with chest pain … Findings: persistent low lung volumes with enlarged heart. … x < l a t e x i t s h a 1 _ b a s e 6 4 = " W B Q 6 p o / p 5 L K 1 P A v c u F D E r Z / k z r 9 x k g Z R 6 4 E L h 3 P u 5 d 5 7 v I g z p W 3 7 0 y o s L a + s r h X X S x u b W 9 s 7 5 d 2 9 t g p j S W i L h D y U X Q 8 r y p m g L c 0 0 p 9 1 I U h x 4 n H a 8 y X X q d < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 i C R 1 D N + 7 q a / K I / m s F y 7 n p G G g z 0 = " x 4 0 I R t 3 6 J O / / G J A 2 i 1 g M X D u f c y 7 3 3 O K H g C k z z U y s s L C 4 t r x R X S 2 v r G 5 t b e n m 7 r Y J I U t a i g Q h k 1 y G K C e 6 z F n A Q r B t K R j x H s I 4 z b q R + 5 5 Z J x Q P / B i Y h s z w y 9 L n L K Y F E s v W y t P v A 7 i F u T P E F N o 3 a i a 1 X T M P M g O d J N S c V l K N p 6 x / 9 Q U A j j / l A B V G q V z V D s G I i g V P B p q V + p F h I 6 J g M W S + h P v G Y s u L s 9 C n e T 5 Q B d g O Z l A 8 4 U 3 9 O x M R T a u I 5 S a d H Y K T + e q n 4 n 9 e L w D 2 z Y u 6 H E T C f z h a 5 k c A Q 4 D Q H P O C S U R C T h B A q e X I r p i M i C Y U k r V I W w n m K 0 + + X 5 0 n 7 0 K g e G U f X x 5 X 6 Z R 5 H E e 2 i P X S A q q i G 6 u g K N V E L U X S H H t E z e t E e t C f t V X u b t R a 0 f G Y H / Y L 2 / g W I L J M A < / l a t e x i t > Radiographs show severe cardiomegaly with plural effusions. y < l a t e x i t s h a 1 _ b a s e 6 4 = " N Z 6 b o F n O m W j w 9 j R H p v 0 m 4 i 6 a g H w = " > A A A B 6 H i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q q P n Z F N y 5 b s A 9 o Q 5 l M J + 3 Y y S T M T I Q Q + g V u X C j i 1 k 9 y 5 9 8 4 S Y O o 9 c C F w z n 3 c u 8 9 X s S Z 0 r b 9 a Z W W l l d W 1 8 r r l Y 3 N r e 2 d 6 u 5 e R 4 W x J L R N Q h 7 K n o c V 5 U z Q t m a a 0 1 4 k K Q 4 8 T r v e 9 C b z u w 9 U K h a K O 5 1 E 1 A 3 w W D C f E a y N 1 E q G 1 Z p d t 3 O g R e I U p A Y F m s P q x 2 A U k j i g Q h O O l e o 7 d q T d F E v N C K e z y i B W N M J k i s e 0 b 6 j A A V V u m h 8 6 Q 0 d G G S E / l K a E R r n 6 c y L F g V J J 4 J n O A O u J + u t l 4 n 9 e P 9 b + p Z s y E c W a C j J f 5 M c c 6 R B l X 6 M R k 5 R o n h i C i W T m V k Q m W G K i T T a V P I S r D O f f L y + S z k n d O a 2 f t s 5 q j e s i j j I c w C E c g w M X 0 I B b a E I b C F B 4 h G d 4 s e 6 t J + v V e p u 3 l q x i Z h 9 + w X r / A g C U j T Q = < / l a t e x i t > Severe cardiomegaly is seen. U x 9 7 V H b j N P 8 U 7 W u l j 9 x A 6 O s r l K r f N 2 L s S T n x H D 2 Z p J W / v U T 8 z 2 t H y j 3 r x s w P I 0 V 9 M n v I j T h S A U r K Q H 0 m K F F 8 o g k m g u m s i A y x w E T p y g p p C e c J T r 6 + / J c 0 j s p 2 p V y 5 O S 5 W L 7 M 6 8 r A L e 1 A C G 0 6 h C t d Q g z o Q u I M H e I J n 4 9 5 4 N F 6 M 1 9 l o z s h 2 t u E H j L d P r E W U D g = = < / l a t e x i t >v = (0, 1, 0, 0)

< l a t e x i t s h a 1 _ b a s e 6 4 = " J u A + n S C F u 7 R n R d z e o A D h I + i n K U 0 = " > A A A C B X i c b V D L S s N A F J 3 U V 6 2 v q E t d D B a h Q i k T K z 4 W Q t G N y w r 2 A U 0 o k + m k H T p 5 M D M p l J C N G 3 / F j Q t F 3 P o P 7 v w b k z S I W g 9 c O J x z L / f e Y w e c S Y X Q p 1 Z Y W F x a X i m u l t b W N z a 3 9 O 2 d t v R D Q W i L + N w X X R t L y p l H W 4 o p T r u B o N i 1 O e 3 Y 4 + v U 7 0 y o k M z 3 7 t Q 0 o J a L h x 5 z G M E q k f r 6 v j n C K j J d r E a 2 E 0 3 i G F 7 C C q o a V V R F R 3 2 9 j G o o A 5 w n R k 7 K I E e z r 3 + Y A 5 + E L v U U 4 V j K n o E C Z U V Y K E Y 4 j U t m K G m A y R g P a S + h H n a p t K L s i x g e J s o A O r 5 I y l M w U 3 9 O R N i V c u r a S W d 6 r v z r p e J / X i 9 U z r k V M S 8 I F f X I b J E T c q h 8 m E Y C B 0 x Q o v g 0 I Z g I l t w K y Q g L T F Q S X C k L 4 S L F 6 f f L 8 6 R 9 X D P q t f r t S b l x l c d R B H v g A F S A A c 5 A A 9 y A J m g B
A u 7 B I 3 g G L 9 q D 9 q S 9 a m + z 1 o K W z + y C X 9 D e v w C r u 5 b a < / l a t e x i t > Figure 2: Our proposed training strategy. Compared to existing work which relies only on a ROUGE reward r R , we add a factual correctness reward r C which is enabled by a fact extractor. The summarization model is updated via RL, using a combination of the NLL loss, a ROUGE-based loss and a factual correctness-based loss. For simplicity we only show a subset of the clinical variables in the fact vectors v andv.
training example with a single Monte Carlo sample and deduct a baseline reward to reduce the variance of the gradient estimation: whereŷ s is a sampled sequence from the network andr a baseline reward. Practically, there are many strategies for generating the baseline reward, and here we adopt the self-critical training strategy (Rennie et al., 2017), where we obtain the baseline rewardr by applying the same reward function r to a greedily decoded sequenceŷ g , i.e., r = r(ŷ g ). We empirically find that the use of this self-critical baseline reward is key to the successful training of our summarization model.

Reward Function
The policy learning strategy in Eq. (7) provides us with the flexibility to optimize arbitrary reward functions. Here we decompose our reward function into two parts: where r R ∈ [0, 1] is a ROUGE reward, namely the ROUGE-L score (Lin, 2004) of the predicted sequenceŷ against the reference y; r C ∈ [0, 1] is a correctness reward, namely the factual accuracy s of the predicted sequence against the reference sequence, as in (4); λ 1 , λ 2 ∈ [0, 1] are scalar weights that control the balance between the two. Paulus et al. (2018) found that directly optimizing a reward function without the original negative log-likelihood (NLL) objective as used in teacherforcing can hurt the readability of the generated summaries, and proposed to alleviate this problem by combining the NLL objective with the RL loss. Here we adopt the same strategy, and our final loss during training is: Our overall training strategy is illustrated in Figure 2. Note that our final loss jointly optimizes three aspects of the summaries: L NLL serves as a conditional language model that optimizes the fluency and relevance of the generated summary, L R controls the brevity of the summary and encourages summaries which have high overlap with human references, and L C encourages summaries that are factually accurate when compared against human references. fewer than 10 words or the impression has fewer than 2 words. Lastly, we replaced all date and time mentions with special tokens (e.g., <DATE>).
To test the generalizability of the models, instead of using random stratification, we stratified each dataset over time into training, dev and test splits. We include statistics of both datasets in Table 1 and stratification details in Appendix B.

Models
As we use the augmented pointer-generator network described in Section 3 as the backbone of our method, we mainly compare against it as the baseline model (PG Baseline), and use the open implementation by Zhang et al. (2018).
For the proposed RL-based training, we compare three variants: training with only the ROUGE reward (RL R ), with only the factual correctness reward (RL C ), or with both (RL R+C ). All three variants have the NLL component in the training loss as in Eq. (9). For all variants, we initialize the model with the best baseline model trained with standard teacher-forcing, and then finetune it on the training data with the corresponding RL loss, until it reaches the best validation score.
To understand the difficulty of the task and evaluate the necessity of using abstractive summarization models, we additionally evaluate two extractive summarization methods: (1) LexRank (Erkan and Radev, 2004), a widely-used non-neural graph-based extractive summarization algorithm; and (2) BanditSum (Dong et al., 2018), an RL-based neural extractive summarization model which achieves state-of-the-art results on the CNN/Daily Mail dataset (Hermann et al., 2015).

For both of them we use open implementations.
We include other model implementation and training details in Appendix C.

Evaluation
We use two sets of metrics to evaluate model performance at the corpus level. First, we use the standard ROUGE scores (Lin, 2004), and report the F 1 scores for ROUGE-1, ROUGE-2 and ROUGE-L.
The second metric is a Factual F 1 score. While the factual accuracy score s that we use in the reward function evaluates how factually accurate a specific summary is, comparing it at the corpus level can be misleading. To understand this, imagine the case where a clinical variable has rare presence in the corpus. A model which always generates a negative summary for it (i.e., the disease is not present) can have high accuracy, but is useless in practice. Instead, for each variable, we obtain a model's predictions over all test examples and calculate an F 1 score for this variable. We then macro-average the F 1 scores of all variables to obtain the overall factual F 1 score of the model.
Note that the CheXpert labeler that we use is specifically designed to run on radiology summaries, which usually have a different style of language compared to the radiology findings section of the reports (see further analysis in Section 7). As a result, we found the labeler to be less accurate when applied to the findings section. For this reason, we were not able to calculate the factual F 1 scores on the summaries generated by the two extractive summarization models.

Results
We first present our automatic evaluation results on the two collected datasets. We then present a human evaluation with board-certified radiologists where we compare the summaries generated by humans, the baseline and our proposed model.

Automatic Evaluation
Our main results on the Stanford dataset and the RIH dataset are shown in Table 2. We first notice that while the neural extractive summarization model, BanditSum, outperforms the non-neural extractive method on ROUGE scores, the pointergenerator baseline substantially outperforms both of them, suggesting that on both datasets abstractive summarization is necessary to generate summaries comparable to human-written ones.
LexRank (Erkan and Radev, 2004) Table 2: Main results on the Stanford and the RIH datasets. R-1, R-2, R-L represent the ROUGE scores and F 1 represents the factual F 1 score. PG Baseline represents our baseline augmented pointer-generator model; RL R , RL C and RL R+C represent RL training with the ROUGE reward alone, with the factual correctness reward alone and with both, respectively. All the ROUGE scores have a 95% confidence interval of at most ±0.6. F 1 scores for extractive models were not evaluated for the reason discussed in Section 5.3.
of 10% absolute, however with consistent decline in the ROUGE scores compared to RL R training.
Combining the ROUGE and the factual correctness rewards (RL R+C ) achieves a balance between the two, leading to an overall improvement of 2.7 on ROUGE-L and 8.6% on factual F 1 compared to the baseline. This indicates that RL R+C training leads to both higher overlap with references and improved factual correctness. Surprisingly, while ROUGE has been criticized for its poor correlation with human judgment of quality and insufficiency for evaluating correctness of the generated text (Novikova et al., 2017;Chaganty et al., 2018), we find that optimizing ROUGE reward jointly with NLL leads to substantially more factually correct summaries. This is shown by the notable gain of 7.3% factual F 1 from the RL R training.
All of our findings are consistent on the RIH dataset, with RL R+C achieving an overall improvement of 2.5 on ROUGE-L and 5.5% on factual F 1 .
Fine-grained Correctness. To understand how improvements in individual variables contribute to the overall improvement, we show the fine-grained factual F 1 scores for all variables on the Stanford dataset in Table 3 and include results on the RIH dataset in Appendix D. We find that on both datasets, improvements in RL R+C can be observed on all variables tested. We further find that, as we change the initialization across different training runs, while the overall improvement on factual F 1 stays approximately unchanged, the distribution of the improvement on different variables can vary substantially. Developing a training strategy for fine-grained control over different variables is an  interesting direction for future work.
Qualitative Results. We present two example reports along with the human reference summaries, the PG baseline outputs and RL R+C model outputs in Figure 3. In the first example, while the summary from the baseline model seems generic and does not include any meaningful observation, the summary from the RL R+C model aligns well with the human reference, and therefore achieves a higher factual accuracy score. In the second example, the baseline model wrongly copied an observation from the findings although the actual context is "no longer evident", while the RL R+C model correctly recognizes this and produces a better summary.

Human Evaluation
To study whether the improvements in the factual correctness scores lead to improvement in summa-

Stanford Dataset
Background: radiographic examination of the chest: <date> <time> am. clinical history: <age> years of age, female, wheezing, sob. comparison: <date> at <time>. procedure comments : two views of the chest...
Findings: continuous rhythm monitoring device again seen projecting over the left heart. persistent low lung volumes with unchanged cardiomegaly. again seen is a diffuse reticular pattern with interstitial prominence demonstrated represent underlying emphysematous changes with superimposed increasing moderate pulmonary edema. small bilateral pleural effusions. persistent bibasilar opacities left greater than right which may represent infection versus atelectasis.
Human: increased moderate pulmonary edema with small bilateral pleural effusions. left greater than right basilar opacities which may represent infection versus atelectasis.
RL R+C (s = 1.00): increasing moderate pulmonary edema. small bilateral pleural effusions. persistent bibasilar opacities left greater than right which may represent infection versus atelectasis.
rization quality under expert judgment, we run a comparative human evaluation following previous work (Chen and Bansal, 2018;Dong et al., 2018;Zhang et al., 2018). We sampled 50 test examples from the Stanford dataset, and for each example we presented to two board-certified radiologists the full radiology findings along with blinded summaries from (1) the human reference, (2) the PG baseline and (3) our RL R+C model. We shuffled the three summaries such that the correspondence cannot be guessed, and asked the radiologists to compare them based on the following three metrics: (1) fluency, (2) factual correctness and completeness, and (3) overall quality. For each metric we asked the radiologists to rank the three summaries, with ties allowed. After the evaluation, we converted each ranking into two binary comparisons: (1) our model versus the baseline model, and (2)   The results are shown in Table 4. Comparing our model against the baseline model, we find that: (1) in terms of fluency our model is less preferred, although a majority of the results (60%) are ties; (2) our model wins more on factual correctness and overall quality. Comparing our model against human references, we find that: (1) human wins more on fluency; (2) factual correctness results are close, with 72% of our model outputs being at least as good as human; (3) surprisingly, in terms of overall quality our model was slightly preferred more by the radiologists than human references.

Analysis & Discussion
Fluency and Style of Summaries. Our human evaluation results in Section 6.2 suggest that in terms of fluency our model output is less preferred than human reference and baseline model output. To further understand the fluency and style of generations from different models at a larger scale, we trained a neural language model (LM) for radiology summaries following previous work in summarization (Liu et al., 2018). Intuitively, radiology summaries which are more fluent and consistent with humans in style should be able to achieve a lower perplexity under this in-domain LM, and vice versa. To this end, we collected all humanwritten summaries from the training and dev set of the Stanford dataset and the RIH dataset, which in total gives us about 222k summaries. We then trained a strong Mixture of Softmaxes LM (Yang et al., 2018) on this corpus, and evaluated the perplexity of test set outputs for all models.
The results are shown in  that while extractive models are able to generate summaries that have non-trivial overlap with human references, their perplexity scores tend to be much higher than humans. We conjecture that this is because radiologists are trained to write the summaries with more compressed language than when they are writing the findings, therefore sentences directly extracted from the findings tend to be more verbose than needed. We further observe that our baseline model trained with teacher-forcing achieves even lower perplexity than human, and the model trained with our proposed method has a perplexity score much closer to human references. We hypothesize that this is because models trained with teacher-forcing are prone to generic generations (therefore also leading to lower factual correctness), and training with the proposed rewards alleviates this issue, leading to summaries more consistent with humans in style. For example, we find that "no significant interval change" is a very frequent generation from the baseline model, regardless of the actual findings in the input. On the Stanford dev set, this sentence shows up in 34% of the summaries generated by the baseline, while the number for RL R+C and human are only 24% and 17%, respectively. This hypothesis is further confirmed when we plot the distribution of the top 10 most frequent trigrams from different models in Figure 4: while the output from the baseline model heavily reuses the few most frequent trigrams, our model RL R+C tends to have more diverse summaries which are closer to human references. The same trends are observed for 4-grams and 5-grams.
Limitations. While we showed the success of our proposed method on improving the factual correctness of a radiology summarization model, we also recognize several limitations of our work. First, our proposed training strategy relies on an external IE module. While this IE module is relatively easy to implement for a domain with a limited space of facts, how to generalize this method to open-domain summarization remains unsolved. Second, our study was based on a rule-based IE system, and the use of a more robust statistical IE model can potentially improve the results. Third, we mainly focus on key factual errors which result in a flip of the binary outcome of an event (e.g., presence of disease), whereas factual errors in generated summaries can occur in other forms such as wrong adjectives or coreference errors (Kryściński et al., 2019a). We leave the study of these problems to future work.

Conclusion
In this work we presented a general framework and a training strategy to improve the factual correctness of neural abstractive summarization models. We applied this approach to the summarization of radiology reports, and showed its success via both automatic and human evaluation on two separate datasets collected from real hospitals. Our general takeaways include: (1) in a domain with a limited space of facts such as radiology reports, a carefully implemented IE system can be used to improve the factual correctness of neural summarization models via RL; (2) even in the absence of a reliable IE system, optimizing the ROUGE metrics via RL can substantially improve the factual correctness of the generated summaries.
We hope that our work draws the community's attention to the factual correctness issue of abstractive summarization models and inspires future work in this direction.

C Model Implementation and Training Details
For the baseline background-augmented pointergenerator model, we use its open implementation. 5 We use a 2-layer LSTM as the findings encoder, 1-layer LSTM as the background encoder, and a 5 https://github.com/yuhaozhang/ summarize-radiology-findings For the training and finetuning of the models, we use the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1e −3 . We use a batch size of 64 and clip the gradient with a norm of 5. During training we evaluate the model on the dev set every 500 steps and decay the learning rate by 0.5 whenever the validation score does not increase after 2500 steps. Since we want the model outputs to have both high overlap with the human references and high factual correctness, for training we always use the average of the dev ROUGE score and the dev factual F 1 score as the stopping criteria. We tune the scalar weights in the loss function on the dev set, and use weights of 0.03, 0.97 and 0.97 for L NLL , L R and L C , respectively.
For the extractive LexRank and BanditSum models, we use their open implementations. 6 For the BanditSum extractive summarization model, we use default values for all hyperparameters as in Dong et al. (2018). For both models we select the top 3 scored sentences to form the summary, which yields the highest ROUGE-L scores on the dev sets.
For ROUGE evaluation, we use the Python