Factual Error Correction for Abstractive Summarization Models

Neural abstractive summarization systems have achieved promising progress, thanks to the availability of large-scale datasets and models pre-trained with self-supervised methods. However, ensuring the factual consistency of the generated summaries for abstractive summarization systems is a challenge. We propose a post-editing corrector module to address this issue by identifying and correcting factual errors in generated summaries. The neural corrector model is pre-trained on artificial examples that are created by applying a series of heuristic transformations on reference summaries. These transformations are inspired by an error analysis of state-of-the-art summarization model outputs. Experimental results show that our model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. We also find that transferring from artificial error correction to downstream settings is still very challenging.


Introduction
Self-supervised methods have achieved success in a wide range of NLP tasks, and automatic summarization is no exception (Liu and Lapata, 2019;Lewis et al., 2019;Zhang et al., 2019a;Shi et al., 2019;Fabbri et al., 2019). These state-of-theart abstractive summarization models typically finetune pre-trained transformer-based models on a summarization dataset (Vaswani et al., 2017). Despite significant improvements over previous methods in terms of automatic evaluation scores such as ROUGE (Lin, 2004), ensuring factual consistency of the generated summary with respect to the source remains challenging. For example, 1 Our data and code is available at https://github.com/mcao610/Factual-Error-Correction Source: Jerusalem (CNN)The flame of remembrance burns in Jerusalem, and a song of memory haunts Valerie Braham as it never has before. This year, Israel's Memorial Day commemoration is for bereaved family members such as Braham. "Now I truly understand everyone who has lost a loved one," Braham said. (...) Original: France's memorial day commemoration is for bereaved family members as braham. (inconsistent) After Correction: Israel's memorial day commemoration is for bereaved family members as braham. (consistent) Table 1: An example of an inconsistent systemgenerated summary and the output summary from our correction model. In this case, "France" is successfully corrected as "Israel". Cao et al. (2018) claims that about 30% of summaries generated by abstractive models contain factual errors, which greatly limits their practicality.
Different approaches have been proposed to detect or ensure the factual consistency of generated summaries, including using fact extraction or applying attention on fact triples (Cao et al., 2018;Zhang et al., 2019b;Goodrich et al., 2019), applying natural language inference or question answering models for consistency checking (Falke et al., 2019;Li et al., 2018;Wang et al., 2020) and training the model on artificial datasets (Kryściński et al., 2019). Most of these approaches either require a high-quality fact extraction model or they only focus on factual consistency evaluation. Improving factuality correction by editing inconsistent parts in generated summaries is a direction that has not been explored much.
In this work, we propose a model to improve the factual consistency of system summaries with post-editing correction (Table 1). Our model takes a draft summary that is generated by an abstractive summarization model and produces a corrected final summary, conditioned on the source document. In addition, our trained corrector can be used as an evaluation model for factual consistency of abstractive summaries, with the assumption that a generated summary is inconsistent if our corrector decides to make edits. To teach the model to correct errors, we train it with artificial data that has factual errors introduced using heuristics proposed by Kryściński et al. (2019).
The empirical results based on automatic and human evaluations indicate that our model not only corrects factual errors in summaries, it is also a better and more reliable factuality evaluation model than FactCC (Kryściński et al., 2019), a recent proposed factuality evaluation method. In a downstream setting where we apply the corrector to the output of an abstractive summarizer, we find that our corrector is able to accurately correct errors in the generated summaries. However, the overall recall on correcting factual errors in real system summaries remains low, suggesting the errors introduced by heuristics have a different distribution than errors made by abstractive summarization systems.

Background and Related Work
Previous work on factual consistency in abstractive summarization can be divided into two categories: abstractive summarization models tailored towards factual consistency (Cao et al., 2018;Zhang et al., 2019b;Li et al., 2018), and evaluation models for factual consistency in abstractive summarization (Goodrich et al., 2019;Falke et al., 2019;Kryściński et al., 2019;Wang et al., 2020). Cao et al. (2018) proposed a dual attention module in an abstractive summarizer that attends to both the source document and to relation triples extracted from the document. Zhang et al. (2019b) propose to improve their abstractive summarization model by optimizing fact scores defined in radiology reports with reinforcement learning methods. Li et al. (2018) jointly train their model's encoder on summarization and NLI tasks. Guo et al. (2018) train an abstractive summarization system with the auxiliary tasks of question and entailment generation and show that their generated summaries are less likely to produce extraneous facts. Kumar and Cheung (2019) show that neural abstractive summarizers often assign higher posterior likelihood to perturbed contrastive summaries that are inconsistent with the source text than to human-written gold-standard ones. Concurrently to our work, Zhu et al. (2020) recently proposed a fact-aware summarization model that uses a knowledge graph. They use a pre-trained corrector module to modify generated summaries. Concurrent to our work, Dong et al. (2020) proposes factual correction models that leverages knowledge learned from question answering models via span selection. Their models employ single or multimasking strategies to either iteratively or autoregressively replace entities.
In terms of evaluating abstractive summarization models for factual consistency, Goodrich et al. (2019) proposed a metric to check factual consistency by checking the overlapped fact triples between a source document and generated text on Wikidata. Falke et al. (2019) shows that factual error detection is a difficult task on its own and adapting entailment models for factual error detection do not offer the desired performance. Kryściński et al. (2019) finetune a BERT model on heuristically-created data with six types of rule-based text transformations for factual consistency checking. Wang et al. (2020) propose a framework for measuring inconsistencies in abstractive summarization by answering questions based on both generated summaries and documents.

Proposed Approach
In this section, we describe our procedure of introducing artificial errors in the datasets for training and propose our end-to-end error corrector model.

Dataset of Artificial Corruptions
Inspired by a recent study of error types made by state-of-the-art summarization system, we artificially created a weakly-supervised training dataset based on the text transformations proposed by Kryściński et al. (2019).
Given a source text d and the reference summary s, we corrupt the reference summary into an inconsistent summary s ′ with a randomly sampled corruption rule (described below) with probability α; otherwise, we keep s ′ = s with probability 1 − α. We set α = 0.3 to match the factuality error rate in real abstract summaries based on a recent study (Cao et al., 2018). The training data consists of triplets (s ′ , s, d).
Error Corruptions Four types of errors are used to create the inconsistent summaries: En-   (2019), we corrupt the reference summary rather than sentences sampled from the source document.
In the first four types of error constructions, we utilize a swapping strategy to introduce errors. For Entity, Number, and Date swapping, one entity in the reference summary is selected and swapped with another random entity of the same type 2 in the source document. For Pronoun swapping, one pronoun was extracted and swapped with another one of a matching syntactic case. Table 2 shows one example of a corruption.

Training Objective and Models
With the artificial training data consisting of triplets (s ′ , s, d), the goal of the corrector is to generate the correct summary s based on the inconsistent summary s ′ and the source d. This can be expressed as a problem of maximizing the likelihood of P (s|s ′ , d) in an encoder-decoder model. We concatenate s ′ and d as input to the encoder (s ′ and d are separated by a separation token) and train the decoder to generate s.
We use BART (Lewis et al., 2019) as the basis of our summary corrector because of its demonstrated level of performance on conditional text generation tasks. BART is a sequence-to-sequence

Overall
Acc.  auto-regressive transformer model that is pretrained as a denoising auto-encoder. One appealing aspect about BART is that it is pre-trained on a denoising task. Specifically, given an input sentence that is corrupted by text infilling, token deletion as well as other text transformations, BART is trained to output the original sentence. This pretraining task is similar to our summary correction task in which we can regard the corrupted or generated summary as the noisy input and in this case the noise is the inconsistent content in the summary.

Evaluation Tasks and Measures
We evaluate our model on two tasks: factual consistency checking and error correction.
Factual consistency checking For this task, the model needs to classify each original input summary as consistent or inconsistent with respect to the source text. It is thus a binary classification task for which we report accuracy, as well as precision, recall, and F1. We interpret the output of our corrector model as a classification decision as follows. If the corrector makes any change to the original input summary, we consider this to be a prediction of the inconsistent class. Otherwise, the corrector makes no change and we consider this a prediction of the consistent class.
Error correction For this task, the model must correct inconsistencies in the original summary (in any) with respect to the source text.
We define correction accuracy as the proportion of original summaries that are correctly changed by our corrector. On our artificial test set, an input summary is considered successfully corrected if the corrected summary matches the ref-erence summary exactly. For the K2019 dataset, no reference corrections are available. We instead conducted a human evaluation to check the consistency of the corrected output. We read the original and corrected summaries as well as the source document to determine whether a summary is successfully corrected by our model.

Datasets
We use two datasets for our experiments. The first is the dataset of artificial corruptions described in Section 3.1, which we create by taking samples from the CNN/DailyMail dataset. There are in total 287,227 samples in the training set, and we corrupted 30% of them (85,583). This results in 16,858/35,113/13,408/20,204 date/entity/number/pronoun corrupted samples respectively. We refer the other 201,644 training samples as clean samples. We also create artificial validation and test set for model selection and evaluation. In the test set, there are 5,780 corrupted samples and 5,710 clean samples.
The second dataset we use is the K2019 test set of Kryściński et al. (2019). This dataset contains 503 summaries generated by different recent neural abstractive summarizers, which have been manually labeled for whether they contain an inconsistency.
We evaluate our model on both datasets. We did not use baselines for the artificial test set since it is simply used as a check to demonstrate our model's performance in the artificial setting. The more meaningful evaluations are on K2019 consistency checking and error correction.

Corrector Training Details
We use the BART implementation from fairseq as the basis of our corrector. 3 The pre-trained BART model is fine-tuned on our training dataset for 10 epochs as described in Section 3.2. The learning rate is set to 3e-5. All our experiments is done on 4 NVIDIA Tesla V100 GPUs. The training process takes about 12 hours.  model is able to identify these artificially injected errors.

Artificial corruptions
For error correction, among the 5780 corrupted summaries in the test set, 62.13% are corrected by the model to exactly match the reference summary. For the 5710 clean summaries, the model made changes to 26.27% of them, which results in 73.73% correction accuracy on clean summaries. These results show that the model is able to correct majority of the test samples even under our strict evaluation measure.
K2019 Table 4 shows the consistency checking results on the K2019 test set. Compared with FactCC, our model improves performance by more than 10% higher in accuracy and 0.11 in macro F1-score. This result is interesting considering that the model is trained on a generation task rather than on classification.
As for correction performance, Table 6 shows the evaluation result of our human evaluation. Among 62 inconsistent summaries in the test set, the corrector model made changes to 19 summaries, of which 11 were successfully corrected and 7 remained inconsistent. For the remaining 441 consistent summaries in the test set, changes are made to 39 summaries and the model changed the meaning of 5 samples. In conclusion, with 17.74% probability that our model can successfully correct an inconsistent summary and 1.13% probability that it will corrupt a consistent one. Compared with the correction rate of 62.13% on the artificial test set, much lower correction rate on the real test set suggests that there is still a gap between the two settings. The error types in the training set are not able to represent the diverse errors made by summarization systems.  Output Analysis Table 5 shows several input and output summaries of our corrector model together with the source document fragments. In the second example, the model correctly replaced 147 with 19, but was not able to correctly remove "including 142 students", which is a larger modification to the original summary. More examples can be found in the Appendix.

Conclusions
In this paper, we proposed a novel approach to correct inconsistent content in summaries generated by abstractive summarization models. We train an end-to-end correction model with artificial examples created by corrupting reference summaries. Our model achieved promising performance on our artificial test set and outperformed previous models on the manually annotated test set by wide margins. Our human evaluation indicates that our model is able to correct some factually inconsistent summaries generated by abstractive summarization model. However, low recall on the inconsistent summaries and false positive samples remain as challenges.