Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection

Despite significant progress in neural abstractive summarization, recent studies have shown that the current models are prone to generating summaries that are unfaithful to the original context. To address the issue, we study contrast candidate generation and selection as a model-agnostic post-processing technique to correct the extrinsic hallucinations (i.e. information not present in the source text) in unfaithful summaries. We learn a discriminative correction model by generating alternative candidate summaries where named entities and quantities in the generated summary are replaced with ones with compatible semantic types from the source document. This model is then used to select the best candidate as the final output summary. Our experiments and analysis across a number of neural summarization systems show that our proposed method is effective in identifying and correcting extrinsic hallucinations. We analyze the typical hallucination phenomenon by different types of neural summarization systems, in hope to provide insights for future work on the direction.


Introduction
Abstractive Summarization is the task of producing a concise and fluent summary that is salient and faithful to the source document(s). Datadriven, neural methods (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017), and the more recent, pretrained transformer language models (Vaswani et al., 2017;Devlin et al., 2019;Liu and Lapata, 2019), have shown improvements in the fluency and salience of generated summaries.
However, less progress has been made on improving the faithfulness of the generated summaries, that is, producing a summary that is entailed by the information presented in the source document. Despite the increased level of performance under automatic metrics such as ROUGE * Most of the work done while the authors were at Google.
Source: He was re-elected for a second term by the UN General Assembly, unopposed and unanimously, on 21 June 2011, with effect from 1 January 2012. Mr. Ban describes his priorities as mobilising world leaders to deal with climate change, economic upheaval, pandemics and increasing pressures involving food, energy and water... Unfaithful Summary: The United Nations Secretary-General Ban Ki-moon was elected for a second term in 2007. Our Summary: The United Nations Secretary-General Ban Ki-moon was elected for a second term in 21 June 2011. Table 1: An example unfaithful summary. It suffers from extrinsic hallucination, where information not present in the source document was generated. Our method attempts to correct the unfaithful summary by replacing "2007" with an entity from the source with compatible semantic type (i.e. DATE). (Lin, 2004) or BERTSCORE , current state of the art models (Liu and Lapata, 2019;Lewis et al., 2020) produce summaries that suffer from intrinsic and extrinsic hallucinations -the fabrication of untruthful text spans containing information either present or absent from the source (Maynez et al., 2020). Table 1 shows an example of such summary, generated by BART (Lewis et al., 2020), an auto-regressive, transformer-based sequence-tosequence model. The article describes an event where the former UN-Secretary-General Ban Ki-Moon was re-elected for a second term. The model hallucinates "2007", which never appears in the source document, leading to inconsistency with the correct date of the event presented.
In this work, we focus on the problem of correcting such hallucinations as a post processing step 1 . A post processing correction step allows us to rely on the fluency of the text generated by SOTA systems, that gain from huge pretrained models and large fine-tuning datasets, and correct it using small 5936 amounts of automatically generated training data.
Under the setting where a large fraction of ground truth summarization data is hallucinated, as we show in Table 2, we study the method of contrast candidate generation and selection. In the generation step, we replace named entities in a potentially hallucinated summary with ones with compatible semantic types that are present in the source, and create variants of candidate summaries. In the selection step, we rank the generated candidates with a discriminative model trained to distinguish between faithful summaries and synthetic negative candidates generated given the source. We experiment on a range of RNN-and transformer-based abstractive summarization models. Our preliminary results on the XSum corpus (Narayan et al., 2018a), which contains substantial presence of hallucinated ground truth examples, show the effectiveness of our method in correcting unfaithful summaries with extrinsic hallucinations.
Our main contributions are as follows. First, our work is the first to study the effectiveness of contrast candidate generation and selection as a model-agnostic method for correcting hallucinations, under the setting where a large fraction of ground truth summarization data suffers from hallucinations. Second, we validate our method on various neural summarization systems trained on XSum, and provide detailed analysis on the typical types of hallucinations from each system.

Contrast Candidate Generation & Selection
Our proposed method is built on the observation that a large fraction of extrinsic hallucinations happen on named entities and quantities. Table 2 shows the human analysis by Maynez et al. (2020) on the hallucinations of 500 randomly sampled gold summaries from the XSum corpus . We break down each category and annotate the proportion of hallucinations that happen on entity and number/quantity spans. As Maynez et al. (2020) further show that the hallucinations in training data translate to similar issues for the generated outputs across different summarization models, we want to study a modelagnostic, post-processing method that can correct such entity and quantity hallucinations. We frame the problem as a correction task and make it conceptually a less complex problem than summarization. Modeling correction as a standalone task would  The "%" column shows the % of intrinsic and extrinsic hallucinations annotated by Maynez et al. (2020). We analyzed the % of hallucinations on entities and numbers/quantities, and show the % out of all 500 summaries in the right two columns.
require less training data, which becomes crucial when a large proportion of ground truth summarization data suffer from hallucinations, and inherit the fluency of data intensive SOTA models.

Contrast Candidate Generation
From a model-generated summary, we first identify any potentially hallucinated entities or quantities by checking whether entities with similar surface forms have appeared in the source document. We use a neural Named Entity Recognition (NER) system from the Stanza NLP toolkit (Qi et al., 2020) trained on the OntoNotes corpus (Weischedel et al., 2013) to extract named entities of different semantic types from the source document and summary. Each named entity present in the summary is replaced with a different entity present in the document with the same NER label. This gives us different variants of the original summary with the same level of fluency , but not necessarily faithful.

Contrast Candidate Selection
For the candidate selection step, we want to identify the best candidate among the variants generated in the previous step as the final output summary.
As the contrast candidates vary in no more than a few tokens from the original summary, it requires a model with more delicate local decision boundaries (Gardner et al., 2020) to select the correct candidate. For example, we observe that MNLI models (Williams et al., 2018) fail to produce satisfactory results.
To create training data for that purpose, we sample examples from the XSum training set where all entities in the ground truth summary appear in the source document. We then follow the same procedure in the generation step, and produce unfaithful variants from the ground truth summary by replacing entities with others that have the same se-mantic type but different surface form in the source text. With the ground truth and synthetic negative summaries, we train a text classifier with a discriminative objective to score and rank the variants of the summaries.
We use BART (Lewis et al., 2020) plus a linear layer as our classification model. We adopt a similar learning objective to contrastive learning (Khosla et al., 2020). For each pair of positive and negative summary candidate, we use cross entropy loss L XE to handle the correctness of the label predictions. We add a margin ranking loss term L RANK to encourage the model to assign higher probability to the positive than the negative candidate. The margin γ is a tunable hyperparameter in training.
During test time, we use the trained model to score the generated contrast candidate summaries, as well as the original version generated by the summarization model. We take the candidate with the highest score as the final summary.   in the source text. Our model decides to keep the original summary in the rest 48.3%.

ROUGE and BERTSCORE Evaluation
We first verify that our method does not hurt the fluency and salience of the generated summaries, for which we assume ROUGE (Lin, 2004) and BERTSCORE  are suitable metrics. We report the results in Table 3. We observe though both the baseline and our method do well in both ROUGE and BERTSCORE, our method trails behind in both metrics slightly. This is due to the existence of extrinsic hallucinations in the ground truth summary, and the model manages to generate a part of the hallucinations, and gets incorrectly rewarded by such.

Faithfulness Evaluation
To test whether our correction method can improves the faithfulness of the summaries, we evaluate the summaries with FEQA (Durmus et al., 2020), a QA-based metric for summary faithfulness. Given a summary, FEQA automatically generates questions on noun phrase and named entity spans in the summary, and uses a pretrained QA model to verify if the answer derived from the source document exact-matches the span in the summary.
We run FEQA and compute the macro-averaged percentage of questions answered correctly for each of the 1510 summaries that our system made corrections to, and report the results in Table 3. The results suggest that the corrected summaries present statistically significant improvements over the original ones (p < 0.001, with a two-tailed, paired t-test). Table 4 shows the human evaluation results on the 95 randomly sampled subset of changed summaries. Two expert annotators assign each summary into three faithfulness categories and adjudicate the decisions. Additional annotations from   a third expert is then used to calculate the interannotator agreement. As the results show, our model is able to improve the faithfulness of the summaries, but at the cost of incurring intrinsic hallucinations on mistakes, which we will discuss more in detail in section 4.2.  (Rothe et al., 2020). Our system achieves consistently high level of precision across models. The system achieves high relative recall with respect to the % of entity and quantity hallucinations among all hallucinations. As our method only targets entities and quantities, the overall recall varies by the typical type of hallucinations each summarization system makes. We also observe while our method achieves high recall on models with lower ROUGE and BERTSCORE, the recall drops on pretrained models such as BERTS2S. This is potentially due to the decreased percentage of entity/quantity hallucinations exist in generated summaries from the models with pretraining.

Intrinsic vs. Extrinsic Hallucinations Trade-off
As our method detects and corrects extrinsichallucinated entities, naturally any entities replaced wrong would introduce intrinsic hallucinations in the changed summary, as indicated by the results in Table 4. To speculate why the mistakes happen, we analyzed the typical mistakes by the model, and listed a few representative examples in Table 5. For example, our method could not find the correct replacement for a hallucinated entity when no such one exists in the source text. We observe that the models with pretraining, such as BERTS2S, (Rothe et al., 2020) and BART, suffer from the issue by most, as they tend to be affected by artifacts/priors from the pretraining process.

Entity Faithfulness Summary Faithfulness
From the observation that models often hallucinate entities with no correct replacement in the source, we suspect that solving entity faithfulness alone does not guarantee the faithfulness of the summary. In the last example from Table 5, the BERTS2S system correctly identifies that three fugitives are involved in the event described by the source text, even though the number "three" has never been explicitly mentioned in the source context in any surface forms. Furthermore, statistics provided by Maynez et al. (2020) show that abstractive summarization models often produces factual statements, i.e. verifiable in the real world independent of the source text. Such findings imply that identifying hallucinations often requires more complex objectives such as commonsense reasoning and knowledge retrieval. The solution we propose here that focuses only on entites and quantities would likely be insufficient to solve the entire problem.

Related Work
There have been growing interests in quantitatively measuring the faithfulness of text generation models. Most widely-adopted evaluation metrics for text generation, such as ROUGE (Lin, 2004) and BERTScore , correlate poorly with the human perceived faithfulness of the generated text (Kryscinski et al., 2019;Durmus et al., 2020). Recent studies explore categorical, contentbased analysis for measuring the faithfulness of summaries (Goyal and Durrett, 2020;. Narayan et al. (2018b); ; Durmus et al. (2020) propose to use question answering to test the consistency of summary content to the information presented in the source text.
There have been efforts to study pre-or post-processing methods to improving faithfulness of generated summaries. Falke et al. (2019) attempt to use textual entailment models to re-rank the summary candidates generated from beam search or different neural systems. As Maynez et al. (2020) highlight the existence of hallucinations in training data, truncating potentially unfaithful gold summaries during training is an effective strategy (Kang and Hashimoto, 2020;Filippova, 2020). Kryscinski et al. (2020) take similar apporach as in this work to identify the hallucinations in summary. A concurrent study to this work (Cao et al., 2020) uses similar strategies as in this paper on a dataset with a very small fraction of hallucinations present. Our study instead focuses on the more challenging setting (Goyal and Durrett, 2021) where a large part of training data suffers from extrinsic and intrinsic hallucinations, and provides cross-system analysis on the both hallucinations categories.

Conclusion
We study contrast candidate generation and selection as a method to apply post-hoc fixes to extrinsically hallucinated summary on entities and quantities, under the setting where the summarization dataset suffers from intrinsic and extrinsic halluci-nations. We conduct our experiments on the XSum dataset, and show that our method is able to correct extrinsic hullucinations, but incurs a small fraction of intrinsic hallucinations on mistakes. We also provide detailed analysis and discussions on the capabilities and limitations of our method. We hope our findings in the paper will provide insights to future work in this direction.
Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214Linguistics, pages -2220

A Candidate Selection Model
For our contrast candidate selection model, we use a pretrained BART base model. We add a linear layer over the max pooled embedding, and the classification model is expected to output a label between ["FAITHFUL", "HALLUCINATED"]. For all our experiments, we use the following set of hyper-parameters: r = 1e − 5, margin γ = 0, number of training epoch= 3.

C Estimating Confidence Interval for Human Evaluation
We use bootstrapping to estimate the confidence interval for the expert annotation presented in Table 4. For each faithfulness category on the two systems, we regard the adjudicated annotation as ground truth, and label the individual instance as the true positive (TP), false negative (FN), true negative (TN) and false positive (FP) respectively according the annotations from the third expert. We re-sample the 95 instances with replacement for 1,000 times. We estimate the adjusted mean and 95% confidence interval from the mean and standard deviation of the sampled distribution of (TP + FN).