FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.


Introduction
Abstractive summarization models must aggregate salient content from the source document(s) and remain faithful, i.e. being factually consistent with information in the source documents. Neural abstractive models are effective at identifying salient content and producing fluent summaries (See et al., 2017;Chen and Bansal, 2018;Gehrmann et al., 2018). However, the generated summary may not always contain faithful information, which is vital for real-world applications. * Most of the work is done while the authors were at Amazon Web Services AI. 1 Faithfulness Evaluation with Question Answering.
Source. The world's oldest person has died a few weeks after celebrating her 117th birthday. Born on March 5, 1898, the greatgrandmother had lived through two world wars, the invention of the television and the first successful powered aeroplane flight by the wright brothers... Output sentence. The world 's oldest person has died on March 5, 1898.  Table 1 shows an example of unfaithful generation. Recent studies have shown that around 30% of generated summaries contain unfaithful information (Cao et al., 2018;Falke et al., 2019a;Kryściński et al., 2019), especially when the sentence combines content from multiple source sentences (Lebanoff et al., 2019). In this paper, we address the problem of evaluating faithfulness of generated summaries given their source documents. Our key insight is that current models are limited by a trade-off between abstractiveness and faithfulness (Section 2). On a wide range of systems and two datasets with varying levels of abstractiveness (CNN/DM and XSum), we show that the number of unfaithful sentences (annotated by humans) increases as the summary becomes more abstractive (i.e. less overlap with the source document). Next, we investigate a diverse set of existing automatic evaluation metrics such as ROUGE, BERTScore (Zhang et al., 2019a), and learned entailment models. We find that their correlations with human scores of faithfulness drop significantly on highly abstractive summaries, where deeper text understanding beyond surface similarity is needed.
Recently, question answering (QA) based automatic metrics have been proposed for evaluating content selection in summarization (Eyal et al., 2019;Scialom et al., 2019;Chen et al., 2018). Specifically, cloze-style QA is used to evaluate whether important information in the source is recovered from the summary. Inspired by prior work, we use automatically generated QA pairs to represent information in the summary and validate it against the source. Concretely, we generate a set of "groundtruth" QA pairs from the summary, using a learned model that converts a declarative sentence and an answer span to a question (Section 3). Then, off-the-shelf reading comprehension models are evaluated on this set by extracting answer spans from the source documents. High accuracy means that the summary and the source document tend to produce the same answers, thus they are factually consistent with respect to the questions. Compared to prior approaches using cloze tests, our question generation approach enables evaluation with a broader range of QA models and answer types (e.g. extractive and generative), thus maximally taking advantage of progress in QA.
Among automatic metrics based on n-gram overlap, word embeddings, and language understanding models (relation extraction and entailment), FEQA has significantly higher correlation with human scores of faithfulness and is the only metric that correlates with human scores on highly abstractive summaries from XSum.

The Abstractiveness-Faithfulness Tradeoff
While extractive summarizers are largely faithful (since they copy sentences from the source document), current abstractive models struggle to produce faithful summaries without copying. Similar to Lebanoff et al. (2019), we observe that factual errors occur more frequently as models generate more abstractive summary sentences, i.e. less overlap with the source document. In this section, we analyze generated summaries along two dimensions: abstractiveness and faithfulness. Specifically, we aim to answer the following questions: (1) How to quantify abstractiveness of a summary? (2) Is abstractiveness encouraged more by the data or the model? (3) How does being abstractive affect faithfulness?

Characterizing Abstractiveness of a Summary
Abstractive summarization involves rephrasing important content into brief statements, ranging from minor editing of a source sentence to condensing multiple sentences in new words. Given a source document and a summary, we want to measure the level of abstractiveness of the summary. Prior work measures abstractiveness by overlapped text spans between the summary and the document (Grusky et al., 2018;Zhang et al., 2018), or indirectly by the effectiveness of extractive baselines such as LEAD-3 (Nallapati et al., 2016a). While metrics such as extractive fragment coverage and density (Grusky et al., 2018) provide a continuous measure of the level of abstractiveness, we define a more fine-grained categorization of abstractiveness by analyzing how each sentence in the summary is formed.
A more abstractive summary sentence aggregates content over a larger chunk of source text; consequently it must copy fewer words to maintain brevity. Therefore, we define the following abstractiveness types based on the amount of copying, e.g. copying a source sentence, one or more partial fragments from the source sentence, and individual words.
1. Sentence extraction: the summary sentence is exactly the same as one of the source sentences.
2. Span extraction: the summary sentence is a substring of one of the source sentences, e.g. "the plane was coming back from the NCAA final" is a span extracted from "the plane was coming back from the NCAA final, according to spokesman John Twork".
3. Word extraction: the summary sentence is formed by a subset of the tokens in a source sentence, e.g. "Capybara Joejoe has almost 60,000 followers" is a result of deleting words in "Capybara Joejoe who lives in Las Vegas has almost 60,000 followers on Instagram".
4. Perfect fusion k : the summary sentence is constructed by piecing together the substrings from k (k > 1) source sentences in their original order, e.g. "Capybara Joejoe has almost 60,000 followers" is a perfect fusion of the sentences "Capybara Joejoe lives in Las vegas." and "He has almost 60,000 followers on Instagram." To quantify the amount of abstractiveness of a set of summaries, we label each sentence with the first qualified type in the order above if it fits to one of these categories.
We then define the score of each type as the percentage of sentences labeled by that category. The types are ordered by increasing levels of abstractiveness. For example, a summary with higher fusion scores and lower extraction scores is considered more abstractive. In addition, we compute the percentage of novel n-grams that do not appear in the source document as another metric for abstractiveness.

Is abstractiveness from the model or the data?
Equipped with the metrics for abstractiveness above, we want to further understand how abstractive the generated summaries are, and whether the amount of abstractiveness is a result of the training data or the model. Therefore, we compute abstractiveness scores for both the reference summaries and summaries generated from a diverse set of models on two datasets. Models. Most neural abstractive summarization models are based on sequence-to-sequence models. They differ in how summarization-specific operations such as copying/extraction are instantiated. We consider 5 prominent models and sum-   (See et al., 2017) uses the copy mechanism during decoding to allow extraction. FASTRL (Chen and Bansal, 2018) and BOTTOMUP (Gehrmann et al., 2018) decouple extraction and abstractive generation by learning to select sentences and words respectively in the first step; this model has been shown to generate more abstractive summaries compared to PGC. TCONV (Narayan et al., 2018) is initially designed for XSum, thus it does not include any explicit copying/extraction components and focuses on long text representation using convolutional neural networks. BERTSUM (Liu and Lapata, 2019) consists of a BERT-based encoder and a 6-layer Transformer decoder. It incorporates extraction implicitly by first fine-tuning the encoder on the extractive summarization task. 3 Results. Our goal is to understand the level of abstractiveness of summaries generated by different models, and the influence on abstractiveness from the training data. Therefore, we analyzed summaries generated by the above models on CNN/DM and XSum. We computed the metrics described in Section 2.1 for both the generated summaries and the reference summaries on the test sets. The results are shown in Table 3.
First, CNN/DM is more extractive than XSum. Extraction scores of the reference summaries in CNN/DM shows that almost half of the sentences are formed by deleting words in one of the source sentences. This shows that sentence compression (Knight and Marcu, 2002) is the main technique used for this dataset. In contrast, none of the summary sentences in XSum are formed by copying from a single source sentence. They are generated mostly by paraphrasing the input content, indicated by the large fraction of novel n-grams.  Table 3: Abstractiveness measures of the models on CNN/DM and XSum datasets. The numbers for Extraction and Perfect fusion indicate % of sentences generated with these strategies. Numbers for novel n-grams indicate % of n-grams that are present in the output sentence but is not present in the source.
Second, training data has a larger influence on the abstractiveness of model outputs. Similar to Zhang et al. (2018), we find that models trained on CNN/DM are near-extractive. However, the same models trained on XSum are significantly more abstractive. In fact, none of the models produced any sentence that copies words/phrases from a single source sentence, which is consistent with characteristics of the reference summaries in XSum. The content is more often rephrased in novel words/phrases. However, on both datasets, current models struggle to achieve the same level of abstractiveness as the reference summaries, indicating that additional inductive bias is needed to condense multiple sentences by rephrasing.
Third, different models have different ways of doing extraction. When trained on CNN/DM, PGC generates the majority of sentences by copying complete source sentences, whereas FASTRL, BOTTOMUP and BERTSUM do simple compression by deletion more often. In addition, BOT-TOMUP does more fusion compared to PGC, FAS-TRL and BERTSUM.

Annotating Summary Faithfulness 4
To understand faithfulness of current systems and its relation to abstractiveness, we crowd-sourced human annotations on the output of each modeldataset pair described in Section 2.2. Since a nearextractive sentence is very likely to be grammatical and faithful, we focus on more abstractive cases by excluding output sentences that are either an exact copy or a substring of one of the source sentences.
A key challenge to reliable human annotation is that the inter-annotator agreement on faithfulness is relatively low (Lebanoff et al., 2019). Our pi- 4 We make our data and code available for reproducibility at: https://github.com/esdurmus/summary-faithfulness. lot study shows that workers often do not agree on incoherent sentences, e.g. whether "Chelsea beat Chelsea 5 − 3 in the Premier League on Saturday." is faithful or not. To standardize the annotation process, we design hierarchical questions to distinguish among failed generation that render a sentence meaningless, low-level grammatical errors that hardly affect semantic understanding, and faithfulness errors that convey incorrect (yet meaningful) information. Figure 1 shows the decision tree of our human annotation steps. We first evaluate the grammaticality of generated sentences (independent from the source document). We show annotators a summary sentence and ask them to choose whether the given sentence is meaningful or nonsensical to determine if the given sentence is structurally and semantically sound. If the annotator can make sense of the sentence, we then ask whether it is grammatical or has minor grammaticality problems which a person can easily correct.
Next, for sentences labeled as meaningful in the first step, we ask workers whether they are faithful to the provided source document. In case the worker labels a sentence as unfaithful, we conduct a simple error analysis by asking them to indicate if the sentence contains information that is absent from or conflicting with the source document, which corresponds to hallucination and contradiction errors, respectively. More details about the annotation schema and guidelines are included in the Appendix C. Next, we describe our human evaluation results.

Human Annotation Results
For each dataset-model pair described in Section 2.2, we randomly sampled 1000 sentencesource pairs eliminating output sentences that are either an exact copy or substring of a source sen-S1:

S2:
Chelsea and Manchester City are interested in signing Chelsea.
A man has died after his car left the road and hit a tree in Surrey, police said.

Source for S1:
The man, in his 20s, was the only person in the BMW convertible, when the accident happened on the Aldershot road in Guildford. He was traveling east when his car left the road. Police closed the road while investigators were at the scene.

Is it meaningful?
Is it grammatical?

Yes
Is it faithful? Contradiction or Hallucination?  Figure 1: The decision diagram of our human annotation process. Decision nodes are rectangular and outcome nodes are circular. We show the annotation path of two summary sentences, S1 (green arrows) and S2 (red arrows). S2 is annotated as nonsensical thus is not considered for faithfulness. S1 is annotated as unfaithful due to hallucinated content.  Table 4: Grammaticality and faithfulness results of human annotations. Score is computed by taking the percentage of annotators that selected "meaningful" and "faithful" for grammaticality and faithfulness annotation tasks, respectively, and then averaging these values across all the examples for the given annotation task. Agreement is computed by taking the percentage of the workers that annotate the majority class for the given example. Abstractiveness is measured by the percentage of novel trigrams in a given sentence.

Yes
tence. We collected grammaticality annotations for these sentences from 5 annotators. We consider a sentence meaningful if at least 4 out of 5 annotators label it as meaningful in the first stage. We sampled 200 meaningful sentences randomly to collect annotations for faithfulness. Table 4 shows the results of the grammaticality and faithfulness human evaluations.
Grammaticality. Overall, outputs from all models are scored high on grammaticality with high inter-annotator agreement. However, on more abstractive summaries (i.e. when trained on XSum), the grammaticality scores drop significantly. One exception is BERTSUM, which maintains good performance on XSum and achieves the highest grammaticality score on both datasets. 5 Faithfulness. Near-extractive summaries generated from models trained on CNN/DM have significantly higher faithfulness scores than highly abstractive summaries from models trained on XSum. We find that PGC and TCONV has faithfulness errors in more than half of the sentences they generate when trained on XSum. Although BERTSUM generates fewer unfaithful sentences, it still suffers from performance drop on XSum. Interestingly, human agreement on faithfulness is also lower for abstractive summaries from XSum. This suggests that faithfulness errors are harder to catch for humans as well in more abstractive settings. We further observe conflicting information is more common among models trained on CNN/DM while hallucination is more common among models trained on XSum. Table 5 shows examples of meaningful but unfaithful sentences.

FEQA: Faithfulness Evaluation with Question Answering
Our analysis above shows that the number of unfaithful sentences increases significantly as more abstractive summaries are generated. Thus the key challenge to faithfulness evaluation is to verify highly abstractive sentences against the source document, where surface similarity match-   Table 11.

Summary sentence
The home was built for inspection.

Masked summary sentence
The home was built for [MASK].
[MASK] was built for inspection.  Figure 2: Overview of FEQA. Given a summary sentence and its corresponding source document, we first mask important text spans (e.g. noun phrases, entities) in the summary. Then, we consider each span as the "gold" answer and generate its corresponding question using a learned model. Lastly, a QA model finds answers to these questions in the documents; its performance (e.g. F1 score) against the "gold" answers from the summary is taken as the faithfulness score.
ing would fail. If we have a good semantic representation of the sentence abstracting away its surface form (e.g. a list of facts about who did what to whom), we can simply compare the sentence representation to the document representation (e.g. check whether the fact list from the summary is a subset of the list from the document). Ideally, the representation should be domain-general and interpretable for easy error analysis.
Motivated by the fast progress in reading comprehension (Chen, 2018;Gao et al., 2018) we propose to use QA pairs as a generic meaning representation of sentences for faithfulness evaluation. Given a summary sentence, we produce a list of questions asking about key information in the sentence and their corresponding answers. To verify this information against the source, we use a QA model to predict answers from the document. The questions and the QA model thus extract comparable information from two pieces of text. More matched answers from the document implies a more faithful summary since the information addressing these questions are consistent between the summary and the source document. Figure 2 shows the workflow of FEQA.
Question generation. Prior work (Eyal et al., 2019;Scialom et al., 2019) uses cloze tests as questions by masking entities. To go beyond cloze-style QA and leverage more recent extractive (Rajpurkar et al., 2016) or even generative (Alec et al., 2019) QA models, we generate natural language questions from the summary sentence automatically. Specifically, we mask important text spans in a sentence, including noun phrases extracted by a constituency parser (Kitaev and Klein, 2018) and named entities extracted by the Stanford CoreNLP NER model (Finkel et al., 2005;Manning et al., 2014). We consider each span as the gold answer and generate its corresponding question by fine-tuning a pretrained BART language model (Lewis et al., 2019). To train the question generator, we adapt the QA2D dataset Demszky et al. (2018). The input is a declarative sentence with masked answers and the output is a question. A training example might look like: Input: Sally was born in <m> 1958 </m> Output: When was Sally born ?
Since the transformation from declarative sen-tences to questions is almost rule-based without much paraphrasing, we expect the model to generalize to various domains.
Answer verification. Given the QA pairs generated from a summary sentence, we run off-theshelf QA models to get answers to these questions from the source document. We then measure the average F1 score against the "gold" answers from the summary, which is our faithfulness score for the given sentence. This step does not have any constraint on the QA model. We experiment with the pretrained BERT-base model (Devlin et al., 2019) fine-tuned on SQuAD-1.1 (Rajpurkar et al., 2016) and SQuAD-2.0 (Rajpurkar et al., 2018). Note that in the case of SQuAD-2.0, the model may be able to hypothesize that a question is unanswerable. This case is equivalent to getting an answer incorrect (i.e. unfaithful).

Experiments
We aim to understand to what extent the proposed QA-based metric and existing metrics capture faithfulness of a summary. Given pairs of documents and summary sentences without reference summaries, we measure correlations between human-annotated faithfulness scores (Section 2.3) and scores computed using each metric described below.

Automated Metrics for Faithfulness
Word overlap-based metrics. A straightforward metric for faithfulness is the word overlap between the summary sentence and the document. We compute ROUGE (R), BLEU (B), 6 between the output sentence and each of the source sentences (i.e. taking the source sentence as the reference). We then take the average scores and maximum score across all the source sentences. Since according to our analysis taking the average score consistently has higher correlation, we report only the correlation for the average.
Embedding-based metrics. Word embeddings extend word overlap-based metrics beyond exact match. Recently, BERTScore (Zhang et al., 2019b) was proposed to compute the similarity between two sentences using contextual word embeddings from BERT. It has higher correlation with human judgements on image captioning and machine translation than word overlap based metrics. We compute BERTScore (BERTSc) between each source sentence and the summary sentence. 7 To get the final score, we experiment with both the average and the maximum scores computed from each source sentence and the summary sentence. We report results using the maximum score since it has better performance.
Model-based metrics. In addition to QA, recent work has used relation extraction and textual entailment models for faithfulness evaluation (

Results
Metric Comparison. We first compute scores for each metric on document and output sentence pairs on both CNN/DM and XSum datasets (748 and 286 pairs respectively). We then compute Pearson and Spearman correlation coefficients between scores given by each metric and humanannotated scores. Table 7 includes correlation coefficients for the examples from CNN/DM and XSum, respectively. We observe that for both CNN/DM and XSum, the score of QA-based evaluation has a higher correlation with faithfulness than other metrics. Although word-overlap based metrics are correlated with the faithfulness in more extractive settings (i.e. for CNN/DM), these metrics have no correlation with faithfulness in more abstractive settings (i.e. for XSum). We further notice that all the metrics have significantly lower correlation with human scores for XSum, suggesting that evaluating faithfulness is more difficult in highly abstractive settings; deeper understanding of the source and the summary sentence is necessary here.
Consistent with the findings of Falke et al. (2019b), the entailment metric does not have a significant correlation with faithfulness in most cases. These models fail to distinguish entailed (faithful) Source Sentence Output Sentence Metric Score Health Inspectorate Wales said Wrexham Maelor Hospital staff were under "considerable pressure" for long periods as ambulances waited outside.
A hospital ward in Wrexham has been rated "inadequate" by inspectors after inspectors found patients at risk of harm.

72.83%
The Black Poplar is one of the rarest native trees in the UK, with only 2,500 thought to be left.
Northern Ireland's first trees are among those recognised in the Welsh Architecture Trust's list of the year's best trees.
BertScore 83.06%  and non-entailed (unfaithful) summary sentences when both overlap largely with the source document, because models trained on current entailment datasets may rely on simple heuristics such as lexical overlap (McCoy et al., 2019). Similarly, BERTScore tends to give higher scores when there are overlapping concepts between the sentences even though the content is not the same. See Table 6 for examples.
Content selection and faithfulness. Current evaluation metrics for summarization produce a single measure of the overall quality of the summary. Typically, the output summary is compared against the reference summary in terms of n-gram overlap. These metrics mainly evaluate content selection, i.e. whether the content of the output is similar to the content of the reference. In contrast, to evaluate faithfulness, we compare the output summary against the source document. One natural question that follows is whether high content matching sufficient for faithfulness. We compute the correlation coefficients between humanannotated faithfulness scores and ROUGE scores computed from the reference and the output sentence. As shown in  Table 8: Pearson (P) and Spearman (S) correlation between human-annotated faithfulness scores and ROUGE scores of content selection (computed between the reference and the output sentence). High content selection scores (typical ROUGE score for summarization) do not necessarily imply faithfulness of the summary. correlation between ROUGE scores of content selection and faithfulness on CNN/DM, the correlation is significantly lower than ROUGE scores of faithfulness (i.e. computed between the source and the output sentence). For XSum, there is no significant correlation between the content selection metrics and faithfulness. We provide unfaithful examples with high content selection scores in Appendix D.3. This suggests that content selection and faithfulness should be measured separately as opposed to using a unified score.
Analysis and limitations of QA-based evaluation. Table 9 shows examples for a faithful and an unfaithful output sentence and the corresponding QA pairs. Note that the QA system is able to capture common errors such as conflicting information in the output sentence. To measure the reliability of FEQA, we further perform a manual error analysis using 100 randomly sampled QA pairs. We observe that around 94% of generated questions are mostly grammatical and correct given the mask. For 78% of the questions, the QA system has the correct behaviour: it answers the question correctly if the sentence is faithful to the article, otherwise it produces "unanswerable" or an incorrect answer. Majority of the errors of the QA system are because it either didn't detect unanswerable questions or produces "unanswerable" when there exists an answer (14%). More-  Table 9: Examples detection results from FEQA. OA:Output Answer, SA:Source Answer. The output sentence in the first example is unfaithful, whereas the one for the second example is faithful. Bold text indicates the span that was masked to generate the question.
over, when the article is long, QA system tends to make more mistakes. Especially for more abstractive settings, F1-score penalizes the correct answers when the answer from the article does not exactly match with the gold answer (i.e. "Donald Trump" vs. "the President of the United States Donald Trump") (16%).

Related Work
Problems in current neural generation models. Since the beginning of neural text generation, problems with repetition and generic responses have received lots of attention (Sordoni et al., 2015;Holtzman et al., 2019).
Recently, more work has focused on semantic errors in model outputs, such as adequacy in machine translation (Tu et al., 2017), faithfulness in summarization (Cao et al., 2018), and consistency in dialogue (Li et al., 2019). Our analysis on the abstractiveness-faithfulness tradeoff reveals additional limitation of current models, and suggests that we need new inductive bias on how to summarize beyond copying.
QA as a proxy. Question answering is a broad format that subsumes many tasks (Gardner et al., 2019). To the best of our knowledge, Mani et al. (1999)  Our work is the first to apply automated question generation. While we focus on faithfulness, our QAbased metric is applicable to semantic comparison between any two pieces of text.
Automated evaluation for NLG. Automated NLG evaluation is challenging as it often requires deep understanding of the text. Although metrics based on word overlap with the reference text are commonly used, it is widely known that they do not correlate well with human judgments (Novikova et al., 2017;. Recently, more work has focused on model-based evaluation using discriminators (

Conclusion
We investigate the faithfulness problem in neural abstractive summarization and propose a QAbased metric for evaluating summary faithfulness. We show that current models suffer from an inherent trade-off between abstractiveness and faithfulness. They are good at copying important source content, but tend to concatenate unrelated spans and hallucinate details when generating more abstractive sentences. A new inductive bias or additional supervision is needed for learning reliable models. While our QA-based metric correlates better with human judgment and is useful for model development, it is limited by the quality of the QA model. The final evaluation should still rely on human annotation or human-in-the-loop methods (Chaganty et al., 2018).

B Summarization Models
The characteristics of each model used in our experiments are detailed below.
Pointer Generator Model with Coverage (PGC) (See et al., 2017) uses the copy mechanism (Vinyals et al., 2015) to allow copying words from the source. The adapted coverage mechanism (Tu et al., 2016) is incorporated to alleviate repetition by keeping track of source words that have been summarized. This copy mechanism is widely adopted by subsequent models.
Fast Abstractive Summarization with Reinforce (FASTRL) (Chen and Bansal, 2018) first uses an extractor agent to select salient sentences from the document, then condenses the extracted sentences using the Pointer-Generator summarizer.
Bottom-up Summarization Model (BOTTOMUP) (Gehrmann et al., 2018) first selects words from the source document that are likely to appear in the summary, then generates using the Pointer-Generator model, where the copying mechanism is constrained to the previously selected words. It improves upon PGC by explicitly learning the selector to avoid copying long text spans.
Topic-aware Convolutional Sequence-to-Sequence model (TCONVS2S) (Narayan et al., 2018) is a convolutional neural network-based model conditioned on the topics of the article. It is shown to be effective in capturing long-range dependencies in the documents.
BERT-based model (BERTSUM) (Liu and Lapata, 2019) is a two-stage fine-tuning approach where the BERT-based encoder is first fine-tuned on the extractive summarization task and then on the abstractive sumarization task with the decoder (denoted as BERTSUMEXTABS in the original paper).

C.1 Grammaticality Annotation Guidelines
For grammaticality annotation, we present only the output sentence to the workers. We collect annotations from 5 workers for both of the tasks. For this task, given the output sentence, we provide workers the following guidelines:

Source
Output Sentence Category ...Although her due date has not officially been confirmed, the duchess of Cambridge told wellwishers at a charity event last month: I am due mid-April, to the end of April...
The duchess of Cambridge told wellwishers at a charity event last month: "The duke's intention is to be at the commemorations". Carragher posted a picture on his son play in the famous youth tournament.

IC
A body was found by a member of the public on private land near Leighton, about 10 miles (16.09km) away from the centre of Shrewsbury, on Monday. Mr Bebbington's family has been informed, West Mercia Police confirmed.
The death of a man whose body was found in a river in Cumbria has been identified as murder.

H
The incident happened near Dr. Gray's hospital shortly after 10:00. The man was taken to the hospital with what police said were serious but not life-threatening injuries. The a96 was closed in the area for several hours, but it has since reopened.
A man has been taken to hospital after he was hit by a lorry in Dumfries.

Reference
Output Sentence ... University of Nebraska researcher has revealed why stress is bad for you. Limited periods of stress are good, as they release cortisol... University of Nebraska researcher has revealed why stress is bad for you, stimulating your body to produce an important hormone called cortisol. ...Indian air force and Nepalese army medical team launch rescue mission to bring injured people to hospitals in Kathmandu. Forshani Tamang's family carried her for four hours to reach help after she was wounded when their home was destroyed... Indian air crew and Nepalese army medical team were killed in Nepal's Sindhupalchok quake. • A judge in Japan has ordered a judge to order a woman who has absconded from Japan to Japan. (generated by PGC for XSum) • Stoke City moved up to third in the Premier League with victory over Stoke City at Stoke. (generated by TCONV for XSum) • Johnny Depp's management group is suing his management group over his "lavish lifestyle". (generated by BERTSUM for XSum) D.2 Examples for meaningful but unfaithful sentences Table 11 includes examples that are annotated as meaningful but unfaithful. First three examples are picked from the models trained on CNN/DM, and last three are from the models trained on XSum.
We observe that majority of sentences with faithfulness errors for CNN/DM dataset are generated by incorrect concatenation (IC). The models fuse two sentences from the source and generate a new sentence that is not consistent with the context of the source. Within this category, however, the models make a wide-range of mistakes such as copying the wrong entity, date, and quote. For XSum, the faithfulness mistakes are mostly hallucinations. Models tend to hallucinate information (e.g. entities, events, date) that is not present in the source.

D.3 Examples for sentences with high content
overlap with reference that are unfaithful Although current summarization models are evaluated with respect to the content overlap between the reference and the output, these metrics do not necessarily provide any guarantees for the faithfulness of the output. Table 12 includes examples with similar content overlap scores as the faithful examples but are unfaithful. We can see that although the output sentences include similar words and refer to similar topics, they include hallucinations and inaccurate information.

D.4 Limitations of the datasets
Since CNN/DM and XSum datasets are automatically crawled, we find that there is noise in the data. For example, source documents can include phrases such as "click here for the latest news". We further observe that reference can carry information that is not in the source document since some of these one sentence highlights are written using additional world knowledge. Table 13 shows an example where the reference is unfaithful since it includes information that is not in the source (i.e. the fact that Ms. Wood's first name is Leanne and she is Plaid Cymru leader.).