Multi-Fact Correction in Abstractive Text Summarization

Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE. However, system-generated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text. To address this challenge, we propose Span-Fact, a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text, while retaining the syntactic structure of summaries generated by abstractive summarization models. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.

Pre-trained neural abstractive summarization systems have dominated extractive strategies on news summarization performance, at least in terms of ROUGE. However, systemgenerated abstractive summaries often face the pitfall of factual inconsistency: generating incorrect facts with respect to the source text. To address this challenge, we propose Span-Fact, a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection. Our models employ single or multimasking strategies to either iteratively or autoregressively replace entities in order to ensure semantic consistency w.r.t. the source text, while retaining the syntactic structure of summaries generated by abstractive summarization models. Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.

Introduction
Informative text summarization aims to shorten a long piece of text while preserving its main message. Existing systems can be divided into two main types: extractive and abstractive. Extractive strategies directly copy text snippets from the source to form summaries, while abstractive strategies generate summaries containing novel sentences not found in the source. Despite the fact that extractive strategies are simpler and less expensive, and can generate summaries that are more grammatically and semantically correct, abstractive strategies are becoming increasingly popular thanks to its flexibility, coherency and vocabulary diversity (Zhang et al., 2020a). * *Most of this work was done when the first author was an intern at Microsoft.

CNNDM Source
(CNN) About a quarter of a million Australian homes and businesses have no power after a "once in a decade" storm battered Sydney and nearby areas. About 4,500 people have been isolated by flood waters as "the roads are cut off and we won't be able to reach them for a few days,"... Bottom-up Summary a quarter of a million australian homes and businesses have no power after a decade.
Corrected by SpanFact about a quarter of a million australian homes and businesses have no power after a "once in a decade" storm.
Gigaword Source all the 12 victims including 8 killed and 4 injured have been identified as senior high school students of the second senior high school of ruzhou city, central china's henan province, local police said friday. Pointer-Generator Summary 12 killed, 4 injured in central china school shooting. XSum Source st clare's catholic primary school in birmingham has met with equality leaders at the city council to discuss a complaint from the pupil's family. the council is supporting the school to ensure its policies are appropriate... BertAbs Summary a muslim school has been accused of breaching the equality act by refusing to wear headscarves.
Corrected by SpanFact a catholic school has been accused of breaching the equality act by refusing to wear headscarves. Recently, with the advent of Transformer-based models (Vaswani et al., 2017) pre-trained using self-supervised objectives on large text corpora (Devlin et al., 2019;Radford et al., 2018;Raffel et al., 2020), abstractive summarization models are surpassing extractive ones on automatic evaluation metrics such as ROUGE (Lin, 2004). However, several studies (Falke et al., 2019;Goodrich et al., 2019;Kryściński et al., 2019;Wang et al., 2020;Durmus et al., 2020;Maynez et al., 2020) observe that despite high ROUGE scores, system-generated abstractive summaries are often factually inconsistent with respect to the source text. Factual inconsistency is a well-known problem for conditional text generation, which requires models to generate readable text that is faithful to the input document. Consequently, sequence-to-sequence generation models need to learn to balance signals between the source for faithfulness and the learned language modeling prior for fluency (Kryściński et al., 2019). The dual objectives render abstractive summarization models highly prone to hallucinating content that is factually inconsistent with the source documents (Maynez et al., 2020).
Prior work has pushed the frontier of guaranteeing factual consistency in abstractive summarization systems. Most focus on proposing evaluation metrics that are specific to factual consistency, as multiple human evaluations have shown that ROUGE or BERTScore (Zhang et al., 2020b) correlates poorly with faithfulness (Kryściński et al., 2019;Maynez et al., 2020). These evaluation models range from using fact triples (Goodrich et al., 2019), textual entailment predictions (Falke et al., 2019), adversarially pre-trained classifiers (Kryściński et al., 2019), to question answering (QA) systems (Wang et al., 2020;Durmus et al., 2020). It is worth noting that QA-based evaluation metrics show surprisingly high correlations with human judgment on factuality (Wang et al., 2020), indicating that QA models are robust in capturing facts that can benefit summarization tasks.
On the other hand, some work focuses on model design to incorporate factual triples (Cao et al., 2018;Zhu et al., 2020) or textual entailment Falke et al., 2019) to boost factual consistency in generated summaries. Such models are efficient in boosting factual scores, but often at the expense of significantly lowering ROUGE scores of the generated summaries. This happens because the models struggle between generating pivotal content while retaining true facts, often with an eventual propensity to sacrificing informativeness for the sake of correctness of the summary. In addition, these models inherit the backbone of generative models that suffer from hallucination despite the regularization from complex knowledge graphs or text entailment signals.
In this work, we propose SpanFact, a suite of two neural-based factual correctors that improve summary factual correctness without sacrificing informativeness. To ensure the retention of semantic meaning in the original documents while keeping the syntactic structures generated by advanced summarization models, we focus on factual edits on entities only, a major source of hallucinated errors in abstractive summarization systems in practice (Kryściński et al., 2019;Maynez et al., 2020). The proposed model is inspired by the observation that fact-checking QA model is a reliable medium in assessing whether an entity should be included in a summary as a fact (Wang et al., 2020;Durmus et al., 2020). To our knowledge, we are the first to adapt QA knowledge to enhance abstractive summarization. Compared to sequential generation models that incorporate complex knowledge graph and NLI mechanisms to boost factuality, our approach is lightweight and can be readily applied to any system-generated summaries without retraining the model. Empirical results on multiple summarization datasets show that the proposed approach significantly improves summarization quality over multiple factuality measures without sacrificing ROUGE scores.
Our contributions are summarized as follows. (i) We propose SpanFact, a new factual correction framework that focuses on correcting erroneous facts in generated summaries, generalizable to any summarization system. (ii) We propose two methods to solve multi-fact correction problem with single or multi-span selection in an iterative or auto-regressive manner, respectively. (iii) Experimental results on multiple summarization benchmarks demonstrate that our approach can significantly improve multiple factuality measurements without a huge drop on ROUGE scores.

Related Work
The general neural-based encoder-decoder structure for abstractive summarization is first proposed by Rush et al. (2015). Later work improves this structure with better encoders, such as LSTMs (Chopra et al., 2016) and GRUs (Nallapati et al., 2016), that are able to capture longrange dependencies, as well as with reinforcement learning methods that directly optimize summarization evaluation scores (Paulus et al., 2018). One drawback of the earlier neural-based summarization models is the inability to produce out-of- vocabulary words, as the model can only generate whole words based on a fixed vocabulary. See et al. (2017) proposes a pointer-generator framework that can copy words directly from the source through a pointer network , in addition to the traditional sequence-to-sequence generation model.
Abstractive summarization starts to shine with the advent of self-supervised algorithms, which allow deeper and more complicated neural networks such as Transformers (Vaswani et al., 2017) to learn diverse language priors from large-scale corpora. Models such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018) and BART  have achieved new state-of-the-art performances on abstractive summarization (Liu and Lapata, 2019;Zhang et al., 2020a;Shi et al., 2019;Fabbri et al., 2019). These models often finetune pre-trained Transformers with supervised summarization datasets that contain pairs of source and summary.
However, encoder-decoder architectures widely used in abstractive summarization systems are inherently difficult to control and prone to hallucination (Vinyals and Le, 2015;Koehn and Knowles, 2017;Lee et al., 2018), and often leads to factual inconsistency: the system-generated summary is fluent but unfaithful to the source (Cao et al., 2018). Studies have shown that 8% to 30% system-generated abstractive summaries have factual errors (Falke et al., 2019;Kryściński et al., 2019)  Our fact correction models are inherently different from these models, as we focus on postcorrecting summaries generated by any model. Our models are trained with the objective of predicting masked entities identified for fact correction (Figure 1), and learn to fill in the entity masks of any system-generated summaries with single or multi-span selection mechanism ( Figure 2). The most similar work to ours is proposed concurrently by Meng et al. (2020), where they fine-tune a BART

Multi-Fact Correction Models
In this section, we describe two models proposed for factual error correction: (i) QA-span Fact Correction model, and (ii) Auto-regressive Fact Correction model. As both methods rely on span selection with different masking and prediction strategies, we call them SpanFact collectively.

Problem Formulation
Let (x, y) be a document-summary pair, where x = (x 1 , . . . , x M ) is the source sequence with M tokens, and y = (y 1 , . . . , y N ) is the target sequence with N tokens. An abstractive summarization model aims to model the conditional likelihood p(y|x), which can be factorized into a product p(y|x) = T t=1 p(y t |y 1....,t−1 , x), where y 1....,t−1 denote the preceding tokens before position t. The conditional maximum-likelihood objective ideally requires summarization models to not only optimize for informativeness but also correctness. However, in reality this often fails as the models have a high propensity for leaning towards informativeness than correctness . Suppose a summarization system generates a sequence of tokens y = (y 1 , . . . , y N ) to form a summary. Our factual correction models aim to edit an informative-yet-incorrect summary into y = (y 1 , . . . , y K ) such that where f is a metric measuring factual consistency between the source and system summary.

Span Selection Dataset
Our fact correction models are inspired by the span selection task, which is often used in reading comprehension tasks such as question answering. Figure 1 shows examples of the span selection datasets we created for training our QAspan and auto-regressive fact correction models, respectively. The query is a reference summary masked with one or all entities, 1 and the passage is the corresponding source document to be summarized.
If an entity appears multiple times in the source document, we rank them based on the fuzzy string-matching scores (a variation of Levenshtein distance) between the query sentence and the source sentence containing the entity. Our models explicitly learn to predict the span of the masked entity rather than pointing to a specific token as in Pointer Network , because the original tokens and replaced tokens often have different lengths.
Our QA-span fact correction model iteratively mask and replace one entity at a time, while the auto-regressive model masks all the entities simultaneously, and replace them in an auto-regressive fashion from left to right. Figure 2 shows an overview of our models. Comparing the two models, the QA-span fact correction model works better when only a few errors exist in the draft summary, as the prediction of each mask is relatively independent of each other. On the other hand, the auto-regressive fact correction model starts with a skeleton summary that has all the entities masked, which is often more robust when summaries contain many factual errors.

QA-Span Fact Correction Model
In the iterative setting, our model aims to conduct entity correction by answering a query that contains only one mask at a time. Suppose a system summary has T entities. At time step i, we mask the i-th entity and use this masked sequence as the query to our QA-span model. The prediction is placed into the masked slot in the query to generate an updated system summary to be used in the next step.
Given the source text x and a masked query q = (y 1 , . . . , [MASK], . . . y m ), our iterative correction model aims to predict the answer span via modeling p(i = start) and p(i = end). For span selection, we use the BertForQuestionAnswering 2 model, which adds two separate non-linear layers on top of Transformers as pointers to the start and end token position for the answer. We initialize the fact-correction model from a pre-trained BERT model (Devlin et al., 2019), and perform finetuning with the span selection datasets we created from the summarization datasets ( Figure 1).
The input to the BERT model is a concatenation of two segments: the masked query q and the source x, separated by special delimiter markers as ([CLS], q, [SEP], x). Each token in the sequence is assigned with three embeddings: token embedding, position embedding, and segmentation embedding. 3 These embeddings are summed into a single vector and fed to the multi-layer Transformer model: where h 0 are the input vectors, and l represents the depth of stacked layers. LN and MHAtt are layer normalization and multi-head attention operations (Vaswani et al., 2017). The top layer provides the hidden states for the input tokens with rich contextual information. The start (s) and end (e) of the answer span are predicted as: where H is the number of encoder's hidden states, w s , w e ∈ R d and b s , b e ∈ R are trainable parameters. The final span is selected based on the argmax of Eqn. (4) and (5) with the constraint of p start < p end and p end − p start < k.

Auto-regressive Fact Correction Model
One disadvantage of the QA-style span-prediction strategy is that if the sequence contains too many factual errors, masking out one entity at a time may lead to highly erroneous skeleton summary to start with. The model might be making predictions on top of wrong entities from later in the sequence. Masking one entity at a time is essentially a greedy local method that is prone to error accumulation. To alleviate this issue, we propose a new sequential fact correction model to handle errors in a more global manner with beam search. Specifically, we mask out all the entities simultaneously, and use a novel auto-regressive span-selection decoder to predict fillers for the multiple masks sequentially. By doing this, we assume dependency between the masks: the earlier predicted entities will be used as corrected context for better predictions in the later steps. Given a source text x = (x 1 , . . . , x n ) and a draft summary (y 1 , . . . y m ). Our model first masks out all the entities (with T masks), and leaves a skeleton summary as the query q = (y 1 , . . . , [MASK] 1 , . . . , [MASK] T . . . y m ). Then, we concatenate the query q with the source x (similar to Section 3.3) as inputs to the encoder. The inputs are fed into BERT to obtain contextual hidden representations.
We then select the encoder's hidden states for the T masks h y mask 1 , . . . , h y mask T as partial input to an auto-regressive Transformer-based decoder. Unlike generation tasks that require an [EOS] token to indicate the end of decoding, our decoder runs T steps to predict the answer spans for these T masks. At step t, we first fuse the hidden representation h [MASK] t ∈ R d of the t-th [MASK] token and previously predicted entity representation s ent t−1 ∈ R d : where W ∈ R 2d×d , s ent 0 = h [CLS] (the representation of [CLS] token), and [; ] denotes vector concatenation.
The input z t is then fed to the Transformer decoder (as in Eqn. (2) and (3)) to generate the decoder's hidden state h t at time step t. Based on h t , we use a two-pointer network to predict the start and end positions of the answer entity in the source (encoder's hidden states). This is achieved with cross-attention of h t w.r.t. the encoder's hidden states, similar to Eqn (4) and (5). This operation results in two distributions over the encoder's hidden states for the start and end span positions. The final prediction of the start and end positions for mask t is obtained by taking the argmax 4 over the pointer position distributions: p end = arg max(a end 1 , ..., a end M ) , under the constraint that p start < p end and p end − p start < k. Based on the start and end positions for the predicted entity, we can obtain the predicted entity representation at time step t as the mean over the in-span encoder's hidden states: which is used as the input for the next step of decoding. It is worth noting that although the argmax operations in Eqn. (9) and (10) are nondifferentiable, the model is trained based on the start and end positions of the ground-truth answer w.r.t. the start and end logits in Eqn. (4) and (5), which makes the gradient back-propagates to the encoder. Meanwhile, the encoder's hidden states used to compose s ent i in Eqn. (11) also carry the gradients. During inference, beam search is used to find the best sequence of predicted spans in the source to replace the masks.
Compared to the conventional Pointer Network See et al., 2017) that only points to one token at a time, our sequential span selection decoder has the flexibility to replace a mask by any number of entity tokens, which is often required in summary factual correction.

Experiment
In this section, we present our results on using SpanFact for multiple summarization datasets.

Experimental Setup
Training data for our fact correction models are generated as described in Section 3.2 on CNN/DailyMail (Hermann et al., 2015), XSum (Narayan et al., 2018) and Gigaword (Graff et al., 2003;Rush et al., 2015). The statistics of these three dataset are provided in Table 2. During training, if an entity does not have a corresponding span in the source, we point the answer span to the [CLS] token. During inference, if the answer span predicted is the [CLS] token, we replace back the original masked entity.
for the answer span, and the softmax is used for computing the loss for back-propagation.  Our fact correction models are implemented via the Huggingface Transformers library (Wolf et al., 2019) in PyTorch (Paszke et al., 2017). We initialize all encoder models with the checkpoint of an uncased, large BERT model pretrained on English data and SQuAD for all experiments. Both source and target texts were tokenized with BERT's sub-words tokenizer. The max sequence length is set to 512 for the encoder. We use a shallow Transformer decoder (L=2) for the auto-regressive span selection decoder, as the pre-trained BERT-large encoder is already robust for selecting right spans in the single-span selection task with only two pointers (Section 3.3). The Transformer decoder has 1024 hidden units and the feed-forward intermediate size for all layers is 4,096.
All models were finetuned on our span prediction data for 2 epochs with batch size 12. AdamW optimizer (Loshchilov and Hutter, 2017) with =1e-8 and an initial learning rate 3e-5 is used for training. Our learning rate schedule follows a linear decay scheduler with warmup=10,000. During inference, we use beam search with b = 5 and k = 10 (constraint for the distance between the start and end pointer). The best model checkpoints are chosen based on performance on the validation set. Experiments are conducted using 4 Quadro RTX 8000 GPUs with 48GB of memory.

Evaluation Metrics
We use three automatic evaluation metrics to evaluate our models. The first is ROUGE (Lin, 2004), the standard summarization quality metric, which has high correlation with summary informativeness in the news domain (Kryściński et al., 2019).
Since ROUGE has been criticized for its poor correlation with factual consistency (Kryściński et al., 2019;Wang et al., 2020), we use two additional automatic metrics that specifically focus on factual consistency: FactCC (Kryściński et   2019) and QAGS (Wang et al., 2020). FactCC is a pre-trained binary classifier that evaluates the factuality of a system-generated summary by predicting whether it is consistent or inconsistent w.r.t. the source. This classifier was trained on adversarial examples obtained by heuristically injecting noise into reference summaries. In addition, very recent work proposed QAbased models for factuality evaluation (Wang et al., 2020;Durmus et al., 2020;Maynez et al., 2020), and Wang et al. (2020) showed that their evaluation models have higher correlation with human judgements on factuality when compared with FactCC (Kryściński et al., 2019). We thus include our re-implementation of a question generation and question answering model (QGQA) following Wang et al. (2020) as an evaluation metric for factuality. 5 This model generates a set of questions based on the system-generated summary, and then answers these questions using either the source or the summary to obtain two sets of answers. The answers are compared against each other using an answer-similarity metric (token-level F1), and the averaged similarity metric over all questions is used as the QGQA 5 We were not able to obtain any of the QA evaluation model or code from Wang et al. (2020); Durmus et al. (2020); Maynez et al. (2020) as the authors are still in the stage of making the code public. We used pre-trained UniLM model for question generation (QG) and BertForQuestionAnswering model for question answering (QA). The QG model is fine-tuned on NewsQA (Trischler et al., 2017) with entityanswer conditional task (Wang et al., 2020), and the QA model is pre-trained on SQuAD 2.0 (Rajpurkar et al., 2018).  score. Answers generated from a highly faithful system summary should be similar to those generated from the source.

Baselines
We compare against the following abstractive summarization baselines. On CNNDM and XSum, we use BertSumAbs, BertSumExtAbs and TransformerAbs (Liu and Lapata, 2019). In addition, we also compare with Bottom-up (Gehrmann et al., 2018). On Gigaword, we use the pointergenerator (See et al., 2017), base and full Gen-Parse models (Song et al., 2020) for comparison. For the factual correction baseline, we compare with the Two-encoder Pointer Generator 6 (Split Encoder) (Shah et al., 2020), which employs a similar setting to ours for masking entities w.r.t. the source, and uses dual encoders to copy and generate from both the source and the masked query for fact update. Compared to our span selection models that can fill in the mask with any number of tokens, their models aim to regenerate the mask query based on the source. In other words, their decoder regenerates the whole sequence token by token with a pointer-generator, which inherits the backbone of generative models that suffer from hallucination.

Experimental Results
Tables 3, 4, and 5 summarize the results on the CNN/DailyMail, XSum and Gigaword datasets, respectively. Each block in the tables compares the original summarization model's output with  the corrected outputs obtained by our baseline and proposed models. On CNN/DailyMail (Table 3), our correction models significantly boost factual consistency measures (QGQA and FactCC) by large margins, with only small drops on ROUGE. This shows our models have the ability to improve the correctness of system-generated summaries without sacrificing informativeness. When comparing our two proposed models, we observe that the QA-span model performs better than the auto-regressive model. This is expected as CNN/DailyMail's reference summaries tend to be more extractive (See et al., 2017), and summarization models tend to make few errors per summary (Narayan et al., 2018). Thus, the iterative procedure of the QAspan model is more robust with high precision as it has more correct context from the query, with only minimum negative influence from other concurrent errors. This is also reflected in the high scores of QGQA and FactCC across all the models we tested. Since QGQA and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source.
On XSum (Table 4) and Gigaword (Table 5), both of our correction models boost factual consistency measures by large margins with a slight drop in ROUGE (-0.5 to -1.5) on average. This is still encouraging, as abstractive summarization models that use complex factual controlling components for generation often have drops of 5-10 ROUGE points (Zhu et al., 2020).
We also notice that the QGQA and FactCC scores of all summarization models are lower than that on CNN/DailyMail. The scores are especially  low on XSum. This is likely due to the data construction protocol of XSum, where the first sentence of a source document is used as the summary and the remainder of the article is used as the source. As a result, many entities that appear in the reference summary never appear in the source, which may cause abstractive summarization models to hallucinate severely with many factual errors (Maynez et al., 2020). As the system summaries often contain many errors, our QA-span model that relies on answering a single-mask query often has the wrong context to condition on at each step, which negatively affects the performance of this model. In contrast, the strategy of masking all the entities would provide the auto-regressive model a better query for entity replacement. We can observe in Table 4 that the auto-regressive model performs better than the QA-span model on XSum.

Human Evaluation
To provide qualitative analysis of the proposed models, we conduct human evaluation on pairwise comparison of CNN/DailyMail summaries enhanced by different correction strategies. We select three state-of-the-art abstractive summarization models as the backbones, and collect three sets of pairwise summaries for each setting: (  source document. As shown in Table 6, summaries from our two models are chosen more frequently as the factually correct one compared to the original. Between the two correction models, the preferences are comparable. In addition, we also test our fact correction models on the FactCC test set provided by Kryściński et al. (2019) and manually checked the outputs. Table 7 shows the results of the original summaries and the summaries corrected by our models in terms of automatic fact evaluation and our manual evaluation. Among 508 systemgenerated summary sentences, 62 were incorrect. The QA-span model was able to correct 18 out of 62 right, and the auto-regressive model was able to correct 16 out of 62. Among the 446 sentences that are labeled as correct by the annotators in Kryściński et al. (2019), our two models made 3 and 4 wrong changes in the entities, respectively, 7 while keeping most of the entities unchanged or changed with equivalent entities.

Conclusion
We present SpanFact, a suite of two factual correction models that use span selection mechanisms to replace one or multiple entity masks at a time. SpanFact can be used for fact correction on any abstractive summaries. Empirical results show that our models improve the factuality of summaries generated by state-of-the-art abstractive summarization systems without a huge drop on ROUGE scores. For future work, we plan to apply our method for other type of spans, such as noun phrases, verbs, and clauses. the reviewers for their valuable comments and special thanks to Yuwei Fang and other members of the Microsoft Dynamics 365 AI Research team for the feedback and suggestions.

CNNDM Source
Jerusalem (CNN)The flame of remembrance burns in Jerusalem, and a song of memory haunts Valerie Braham as it never has before. This year, Israel's Memorial Day commemoration is for bereaved family members such as Braham. "Now I truly understand everyone who has lost a loved one," Braham said. Her husband, Philippe Braham, was one of 17 people killed in January's terror attacks in Paris. He was in a kosher supermarket when a gunman stormed in, killing four people, all of them Jewish. System Summary france's memorial day commemoration is for bereaved family members as braham. valerie braham was one of 17 people killed in january's terror attacks in paris. Corrected by SpanFact israel's memorial day commemoration is for bereaved family members as braham. philippe braham was one of 17 people killed in january's terror attacks in paris.

CNNDM Source
(CNN)If I had to describe the U.S.-Iranian relationship in one word it would be "overmatched." ... America is alienating some of our closest allies because of the Iran deal, and Iran is picking up new ones and bolstering relations with old ones who are growing more dependent because they see Iranś power rising... System Summary iran is alienating some of our closest allies because of the iran deal, and iran is picking up new ones. Corrected by SpanFact america is alienating some of our closest allies because of the iran deal, and iran is picking up new ones.

CNNDM Source
(CNN)A North Pacific gray whale has earned a spot in the record books after completing the longest migration of a mammal ever recorded. The whale, named Varvara, swam nearly 14,000 miles (22,500 kilometers), according to a release from Oregon State University, whose scientists helped conduct the whale-tracking study. Varvara, which is Russian for "Barbara," left her primary feeding ground off Russiaś Sakhalin Island to cross the Pacific Ocean and down the West Coast of the United States to Baja, Mexico... System Summary a north pacific gray whale swam nearly 14,000 miles from oregon state university.
Corrected by SpanFact a north pacific gray whale swam nearly 14,000 miles from russiaś sakhalin island.

CNNDM Source
Sanaa, Yemen (CNN)Saudi airstrikes over Yemen have resumed once again, two days after Saudi Arabia announced the end of its air campaign. The airstrikes Thursday targeted rebel Houthi militant positions in three parts of Sanaa, two Yemeni Defense Ministry officials said. The attacks lasted four hours. ... The Saudi-led coalition said a new initiative was underway, Operation Renewal of Hope, focused on the political process. But less than 24 hours later, after rebel forces attacked a Yemeni military brigade, the airstrikes resumed, security sources in Taiz said. System Summary the attacks lasted four hours, two days after rebel forces attacked yemeni military troops..

Corrected by SpanFact
the attacks lasted four hours, less than 24 hours after rebel forces attacked yemeni military troops.

CNNDM Source
Boston (CNN)When the bomb went off, Steve Woolfenden thought he was still standing. That was because, as he lay on the ground, he was still holding the handles of his son's stroller. He pulled back the stroller's cover and saw that his son, Leo, 3, was conscious but bleeding from the left side of his head. Woolfenden checked Leo for other injuries and thought, "Let's get out of here." ... System Summary steve woolfenden, 3, was conscious but bleeding from the left side of his head.
Corrected by SpanFact leo, 3, was conscious but bleeding from the left side of his head.  Kryściński et al. (2019)). Factual errors by abstractive summarization system are marked in red. Corrections made by the proposed SpanFact models are marked in orange.