Improving Truthfulness of Headline Generation

Most studies on abstractive summarization report ROUGE scores between system and reference summaries. However, we have a concern about the truthfulness of generated summaries: whether all facts of a generated summary are mentioned in the source text. This paper explores improving the truthfulness in headline generation on two popular datasets. Analyzing headlines generated by the state-of-the-art encoder-decoder model, we show that the model sometimes generates untruthful headlines. We conjecture that one of the reasons lies in untruthful supervision data used for training the model. In order to quantify the truthfulness of article-headline pairs, we consider the textual entailment of whether an article entails its headline. After confirming quite a few untruthful instances in the datasets, this study hypothesizes that removing untruthful instances from the supervision data may remedy the problem of the untruthful behaviors of the model. Building a binary classifier that predicts an entailment relation between an article and its headline, we filter out untruthful instances from the supervision data. Experimental results demonstrate that the headline generation model trained on filtered supervision data shows no clear difference in ROUGE scores but remarkable improvements in automatic and manual evaluations of the generated headlines.


Introduction
Automatic text summarization aims at condensing a text into a shorter version while maintaining the essential information (Mani, 2001). Methods on summarization are broadly categorized into two approaches: extractive and abstractive. The former extracts important words, phrases, or sentences from a source text to compile a summary (Goldstein et al., 2000;Erkan and Radev, 2004;Mihalcea, 2004;Lin and Bilmes, 2011). In contrast, the latter involves more complex linguistic operations (e.g., abstraction, paraphrasing, and compression) to generate a new text (Knight and Marcu, 2000;Clarke and Lapata, 2008). Until 2014, abstractive summarization had been less popular than extractive one because of the difficulty of generating a natural text. However, research on abstractive summarization has attracted a lot of attentions recently with the advances on encoder-decoder models (Rush et al., 2015;Takase et al., 2016;Zhou et al., 2017;Cao et al., 2018a;Song et al., 2019;. English Gigaword (Graff and Cieri, 2003;Napoles et al., 2012) is a representative dataset for abstractive summarization. Rush et al. (2015) regarded Gigaword as a corpus containing a large number of article-headline pairs for training an encoder-decoder model. Their work assumed a task setting where the first sentence of an article is a source text and its corresponding headline is a target text (summary). Since then, it has been a common practice to use the Gigaword dataset with this task setting and to measure the quality of generated headlines with ROUGE scores (Lin and Hovy, 2003) between system-generated and reference headlines.
Apparently, a summarization method is desirable to achieve a ROUGE score of 100, i.e., a system output is identical to the reference. However, this is an unrealistic goal for the task setting on the Gigaword dataset. The summarization task is underconstrained in that the importance of a piece of information highly depends on the expectations and prior knowledge of a reader (Kryściński et al., 2019). In addition, the Gigaword dataset (as well as other widely-used datasets) was noisy for summarization research because it was not created for the research objective but other professional activities (e.g., news production and distribution). Thus, the state-of-the-art method could only reach ROUGE-1 scores less than 40 on the dataset. While a number of methods compete with each other for the underconstrained task on the noisy data, we have another concern about the truthfulness of generated summaries: whether all facts of a generated summary are mentioned in the source text. Unlike extractive summarization, abstractive summarization has no guarantee of truthfulness. This may result in a serious concern of practical applications of abstractive summarization when a generated summary includes fake facts that are not mentioned in the source document.
In this paper, we explore improving the truthfulness in abstractive summarization on two datasets, English Gigaword and JApanese MUlti-Length Headline Corpus (JAMUL) (Hitomi et al., 2019).
In Section 2, we analyze headlines generated by the state-of-the-art encoder-decoder model and show that the model sometimes generates unexpected words. In order to estimate the truthfulness to the original text, we measure the recall-oriented ROUGE-1 scores between the source text and the generated headlines. This analysis reveals that a high ROUGE score between a reference and headline does not necessarily mean a high truthfulness to the source and that there is only a weak correlation between the two.
In Section 3, we conjecture that one of the reasons why the model sometimes exhibits such an untruthful behavior lies in untruthful article-headline pairs, which are used for training the model. In order to quantify the truthfulness of article-headline pairs, we consider the textual entailment of whether an article (source document) entails its headline. We will show that about 30-40% of source documents do not entail their headlines under the widely-used experimental settings. In other words, the current task setting is inappropriate for abstractive summarization. We release the annotations of textual entailment for both English Gigaword and JAMUL 1 .
After confirming the untruthfulness of articleheadline pairs in the datasets, we hypothesize that removing untruthful instances from the training data may remedy the problem of the untruthful behavior of the model. In Section 4, we build a binary classifier that predicts an entailment relation between an article and its headline and use the classifier to filter out untruthful instances in the training data. We train a model on the filtered supervision data in Section 5. Experimental results demonstrate that the filtering procedure shows no clear difference in ROUGE scores but remarkable improvements when we manually and automatically evaluate the truthfulness of the generated headlines. These results suggest the importance of evaluating truthfulness in addition to relevance.

Examples of unexpected outputs
Although the current state-of-the-art method for abstractive summarization could only achieve a ROUGE-1 score of less than 40 on the Gigaword dataset, generated headlines actually look very fluent. This is probably because the encoder-decoder model acquired a strong language model from the vast amount of supervision data. However, some studies reported that the generated headlines often deviate from the content of the original document (Cao et al., 2018b;Kryściński et al., 2019). They addressed the problem where an abstractive model made mistakes in facts (e.g., tuples of subjects, predicates, and objects).
However, we also regularly see examples where the abstractive model generates unexpected words. This is true even for the state-of-the-art model. Table 1 shows examples of unexpected outputs from UniLM (Dong et al., 2019), which shows the highest ROUGE scores 2 on English Gigaword. In the first example, the output includes "in November" whereas the input did not mention the exact month. In fact, this article was published in August 2009; however, the model probably guessed the month from the expression "this fall". The second example also exhibits a similar problem where the model incorrectly supplemented the news source "the Detroit News". The third and fourth examples are more problematic in that the generated headlines do not summarize the input sentences at all.

Estimating truthfulness
In order to quantify the problem of outputs that are untruthful to source documents, we measure the word overlap between the input and output of the UniLM model on the test set of English Gigaword (Rush et al., 2015). Here, we calculate the recall-oriented ROUGE-1 score 3 , regarding an out-# Input (lead sentence) Output (generated headline) 1 u.s. home resales posted the largest monthly increase in at least ## years last month as first-time buyers rushed to take advantage of a tax credit that expires this fall .
home sales rise #.# percent in november 2 seattle -for years , the standard treatment for patients with blood clots in veins deep in a limb has been blood thinners that stop the clots from getting bigger .
UNK drug may help treat UNK clots the detroit news 3 wigan moved to consolidate their premiership status tuesday by tying down one of the brightest stars of last season 's maiden top flight campaign .
english football league tables 4 never mind that she has dark blond hair and light blue eyes and the fairest of skin .
african-american girl is a UNK  put (generated headline) as a gold standard and an input (source document) as a target to be evaluated 4 . Although this use of the ROUGE metric is unconventional, the intention here is to measure how many words in a generated headline originate from the input document. In other words, if all words in a generated headline are covered by its source document (truthful), the score is 100; if none of the words in a generated headline originate from its source document (untruthful), the score is 0. We call this ROUGE score support score hereafter to avoid naming conflicts with conventional ROUGE scores between system and reference summaries. We mention that we can find a similar method to the support score in several studies; for example,  measured the abstractiveness of an output. Our support score is roughly a reverse 0 20 40 60 80 100 ROUGE-1 (F-value) between system and reference outputs 0 20 40 60 80 100 Support score Figure 2: Scatter plots of ROUGE scores and support scores: X-axis presents ROUGE-1 score between system and reference headlines; and Y-axis presents support score (the same to Figure 1). version of abstractiveness because the abstractiveness measures the number of words in an output that do not appear in the input. Figure 1 reports the histogram of the support scores. A certain amount of instances receive relatively high support scores: 50.10% of the instances obtain scores larger than 80. At the same time, a non-negligible amount (9.14%) of instances have support scores less than 40. Note that the support scores present rough estimations of the truthfulness of the model; a lower score may imply that a headline includes paraphrased or shortened words from its source document. Having said that, Figure 1 indicates that the state-of-the-art model sometimes generates untruthful headlines.
Here, another interesting question comes into our mind: how do the widely-used benchmarking performance values (measured by ROUGE scores between system and reference headlines) reflect the truthfulness (measured by the support scores)? Figure 2 depicts the correlation between the two: the X-axis presents the ROUGE-1 score between system and reference headlines, and Y-axis presents support score. Unfortunately, we cannot observe a strong correlation between the two scores: Pearson's correlation coefficient between the two scores is 0.189, which suggests no correlation. This result supports that the conventional ROUGE scores tell us little about the truthfulness of generated summaries.
3 Are the task settings truthful?

Background of the datasets and settings
Why does a headline generation model exhibit untruthful behavior as we saw in the previous section? Before discussing the reason behind this, we need to understand how the datasets and task settings were established.
The Annotated English Gigaword corpus 5 is one of the most popular corpora in abstractive summarization research. Rush et al. (2015) converted this corpus into a dataset for abstractive summarization. They assumed the lead (first) sentence of an article as a source document and its corresponding headline as a target output. They did not explain the reason why they did not use a full-length article but only a lead sentence as a source document for headline generation. We infer that the reason for this treatment is that: a lead sentence provides a strong baseline for extractive summarization; their intention was to explore the capability of abstractive summarization from a lead sentence to a headline; using full text was time-consuming for encoderdecoder models.
Moreover, Rush et al. (2015) introduced some heuristics to remove some noisy instances. They discarded an instance if: (1) the source and target documents have no non-stop word in common; (2) the headline contains a byline or other extraneous editing marks; and (3) a headline includes a question mark or colon.
JApanese MUlti-Length Headline Corpus (JA-MUL) 6 is a dataset specially designed for evaluating summarization methods. JAMUL consists of 1,524 Japanese full-text articles and their print headlines (used for newspapers). Although JAMUL is distributed for free of charge, JAMUL alone is insufficient for training an encoder-decoder model. Hitomi et al. (2019) also released Japanese News Corpus (JNC), which is a large-scale dataset consisting of 1,831,812 pairs of newspaper articles and their print headlines. JNC includes only the first three sentences of each article 7 .
Table 2 summarizes the datasets and task settings. As we can see from the rows of Rush et al. (2015) and JNC, these task settings do not use fulltext articles but only lead (6.6% of words in full articles, Gigaword) and lead three sentences (25.9% of words in full articles, JNC) as source documents for abstractive summarization. Hence, we hypothesize that the source documents under these task settings contain insufficient information for generating headlines. In other words, headline generation models might be faced with supervision data where headlines cannot be generated from source documents and learned to be untruthful, i.e., producing pieces of information that are not mentioned in source documents.

Truthfulness of the datasets and settings measured by textual entailment
This section explores the hypothesis: do source documents include sufficient information to produce headlines? We examine this hypothesis by considering textual entailment between a source document and its headline. More specifically, we would like to know whether a source document entails its headline, i.e., whether we can infer that a headline is true based on the information in the source document. We asked three human subjects to judge entailment relations for 1,000 pairs of source documents and headlines of each dataset. We randomly selected 1,000 pairs from the test set of the English Gigaword dataset and 1,000 pairs from JAMUL. The labels include entail, non-entail, and other (see Appendix for the definition of the labels and the treatment). Table 4 reports the ratio of document-headline pairs for which two or three human subjects voted 'yes' for the entailment relation (entail). Only 70.3% of lead-headline pairs in the Gigaword dataset hold the entailment relation. For reference, we did the same analysis by using full-text articles as source documents and found that the ratio  Table 2: The statistics of datasets and task settings. The column "# words" presents two values for each row: a top value is the total number of words in the headline; and the bottom value is the total number of words in the article. The second row of each group (Rush et al. (2015) and JNC) corresponds to the setting of training data. The columns "# sent / doc", "# words / doc", and "# words / headline" denote the average number of sentences per source document, words per source document, and words per headline, respectively.

Improving the truthfulness of data
Based on the analysis in the previous section, we can consider two strategies to improve the task setting: using full-text articles as source documents instead of leading sentences; and removing nonentailment instances from the dataset. Although the former strategy reduces the ratio of non-entailment pair to 7.2% (English Gigaword) and 5.8% (JA-MUL), we must consider the trade-off: the use of full-text articles increases the cost for training, and may decrease the quality of headlines because of longer inputs to encoder-decoder models. Furthermore, JNC does not provide full-text articles but only lead three sentences. Therefore, we take the latter strategy, removing non-entailment pairs from the supervision data for headline generation.

Recognizing textual entailment
In order to find non-entailment pairs in the dataset, we build a binary classifier that judges whether a source document entails its headline or not. Recently, pretrained language models such as BERT (Devlin et al., 2019) show remarkable advances in the task of recognizing textual entailment (RTE) 8 . Thus, we fine-tune pretrained models on the supervision data for entailment relation between source documents and their headlines. For English Gigaword dataset, we use the pretrained RoBERTa large  finetuned on Multi-Genre Natural Language Inference (MultiNLI) (Williams et al., 2018). We further fine-tuned the model on the supervision data of the leadheadline pairs with entailment labels (acquired in Section 3). Here, the supervision data include leadheadline pairs where two or three human subjects labeled either entail or non-entail; other pairs were excluded from the supervision data. In this way, we obtained a binary classifier for entailment relation of 91.7% accuracy on a hold-out evaluation (761 training and 179 test instances) after running 10 epoch of fine-tuning on the RoBERTa model.
For JNC, we use the pretrained BERT model for Japanese text (Kikuta, 2019). However, no large-scale Japanese corpus for semantic inference (counterpart to MultiNLI) is available. Thus, we created supervision data for entailment relation between lead three sentences and headlines (lead3headline, hereafter) on JNC. We extracted 12,000 lead3-headline pairs from JNC, and collected entailment labels using crowdsourcing. Each pair had five entailment labels assigned by five crowd workers. We used lead3-headline pairs where four or five crowd workers labeled either entail or nonentail; other pairs were unused in the supervision data. The entailment classifier fine-tuned on the supervision data achieved 83.9% accuracy on a hold-out evaluation with 5,033 training and 1,678 test instances.
Applying the entailment classifiers to the training and development sets of English Gigaword dataset and JNC, we removed instances of nonentailment pairs judged by the classifiers. Eventually, we obtained 2,695,325 instances (71% of the original training instances) on the English Gigaword dataset and 841,640 instances (49% of the original training instances) on JNC.

Improving the truthfulness of models
In this section, we examine whether the supervision data built in the previous section reduces untruthful headlines.

Headline generation models
We use fairseq 9  as an implementation of the Transformer architecture (Vaswani et al., 2017) throughout the experiments. Hyperparameter configurations are: 6 layers both in the encoder and decoder; 8 attention heads; the dimension of hidden states is 512; the dimension of hidden states of the feed forward network is 2048; the smoothing rate, dropout rate, and label smoothing 9 https://github.com/pytorch/fairseq were set to 0.1; Adam optimizer with β = 0.98, the learning rate of 0.0005, and 4,000 warm-up steps.
We train the Transformer models on the supervision data with and without non-entailment instances. Because removing non-entailment instances decreases the number of training instances, we also apply the self-training strategy (Murao et al., 2019) to obtain the same amount of training instances to the full supervision data. More specifically, we generated headlines for the source documents discarded in Section 4.1, and added pairs of source documents and generated headlines as pseudo supervision data. The experiments compare models trained on the full supervision data (full), the one filtered by the entailment classifier (filtered), and the one filtered but augmented by the self-training (filtered+pseudo).

Data preparation
The experiments use the same data split of training (3.8M instances), development (390k instances), and test (380k instances) sets to Rush et al. (2015). In this study, we used 10,000 instances for evaluation that were sampled from the test set and unused in the analysis in Section 3. We do not apply any replace operations for the English Gigaword dataset: digit masking, rare word to UNK, and lower-casing. The dataset is tokenized by WordPiece (Wu et al., 2016) with the same vocabulary used in UniLM.
Splitting JNC into 1.7M training and 3k development instances, we evaluate the model on the JAMUL dataset. We use SentencePiece 10 (Kudo and Richardson, 2018) for tokenization.

Evaluation protocol
We evaluate the quality of generated headlines by using full-length F1 ROUGE scores 11 , following the previous work. However, Kryściński et al. (2019) reported that ROUGE scores between system and reference summaries had only a weak correlation with human judgments. Furthermore, we would like to confirm whether the filtering strategy can improve the truthfulness of the model. Therefore, we also report the support score, the ratio of entailment relation between source documents and generated headlines measured by the entailment classifiers (explained in Section 4.1), and human evaluation about the truthfulness.  Table 5 shows the main results. The baseline model with full training data obtained 35.80 ROUGE-1 score on the English Gigaword dataset and 48.08 ROUGE-1 score on JAMUL. The entailment filter lowered ROUGE scores on both of the datasets probably because of the smaller number of training instances, but the self-training strategy improved ROUGE scores on the Gigaword dataset, outperforming the baseline model. In contrast, the self-training strategy could not show an improvement for ROUGE scores on JA-MUL. Although it is difficult to find the exact cause of this result, we suspect that the filtering step reduced the training instances too much (0.8M instances) for the self-training method to be effective. Another possibility is that the writing style of articles of non-entailment pairs in JNC/JAMUL is so distant that the self-training method generated headlines that are too different from reference ones.

Results
The column "Sup" presents the support score computed by the recall-oriented ROUGE-1 between source documents and generated headlines (explained in Section 2.2). The table indicates that the filtering and self-training strategies obtain higher support scores than the baseline. Figures  3 and 4 depict histograms of the support scores for the baseline and filtering+pseudo settings on Gigaword and JAMUL, respectively. We could confirm that the filtering+pseudo strategy increased the number of headlines with high support scores.
The column "Entail" shows the entailment ratio measured by the entailment classifier. Again, the filtering+pseudo strategy obtained the highest entailment ratio on both the Gigaword dataset and JAMUL. Although this result may be interpreted as natural because we selected training instances based on the same entailment classifier, it is interesting to see that we can control the entailment ratio without changing the model. In order to examine whether the filtering strategy can deliver noticeable improvements for human readers, we asked a human subject to judge the truthfulness of the headlines generated by the baseline setting and filtering+pseudo strategy. Presented with both a source document and a headline generated by the model, the human subject judged whether the headline was truthful, untruthful, or incomprehensible. We conduct this evaluation for 109 instances randomly sampled from the test sets of Gigaword and JAMUL.
The "Truthful" column in Table 5 reports the ratio of truthful headlines. Consistently with the entailment ratio, we could confirm that the fil-tering+pseudo strategy generated truthful headlines more than the baseline setting on both of the datasets. During the human evaluation, one instance in both full and filtered+pseudo settings from the Gigaword dataset judged as incomprehensible.

Discussion
To sum up the results, improving the truthfulness of the supervision data does help improving the truthfulness of generated headlines. We could confirm the improvements from the support scores, entailment ratio, and human judgments. However, the ROUGE scores between system and reference headlines did not indicate a clear difference.
The ROUGE metric was proposed to measure the relevance of a summary when extractive summarization was the central approach (in the early 2000s). Obviously, the truthfulness of summaries  Table 5: Results on the test set. We used F1 full-length ROUGE score: R-1 (ROUGE-1), R-2 (ROUGE-2), and R-L (ROUGE-L). "Sup" denotes support score. "Entail" presents the percentage of outputs to which the entailment classifier predicts the entailment relation (built in Section 4.1). "Truthful" show the percentage of outputs to which a human subject judged as truthful headlines.
is out of the scope of ROUGE. The experimental results in this paper suggest that we should consider both relevance and truthfulness when evaluating the quality of abstractive summarization.
6 Related Work Rush et al. (2015) first applied the neural sequenceto-sequence (seq2seq) architecture (Sutskever et al., 2014;Bahdanau et al., 2015) to abstractive summarization. They obtained a dataset for abstractive summarization from the English Gigaword (Graff and Cieri, 2003;Napoles et al., 2012). After this work, a large number of studies followed the task setting (Takase et al., 2016;Zhou et al., 2017;Cao et al., 2018a;Song et al., 2019;. Some researchers pointed out that abstractive summarization models based on seq2seq sometimes generate summaries with inaccurate facts. Cao et al. (2018b) reported that 30% of the summaries generated by a seq2seq model include different facts from source articles. In addition, Kryściński et al. (2019) reported that ROUGE scores have only a weak correlation with human judgments in abstractive summarization and that the current evaluation protocol is inappropriate for factual consistency.
Several studies approach the problem of inconsistency between input and output by improving the model architecture or learning method. Cao et al. (2018b) applied an information extraction tool to extract tuples of subject, predicate, and object from source documents and utilized them as an additional input to the model.  incorporated an entailment classifier as a reward in reinforcement learning. Guo et al. (2018) presented a multi-task learning method between summarization and entailment generation where hypotheses entailed by a given document (as a premise) are generated. Li et al. (2018) introduced an entailment-aware encoder-decoder model to ensure the correctness of the summary. Kiyono et al. (2018) reduced incorrect generations by modeling token-wise correspondences between input and output. Falke et al. (2019) proposed a re-ranking method of beam search based on factual correctness from a classifier of textual entailment.
As another direction, Kryscinski et al. (2019) evaluated the factual consistency of a source document and the generated summary with a weaklysupervised model.
A few studies raised concerns about the data set and task setting. Tan et al. (2017) argued that lead sentences do not provide an adequate source for the headline generation task. The researchers reported that making use of multiple summaries as well as the lead sentence of an articles improved the performance of headline generation on the New York Times corpus. In contrast, our paper is the first to analyze the truthfulness of existing datasets and generated headlines, provide a remedy to the supervision data, and demonstrate the importance of truthfulness in headline generation.

Conclusion and future work
In this paper, we showed that the current headline generation model yields unexpected words. We conjectured that one of the reasons lies in the defect in the task setting and data set, where generating a headline from the source document is impossible because of the insufficiency of the source information. We presented an approach for removing from the supervision data headlines that are not entailed by their source documents. Experimental results demonstrated that the headline generation model trained on filtered supervision data showed no clear difference in ROUGE scores but remarkable im-provements in automatic and manual evaluations of the truthfulness of the generated headlines. We also presented the importance of evaluating truthfulness in abstractive summarization.
In the future, we explore a more sophisticated method to improve the relevance and truthfulness of generated headlines, for example, removing only deviated spans in untruthful headlines rather than removing untruthful headlines entirely from the supervision data. Other directions include an extensive evaluation of relevance and truthfulness of abstractive summarization and an establishment of an automatic evaluation metric for truthfulness.
Moreover, it will be also interesting to see whether the same issue occurs in other related tasks such as data-to-text generation. We believe that the concern raised in this paper is beneficial to other tasks.

Entail
• All facts of the headline are covered by those of the article.
• If the headline includes an expression that do not appear in the article, but if the fact mentioned by the expression can be derived from the article, judge the pair as "Entail".

Non-entail
• The statement of the headline conflicts with the article.
• The headline mentions facts that cannot be confirmed by the article.

Incomprehensible
• Impossible to judge because the article or headline is unreadable. If the headline is not grammatically complete but correct as the headline style, please try to judge either entail or non-entail.
• Other problems such as garbled characters.  B Examples Figure 6 shows some examples of the generated headlines from the models described in Section 5. In the first example, the baseline model added "in Kashmir" in the headline, but this is incorrect. The correct location is in Southern Egypt, which was mentioned in the reference headline. The fil-tered+pseudo model generates a safe headline. The second headline generated by the baseline includes the verb 'begin' although the report was written two years ago. The baseline model added "dollar lower against yen" in the headline. There is a correlation indeed that dollar is lower against yen when Tokyo stocks rise, but we cannot confirm the fact  from the source document.