Fill in the BLANC: Human-free quality estimation of document summaries

We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document’s text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.


Introduction
Summaries have a real-world job to do: Save the reader from having to read a document by giving her its gist.A good summary does this by fluently, concisely, and accurately conveying the most important information.The abstract above is an example.You need not read further.
But since we have your attention: Consider the two most widely used methods for measuring the quality of a summary, ROUGE [1] and human evaluation [2].
The ROUGE family of methods are well-defined and reproducible.However, these methods typically require a humanwritten reference summary or summaries for comparison, completely disregarding the original document text.Even if one assumes that a reference summary is available and of optimal quality, the ROUGE method is limited to measuring a mechanical overlap of text tokens with little regard to semantics.This deficiency may be partially addressable through measurement of the similarity not of text tokens but named entities or other preprocessed features [3,4,5,6,7], as well as measuring the overlap between summary and document text [8].
Human evaluation of summary quality is far more meaningful and powerful than ROUGE, but it is far less reproducible.Summary quality estimation is a cognitively demanding and highly subjective task.Humans are also vulnerable to biases, such as the preference for phrases and sentences copied directly from the document text into summaries, perhaps due to the greater convenience of finding the relevant section of the document during evaluation [9].Improving human evaluation may require prompting labelers to pay higher attention [10], as well as splitting quality scores into multiple dimensions such as fluency, informativeness, and factual correctness [2,11,12].Even if humans can be trained to be more reliable, reproducible estimators of summary quality, they will forever remain a slow, expensive, limiting resource.
One possible route to a better automatic method for summary quality estimation is to train a model on document summaries annotated with human quality scores [13,14,15].Such a model could be used to evaluate summaries without further human involvement.But even if such a model could achieve high agreement with human labelers, its performance would only be as objective and reproducible as the summary quality scores generated by one particular group of humans on a particular group of documents.Such a model may not generalize beyond the domain and style of the training samples unless they are a massive, representative sample of all documents of interest.

arXiv:2002.09836v1 [cs.CL] 23 Feb 2020
A more fundamental approach to the problem is to estimate how "helpful" a summary is for the task of understanding a text.For example this might be achieved through a series of question-answers [16,17,18].However, this approach faces a limitation similar to ROUGE in that one must choose from a vast set of questions one might ask of a text, presupposing knowledge of the document itself and seriously limiting its reproducibility.
In the following section we suggest a new approach that is fundamentally justifiable as an estimator of summary quality, as well as being conceptually simple and reproducible.

Introducing BLANC
Consider these ingredients for an ideal summary quality estimator.
First, it should measure the functional qualities of a summary as directly as possible.The ideal estimator would directly test how helpful a summary is to its readers.Thus in some sense, the estimator itself should be a reader.
Second, the estimator should be intellectually flexible and omnivorous like our human reader.It should reliably estimate quality across a broad range of document domains and styles.And yet it should achieve this without requiring ornate preconditions and presuppositions about the text being summarized.If this estimator relies upon an existing base model, that model should be well-documented, well-understood, widely used, and open source.
We propose BLANC 1 as a replacement for the ROUGE family of summary quality estimators.
We define BLANC, in broad terms, as a measure of how well a summary helps an independent pre-trained language model while it performs its language understanding task on a document.We begin by focusing on the masked token task, also known as the Cloze task [19], in which a model is challenged to reconstruct obscured spans of text.
For the first version of BLANC we use the well-known BERT language model [20] which has been pre-trained to predict masked text tokens.The BERT tokenizer represents the majority of the most frequently used words as single tokens, while splitting less common words into two or more.For this first version of BLANC, we must choose whether to restrict the kinds of tokens used in the measurement.We must also specify exactly how to measure the "helpfulness" of the summary.Thus, like ROUGE, the BLANC method is extendable to a family of measures.By varying the base language model, tokenization rules, and other relevant parameters, varying sensitivity may be achieved.
BLANC scores should strongly correlate with the summary quality scores of human judges, even when split into multiple quality dimensions.These are the key intuitions: 1.A more informative summary should help a language model with tasks such as sentence entailment prediction and masked token reconstruction.2. A less factually correct summary should be less helpful to a language model performing inference on the document's text.3. A more fluent summary should be better understood by a language model because it better matches the distribution of text on which it was pre-trained.
We explore two versions of BLANC, which we dub BLANC-help and a BLANC-tune.These measures are described in detail in the following sections.The essential difference between them: 1. BLANC-help uses the summary text by directly concatenating it to each document sentence during inference.2. BLANC-tune uses the summary text to fine-tune the language model, and then processes the entire document.
Thus with BLANC-help, the language model refers to the summary each time it attempts to understand a part of the document text.While with BLANC-tune, the model learns from the summary first, and then uses its gained skill to help it understand the entire document.

BLANC-help
The algorithm for obtaining BLANC-help scores is illustrated in Figure 1. 1 According to ancient tradition, we should adorn our newly created jargon term with a bacronymic justification.The term BLANC is a nod to its proud lineage of French color words that began with the BLEU method for evaluating machine translation and ROUGE for summarization.BLANC is also a reference to the method's core task of "filling in the blanks" in the masked token task.But to honor tradition we offer this: Bacronymic Language model Approach for summary quality estimatioN.Cool? Figure 1: BLANC-help of summary quality is defined by the difference in accuracy of two reconstructions of masked tokens: with summary vs. filler concatenated in front of the sentence with masked tokens.The model input is a summary (or filler) + sentence with masked (grey) tokens.The model output is the unmasked tokens.
There are many possible choices for how to mask the tokens.Our aim is to mask approximately 15% of tokens in a sentence, and evenly cover all tokens.
The unmasking is done twice for each sentence of the text and for each allowed choice of masked tokens in the sentence.First, the unmasking is done for input composed of the summary concatenated with the sentence.Second, the unmasking is done for input composed of a "filler" concatenated with the sentence.The filler has exactly the same lengths as the summary, but each summary token is replaced by a period symbol (".").After iterating over all sentences and over all the allowed choices of masking, we end up with four total counts of successful and unsuccessful unmasking S ij , i = 0, 1; j = 0, 1.Here the index i equals 0 or 1 -for unsuccessful (0) or successful (1) unmasking for the filler-input version.The index j is defined the same way for the summary-input version.For example, S 01 is the total number of cases where the filler-input version was unsuccessful and the summary-input version was successful.
We define two simple versions of BLANC-help for summary quality estimation.The first one is the most obvious: It is defined as total improvement in accuracy of unmasking, normalized by the total number of predictions.
The second version ignores the "abnormal" cases where the summary is worse than the filler, and considers only the improvement: This second version probably diminishes the account of possible factual errors and hallucinations within the summary, but as such may better correlate with one-dimensional human quality scores assigned to summaries.
The algorithm for BLANC-help is shown in more detail in Figure 2.
We explored several variations in calculating this measure, finding no significant difference in results.First, since the BERT model deals with tokens rather than words, we can choose to mask tokens rather than words.In typical news documents, for example, only about 10% of words are split by the BERT tokenizer into two or more tokens.Such "composite" words (those not existing in the BERT vocabulary) should be particularly valuable in estimating the helpfulness of a summary.In a version dealing with tokens rather than words it is natural to always allow masking of composite words regardless of their length.
Secondly, the setting L min = 4 allows the masking only of sufficiently long words (4 or more characters), because shorter (more common) words are typically easier to predict, with or without the help of a summary.When a word is too short to be masked, it can simply be skipped; the next possible word is then M positions to the right, as follows from the Figure 2. Another option is to compensate for non-masked words by picking up whichever next word is long enough.
The value M = 6 in Figure 2 is a natural choice because the standard BERT model is trained by masking 15% of tokens, which makes about one-sixth or one-seventh of tokens eligible to be masked.Changing M around 5 to 8 does not strongly affect the measures.We found that altering the filler has a negligible effect on the measures.The reason we use the filler is to avoid any effect of the length of input on the action of the model.

BLANC-tune
The algorithm for obtaining BLANC-tune is illustrated in Figure 3.For calculating this measure, the model first learns from the summary, and then we observe how helpful this learning was in reconstructing masked tokens in text sentences.
As in the case of BLANC-help, we define BLANC-tune by comparing the accuracy of two reconstructions: one that does use the summary, and another that does not.In the case of BLANC-help, this was the difference between placing the summary vs. placing the filler in front of a sentence.Now, in the case of BLANC-tune, we compare the performance of a model fine-tuned on the summary text vs. a model that has never seen the summary.
The task, using a model to unmask tokens, is performed the same way as for BLANC-help, except that the input is simply a document sentence with masked tokens.
The tuning of the model is done on an extremely small dataset (derived from the summary text), in which each sample is the very same summary but with different tokens masked.The masking in the summary is done accordingly to the original BERT pre-training strategy.Unmasking must be performed for 15% randomly selected tokens, of which 80% are masked, 10% are replaced by random tokens, and 10% are left unchanged.To ensure coverage of tokens, we select and shuffle all eligible tokens, and then go through them to generate samples for the BLANC-tune dataset.
And as is the case for BLANC-help, we present two simple versions of BLANC-tune: M easure relative and M easure improve .
The algorithm for BLANC-tune is shown in more detail in Figure 4.

Given:
summary; text; model; probability p mask = 0.15 of masking a word at tuning; min allowed length of word to be masked L min = 4; number N of passes through tokens masking for tuning.
# Tune the model on the summary: Figure 4: BLANC-tune for quality of summary Similar to BLANC-help, there can be several variations of the measure.The details described in the previous section for BLANC-help are now applicable here in two parts of the algorithm where we must select masked tokens: for the tuning dataset, and for the inference.Any fixed version of the measure can be reproducible, with fixed seed for randomness at the tuning.In our tuning we used the same optimizer and learning rate as was used by the open source huggingface repository [21] for training, and we found that dependency on the seed is very weak.
While BLANC-tune appears more complicated than BLANC-help, and does run several times slower (depending on parameters), it is a promising method in that learning from a summary is separated completely from the task of understanding the document, with no messy concatenation required.While we mostly use BLANC-help for the presentation of our approach in this paper, in future work we will systematically explore BLANC-tune.It is interesting to note that, regardless of the version of BLANC, the values of BLANC-help and BLANC-tune correlate with more than 90% Pearson correlation with negligible p-values.We made these observations for summaries generated by three different models on typical news texts -the same models and texts that we used in human evaluations described below.

Extractive summaries: no-copy-pair guard
In the case of purely extractive summaries, the process of calculating BLANC scores may pair a summary with sentences from the text that have been copied into the summary.This exact sentence copying should be unfairly helpful in unmasking words in the original sentence.This effect may be reduced or completely eliminated by using a stronger underlying language model, especially for BLANC-tune.But a simpler solution is to include a simple guard rule into the measure: We may exclude any pairing of exact copy sentences from the calculation of the measure.This guard against pairing exact copies of sentences can be added as one of two versions: 1.In the process of iterating over text sentences, whenever a sentence contains its exact copy in the summary, it is skipped.
2. In the process of iterating over text sentences, whenever a sentence contains its exact copy in the summary, the copy of the sentence is removed from the summary (only for this specific step in the process).
Throughout this paper we do not use the "no-copy-pair" guard we suggested here, except in the corner case consideration of copying random sentences from the text, as described in the next section.

Basic validation of BLANC measurement
As part of the validation of these new measures we performed experiments to determine how a substitution of an obviously bad summary affects the measure.One example is a summary generated by selecting random words from the text.The random words summary is generated with the same length as the original summary.Our original summaries are generated for randomly selected daily news by three different methods: by Microsoft's abstractive UniML model [22], by Primer's semi-abstractive bullet summarization model (based on [23]), and by Primer's extractive LexRank model (based on [24]).The summaries generated by these models are far from flawless and vary widely in overall quality when evaluated by human labelers.Nonetheless, we should be concerned if BLANC does not assign lower quality to equal-length summaries consisting of randomly chosen words from the document.
In another validation experiment, we generate a "random sentences summary", which is constructed from the sentences of a document.For this example, we apply BLANC-help with the "no-copy-pair" guard introduced above.But we use the second version of the guard rule, because it is less exclusive of text sentences overall, and we also compensate for the length of the summary by replacing the copy-sentence of the summary with another sentence, rather than simply removing the copy-sentence.Moreover, we repeat 10 times the selection of a random-sentences "summary" for each text sentence, and take the average of the results after the entire run.
BLANC-help results for both examples (in comparison to the measure of the original summaries) are shown in the Figure 5.
We can see that the measure value for the real generated summary is always higher than the value for the random-words summary or for the random-sentences summary.This confirms that the measures take into account the context as well as the informativeness of the summary to assess the quality.
Selecting only summaries with exactly three sentences, we can observe how BLANC-help deteriorates if we spoil some of the sentences of the summary.We replace one, two or all three sentences with random words, keeping the same length of the resulting randomized summary as the original summary.We also take care to run on each possible choice of replacement sentences twice, and average the resulting BLANC-help.The result is shown up in Figure 6.

Comparison with human evaluation scores
The BLANC measures do not require any "gold-labeled" data: No human-written summaries nor human-annotated quality scores are needed.Theoretically, the measures should reflect how fluent, informative, and factually correct a summary is, simply because only fluent, informative, correct summaries are helpful to the underlying language model.We now turn to the question of whether the BLANC measures correlate with summary quality scores assigned by human readers.
Human scoring is fallible; a correlation with human scores should not be considered as a full validation of our measures, but rather as an independent confirmation that the measures are sensible.
For purposes unrelated to this study, we have undertaken a series of human evaluations of many generated summaries of approximately similar length.As mentioned in the previous section, the summaries were generated by Microsoft's abstractive UniML model [22], by Primer's semi-abstractive model [23], and by Primer's extractive LexRank model [24].The summaries from the latter two sources were "equalized" in length to the UniML, so that at least on average the summaries from all three generation sources would be equal, and also so that most summaries would not differ significantly in length.Altogether, we assembled 555 summary-text pairs for human scoring, with the texts taken from the CNN / Daily Mail dataset [25].
Figure 5: BLANC-help of a generated summary vs. random-words summary (left) and BLANC-help of a generated summary vs. random-sentences "summary" (right).The random-words summary is produced from random words of the same text by filling with the words the same length as the generated summary.The random-sentences summary is calculated with the no-copy-pair guard rule (version 2), but compensating for the summary length by adding more random sentences to the summary whenever needed.The values of the correlations are illustrated in Figure 7.The green step-function shows the value of correlation of an annotator score (with Id ranging from 1 to 10) with the averaged score of the 9 other annotators.The number of samples used for the plot is 555 -the summaries generated by the three models.The red and blue lines show correlations of BLANC-help and rouge correspondingly with the averaged score of all 10 annotators.The rouge here is calculated using the google-research package [28] as F1 value of "rougeL" (lower blue line on the plot) and F1 value of "rougeLsum" (upper blue line).
The difference between the two versions of BLANC-help is negligible here ( 0.001).
The yellow line in the figure shows how a simplest combination of BLANC-help and ROUGE correlates with the annotators.The "BLANC-help + ROUGE-Lsum" is literally a simple sum of BLANC-help and the ROUGE-Lsum.As usual a blending of two different models produces better results, though it is not our purpose here to fit human scores, and we do not fit the weights in the sum.(For example, using a score = 3 * measure help + rouge Lsum with the weight 3 for BLANC-help would increase the correlation with human scores by 1%).
All shown correlations have negligible p-values, of order 10 −6 and lower.
We observe that both BLANC-help and ROUGE correlate with annotators as good as or better than about 30% of annotators.
A direct comparison of our measure with ROUGE for human-created reference summaries of CNN / Daily Mail would be more dramatic: the ROUGE has zero correlation with human scores simply because it equals 1 for all the reference summaries, and there are no alternative human summaries available for this dataset.BLANC-help correlates with the average annotators score for human-created summaries at 0.28.
In Figure 8 we present correlations with human scores on summaries generated for 100 typical daily news documents.The summaries were generated by the same three models; there were 300 summary-text pairs for scoring, again by 10 annotators.Since there are no "gold-labeled" summaries for these news documents, there is no ROUGE score in the figure .For the BLANC measure here, again the difference between the "relative" and "improve" versions is negligible.
As we see in all these examples, the human-human agreement is not impressive.We have observed from yet another evaluation dataset that if the texts and the generated summaries are challenging with very low inter-annotator agreement, the correlation of our measure with human scores is similarly diminished (with borderline p-values).
The values assigned to humans scores (0,1,2,3,4) are not a perfect translation of the human perception of the corresponding labels ("very bad", "bad", "OK", "good", "very good").From multiple evaluations unrelated to this study we know that when an evaluation is repeated, human annotators are far likelier to substitute "OK" and "good" with each other than other tags.When we obtain an averaged human score, a weighting of the values (0, 1, 2, 3, 4) with weights = (3.0,3.0, 1.0, 1.0, 2.0) may be more inline with human perception, but the changes to results we presented here are not essential, of order 1%.
A few simple observations may serve as an evidence that our measure deals with the length of a summary more reasonably than either humans or ROUGE.In the Table 1 we show the correlations with the summary length and with the compression factor, which is defined as the ratio of summary length to document text length.The length here is the number of characters.The table is based on the same data as the Figure 7.We see from the table that similarly to humans, our measure is helped by longer summaries in general.But humans, it is much more sensitive to a summary's compression factor.A disregard for the compression factor by humans may be caused by the anchoring effect.
The table 2 gives similar insight for very different kind of documents -random daily news, same as were used for the Figure 8.
Simple correlation with a consensus score of annotators is not an easy criterion for judging the usefulness of the measure.When annotators are tasked with scoring several different qualities of a summary, their final score for the overall quality should be more grounded, because more attention has been spent on the summary and the text.In Figure 9 we show values of correlations obtained from such evaluation (undertaken for purposes unrelated to this study).The data used here are the same as the data for the figure 8: summaries generated on randomly selected daily news documents.For this illustration, however, we split our 10 annotators into a small group of 3 and an "others" group of the remaining 7.There are 120 ways to chose the split, hence there are 120 groups of annotators on the X-axis.The circle markers show human-human correlation, i.e. the correlation between the average score of the small group and the average score of the "others" group.The plus markers show a measure-human correlation, i.e. a correlation of the measure with the "others" group of annotators.Hence we see how well the measure performs against the small group of 3 annotators in correlating with the "others" group.For simplicity of the presentation, each type of correlation was sorted independently.If a correlation is completely unreliable (p-value > 0.05) then the marker is not shown.
We see that human annotators are in good agreement (compared to the measure-human agreement) on how informative or how understandable the summary is.But the measure makes a better agreement on the fluency and overall quality of the summary.We are careful not to make any bold conclusion from this.Both disagreement and agreement between annotators may be due to very different reasons, and also the values of measure-human correlations for different qualities are closer to each other than the values of human-human correlations.

Conclusion
In this paper we present BLANC, a new family of objective and reproducible measures of document summary quality estimation.We introduce a working version of the method that uses BERT, a widely used pre-trained language model.BLANC does not require prior knowledge about a summarized document, nor human-written reference summaries, nor "gold-labeled" summary quality scores.In our approach the quality measurement of a summary is based on how helpful it is for understanding the document.
By comparison, it is difficult to suspend disbelief when considering a method like ROUGE that does not include the document itself when estimating the quality of a summary.It is notable that ROUGE scores are often cited even as applied to automated headline generation [29,30,31,32] where it is hard to imagine that any single headline could be confidently regarded as the best possible headline for a document.
We note that an intriguing byproduct of BLANC calculation is a heat map across the document text showing the locations where the summary was helpful -leading to increase S 01 of the measure value, -and also the places in text that may help understanding shortcomings of the summary S 10 .This may provide the basis of an entirely novel method for semantically mapping the relation between paired texts.
In future work we will explore alternative language models and pre-training tasks for improving BLANC.We will also delve deeper into the correlations between BLANC and the various dimensions of summary quality scores.Based on early results, as illustrated in the previous section, we observe that our first version of BLANC-help already correlates well with human evaluations along several dimensions: f luent, understandable, inf ormative, compact, and overall.It will be valuable to explore whether these correlations can be altered by changes to the filtering rules for token masking.For example, higher weights for masked named entities may better reflect the informativeness of a summary.And finally, an exploration of BLANC scores and summary factual accuracy is underway.
The measure of summary quality we introduce here is reasonable and reproducible.It does demand more processing power than ROUGE.But further refinement may yield a faster, lighter model.It may also be possible to train a fast classification or regression model on summary-document pairs labeled with BLANC scores.

Figure 2 :
Figure 2: BLANC-help for quality of summary.

Figure 3 :
Figure 3: BLANC-tune of summary quality is defined by the difference in accuracy of two reconstructions of masked tokens: with model tuned on the summary vs. with the original model.Both models are given the same input: a sentence with masked (grey) tokens.Each model outputs the unmasked tokens.

Figure 6 :
Figure 6: BLANC-help for 3-sentence summaries with one or more sentences replaced by random words from the text.Left: the summaries in each case are sorted by their measures.Right: the summaries are sorted by measure of the original summary.

Figure 7 :
Figure 7: Spearman (left) and Pearson (right) correlations with human annotators.The green step-line is the level of correlations of one of annotators with all other annotators.The correlation of BLANC-help with an average over all annotators is shown by the red line.The blue lines correspond to ROUGE-L and ROUGE-Lsum, and the yellow line to a simple sum "BLANC-help + ROUGE-Lsum".The summaries were generated on the CNN / DailyMail texts.

Figure 8 :
Figure 8: Spearman (left) and Pearson (right) correlations with human annotators.The green step-line is the level of correlations of one of the annotators with all other annotators.The correlation of BLANC-help with an average over all annotators is shown by the red line.The summaries were generated on regular news documents: There are no reference summaries, and hence no ROUGE score.

Figure 9 :
Figure 9: Spearman (left) and Pearson (right) correlations with a group of 7 annotators.The x-axes depicts 120 ways to choose 3 annotators out of 10.The circle-markers show correlation of average score of 3 annotators with average score of 7 other annotators.The plus-markers show correlation of BLANC-help with the 7 annotators.Each type of correlation was sorted independently, left-to-right.Markers with p-values > 0.05 are not shown.
Given: summary; text; model; parameters M = 6, L min = 4 Initialise f iller as string of same length as summary, filled by period Initialise S 00 , S 01 , S 10 , S 11 to zero for sentence in text: for i start in range from 1 to M : In sentence mask each ith word if (i − i start )%M == 0 and if the word's length >= L min Make input help = concatenate(summary, sentence) Make input base = concatenate(f iller, sentence) Apply the model to predict the masked words for both inputs.for each masked word: S 00 + = 1 if both predictions are wrong.S 11 + = 1 if both predictions are correct.S 01 + = 1 if prediction for input base is wrong and for input help is correct.S 10 + = 1 if prediction for input base is correct and for input help is wrong.M easure relative = (S 01 − S 10 )/(S 00 + S 11 + S 01 + S 10 ) M easure improve = S 01 /(S 00 + S 11 + S 01 ) number of words in summary multiplied by p mask and cast to integer Initialize empty dataset dataset tune for tuning the model for i in range from 1 to N : Create list positions mask of positions of all summary words longer than L min .Random shuffle positions mask until all position are used: Take next N mask positions from positions mask and mask the words at these positions.Add the summary with masked words to dataset tune Tune the model on the dataset tune .Result: model tuned # Compare inference using the model vs. model tuned : Initialise S 00 , S 01 , S 10 , S 11 to zero M = integer(1/p M easure relative = (S 01 − S 10 )/(S 00 + S 11 + S 01 + S 10 ) M easure improve = S 01 /(S 00 + S 11 + S 01 ) mask ) for sentence in text: for i start in range from 1 to M : In sentence mask each ith word if (i − i start )%M == 0 and if the word's length >= L min Apply the model to predict the masked words in sentence.Apply the model tuned to predict the masked words in sentence.for each masked word: S 00 + = 1 if both predictions are wrong.S 11 + = 1 if both predictions are correct.S 01 + = 1 if prediction by model is wrong and by model tuned is correct.S 10 + = 1 if prediction by model is correct and by model tuned is wrong.

Table 1 :
Correlation of different quality estimators with length of summary and with compression.The compression is defined as length of summary divided by length of text, in characters.The no correlation cases (p-value > 0.05) are left empty.Based on CNN / Daily Mail news.

Table 2 :
Correlation of different quality estimators with length of summary and with compression.The compression is defined as length of summary divided by length of text, in characters.Based on randomly selected daily news documents.