HOLMS: Alternative Summary Evaluation with Large Language Models

Efficient document summarization requires evaluation measures that can not only rank a set of systems based on an average score, but also highlight which individual summary is better than another. However, despite the very active research on summarization approaches, few works have proposed new evaluation measures in the recent years. The standard measures relied upon for the development of summarization systems are most often ROUGE and BLEU which, despite being efficient in overall system ranking, remain lexical in nature and have a limited potential when it comes to training neural networks. In this paper, we present a new hybrid evaluation measure for summarization, called HOLMS, that combines both language models pre-trained on large corpora and lexical similarity measures. Through several experiments, we show that HOLMS outperforms ROUGE and BLEU substantially in its correlation with human judgments on several extractive summarization datasets for both linguistic quality and pyramid scores.

Generating human readable summaries of textual documents is posed to remain a key technology in our information era. Whether summarizing news, scientific articles, encyclopedias, or social media posts, the demand for a faster consumption of the most relevant information is expected to grow hand in hand with the amount of information available online.
A current bottleneck to a wider adoption of automatic summarizers is the lack of efficient solutions addressing both the relevance of the generated summaries and their linguistic quality. One component of this current limitation is the lack of efficient evaluation measures that address both aspects.
The two most cited and widely adopted measures in various summarization datasets and summarization challenges are ROUGE (Lin and Hovy, 2003;Lin, 2004) and BLEU (Papineni et al., 2002). While both measures have been shown to be highly efficient baselines in ranking summarization systems based on their average score (Dang and Vanderwende, 2007;Dang and Owczarzak, 2008;Dang and Owczarzak, 2009;Owczarzakg, 2010;Owczarzak and Dang, 2011), fewer studies have examined their relevance for individual summary ranking and their adequacy with linguistic quality and fluency.
With the new levels of performance achieved by neural language models in a variety of natural language processing tasks, several insights point towards the high potential of their contextual language encoding for language representation. Most of the state-of-the-art models such as T5 (Raffel et al., 2019), BERT (Devlin et al., 2019), GPT (Radford et al., 2018;Radford et al., 2019) and the Universal Sentence Encoder (Cer et al., 2018), are built from very large corpora, which reduces substantially the potential bias from using them to evaluate summaries in a restricted document set or benchmark.
On the other hand, the shallow lexical features of the original texts play a key role in extractive summarization and in pointer-generator approaches. It can also be argued that lexical similarities with gold summaries are implicitly capturing an important portion of the relevant semantics, especially at high similarity values.
In this paper, we propose a new evaluation measure for summarization, called HOLMS, relying on both contextual neural representations and lexical similarities. To capture the salient indicators from each side more efficiently, HOLMS relies on a multi-dimensional Gaussian function combining a sequential similarity measure based on neural embeddings and ROUGE. We motivate and present each component in more details in section 3.
We evaluate HOLMS on five different summarization datasets by computing its correlation with human judgements for content relevance and linguistic quality, and show that HOLMS has higher Pearson, Spearman and Kendall correlations than state-of-the-art measures on both aspects.
In the next section, we discuss related works on summarization evaluation. We present HOLMS in details in section 3 and our experiments in section 4. Finally we discuss the results in section 5 before concluding.

Related Work
Summary evaluation was studied from several perspectives, including the similarity of the candidate summaries with reference human summaries, their intrinsic linguistic quality and coherence, and their relevance w.r.t. the original document (Cabrera-Diego and Torres-Moreno, 2018;Lloret et al., 2018). In this paper, we focus on extrinsic evaluations when human generated summaries are available as they offer both more specific parameters for the task, and available benchmarks with human-generated scores for automatically generated summaries.   Table 1, we present several summarization evaluation measures and their main characteristics. In terms of usage, ROUGE and BLEU stand out as the most cited measures, albeit a big portion of BLEU citations are likely from language translation papers, and not from research works on summarization.
ROUGE stands for Recall-Oriented Understudy of Gisting Evaluation (Lin, 2004). It allows evaluating system-generated summaries by comparing them with (ideal) summaries created by humans. It relies on computing the ratio of overlapping units between the two summaries and has several variants according to the unit type: e.g., unigrams, n-grams or skip-grams.
BLEU was designed for the evaluation of language translation systems (Papineni et al., 2002). It is a precision metric that counts a unigram as correct if it occurs in a reference translation and tackles redundancies by clipping the maximum number that a candidate word can be matched to its maximum number of occurrences in the gold/reference translation.
Other metrics have been proposed, mostly in the first decade of the twenty first century, in the context of the DUC and TAC challenges, with a few recent exceptions. For instance, ROUGE 2.0 was proposed in 2018 as an extension of ROUGE that relies on Wordnet synonyms (Pedersen et al., 2004) and main topics in addition to the original tokens. AutoSummENG is a lexical method relying on computing similarities between the n-gram graphs of candidate summaries and gold summaries. Edges represent proximity between two n-grams and are weighted by the number of co-occurrences of their vertices (connected n-grams) in a specified window of text.
One of the closest approaches to our work considered applying the ROUGE metric using word embeddings (Ng and Abrecht, 2015). Instead of computing word matching in a binary fashion as in ROUGE, the authors consider a word or n-gram similarity to be either 0 if the candidate word is out of vocabulary, or equal to the dot-product of their embeddings otherwise. While this approach made use of contextual embeddings, it still required the words to be present in both the gold summary and the candidate summary, which makes it subject to the same limitation of ROUGE, i.e., the lack of semantic generalization that would allow matching synonyms and paraphrases. Results-wise, their approach was tested only on one dataset, and under-performed substantially ROUGE-SU4 and other ROUGE variants in terms of Pearson's correlation.
In contrast, HOLMS does not restrict the coverage of neural representations with lexical constraints, but explores new associative ways to get the best of both worlds. The embeddings component of HOLMS also relies on a sequential similarity function that grants increasingly more weight to (matching) sub-sequences with higher embeddings similarity. As far as we know, HOLMS is the first summarization evaluation measure that (a) takes into account embeddings similarity with a sequential perspective, and (b) uses a Gaussian function to combine lexical and embeddings-based similarities.

HOLMS
HOLMS stands for Hybrid Lexical and MOdel-based evaluation of Summaries. It relies on both deep contextual embeddings and lexical similarities.
The Embeddings-based Similarity (ES) used in HOLMS is computed in a sequential method. The intuition behind ES is that embeddings similarity between a word in a system summary and a word in a reference summary should be more pronounced if the contiguous words also have high embeddings similarity. Such perspective can also be extended to consecutive n-grams to capture conceptual similarity between sequences.
Concretely, to compute ES, a first step is to transform the input texts from both the gold and system summaries into two sets of consecutive n-grams: G = {g 1 , . . . , g l } for the gold summary, and A = {a 1 , . . . , a m } for a system summary. An embedding vector V L (x) is then generated for each n-gram x from a given language model L.
A second step is to compute the best distance value for a given gold n-gram g i , Dist A (g i ), with a filtering method where each system n-gram a ∈ A can only be used once as the best match for any given gold n-gram g ∈ G.
With euc being the euclidean distance, and A k the dynamic set computed by removing the previously matched n-grams in A.
The last step is inspired from TextFlow (Mrabet et al., 2017), and consists in using the (consecutive) distance values as coordinates on a curve and computing the area under curve as the overall distance. This is best shown with an example as in figure 1. The two example sentences, S, and G are from a news article 1 , and were slightly modified for ease of presentation: • G: "The man who ate the $120K banana art on the wall said that he was not sorry and that he was performing art by eating it." • S: "An artist claimed he was performing art by eating a banana used as a center piece in the art work of a colleague." Drawing the curve connecting the distance values as coordinates allows adding more weight to highly matching sub-sequences and less weight to weakly matched sub-sequences. The final value of ES is then computed as the complementary of the area under curve normalized by |G| × M ax euc to obtain values in the [0, 1] range.
The Lexical Similarity used in HOLMS is the ROUGE-1 recall. There is no restriction on the lexical similarity measure, or ensemble of measures, that could be used for the lexical component of HOLMS. We picked ROUGE-1 for our first proof-of-concept as it has shown a good potential in ranking systems based on their average performance and is also widely used in the community. For the remainder of the paper, we will refer to ROUGE-1, ROUGE-2 ,and ROUGE-SU4 as R 1 , R 2 , and R SU 4 .
The Combination of both aspects (lexical and neural) needs to take into account the implicit and relative links existing between shallow lexical similarities and embeddings-based similarity. To this effect, we build HOLMS using a bound three-dimensional Gaussian function that highlights further the summary pairs on which both measures agree, by promoting or downgrading exponentially strong agreements and strong disagreements. The function has its peak at 1 for perfect agreement on the quality of a summary and a low at 0 for total disagreement between the two measures (cf. figure 2).
with x 0 and y 0 the coordinates of ES and R 1 peaks and σ 2 x and σ 2 y the respective spreads. In practice, the values on the X and Y axes (representing the values of the evaluation measures) will never exceed the x and y coordinates of the peak, hence the bound nature of the function.

Evaluation
We used 4-grams as our units to compute the ES and HOLMS values. The peaks and spreads are set to 1. To compare HOLMS with a hybrid baseline, we also compute the correlation results for linear, a simple equal-weighted linear combination of ES and R 1 .
We tested several variations of ES using three different sources for neural embeddings: • The BERT large uncased model based on transformers and trained on large corpora such as Google books and Wikipedia using word masking and consecutive sentences classification.
• The universal sentence encoder (Cer et al., 2018) based on auto-encoders. We used the transformerbased version in our experiments instead of the slightly less accurate deep averaging version.
• Glove embeddings (Pennington et al., 2014) based on both word co-occurrence and local context windows.
In our preliminary experiment on the TAC 2011 dataset (Owczarzak and Dang, 2011), we observed that both the BERT embeddings and Glove embeddings underperformed substantially the embeddings from the universal sentence encoder (use) by more than 51%. We experimented with both the CLS class embeddings and the sum of the tokens embeddings from BERT, and with different vector dimensions for Glove. Subsequently, we used only the universal sentence encoder for our extended experiments on the 5 datasets. In future works, we plan to test more recent language models such as T5 (Raffel et al., 2019) and SciBERT (Beltagy et al., 2019).
We compute the correlations of HOLMS with human judgments of automatic summaries on 5 datasets from the summarization benchmarks introduced in the DUC and TAC challenges from 2007 to 2011 (Dang and Vanderwende, 2007;Dang and Owczarzak, 2008;Dang and Owczarzak, 2009;Owczarzakg, 2010;Owczarzak and Dang, 2011). We use the three standard correlation measures: Pearson correlation (P), Spearman rank coefficient (S), and Kendall's rank coefficient (K).
Each dataset consists of a set of source documents, a set of human-generated summaries for each document (called models in the challenges), and manual annotations of the relevance score for each candidate summary generated by the participating systems. The score for each candidate summary is computed as the average similarity between the system summary and the set of reference summaries. For content relevance, human assessors first selected the important content units (SCUs) then used the pyramid method (Passonneau et al., 2005) to score automatically the system-generated summaries based on the SCUs. In addition to content relevance scoring, the assessors also annotated the system-summaries with linguistic quality scores ranging from 1 to 5 for the DUC 2007, TAC 2008, and TAC 2011 datasets. Several scores ranging from 1 to 5 were first assigned to assess the system summaries according to five questions or aspects: grammaticality, non-redundancy, referential clarity, focus, structure, and coherence. The final linguistic quality score is then obtained by averaging the answers/scores given to all sub-questions. Further details about the data collection and annotation methods are described in the challenges overview papers and websites (Dang and Vanderwende, 2007;Dang and Owczarzak, 2008;Dang and Owczarzak, 2009;Owczarzakg, 2010;Owczarzak and Dang, 2011).
In the DUC and TAC editions, the ROUGE measure achieved state-of-the-art performance on its correlation with human judgments for both content relevance and linguistic quality. However, most evaluations of the TAC and DUC tracks relied only on average system scores and did not focus on ranking individual summaries.
Tables 2 and 3 present the correlations of BLEU, ROUGE, and HOLMS, with the linguistic quality scores computed manually by the NIST assessors 2 . Table 2 presents the Pearson, Spearman, and Kendall correlations on the linguistic quality of individual summaries. Table 3 presents the correlations on the average linguistic quality scores of each participating system.
Tables 4 and 5 present the correlations with Pyramid scores computed manually by NIST assessors to measure the content relevance of the system summaries. Table 4 presents the Pearson's (P), Spearman's (S), and Kendall's (K) correlations on individual summaries.

Discussion
Common observations. BLEU underperformed substantially all other measures in our experiments, except for the correlation on the average systems linguistic score in DUC 2007, which is likely due to shorter summaries. This could be caused mostly by the design of BLEU that was aimed at sentence-level machine translation. To capture only the relative performance of each summary and each system, we normalized the BLEU scores by the maximum score obtained by a system/summary for a given document. This improved the results (presented in section 4) compared with the correlation of the raw BLEU scores but was not enough to close the gap with the other evaluation measures. Another common observation is the relatively low level of correlation of all measures on linguistic quality when compared to content relevance correlations (pyramid scores). This can be explained in part by a higher level of subjectivity for linguistic quality inducing a higher assessor bias.
Evaluation of the linguistic quality of individual summaries. HOLMS outperformed the ROUGE variants, its embeddings-based component ES and the linear baseline in terms of Pearson correlation with a relative improvement ranging from 3% to 10.1%. The fact that HOLMS led to an improvement over both of its components (ES and R 1 ) and over the linear baseline suggests that: • The embeddings-based sequential similarity ES and ROUGE-1 brought different but complementary perspectives on the linguistic quality of a given summary.
• HOLMS was better suited to take advantage of those different perspectives than the linear equalweighted baseline.
In terms of Spearman's and Kendall's correlation factors, the picture was slightly more nuanced, with ES performing better on TAC 2011, HOLMS performing better on TAC 2008, and ROUGE-2 performing better on DUC2007. The macro-average is less prone to dataset-specific bias and showed that HOLMS performed better than all other measures.
Evaluation of the average linguistic quality of summarization systems. When analyzing the correlation values for average system scores, the benefits brought by HOLMS are even more substantial with a relative improvement of +25.2% on average (cf. table 3). The ablation study led to the interesting observation of the ROUGE variants outperforming the embeddings-based similarity and the linear combination. This suggests that: (1) lexical similarity measures can have a more relevant coarse-grained picture on system-level linguistic quality, and that neural language models are not necessarily better suited to rank extractive systems based on linguistic quality when gold summaries are available, and (2) neural language models still have a distinct perspective as shown by the better results obtained by HOLMS, and can make relevant hybrid methods more efficient at system ranking than lexical measures alone.
Evaluation of content relevance for individual summaries. The hybrid methods outperformed the ROUGE variants and the embeddings based similarity (ES) on content relevance with HOLMS performing better than the linear baseline on average in terms of Spearman and Kendall correlation factors, while maintaining comparable Pearson values (cf.table 4). In terms of components, the picture was less mixed, with ROUGE-1 performing slightly better on average on the three correlation measures. This shows that when gold summaries are available, performing better than lexical evaluation measures on content relevance is not as straightforward as computing embeddings similarities or taking into account sub-sequence similarities with measures like ES. Going to the embeddings space seemed to result in a relatively small loss of information when compared to ROUGE-1 for individual summary ranking. One potential explanation is that at the scale of one (small) document, the precision of a restricted terminology is greater than the precision of large (contextual) language models. This finding, together with the higher performance of HOLMS on the evaluation of content relevance for individual summaries, supports the theoretical effectiveness of pointer-generator approaches that combine abstractive and extractive functions.
Evaluation of content relevance for summarization systems. The improvement provided by HOLMS on the evaluation of average system performance for content relevance was noticeable over all baselines and datasets, with an average increase in correlation of 3.2%, 3.2%, and 4.2% respectively for Pearson's, Spearman's and Kendall's correlation factors. The lexical and neural components had comparable correlation results, with the linear baseline consistently under-performing HOLMS. This validates further the observation that the methods used to combine the lexical and neural language model spaces for summary evaluation can play a key role in improving systems evaluation when designed appropriately.
In a more general note, the proximity of the correlation results obtained by ES and the ROUGE variants raise an interesting question on the codification of language (or meaning) by neural embeddings and the extent to which their underlying representations provide an actual semantic generalization or rather a symbolic compression that remains tuned for data patterns that are both complex and shallow. What we can observe from our experiments is that the Gaussian combination through HOLMS outperforms both lexical and neural measures. This shows that embeddings do provide distinct and complementary information to the discrete/lexical information considered by ROUGE, but that the differences might not be as wide as could be expected. Moving forward, we are likely going to need either substantially different language representation spaces, or the integration of a different source of semantics such as knowledge bases and their associated neural graph models.
Limitations. In this paper, we did not address abstractive summarization due to the lack of sufficiently large abstractive summarization datasets with human judgments of content quality and relevance. We expect the neural-embedding component of HOLMS to have a higher impact in generative summarization approaches, and acquiring or building such datasets is one of our short-term objectives.

Conclusions
We presented a new summarization evaluation measure, called HOLMS , based on a sequential n-gram embeddings similarity and ROUGE. In our experiments on 5 summarization evaluation benchmarks, HOLMS performed consistently better on average than its individual components, BLEU, and a linear combination baseline. HOLMS can also be used as a framework, as many more variations can be tested, including the use of consecutive skip-grams as the input instead of n-grams, and the combination of sentence level similarity and n/skip-gram similarity. Moving forward, we think that summarization systems should be evaluated on a combination of discrete and contextual or semantic evaluation measures. Such extension fits naturally in HOLMS through the insertion of additional dimensions in its Gaussian function. Our results suggest that such combination is likely to bring a higher level of correlation with human assessments of both linguistic quality and content relevance.