An Anchor-Based Automatic Evaluation Metric for Document Summarization

The widespread adoption of reference-based automatic evaluation metrics such as ROUGE has promoted the development of document summarization. In this paper, we consider a new protocol for designing reference-based metrics that require the endorsement of source document(s). Following protocol, we propose an anchored ROUGE metric fixing each summary particle on source document, which bases the computation on more solid ground. Empirical results on benchmark datasets validate that source document helps to induce a higher correlation with human judgments for ROUGE metric. Being self-explanatory and easy-to-implement, the protocol can naturally foster various effective designs of reference-based metrics besides the anchored ROUGE introduced here.


Reference summary Peer summary
Passive evaluation metrics (old)

Peer summary Source document
Active evaluation metrics (new) Reference summary Figure 1: The transition from old-fashioned to newly-introduced protocol for designing reference-based automatic evaluation metrics in the document summarization task. The curved arrows on the right show that both summaries are derived from the source document.
col are called "active metrics" since they will be able to refer to the source. In a word, the new protocol has introduced a key dimension that can nurture reference-based summarization metrics. For a verification purpose, we propose an anchored version of ROUGE metric under the new protocol. The anchors here mean a set of lexical items (called particles) in source document corresponding to a certain particle in the summary (peer or reference). Utilizing anchor set in the computation of ROUGE can introduce a weighted scheme that focuses more on the link to source document, as will be detailed in the next section.

A Specific Implementation: Anchored ROUGE
Following the new protocol, the ROUGE metric can be revised by introducing the anchor set for each particle (i.e. lexical item such as n-gram and skipping bigram) in both peer and reference summaries. The anchor set for a particle in the summary comprises k particles in source document, each of which is a good match for the summary particle. In other words, anchor sets serve as the ground of summary particles.
We build the anchor set A s for summary particle s following the two steps: (1) Compute the cosine similarity of embedding vectors of s and d with d being any arbitrary document particle (s and d should be of the same lexical form such as bigram); (2) Extract top-k document particles based on similarity to form the anchor set, i.e. A s = {d s1 , d s2 , ..., d sk }. Also, we record the similarity as the strength of anchor and denote the strength between s and d si as q si (1 ≤ i ≤ k). The embedding vector of the particle in this paper is obtained by averaging the contextualized embeddings of all tokens occurring in the particle. Specifically, in the following experiment, we will sum the last four hidden layers of the pretrained uncased BERT Base model 2 (Devlin et al., 2019) to get the embedding for each token (dimension of embedding vector is 768). An example of anchor set can be found in Figure 2.  The anchored version of ROUGE can be defined as follows once all the anchor sets for summary particles (both in peer and reference) have been built. We calculate the union of anchor sets for all particles in a reference summary and denote it as C ref . Eqn. 1 gives the formula of anchored ROUGE and function T is defined by Eqn. 2. Notice that notation "RefSumm" is a collection of reference summaries, w s is the count of particle s (with stemming) occurring in either peer or reference summary and δ is Kronecker delta function, i.e. it is 1 only when two relevant variables are equal and 0 otherwise.  Table 1: Summary-level correlation results between reference-based automatic metrics and human judgments (k = 5 and n = 4). Best correlations are in bold and our proposed metrics are AncR-1 and AncR-2.

ROUGE-anchored
(2) The anchored metric listed above has based the computation on anchor sets residing in the source document. For convenience, we will compare it with the original ROUGE metric, especially the equivalent definition of ROUGE-N given by Lin and Bilmes (2011) (Theorem 3 in the original paper). Function T replaces the count of summary particles, which is adopted in original ROUGE, and for a specific document particle sums the weighted contributions from different summary particles (the weight coefficient is the anchor strength q si as shown in Eqn. 2). The factor w s is used to assess the effect of multiple occurrences of the same summary particle s. In Eqn. 1, the min function is utilized to compute the weighted matching degree based on document particle d (thus the overall metric will be less than one), which revises the exact count of matching summary particles in original ROUGE. Based on these manipulations, anchored ROUGE is endorsed by source document and freed from the pattern of "hard matching" in original ROUGE, whose evaluation efficacy will be tested in the next section.

Evaluation Efficacy of Anchored ROUGE
Datasets. We select two datasets of topic-focused multi-document summarization (MDS), i.e. TAC 2008 3 and TAC 2009 4 , for two main reasons: (1) MDS is more challenging than single document summarization and summarizers tend to behave more differently for evaluation, which fits the purpose to examine various metrics; (2) Multiple reference summaries are offered, which makes it possible to perform robustness test (see Table 2). The two datasets consist of 48 and 44 topics, respectively, each of which has 10 source documents and 4 reference summaries, i.e. n is 4. We only use document set A of official datasets in line with Louis and Nenkova (2013) and Gao et al. (2020). Additionally, TAC 2008 has 57 peer summaries for each topic while TAC 2009 has 55. All summaries are at most 100 words and each peer summary is associated with a Pyramid score (Nenkova and Passonneau, 2004), which serves as the human judgment. For tuning the anchor set size (i.e. k in Section 2), another dataset (DUC 2007 5 ) will be used. Comparing metrics. These reference-based metrics are involved in the experiment. (1) ROUGE (Lin, 2004): a traditional metric for counting lexical-level overlap. For comparison, two variants are considered based on either unigram (R-1) or bigram (R-2). (2) ROUGE-WE (Ng and Abrecht, 2015): a metric based on word2vec embeddings (Mikolov et al., 2013) to compute semantic similarity. ROUGE-WE with unigram (R-1-WE) and bigram (R-2-WE) are computed.    (Kusner et al., 2015). We report its best version with the BERT embeddings and the certain methods for fine-tuning and aggregation of embeddings according to the original paper. (6) ROUGE-anchored: our metric proposed under the new protocol as formulated in Section 2. Similar to ROUGE, we consider two variants with different particle granularities, i.e. unigram (AncR-1) and bigram (AncR-2). Tuning on DUC 2007 sets the anchor set size to 5. Following the convention, we compute the average summary-level correlation with human judgments for each metric in terms of three correlation coefficients: Pearson r, Spearman ρ and Kendall τ . Main results. As shown in Table 1, the overall correlation results prove the superiority of our anchored ROUGE metric. On both datasets, anchored ROUGE has achieved the highest correlations according to all three correlation coefficients. More specifically, both AncR-1 and AncR-2 have a correlation higher than their original counterparts (i.e. R-1 and R-2) and the gaps are over 2.5 and 1.3 percent, respectively. Even the most recent metric based on advanced contextualized embeddings, i.e. Mover, has fallen behind our metric (by over one percent as compared with AncR-1 on TAC 2008 and AncR-2 on TAC 2009). For a more convincing comparison, we have conducted the pairwise Williams significance test recommended by Graham (2015) between our metric (more precisely AncR-1 on TAC 2008 and AncR-2 on TAC 2009) and other competitors and the result shows that the increases of our metric over others except the supervised metric S 3 best are statistically significant (p-value < 0.05). Hyperparameter effect & Robustness. Two extra tests have been performed to further analyze our metric. The effects of anchor set size k on Pearson correlations are illustrated in Figure 3, indicating that an anchor set with the proper size is needed to establish the efficacy of our metric. The correlations deteriorate when k is less than three and we see no substantial improvements with an extremely large k that causes more intensive computation. The effect of the number of reference summaries is shown in Table 2. We have used all available references to compute metrics when n is equal to four and used n randomly selected references with a smaller n (note that the average of 4 n results is reported). The observation is that our metric is relatively robust to n and it demonstrates that our metric is less prone to the reference noise observed in Kryscinski et al. (2019) or the reference bias introduced when very few reference summaries are available (Hermann et al., 2015;Grusky et al., 2018).

Related Work
There are various reference-based automatic evaluation metrics for the task of document summarization. The widely accepted metric is ROUGE (Lin, 2004) that focuses primarily on n-gram co-occurrence statistics. Some strategies are proposed to replace the "hard matching" of ROUGE, such as the adoption of WordNet (ShafieiBavani et al., 2018) and the fusion of ROUGE and word2vec (Ng and Abrecht, 2015). Another promising method of designing metrics is to directly compute the semantic similarity of peer and reference summary, including the metrics utilizing various word embeddings such as ELMo (Sun and Nenkova, 2019) and BERT (Zhang et al., 2019;Zhao et al., 2019). Furthermore, Zhang et al. (2020) proposes a metric computing factual correctness based on information extraction. However, none of the above metrics fall into the newly-introduced protocol. The anchored ROUGE proposed by us is a refined metric that has followed the new protocol and enjoyed higher correlations with human judgments.

Conclusion
We propose a new protocol to foster the development of reference-based automatic metrics to evaluate document summarization. The protocol features the endorsement of source document and can be implemented as an anchored version of the ROUGE metric fixing each summary particle on the ground of source document. Experiments demonstrate that anchored ROUGE has a higher correlation with human judgments as compared to other metrics. Also, our metric is robust to the number of reference summaries, which can be applied to the challenging low-resource setting. Future works include extending the new protocol to get various workable evaluation metrics besides anchored ROUGE.