A Graph-theoretic Summary Evaluation for ROUGE

ROUGE is one of the first and most widely used evaluation metrics for text summarization. However, its assessment merely relies on surface similarities between peer and model summaries. Consequently, ROUGE is unable to fairly evaluate summaries including lexical variations and paraphrasing. We propose a graph-based approach adopted into ROUGE to evaluate summaries based on both lexical and semantic similarities. Experiment results over TAC AESOP datasets show that exploiting the lexico-semantic similarity of the words used in summaries would significantly help ROUGE correlate better with human judgments.


Introduction
Quantifying the quality of summaries is an important and necessary task in the field of automatic text summarization. Among the metrics proposed for this task (Hovy et al., 2006;Tratz and Hovy, 2008;Giannakopoulos et al., 2008), ROUGE (Lin, 2004) is the first and still most widely used one (Graham et al., 2015). This metric measures the concordance of system-generated (peer) summaries and human-generated reference (model) summaries by determining n-grams, word sequences, and word pair matches. ROUGE assumes that a peer summary is of high quality if it shares many words or phrases with a model summary. However, different terminology may be used to refer to the same concepts and hence relying only on lexical overlaps may underrate content quality scores. For clarity, consider the following two sentences: (i) They strolled around the city.
(ii) They took a walk to explore the town.
These sentences are semantically similar, but lexically different. If one of them is included in a model summary, while a peer summary contains another one, ROUGE or other surface based evaluation metrics cannot capture their similarity due to the minimal lexical overlap. We aim to help ROUGE with identifying the semantic similarities of linguistic items, and consequently tackling the main problem of its bias towards lexical similarities.
Considering senses instead of words, we use the Personalized PageRank (PPR) algorithm (Haveliwala, 2002) to leverage repetitive random walks on WordNet 3.0 (Fellbaum, 1998) as a semantic network. We disambiguate each word into its intended sense, and obtain the probability distribution of each sense over all senses in the network. Weights in this distribution denote the relevance of the corresponding senses. At each iteration, we measure the semantic similarity by looking at the path taken by the random walker, and weighting the overlaps between a pair of ranked PPR vectors. Our graph-based approach (ROUGE-G) computes semantic similarity scores between n-grams, along with their match counts, to perform both semantic and lexical comparisons of peer and model summaries. The experiment results indicate that ROUGE-G variants significantly outperform their corresponding variants of ROUGE. Beyond enhancing the evaluation prowess of ROUGE, due to its lexico-semantic analysis of summaries, we believe that ROUGE-G has the potential to expand the applicability of ROUGE to abstractive summarization.

Background
In the summarization literature, a couple of ROUGE variants (i.e., ROUGE-1, ROUGE-2, ROUGE-SU4) are reported to have a strong correlation with human assessments, and are frequently used to evaluate summaries (Lin and Och, 2004;Owczarzak and Dang, 2011;Over and Yen, 2004). Although ROUGE is a popular evaluation metric, improving the current evaluation metrics is still an open research area. Many of these efforts are analyzed and gathered in the surveys provided by Steinberger and Ježek (2012). Herein, we try to briefly review the most significant ones. Since DUC 2005, the Pyramid metric (Passonneau et al., 2005) was introduced as one of the principal metrics for evaluating summaries in the TAC conference. However, this metric is semiautomated and requires manual identification of summary content units (SCUs). Soon after, Hovy et al. (2006) proposed a metric based on comparison of basic syntactic units (Basic Elements) between peer and model summaries. This metric, BE-HM, was specified as one of the baselines in the TAC AESOP task. Among systems participated in this task from 2009 to 2011, Auto-SummENG (DEMOKRITOSGR) (Giannakopoulos et al., 2008) is one of the top systems which compares the graph representations of peer and model summaries. Recently, some evaluation metrics have studied the effectiveness of word semantic similarity to evaluate summaries including terminology variations and paraphrasing (Baroni et al., 2014;ShafieiBavani et al., 2017ShafieiBavani et al., , 2018. For instance, an automated variant of the Pyramid metric has used distributional semantics to map text content within peer summaries to SCUs (Passonneau et al., 2013). A more recent metric, ROUGE-WE, (Ng and Abrecht, 2015) has also enhanced ROUGE by incorporating the use of a variant of word embeddings, called word2vec (Mikolov et al., 2013).

Graph-Theoretic Summary Evaluation
Given a pair of peer and model summaries, we compute PPR vectors at the following levels: (i) sense level, to disambiguate each word (having a set of senses); and (ii) n-gram level, to measure the semantic similarity. We compare the PPR vectors of each pair of n-grams using the following measures: (i) Path-based: considering the path that the random walker takes at each iteration to get to a particular node; (ii) Rank and Weight: weighting the overlaps between a pair of ranked PPR vectors.

Vector Representation
The WordNet graph has edges of various types, with the main types being hypernymy and meronymy to connect nodes containing senses. However, we do not use these types, and consider an edge as an undirected semantic or lexical relation between two synsets. We have utilized the WordNet graph enriched by connecting a senseirrespective of its part-of-speech (POS) -with all the other senses that appear in its disambiguated gloss (Pilehvar and Navigli, 2015). Dimension of the vector representation is the number of connected nodes in the graph. For better clarity, we consider the adjacency matrix A for our semantic graph, and perform iterative random walks beginning at a set of senses S on WordNet with the probability mass of p (0) (S), which is uniformly distributed across the senses s i ∈ S, and the mass for all s i / ∈ S set to zero. This provides a frequency or multinomial distribution over all senses in WordNet, with a higher probability assigned to senses that are frequently visited. The PPR vector of S is given by: At each iteration, the random walker may follow one of the edges with probability d or jump back to any node s i ∈ S with probability (1 − d)/|S|. Following the standard convention, the value of damping factor d is set to 0.85. The number of iterations k is also set to 20, which is sufficient for the distribution to converge.

Comparing Vectors
Conventional measures for comparing PPR vectors calculate the probability that a random walker meets a particular node after a specific number of iterations, which is potentially problematic (Rothe and Schütze, 2014). For example, consider the following connected nodes: law suit tailor dress The PPR vectors of suit and dress have some weight on tailor, which is desirable. However, the PPR vector of law will also have a non-zero weight for tailor. Consequently, law and dress are spuriously similar because of the node tailor.
To prevent this type of false similarity, the random walker needs to take into account the walking path to reach a particular node (Rothe and Schütze, 2014). We formalize this by defining the semantic similarity of two sets of nodes I and J as: where damping factor c was optimized on the TAC 2010 (Owczarzak and Dang, 2010) AESOP dataset, and set to 0.7 to ensure that early meetings are more valuable than later meetings. At each iteration x, we compare PPR vectors by ranking their dimensions (senses) based on their values, and weighting the overlaps between them (Equation 3). Hence, we weight the similarity such that differences in the highest ranks (most important senses in a vector) are penalized more than differences in lower ranks. This measure has proven to be superior to cosine similarity, Jensen-Shannon divergence, and Rank-Biased Overlap for comparing vectors (Pilehvar et al., 2013).
where H is the intersection of all senses with nonzero probability in both vectors Y and Z. r h (Y ) denotes the rank of sense h in vector Y , where rank 1 is the highest rank. The denominator is used as a normalization factor that guarantees a maximum value of one. The minimum value is zero and occurs when there is no overlap, i.e., |H| = 0.

Calculating ROUGE-G
We combine lexical and semantic similarities to compute ROUGE-G-N: where Sim LS is the score of lexico-semantic similarity between a pair of n-grams in model summary (n-gram m ) and peer summary (n-gram p ): SimLS(n-gramm, n-gramp) = β × Count match (n-gramm, n-gramp) + (1 − β) × Simsem(n-gramm, n-gramp) Scaling factor β was optimized on the TAC 2010 AESOP dataset, and set to 0.5 to reach the best correlation with the manual metrics. Count match (n-gram m , n-gram p ) is the maximum number of the n-gram co-occurring in a peer summary and a set of model summaries.

Disambiguation of n-grams
Prior to measuring semantic similarities, each word in n-grams has to be analyzed and disambiguated into its intended sense. However, conventional word sense disambiguations are not applicable due to the lack of contextual information.
Hence, we seek the semantic alignment that maximizes the similarity of the senses of the compared words. As an example (Pilehvar et al., 2013), consider two sentences of "a1. Officers fired." and "a2. Several policemen terminated in corruption probe.", the semantic alignment procedure has been performed as "P a1 . officer 3 n , f ire 4 v ", and "P a2 . policeman 1 n , terminate 4 v , corruption 6 n , probe 1 n ". t i p denotes the i-th sense of a word t in WordNet with POS p. After alignment, among all possible pairings of all senses of f ire v to all senses of all words in a2, the sense f ire 4 v (employment termination) obtains the maximal similarity value of Sim sem (f ire 4 v , terminate 4 v ) = 1.

OOV Handling
Out-of-vocabulary (OOV) words are the words that are not defined in the corresponding lexical resource, hence, they will be ignored while generating PPR vectors. The reason is that they do not have an associated node in the WordNet graph for the random walk to be initialized from. To take them into consideration, we add an extra dimension for each OOV term in the resulting PPR vector. Following Pilehvar and Navigli (2015), we set the associated weights of the new dimensions to 0.5 so as to guarantee their placement among the top dimensions in their vectors.

Data and Metrics
The only available datasets for the task of Summarization Evaluation are three AESOP datasets 1 provided by TAC 2009, 2010, and 2011. Among them, we optimize scaling factors using the TAC 2010 AESOP dataset, and evaluate ROUGE-G on the TAC 2011 (Owczarzak and Dang, 2011) AE-SOP dataset for two main reasons: (i) it is the only dataset on which evaluation metrics can be assessed for their ability to measure summary Readability; (ii) To be in line with the most recent work (ROUGE-WE) that has also been evaluated only on this dataset for measuring the Readability scores. This dataset consists of 44 topics, and a set of 10 documents for each topic. There are four human-crafted model summaries for each document set. A summary for each topic is generated by each of the 51 summarizers which participated in the main TAC summarization task. The output of participating automatic metrics is tasked to be compared against human judgments using three manual metrics of Pyramid, Readability, and Responsiveness. Hence, the outputs are scored based on their summary content, linguistic quality, and a combination of both, respectively. Prior to computing correlation of ROUGE-G variants with manual metrics, ROUGE-G scores have reliably been computed (95% confidence intervals) under ROUGE bootstrap resampling with the default number of sampling point (1000). Correlation of ROUGE-G evaluation scores with the human judgments is then assessed with three metrics of correlation: Pearson r; Spearman ρ; and Kendall τ . We compute scores using the default NIST settings for baselines in the TAC 2011 AE-SOP task (with stemming and keeping stopwords).

Results
We evaluate ROUGE-G, against the top metrics (C S IIITH3, DemokritosGR1, Catolicasc1) among the 23 metrics participated in TAC AESOP 2011, ROUGE, and the most recent related work (ROUGE-WE) ( Table 1). Overall results support our proposal to consider semantics besides surface with ROUGE. Since the large/small differences in competing correlations with human assessment are not an acceptable proof of superiority/inferiority in performance of one metric over another, significance tests should be applied. To better clarify the effectiveness of ROUGE-G, we have used pairwise Williams significance test recommended by Graham et al. (2015) for summarization evaluation. Accordingly, evaluation of a given summarization metric, M new , takes the form of quantifying three correlations: r(M new , H), that exists between the evaluation metric scores for summarization systems and corresponding human assessment scores; r(M base , H), that stands for the correlation of baseline metrics with human judges; and the third correlation, between evaluation metric scores themselves, r(M base , M new ). It can happen for a pair of competing metrics for which the correlation between metric scores is strong, that a small difference in competing correlations with human assessment is significant, while, for a different pair of metrics with a larger difference in correlation, the difference is not significant (Graham et al., 2015). Using this significance test, the results show that all increases in correlations of ROUGE-G compared to ROUGE and ROUGE-WE variants are statistically significant (p < 0.05). We analyze the correlation results reported in Table 1 in the following.
ROUGE-G-2 achieves the best correlation with Pyramid, regarding all correlation metrics. Moreover, every ROUGE-G variant outperforms its corresponding ROUGE and ROUGE-WE variants, regardless of the correlation metric used. However, the only exception is ROUGE-SU4, which correlates slightly better with Pyramid when measuring with Pearson correlation. One possible reason is that Pyramid measures content similarity between peer and model summaries, while the variants of ROUGE-G favor semantics behind the content for measuring similarities. Since some of the semantics attached to the skipped words are lost in the construction of skip-bigrams, ROUGE-SU4 shows a better correlation comparing to ROUGE-G-SU4.
For Responsiveness, ROUGE-G-SU4 achieves the best correlation when measuring with Pearson. We also observe that ROUGE-G-2 obtains the best correlation with Responsiveness while measuring with the Spearman and Kendall rank correlations. The reason is that semantic interpretation of bigrams is easier, and that of contiguous bigrams is much more precise. We also see that every variant of ROUGE-G outperforms its corresponding ROUGE and ROUGE-WE variants.
The readability score is based on grammaticality, structure, and coherence. Although our main goal is not to improve the readability, ROUGE-G-SU4 and ROUGE-G-2 are observed to correlate very well with this metric when measured with the Pearson and Spearman/Kendall rank correlations, respectively. Besides, every variant of ROUGE-G represents the best correlation results comparing to its corresponding variants of ROUGE and ROUGE-WE for all correlation metrics. This is likely due to considering word types and POS tagging while aligning and disambiguating n-grams. POS features are shown by Feng et al. (2010) to be helpful in predicting linguistic quality.
We optimize scaling factor β (Equation 5) on the TAC 2010 AESOP dataset. Figure 1 shows the correlation results by the variants of ROUGE-G with Pyramid (Pyr) and Responsiveness (Rsp) metrics measured by Pearson. The best results are observed when β = 0.5. Performance deteriorates when β approaches 1.0 which indicates the ROUGE scores without any touch of semantic similarity. Decreasing β to zero causes the exclusion of lexical match counts, and consequently inappropriateness of the outcomes. This shows the importance of using both lexical and semantic similarities to fairly judge the quality of summaries. It is noteworthy that we have evaluated our approach with the following settings for computing and comparing PPR vectors: (i) Path-based with Rank and Weight measure (current setting); (ii) Path-based with cosine similarity; (iii) Excluding path-based measure and using Rank and Weight measure solely. The results showed that the current setting performs better than the other two.

Conclusion
This paper presents ROUGE-G to overcome the limitation of high lexical dependency in ROUGE.
Our approach leverages a sense-based representation to calculate PPR vectors for n-grams. The semantic similarity of n-grams are then computed using a formalization of Path-based and Rank and Weight measures. We finally improve on ROUGE by performing both semantic and lexical analysis of summaries. Experiments over the TAC AESOP datasets demonstrate that ROUGE-G achieves higher correlations with manual judgments in comparison with ROUGE.
In order to demonstrate the effectiveness of ROUGE-G to fairly evaluate abstractive summaries, we need to conduct experiments on a dataset composed of abstractive summaries. However, we evaluated our approach on the TAC 2011 AESOP dataset, which is made of summaries that were generated mostly by extractive systems. Since there is not such dataset at the time of writing this paper, we can continue building on this work by using model summaries, which are abstractive in nature, as a proxy. Thereupon, it is possible to incorporate jackknifing procedure in the scoring process in order to see whether our metric can differentiate between peer summaries (naturally extractive) vs. model summaries (naturally abstractive).