The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization

ROUGE is widely used to automatically evaluate summarization systems. However, ROUGE measures semantic overlap between a system summary and a human reference on word-string level, much at odds with the contemporary treatment of semantic meaning. Here we present a suite of experiments on using distributed representations for evaluating summarizers, both in reference-based and in reference-free setting. Our experimental results show that the max value over each dimension of the summary ELMo word embeddings is a good representation that results in high correlation with human ratings. Averaging the cosine similarity of all encoders we tested yields high correlation with manual scores in reference-free setting. The distributed representations outperform ROUGE in recent corpora for abstractive news summarization but are less good on test data used in past evaluations.


Introduction
The widely used ROUGE (Lin, 2004) automatic evaluation for summarization relies on token overlap between reference and system summary. This limited view of meaning has motivated numerous studies on summarization evaluation (Zhou et al., 2006;Ganesan, 2018;ShafieiBavani et al., 2018), and the related areas of translation and dialog, to explore more compelling semantic matching (Kauchak and Barzilay, 2006;Lavie and Denkowski, 2009;Lo and Wu, 2011;Chen and Guo, 2015;Liu et al., 2016;Tao et al., 2018). Most recently, incorporating word embeddings in ROUGE pairwise comparison of n-grams has proven beneficial (Ng and Abrecht, 2015), as well as representing sentences using universal sentence representation to predict the quality of translation (Shimanaka et al., 2018).
We build upon this line of work and show that cosine similarity between the reference and summary embedding works well, and better than ROUGE on recent datasets, for comparing single document summarization systems. Unlike prior work (Ng and Abrecht, 2015), we thoroughly abandon ROUGE and n-gram co-occurrences in the computation of semantic similarity. To give a sense of the generalizability of our findings, we validate the method on three different test sets with human evaluation. We compare several popular representation including sentence embedding, un-contextualized word embedding and contextualized word embedding. Finally, we present experiments on evaluating single document summaries without reference summaries which was originally proposed for multi-document summarization (Louis and Nenkova, 2013) and explored a variety of word-string similarity techniques. Here we study reference-free evaluations via embedding similarity between the full document to be summarized and the system summaries.

Embeddings
To get a dense low-dimenional representation of texts, we test seven representations covering sentence embedding, variants of un-contextualized word embedding and variants of contextualized word embedding. Specifically: (i) Two Google universal sentence encoders: (Cer et al., 2018), an encoder (enc-2) based on deep averaging net (Iyyer et al., 2015) and an encoder (enc-3) based on transformer (Vaswani et al., 2017). Both encoders encode input text to 512-dimensional vector. (ii) Average (ELMo-a) and max (ELMo-m) over each dimension of all ELMo (Peters et al., 2018) word embeddings of an input text. For each token in the input, three layers of 1,024dimensional vectors were concatenated to form a 3,072-dimensional vector. (iii) Average (avg) and max (max) over GoogleNews 300-d word2vec. (iv) InferSent (Conneau et al., 2017) (InferSent), a BiLSTM encoder producing representation of 4,096 dimensions. We compute cosine similarity between summary and reference embedding to capture semantic similarity. To test the robustness of this evaluation approach, we check correlations on old single document summarization evaluations of somewhat obsolete systems and modern corpora for summarization with a mix of extractive and neural abstractive systems.

Evaluation on DUC2001/2002
Document Understanding Conferences (DUC) 2001/2002 provide benchmark datasets along with human evaluation over multiple submitted systems. Human evaluation (coverage score) reflects the degree to which semantic units, roughly clauses, in the reference summary are expressed in the system summary (Lin and Hovy, 2003).
To evaluate the newly proposed automatic evaluations, we follow the conventional methodology of computing correlation between the automatic metric and human evaluations of summary content. The results are shown in Table 1 1 . R1-F correlates better with human ratings on DUC'01, while R2-R works extremely well on DUC'02. Both uni-and bi-gram ROUGE F-measure also correlate well with human evaluation, which is an important finding given that ROUGE-F has become the de facto standard for evaluation of neural summarization systems (Nallapati et al., 2016;See et al., 2017;Gehrmann et al., 2018;Zhang et al., 2018;Celikyilmaz et al., 2018). 1 There are total 14 systems in DUC'02. We discard two poorly performing systems, 17 and 30. Including them in the analysis results in high correlation (> 0.9) for both ROUGE and embedding similarity but the results we present are more convincing without the presence of clearly inferior systems.
We find that there is no single optimal representation that gives the best correlation on both data sets. There is a clear increase of performance from DUC'01 to DUC'02. In DUC'02, embedding similarity can achieve the same level or even higher correlation with human evaluation than ROUGE F-measure, but it performs worse on DUC'01. The correlations using avg, max and InferSent in reference-free setting on DUC'01 are lower.  To understand what causes the low performance on DUC'01, we examined statistics for both datasets, shown in Table 2. Systems in DUC'01 are, on average, inferior to DUC'02 systems (lines 5 and 6) and systems in DUC'01 are more similar to each other than in DUC'02 (line 7), which leaves less room to rank the systems and thus achieve high correlation. Another difference between DUC'01 and DUC'02 is that the number of evaluated articles for each system is considerably larger than that in DUC'01. One might ask if enough articles are provided for each system on DUC'01 data. We show in Figure 1 that 140 is a large enough number for eliciting stable systemlevel correlation. Another possible problem is that the number of systems is not enough, so a minor change in either human or embedding similarity score can lead to large oscillation of correlation. Scatter plots of these systems are shown in Figure  20 40 60 Figure 1: For each number of articles, we sample and compute the correlation for 50 times and plot the average as well as standard deviation. The decreasing size of error bar shows that enough articles are provided for each system and it is not the reason of the performance discrepancy between DUC2001 and DUC2002.
3. As we can see, similarity of all kinds of embeddings indeed correlate with coverage score, however they also generate more extreme values when pairs of systems are examined. For example, two systems may be close in terms of R1-F, but can be relatively distant when comparing the embedding similarities. This problem, possibly due to different architectures and different data each encoder is trained on, may be alleviated by averaging the cosine similarities computed from all the representations. Overall, given the stable results on DUC'02, embedding similarity is a good metric which does not depend on lexical overlap and can be computed quickly immediately after inference without the need of running ROUGE. DUC'01 -where reference-free results are weaker -contains longer articles. To check if representations are sensitive to input length, we truncated the articles to the first 400 or 750 tokens when the length exceeds that limit. We plot the correlations in this setting for the three encoders which have worse results on DUC'01. Figure 2 shows a clear and consistent improvement of correlation when the document size is smaller. In fact, when only the lead 400 tokens are included, the average word embedding is only slightly worse than R1-F. This finding suggests that the embedding of the document lead sentences serves as a better reference than the full document and that the three representations are sensitive to input length.
Another noticeable difference between DUC'01 and DUC'02 is the performance of ELMo-a word embeddings. Both ELMo embedding variants are capable of dealing with long texts. For the four settings we tested, ELMo-m has better Spearman correlation than averaging ELMo embeddings. On DUC'02, ELMo-a leads to lower correlations than other encoders. Unlike max and avg where word embeddings are fixed, contextualized word embeddings are more flexible, thus the average ELMo embeddings for reference and summary can be far away from each other, in this case ELMo-m reflects more salient information about the input than ELMo-a.

Evaluation on Newsroom 60
In this section, we use contemporary data and systems and explore other factors that embedding similarity could potentially capture. We employ the human evaluation set from newsroom data introduced in (Grusky et al., 2018). The evaluation data includes 7 systems, each producing summaries for 60 articles. The 7 systems are: (1) lead3 sentences of the article (2) textrank with word limit of 50 (3) extractive oracle 'fragments' system, representing the best possible performance of an extractive system (4) abstractive model (Rush et al., 2015) trained on Newsroom training data (5) Pointer-Generator (See et al., 2017) trained on CNN/DailyMail data set (Nallapati et al., 2016), on complete and a subset of Newsroom training set respectively. We collected crowdsourced evaluations of relevance (RL) and informativeness (IN) as introduced in the original paper, closely reproducing the earlier findings. We introduce four more dimensions: verbosity (VE), unnecessary content (UC), perfect surrogate (SR) and continue reading (CN). Higher rating corresponds to assessment that the summary is not unnecessarily verbose, it has no unnecessary content, it is a good surrogate for the input and that much additional information can be obtained from the article after reading the summary. The exact questions are presented in the supplementary material. We asked workers to rate in the range of 1 to 7 instead of 1 to 5 in the original paper. We excluded from the analysis the 'fragments' oracle system which maximizes ROUGE by selecting word n-grams but receives very low human ratings because the resulting summary is incomprehensible. Each summary is scored by three crowdworkers whose scores we average. Table 3 shows Spearman correlations and Pearson correlations are in the Appendix.  The table shows that embedding similarity correlates better or the same as ROUGE with human ratings on informativeness, relevance and surrogate. However, ROUGE precision is a more suitable metric for evaluating the extent of unnecessary content and verbosity. This implies that the examined representations capture well the meaning of the input but store repetitive information without enough penalty. For the dimension CN, showing that the input contains considerably more important details than in the summary, neither embedding similarity nor ROUGE is good enough to show a correlation with small p-value.
To establish which representation correlates best with human ratings, we sort the representations by correlations in descending order on the three data sets we examined and compute the average rank for each. We used informativeness ratings on Newsroom data. Smaller value means overall better performance on all three data sets.
The results are shown in Table 4. The max value over each dimension of ELMo word embedding of input text performs the best when reference summaries are given. When we compare the embedding between system summary and document, the averaged cosine similarities of all seven representation also gives good results. Most importantly, the ELMo-m evaluation ranks consistently better than ROUGE-F for all evaluation settings.

Conclusion
In this paper we systematically study embedding cosine similarity as a measure of the quality of summarizers on three data sets. We verify the feasibility of the embedding similarity for system comparison on DUC'01, DUC'02 and Newsroom human evaluation data. The worse results on DUC'01 can be explained by the fact that systems being evaluated are too similar and not that wellperforming. On DUC'02 and Newsroom data, embedding similarity can achieve the same level or even higher correlation with human ratings compared to ROUGE. Overall, when references are given, max ELMo word embeddings have highest correlation and the averaged cosine similarities of the examined representations gives high correlation in reference-free setting.