Pruning Basic Elements for Better Automatic Evaluation of Summaries

We propose a simple but highly effective automatic evaluation measure of summarization, pruned Basic Elements (pBE). Although the BE concept is widely used for the automated evaluation of summaries, its weakness is that it redundantly matches basic elements. To avoid this redundancy, pBE prunes basic elements by (1) disregarding frequency count of basic elements and (2) reducing semantically overlapped basic elements based on word similarity. Even though it is simple, pBE outperforms ROUGE in DUC datasets in most cases and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.


Introduction
Automatic evaluation measures have a significant impact on the research on summarization. Since there is no other practical way to quickly evaluate the quality of system summaries, summarization studies work on raising the scores that are given by automatic evaluation measures.
Among the automatic evaluation measures, the most popular ones are ROUGE (Lin, 2004) and BE (Hovy et al., 2006). ROUGE/BE counts the number of ngrams/basic elements 1 that match those in manual reference summaries. ROUGE normally employs unigrams or bigrams while BE uses dependency triples (head|modifier|relation) as their units. It is known that both ROUGE and BE are well correlated with human judgment.
Their evaluation approach, however, is quite different from humans' in two ways: they score low-information units higher and ignore the semantic overlap of units. The first problem is caused by scoring units according to their frequencies. We found that the units that occur multiple times in a summary are highly likely to be function-word bigrams (e.g., "of the") or basic elements that represent only single nouns (e.g., (house|the|det)); such units are less informative than units connected with verbs (e.g., "John went" and (went|John|nsubj)). The second problem is that ROUGE/BE sometimes gives scores twice or more to the units that are semantically overlapped but spelled differently. This is due to the fact that ROUGE/BE only considers the surface level of unit matching, which also yields inaccurate scoring of paraphrased units.
Our method is aimed at solving these problems by cutting back redundant units. We use BE, but with Universal Dependencies (UD) (Nivre et al., 2016), a more ideal form of annotation that is available for multiple languages, and introduce two steps to prune basic elements. The first step is to disregard the frequency count of basic elements, and the other one is to reduce semantically overlapped basic elements using word embeddings. We call this new measure pruned BE (pBE). Our experiments show that pBE outperforms ROUGE in most DUC datasets and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.

Related Work
ROUGE-WE (Ng and Abrecht, 2015) and BEwT-E (Tratz and Hovy, 2008) are closely related to our method in that they aim to improve unit matching. ROUGE-WE exploits word embeddings to softly match ngrams based on their cosine similarities. Although this also takes semantic correspondence into consideration, it is different from pBE because it does not judge word similarity within one summary, but only between a target sum-mary and its reference summaries. Furthermore, ROUGE-WE does not remove the frequency count of ngrams as pBE does. As a result, ROUGE-WE does not achieve our goal of reducing redundant units.
BEwT-E transforms basic elements to help in matching. However, it requires complex transformation rules, which are difficult to apply to languages other than English. pBE, on the other hand, needs no resources other than word embeddings and UD parsers, and so can be implemented in many other languages. BEwT-E was checked as to whether the frequency count of basic elements affected its performance. The focus, however, was not to prune basic elements and there was no clear explanation as to why disregarding frequency count was effective. Our contribution is that we have identified why disregarding frequency count is effective; it yields the pruning of low-information basic elements, and thus works well in combination with reducing semantic overlaps.
Syntactically and semantically richer structures are free from low-information units. In this sense, PEAK (Yang et al., 2016) is related to our method in that it tries to employ predicate-argument structures as primitive units for matching. However, the predicate-argument structures are more difficult to extract than dependency triples. It is reported that PEAK scored only about 0.7 in Pearson coefficient for the DUC 2006 dataset (Yang et al., 2016), whereas ROUGE achieved around 0.83.

pruned BE (pBE) 2
In this section, we describe our implementation of BE and the two steps of pruning basic elements.

Our Implementation of BE
BE was proposed to compensate some of the shortcomings of ngrams (Hovy et al., 2006). ROUGE usually uses short ngrams such as unigrams and bigrams, but these can be low-information content because they are simply extracted without considering the syntactic relations of the words. For example, the sentence "John went to the store on foot" is decomposed into the bigrams ["John went", "went to", "to the", "the store", "store on", "on foot"]. The function-word pair "to the" bears almost no meaning but is frequently found since function words appear in sentences quite often. On the other hand, a dependency triple holds the syntactic information that the dependency of "to" is not "the" but "store". Although BE requires applying parsers to summaries, syntactic dependencies enable BE to avoid making low-information units 3 .
Accordingly, while we use BE, the annotation is UD based, an approach not employed in previous studies.
Since UD focuses on the relations between content words, UD triples are able to represent key components of sentences more directly. For example, the sentence above can be decomposed in UD as [(went|John|nsubj), (store|to|case), (store|the|det), (went|store|nmod:to), (foot|on|case), (went|foot| nmod:on)] 4 , while it is [(went|John|nsubj), (went| to|prep), (store|the|det), (to|store|pobj), (went|on| prep), (on|foot|pobj)] in Stanford Dependencies (de Marneffe et al., 2006). In UD, the predicate-object relation is directly expressed as (went|store|nmod:to), instead of having intermediate triples (went|to|prep) and (to|store|pobj). Moreover, UD has another key advantage, that it is available in many languages. This makes our method available for multiple languages other than English.
We use (head|modifier|relation) triples of UD v1 relations which correspond to narrow-sense dependencies and multiword expression (MWE) dependencies of UD v2 5 . One thing to note here is that we excluded auxpass and mwe relations. It is because the information of these is mostly contained in other relations such as nsubjpass, nmod and advcl. Auxpass is a special relation of aux, which indicates that a verb is passive. Aux indicates a verb's modality or tense, which is not mentioned by nsubj relation alone. Auxpass also indicates an important information of a verb, its voice. However, the information of voice is already contained in the relation of nsubjpass. Mwe is used for multiword expressions with function words that behave like a single function word.
The whole information of mwe, however, is generally contained in nmod or advcl relations in en-hanced++ UD representation (Schuster and Manning, 2016) (e.g., (fruits|apple|nmod:such as) and (bought|fixing|advcl:insted of)). Counting these relations can lead to redundant unit matching. In fact, the performance was better when we excluded these relations.

3.2
Step 1: Disregard Frequency Count ROUGE/BE score is defined as follows: Given K reference summaries R = {R 1 , ..., R K }, target summary S, and the set of units that appear in R k as F Unit f contributes to ROUGE/BE scores according to its frequency 6 .
The problem is that the units found multiple times tend to be low-information units. ROUGE-2 often finds function-word bigrams, which leads to their overweighting. While BE is free from function-word bigrams, it still contains improperly weighted basic elements: compound and det. For example, in DUC 2003, 302 basic elements are returned more than 1 in min{N (f k m , R k ), N (f k m , S)} of which 139 were compound and 96 were det; together they occupy about 78% of the total. This is because these relations represent only single nouns. Since they are not associated with verbs, which are key components of sentences, they appear in many sentences even within one summary 7 . It is not that compound and det are meaningless units, but that they should not be weighted more than other relations such as nsubj, dobj and iobj, which are associated with verbs. 6 In BE, it is optional to consider or disregard this frequency count (Tratz and Hovy, 2008). We describe why dispensing with the frequency count affects the results below. 7 "Donald Trump" can be used in various sentences like "Donald Trump won the election." and "Donald Trump will visit China next week." But "Trump won" can only occur in the specific situation where Trump won something, which is unlikely to be described in a summary more than once. Therefore, we simply get rid of the frequency count. We define our scoring function as follows: Here O(f k m , R k ) and O(f k m , S) are functions that return 1 if f k m is in R k and S respectively, and otherwise return 0. This way, we can simplify equation (1) and avoid undue weighting.

Step 2: Cluster Basic Elements Using Word Embeddings
We are able to detect semantic correspondence. If we are given key points to be included in the summary, we can judge whether the key points are in the summary or not on the semantic level. ROGUE/BE, however, judges the correspondence of key points only on the surface level. Since the same content can be expressed in various surface forms, ROUGE/BE sometimes scores semantically overlapped units multiple times or does not score units that semantically correspond to each other but are significantly different on the surface level 8 . To deal with this problem, we put semantically identical words into one cluster based on word similarity. Our method only requires word embeddings trained with word2vec (Mikolov et al., 2013), and so offers multilingual capability.
Given K reference summaries R = {R 1 , ..., R K }, target summary S, a set of all unigrams in R and S as U = {u 1 , ..., u P }, and a set of Q word embeddings for the unigrams as V = {v 1 , ..., v Q } (Q ≤ P ), we put U into the set of cluster IDs C = {c 1 , ..., c N } by hierarchical clustering using word similarities. The number of clusters, N , is a hyperparameter. Next, we convert the unigrams of R and S into the cluster ID c. If unigram u i has no word embeddings, we leave it in its surface form. Let the converted reference summaries and target summary be R and S , respectively. We define the set of basic elements in R k as F   Combined with step 1, fully pruned BE is defined as follows:

Experimental Setup
To assess the effectiveness of pBE, we computed the correlation coefficient between pBE scores and human judgments, as well as between the scores of other automatic evaluation measures and manual scores for comparison. We used multi-document summarization datasets DUC 2003 -2007 and TAC 2011. The correlation was computed between all system summaries, excluding reference summaries. Our first experiment compared the performance of pBE and ROUGE on DUC datasets. Since a dependency triple is a type of bigram/skip-bigram, we chose ROUGE-2 and ROUGE-S4 for comparison. We also examined ROUGE-SU4 9 because it is known as a strong baseline that outperforms most of other measures in TAC 2011 AESOP task (Owczarzak and Dang, 2011).
The second experiment was designed to see how well pBE worked compared with our related 9 All three ROUGE here were run with stemming but with no removal of stopwords.   (Ng and Abrecht, 2015). The details of our experimental setup are given in Table 3 and below.
Parser: We used the neural-network dependency parser of Stanford CoreNLP (Manning et al., 2014). Dependencies were set to en-hanced++ Universal Dependencies (Schuster and Manning, 2016).
Clustering: We employed hierarchical clustering, maximum distance method. The number of clusters, N , was set to 0.975 * Q.
Word Embeddings: A set of pre-trained Google-News word embeddings 10 . It contains 3 million words, each of which has a word embedding of 300 dimensions. Table 1 and 2 show the evaluation results on DUC and TAC data set, respectively.

Results and Discussion
Regardless of the diversity of datasets, pBE outperformed ROUGE in most cases (table 1). Interestingly, although step 2 itself sometimes did not work well, the combination of both steps gener-  Table 4: The number of basic elements which returned more than 1 in min{N (f k m , R k ), N (f k m , S)}, before clustering (BE) and after clustering (BE +cls ), and the difference of the numbers, BE +cls − BE (Increased). The relation "subj & obj" includes nsubj, nsubjpass, csubj, csubjpass, iobj and dobj. ally achieved the best performance. This is because clustering enhanced not only the matching of informative basic elements but also that of low-information basic elements. Table 4 shows how the number of compound and det triples increased, compared with that of subj (nsubj, nsubjpass, csubj and csubjpass) and obj (iobj and dobj) triples. In all datasets, the number of compound and det triples that returned more than 1 in min{N (f k m , R k ), N (f k m , S)} increased much more than that of subj and obj, after converting unigrams into cluster IDs. Although clustering reduced semantic mismatches, it worsened the problem of redundant counting. Nonetheless, this problem can be easily solved by applying step 1. This is why the combination of step 1 and 2 was so synergistic.
Another problem with step 2 is that it sometimes makes inappropriate clusters. For example, numbers tend to be put in the same clusters since our word embeddings place them close to each other. In summaries, however, confusing quantitative information such as "two apples" and "five apples" must be avoided . It will be our future work to specify where clustering fails to work and to get rid of inappropriate clusters. Table 2 shows that pBE achieved the best rank correlation among the other competitors in TAC 2011 and ROUGE-WE. Although its score was lower in Pearson coefficient, it should be noted that the Pearson correlation is based on some strict assumptions: Samples are normally distributed and are linearly related to each other. Since Spearman/Kendall correlation is free from these assumptions, the best rank correlation is a good evi-dence of pBE's performance.

Conclusion
We proposed an automatic evaluation measure of summarization, pBE. It is designed to prune redundant basic elements in two steps: (1) disregarding frequency count of basic elements and (2) using word similarity to reduce semantically overlapped basic elements. Our experiments show that pBE outperforms ROUGE in most cases and achieves the highest rank correlation coefficient in TAC 2011 AESOP task.