Automatic Pyramid Evaluation Exploiting EDU-based Extractive Reference Summaries

This paper tackles automation of the pyramid method, a reliable manual evaluation framework. To construct a pyramid, we transform human-made reference summaries into extractive reference summaries that consist of Elementary Discourse Units (EDUs) obtained from source documents and then weight every EDU by counting the number of extractive reference summaries that contain the EDU. A summary is scored by the correspondences between EDUs in the summary and those in the pyramid. Experiments on DUC and TAC data sets show that our methods strongly correlate with various manual evaluations.


Introduction
To develop high quality summarization systems, we need accurate automatic content evaluation. Although, various evaluation measures have been proposed, ROUGE-N (Lin, 2004), Basic Elements (BE) (Hovy et al., 2006) remain the de facto standard measures since they strongly correlate with various manual evaluations and are easy to use. However, the evaluation scores computed by these automatic measures are not so useful for improving system performance because they merely confirm if the summary contains small textual fragments and so they do not address semantic correctness.
The pyramid method was proposed as a manual evaluation that well supports the improvement of summarization systems (Nenkova and Passonneau, 2004;Nenkova et al., 2007). First, the method identifies conceptual contents, Summary Content Units (SCUs), in reference summaries and then constructs a pyramid by collecting semantically equivalent SCUs. The weight of an SCU in the pyramid is defined as the number of reference summaries that contain the SCU. Thus, an SCU shared by many reference summaries is given higher weight. Second, a system summary is scored by the correspondences between SCUs in the summary and the pyramid. Its results are very useful for system improvement, i.e., we can know which important SCUs the system could or could not include in the summary. Although the pyramid method is reliable, it requires considerable cost and effort.
To address the weaknesses, automatic pyramid evaluation, Pyramid Evaluation via Automated Knowledge Extraction (PEAK) was proposed (Yang et al., 2016). Since SCU is the conceptual content of the text, it is difficult to automatically extract them from reference summaries by systems. Thus, PEAK regards subjectpredicate-object triples as alternatives to SCUs and constructs a pyramid by clustering semantically equivalent triples. However, the performance of subject-predicate-object triples extraction is not satisfying for the practical demands and semantic similarity utilized for clustering the triples does not correlate well with human judgment (see Section 2). As a result, the resultant pyramid is unreliable. Actually, PEAK is significantly inferior to ROUGE and BE (see Section 4.3) in terms of correlation.
To cope with the above problems, this paper proposes yet another automatic pyramid evaluation method. Its key feature is constructing a pyramid that consists of Elementary Discourse Units (EDUs), clause-like text units introduced in Rhetorical Structure Theory (Mann, William Charles and Thompson, Sandra Annear, 1988), in the source documents. In other words, we regard EDUs as alternatives to SCUs. To construct the pyramid, we transform human-made reference summaries into EDU-based extractive reference summaries and then weight every EDU by counting the number of the extractive reference summaries that contain the EDU. The rea-son why we derive extractive reference summaries whose SCUs are EDUs is as follows. First, Li et al. (2016) reported that EDUs are very similar to SCUs. Second, the performance of EDU segmenter is sufficient to satisfy practical requirements (see Section 2). Third, we do not need measure any semantic similarity to identify EDUs common to the extractive reference summaries. We also examine two types of extractive reference summary. One is based on the alignment between EDUs in reference summary and source documents. The other is based on the extractive oracle summary (Hirao et al., 2017). We conducted experiments on the Document Understanding Conference (DUC) 2003 to 2007 data sets and Text Analysis Conference (TAC) 2008 to 2011 data sets. The results showed that our methods exhibit strong correlation with manual evaluations.

Background and Related Work
The pyramid method (Nenkova and Passonneau, 2004;Nenkova et al., 2007), a manual evaluation framework, was developed to measure the content coverage of summaries. The pyramid method consists of two steps: (1) pyramid construction, and (2) summary scoring based on the pyramid. First, human annotators identify Summary Content Units (SCUs), conceptual content units in the reference summaries. They then construct a pyramid by clustering and weighting SCUs. The weight of an SCU is defined as the number of reference summaries that contain the SCU. As a result, if there are K reference summaries, the upper bound weight of an SCU in the pyramid is K and the lower bound is 1. Second, the score for a summary is determined by the correspondences between SCUs in the summary and those in the pyramid. Thus, the score is defined as the sum of weights of SCUs that correspond to those in the pyramid in the summary divided by the sum of SCU weight possible for an average-length reference summary. The pyramid method has two advantages over conventional manual evaluations: (1) the score is not intuitive but is systematically computed, i.e., the score can be explained as the sum of weights of SCUs in the pyramid, (2) the correspondences between the SCUs in a summary and the pyramid tell us whether the summary contains important SCUs or not. Thus, the results explicitly tell us why a summary was given a good or bad score.
During the past few years, studies have focused on the automatic scoring of summaries based on manually generated pyramids. Harnly et al. (2005) proposed a scoring method that matches SCUs in the pyramid with possible textual fragments in the summary. They enumerate all possible textual fragments within a sentence in the summary and compute similarity scores between the fragments and the SCUs in the pyramid based on unigram overlap. Then, they find the optimal correspondences between SCUs and the fragments that maximize the sum of similarity scores. Passonneau et al. (2013) extended the method by introducing distributional semantics to compute the similarity scores between SCUs and the fragments.
Recently, Yang et al. (2016) proposed the first automatic pyramid method, Pyramid Evaluation via Automated Knowledge Extraction (PEAK). PEAK employs subject-predicate-object triples extracted by ClausIE (Del Corro and Gemulla, 2013) as SCUs, and constructs pyramids by cutting a graph whose vertices represent the triples and whose edges represent semantic similarity scores between the triples computed by Align, Disambiguate and Walk (ADW) (Pilehvar et al., 2013). When evaluating a summary, PEAK constructs a weighted bipartite graph whose vertices represent subject-predicate-object triples extracted from the pyramid and the summary, respectively; the edges represent the similarity scores between the triples as computed by ADW. It scores the summary by solving the Linear Assignment Problem which involves maximizing the sum of the similarity scores on the bipartite graph.
The major difference between PEAK and our method is that the former regards the reference summary as a set of subject-predicate-object triples while the latter regards a reference summary as a set of EDUs obtained from the source documents. Thus, to construct high quality pyramids, PEAK is required to not only accurately extract the triples but also measure the semantic similarity between them accurately. However, in general, both extracting the triples and measuring the semantic similarity are still challenging NLP tasks. The performances are not always achieved in practical use. Actually, the F-measure of ClausIE is around 0.6 (Del Corro and Gemulla, 2013) and the correlation coefficients between the semantic similarity obtained from ADW and human judgment lie in the range of 0.55 to 0.88  (Pilehvar et al., 2013). As a result, the resultant pyramids have insufficient quality to be practical. Clearly, further improvement is necessary.
While our method is required to decompose a document into EDUs accurately, the EDU segmenter offers accurate decomposition performance; existing EDU boundary detection methods have F-measures over 0.9 (Fisher and Roark, 2007;Feng and Hirst, 2014). Moreover, since extractive reference summaries are set of EDUs from the source documents, we do not need semantic similarity to identify EDUs that have the same meaning. Thus, we can easily construct a pyramid by simply counting the number of extractive reference summaries that contains each EDU.

Automatic Pyramid Evaluation
First, we transform human-made reference summaries into extractive reference summaries; the EDUs in the source documents are used as the atomic units. Second, we construct a pyramid by weighting EDUs in the extractive reference summaries. EDU weights are defined as the number of reference summaries that contain each EDU (see Figure 1). In addition, we propose two techniques for deriving the extractive reference summaries.

Extractive Reference Summaries based on Alignment between EDUs
When similarity scores between EDUs in a reference summary and those in the source documents are available, we can regard extractive reference summary derivation as an optimal alignment problem with a length constraint, an extension of Linear Assignment Problem. We assume that a bipartite graph in which the vertices represent EDUs in the reference summary and source documents, and the edges represent similarity scores between the EDUs. The optimal alignment is obtained by solving following ILP problem: E is the set of all EDUs in the source documents and M is the set of all EDUs in the reference summary. () returns the length (the number of words) of a textual unit. φ(e j , m k ) returns the similarity score between the j-th EDU in the source documents and the k-th EDU in the reference summary as follows: LCS(·, ·) returns the Longest Common Subsequence between e j and m k . a j,k is a binary indicator, and a j,k = 1 denotes that the j-th EDU e j in the source documents is aligned to the kth EDU in the reference summary, i.e., e j is included in the extractive reference summary. Equation (2) ensures the the length of the extractive reference summary is less than L max , the length of the human-made reference summary. After solving the ILP problem, we can obtain the extractive reference summaries by collecting EDUs according to a j,k = 1.

Extractive Reference Summaries based on Extractive Oracle Summaries
As another extractive reference summary, we can utilize extractive oracle summary (Hirao et al., 2017). The extractive oracle summary is defined as the set of consequential textual fragments within a sentence obtained from the source documents that has the maximum automatic evaluation score. Since we regard EDUs as SCUs and employ ROUGE/BE as an automatic evaluation measure, an extractive reference summary is a summary that consists of EDUs in the source documents and has maximum ROUGE/BE score. For a given reference summary R , the extractive oracle summary is defined as follows: (7) f () denotes an automatic evaluation measure (ROUGE/BE) and is defined as follows: .
(8) S is a system summary and U is the set of all atomic units in the reference summary. Ngrams are utilized as the units used in computing ROUGE and head-modifier-relation triples are utilized in computing BE. N (u i , R); N (u i , S) returns the number of occurrences of the units in the reference and system summary, respectively. Since the extractive oracle summaries in Hirao et al. (2017) are based on sentences, we extend the method to obtain EDU-based extractive oracle summaries. The ILP formulation that returns an extractive oracle summary is defined as follows: z i is the count of the i-th unit in the oracle summary. x j is a binary indicator, x j = 1 denotes that the j-th EDU, e j , is included in the oracle summary. s m is a binary indicator, s m = 1 denotes that EDU(s) in m-th sentence is included in the oracle summary. The value of |S| m=1 s m is equal to the number of sentence whose EDU(s) is used in oracle summary. Thus, an oracle summary that consist of fewer sentences tends to obtain a higher objective value. Therefore, we can avoid generating fragmented oracle summaries with low readability. This objective function is inspired by the work of compressive summarization method (Morita et al., 2013). V i is the set of indices indicating the position of the i-th unit, and d n is a binary indicator indicating whether the n-th unit is contained in the oracle summary or not. Function left(·) and right(·) return the index of EDU that contains a word on the left in the unit, and the index of EDU that contains a word on the right in the unit, respectively. Function c(·) returns the index of sentence that contains j-th EDU. Figure 2 shows examples. Suppose that the 10th triple in U is "<has,computer,rcmod>". From the figure, the indices of the triple corresponding to "<has,computer,rcmod>" are 6 and 21. Thus, V 10 = {6, 21}. The word on the left in the triples is "computer" and the word on the right is "has". For the first triple, the index of the EDU that contains "computer" is 1 and the index of the EDU that contains "has" is 2. For the second triple, the index of the EDU that corresponds to "computer" is 4, while that of "has" is 5. Thus, left(6) = 1 and right(6) = 2, left(21) = 4, and right(21) = 5.
After solving the ILP problem, we construct the extractive oracle summary by collecting EDUs according to x j = 1.

Pyramid Construction: EDU Weighting
By deriving extractive references, we can easily construct a pyramid. The weight of an EDU is defined as the number of extractive references that contain the EDU. Here, P is a complete set of all EDUs in K extractive reference summaries, i.e.,   of the j-th EDU in P is defined as follows: C() returns the number of extractive reference summaries that contain p j , i.e., the maximum score of C(p j ) is K and its minimum score is 1. Since all EDUs in the source documents are assigned an integer score in the range of 0 1 to K, the scoring can be regard as a variant of relative utility score (Radev and Tam, 2003).

Automatic Scoring of Summaries
Based on the pyramid, we compute a score for a summary by aligning EDUs in the pyramid and EDUs in the system summary. By following PEAK, we find the optimal alignment by solving the Linear Assignment Problem. That is, we compute all similarity scores between EDUs in the summary and pyramid and then find the maximal score so that each EDU in the system summary is matched to at most one EDU in the pyramid. The ILP formulation of the problem is as follows: 1 The EDUs that are not included in pyramid have scores of zero.
C is the set of EDUs in the system summary. Function g() indicates a binary function based on the similarity score between EDUs as follows: We set t = 0.55 in our experiments (Section 4). α i,j is a binary indicator, α i,j = 1 denotes that the i-th EDU in the system summary is aligned to j-th EDU in the pyramid.
The optimal solution of the objective function in the ILP problem (19)-(22) is not normalized. Since the unnormalized score is not suitable for comparing systems, we propose a normalization method. To normalize the score to satisfy the range of 0 to 1, we divide the score by the maximum score of sum of the EDU weights. Since every EDU in the pyramid has both length (the number of the word) and weight, the maximum score is derived by solving the knapsack problem: ∀j (27) x j is a binary indicator, and x j = 1 denotes that the j-th EDU is included in the knapsack. Finally, the score is defined as Pyramid(P, S) = OTP LAP /OPT KP . OPT LAP and OPT LAP denote maximum score of Equation (20) and maximum score of Equation (25), respectively.

Experiments
To investigate the effectiveness of our automatic evaluation method, we compare the correlation coefficients yielded by our methods with those obtained from strong baselines, ROUGE-2, ROUGE-SU4 and BE. We employ ROUGE toolkit version 1.5.5 to compute ROUGE/BE scores and Stanford Parser (de Marneffe et al., 2006) to obtain headmodifier-relation triples. In addition, we examine two types of oracle summaries for our method. One is ROUGE-2-based, the other is BE-based.
We evaluate automatic evaluation measures by Pearson's correlation r, Spearman's rank correlation ρ and Kendall's rank correlation τ . Correlation coefficients are computed by average automatic score and average manual evaluation score for all topics.

Data Sets
We conducted experiments on the data sets developed for multi document summarization tasks in DUC-2003and TAC-2008to 2011. Table  1 and Table 2 show the statistics of the data sets. DUC-2003 and2004 were used for a generic summarization task with 100 word limit; mean coverage was used in a manual evaluation. DUC-2005DUC- to 2007 were used for a query-focused summarization task with 250 word limit; responsiveness was used in a manual evaluation. The number of topics varied from 30 to 50 and the participating systems from 16 to 35. Note that the pyramid method was applied to small sets of topics in DUC-2006. TAC-2008and 2009 were used for an update summarization task while TAC-2010 and 2011 were employed for a guided summarization task. For both tasks, the participating systems required two types of summaries, initial summary and update summary with 100 word limit. Both pyramid method and responsiveness were used in manual evaluations. In particular, TAC-2008 to 2011 have large numbers of participating systems, from 44 to 48.

EDU Segmenter
We regard decomposing a sentence into EDUs as a sequential tagging problem and implement a neural EDU segmenter that classifies each word in a sentence as the boundary of EDU or not based on 3-layer bi-LSTM (Wang et al., 2015). The size of word embeddings and hidden layers of the LSTM were set to 100 and 256, respectively. To handle low-frequency words, all words are encoded to 40 dimension hidden state by using character-based bi-LSTM (Lample et al., 2016). To utilize entire words in a corpus, we integrated word dropout (Iyyer et al., 2015) into our models with smoothing rate, 1.0. Moreover, to avoid overfitting the training data, dropout layer was adopted to the input of the LSTMs with the ratio 0.3.
The segmenter was trained by utilizing the training data of RST Discourse Treebank corpus (Carlson et al., 2001). The macro-averaged Fmeasure of boundary detection on the test data of the corpus is 0.917. The source documents, system summaries and reference summaries utilized   in our experiments were decomposed into EDUs by the segmenter. Table 3 and 4 list the correlation coefficients between manual evaluation and automatic evaluation for DUC-2003and TAC-2008to 2011 In the tables, the coefficients are written in the order "Pearson's r/ Spearman's ρ/ Kendall's τ ". The rows of Prop(BE), Prop(ROUGE) and Prop(AL) denote our method with BE-based oracle summaries as extractive reference summaries, with ROUGE-2-based oracle summaries, and with extractive reference summaries based on alignment, respectively. "Cov.", "Resp." and "Pyr." denote mean coverage, responsiveness and manual pyramid, respectively. With regard to mean coverage on DUC-2003to 2004 achieved the best correlation coefficients. The correlation coefficients indicate very strong correlation with the manual evaluation. Prop(ROUGE) and Prop(AL) attained comparable correlation coefficients with the baseline methods. The correlation coefficients still indicate strong correlation.

Results and Discussion
With regard to responsiveness, our methods achieved lower correlation coefficients on DUC-2005 to 2007 than on DUC-2003DUC- to 2004. Although our methods are outperformed by the baseline methods, both r and ρ of Prop(BE) still ex-ceed 0.8 except for responsiveness on DUC-2006. Since our methods mimic manual pyramid evaluation, correlation coefficients against manual pyramid on DUC-2006 to 2007 are better than those against responsiveness and the scores are comparable to those of the baselines.
Moreover, we compare our methods with PEAK on the DUC-2006 data set. For manual pyramid, r and ρ are 0.508 and 0.538, respectively, while for responsiveness they are 0.617 and 0.640, respectively. These scores are significantly lower than those attained by our methods and baselines. Note that these results are obtained by running the code from the author's web page http://www. larayang.com/peak/. The results demonstrated that our methods are superior to PEAK.
For manual pyramid on TAC-2008 to 2011, all methods attained quite strong correlation. The scores achieved were around 0.9 and better than those on DUC-2003 to 2007. In particular, Prop(BE) achieved the best scores in some cases. Although, responsiveness yielded lower correlation coefficients than manual pyramid, Prop(BE) still retains strong correlation , e.g., ρ are in the range of 0.857 to 0.932 against manual pyramid, 0.810 to 0.913 against responsiveness. The average correlation coefficients across all data sets on TAC are shown in Table 5. The average correlation coefficients of Prop(BE) slightly lower than those of ROUGE-2 and BE against manual pyramid. On the ohter hand, Prop(BE) achieved the best correlation coefficients against responsiveness. The results imply that Prop(BE) achieves comparable performance to baseline methods.
In a comparison of our methods, Prop(BE) attained the best results while Prop(ROUGE) showed better results than Prop(AL) in many cases. These results imply that extractive oracle summaries are helpful as extractive reference summaries and BE is better objective function to generate them.
In addition, we show SCUs and corresponding EDUs obtained from a human-made pyramid and Prop(BE) in Figure 3. They are obtained from topic "Earthquake Sichuan (ID:D1110B)" from TAC-2011 Guided Summarization Task, the topic type is categorized as "Accidents and Natural Disasters". Summarizers are required to generate a summary that includes following aspects: (1)   (3) WHERE: physical location, (4) WHY: reasons for accident/disaster, (5) WHO AFFECTED: casualties (death, injury), or individuals otherwise negatively affected by the accident/disaster, (6) DAMAGES: damages caused by the accident/disaster, (7) COUNTERMEASURES: countermeasures, rescue efforts, prevention efforts, other reactions to the accident/disaster. From the figure, we can see that the EDUs are not always identical to human-generated SCUs at word-level but are identical at concept-level. In short, these results imply that our methods have at least comparable performance to the baselines. Although our methods are outperformed by the baselines in some cases, the correlation coefficients are high enough against manual evaluation. Moreover, our methods have a significant advantage over the baselines methods because our methods clearly indicate whether the output of the text summarization system failed to include important SCUs. Thus, our automatic pyramid method enhanced with extractive oracle summaries is helpful for further improvement of summarization systems.

Conclusion
This paper proposed an automatic pyramid evaluation method that allows us to scrutinize the failure analysis of systems. To construct a pyramid, we transform human-made reference summaries into extractive reference summaries whose atomic units are EDUs obtained from the source documents. Then, we weight every EDU by counting the number of extractive reference summaries that contain the EDU. When evaluating a summary, we determine the correspondences between EDUs in the pyramid to those in the summary by solving Linear Assignment Problem and give a score to the summary based on the correspondences. We also proposed two types of extractive reference summaries. The first is the alignment-based extractive reference summary. The second is the extractive SCUs obtained from human-made pyramid w = 4 The 7.8-magnitude earthquake struck w = 4 Sichuan Province of China w = 4 No warning signs detected w = 4 Over 8,500 killed w = 4 China allocated 200 million yuan ($29 Million) disaster relief w = 1 Rain is forecast, could hamper relief efforts w = 1 Quake also affected Gansu, Shaanxi provinces, and Chongqing municipality EDUs obtained from pyramid of Prop(BE) w = 1 The 7.8-magnitude earthquake struck Sichuan province shortly before 2:30 pm w = 2 in aid for earthquake victims in Sichuan Province of China w = 4 Chinese authorities did not detect any warning signs ahead of Monday's earthquake w = 1 leaving at least 12,000 people died w = 2 China has allocated 200 million yuan w = 1 Rain in the coming days in Sichuan is expected to hamper earthquake relief efforts, as well as increase risks of landslides w = 1 50 in the municipality of Chongqing, 61 in Shaanxi province, and one in southwestern Yunnan Figure 3: Examples of SCUs obtained from pyramids. oracle summary.
To demonstrate the effectiveness of our methods, we conducted experiments on DUC-2003and TAC-2008 to 2011 data sets. The results demonstrated that our method yielded results that well correlated with various manual evaluations. The correlation coefficients are at least comparable to those obtained from strong baselines, ROUGE-2, ROUGE-SU and BE and significantly higher than those obtained from previous automatic pyramid evaluation, PEAK.