Enumeration of Extractive Oracle Summaries

To analyze the limitations and the future directions of the extractive summarization paradigm, this paper proposes an Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also propose an algorithm that enumerates all of the oracle summaries for a set of reference summaries to exploit F-measures that evaluate which system summaries contain how many sentences that are extracted as an oracle summary. Our experimental results obtained from Document Understanding Conference (DUC) corpora demonstrated the following: (1) room still exists to improve the performance of extractive summarization; (2) the F-measures derived from the enumerated oracle summaries have significantly stronger correlations with human judgment than those derived from single oracle summaries.

The summarization research community is experiencing a paradigm shift from extractive to compressive or abstractive summarization. Currently our question is: "Is extractive summariza-tion still useful research?" To answer it, the ultimate limitations of the extractive summarization paradigm must be comprehended; that is, we have to determine its upper bound and compare it with the performance of the state-of-the-art summarization methods. Since ROUGE n is the de-facto automatic evaluation method and is employed in many text summarization studies, an oracle summary is defined as a set of sentences that have a maximum ROUGE n score. If the ROUGE n score of an oracle summary outperforms that of a system that employs another summarization approach, the extractive summarization paradigm is worthwhile to leverage research resources.
As another benefit, identifying an oracle summary for a set of reference summaries allows us to utilize yet another evaluation measure. Since both oracle and extractive summaries are sets of sentences, it is easy to check whether a system summary contains sentences in the oracle summary. As a result, F-measures, which are available to evaluate a system summary, are useful for evaluating classification-based extractive summarization (Mani and Bloedorn, 1998;Osborne, 2002;Hirao et al., 2002). Since ROUGE n evaluation does not identify which sentence is important, an Fmeasure conveys useful information in terms of "important sentence extraction." Thus, combining ROUGE n and an F-measure allows us to scrutinize the failure analysis of systems.
Note that more than one oracle summary might exist for a set of reference summaries because ROUGE n scores are based on the unweighted counting of n-grams. As a result, an F-measure might not be identical among multiple oracle summaries. Thus, we need to enumerate the oracle summaries for a set of reference summaries and compute the F-measures based on them.
In this paper, we first derive an Integer Linear Programming (ILP) problem to extract an oracle summary from a set of reference summaries and a source document(s). To the best of our knowledge, this is the first ILP formulation that extracts oracle summaries. Second, since it is difficult to enumerate oracle summaries for a set of reference summaries using ILP solvers, we propose an algorithm that efficiently enumerates all oracle summaries by exploiting the branch and bound technique. Our experimental results on the Document Understanding Conference (DUC) corpora showed the following: 1. Room still exists for the further improvement of extractive summarization, i.e., where the ROUGE n scores of the oracle summaries are significantly higher than those of the state-ofthe-art summarization systems.
2. The F-measures derived from multiple oracle summaries obtain significantly stronger correlations with human judgment than those derived from single oracle summaries.

Definition of Extractive Oracle Summaries
We first briefly describe ROUGE n . Given set of reference summaries R and system summary S, ROUGE n is defined as follows: .
(1) R k denotes the multiple set of n-grams that occur in k-th reference summary R k , and S denotes the multiple set of n-grams that appear in system-generated summary S (a set of sentences). N (g n j , R k ) and N (g n j , S) return the number of occurrences of n-gram g n j in the k-th reference and system summaries, respectively. Function U (·) transforms a multiple set into a normal set. ROUGE n takes values in the range of [0, 1], and when the n-gram occurrences of the system summary agree with those of the reference summary, the value is 1.
In this paper, we focus on extractive summarization, employ ROUGE n as an evaluation measure, and define the oracle summaries as follows: (2) D is the set of all the sentences contained in the input document(s), and L max is the length limitation of the oracle summary. (S) indicates the number of words in the system summary. Eq. (2) is an NP-hard combinatorial optimization problem, and no polynomial time algorithms exist that can attain an optimal solution.
3 Related Work Lin and Hovy (2003) utilized a naive exhaustive search method to obtain oracle summaries in terms of ROUGE n and exploited them to understand the limitations of extractive summarization systems. Ceylan et al. (2010) proposed another naive exhaustive search method to derive a probability density function from the ROUGE n scores of oracle summaries for the domains to which source documents belong. The computational complexity of naive exhaustive methods is exponential to the size of the sentence set. Thus, it may be possible to apply them to single document summarization tasks involving a dozen sentences, but it is infeasible to apply them to multiple document summarization tasks that involve several hundred sentences.
To describe the difference between the ROUGE n scores of oracle and system summaries in multiple document summarization tasks, Riedhammer et al. (2008) proposed an approximate algorithm with a genetic algorithm (GA) to find oracle summaries. Moen et al. (2014) utilized a greedy algorithm for the same purpose. Although GA or greedy algorithms are widely used to solve NP-hard combinatorial optimization problems, the solutions are not always optimal. Thus, the summary does not always have a maximum ROUGE n score for the set of reference summaries. Both works called the summary found by their methods the oracle, but it differs from the definition in our paper.
Since summarization systems cannot reproduce human-made reference summaries in most cases, oracle summaries, which can be reproduced by summarization systems, have been used as training data to tune the parameters of summarization systems. For example, Kulesza and Tasker (2011) and Sipos et al. (2012) trained their summarizers with oracle summaries found by a greedy algorithm. Peyrard and Eckle-Kohler (2016) proposed a method to find a summary that approximates a ROUGE score based on the ROUGE scores of individual sentences and exploited the framework to train their summarizer. As mentioned above, such summaries do not always agree with the oracle summaries defined in our paper. Thus, the quality of the training data is suspect. Moreover, since these studies fail to consider that a set of reference summaries has multiple oracle summaries, the score of the loss function defined between their oracle and system summaries is not appropriate in most cases.
As mentioned above, no known efficient algorithm can extract "exact" oracle summaries, as defined in Eq. (2), i.e., because only a naive exhaustive search is available. Thus, such approximate algorithms as a greedy algorithm are mainly employed to obtain them.

Oracle Summary Extraction as an Integer Linear Programming (ILP) Problem
To extract an oracle summary from document(s) and a given set of reference summaries, we start by deriving an Integer Linear Programming (ILP) problem. Since the denominator of Eq. (1) is constant for a given set of reference summaries, we can find an oracle summary by maximizing the numerator of Eq. (1). Thus, the ILP formulation is defined as follows: ∀j : z kj ∈ Z + .
Here, z kj is the count of the j-th n-gram of the k-th reference summary in the oracle summary, i.e., z kj = min{N (g n j , R k ), N (g n j , S)}. (·) returns the number of words in the sentence, x i is a binary indicator, and x i = 1 denotes that the i-th sentence s i is included in Root Figure 1: Example of a search tree the oracle summary. N (g n j , s i ) returns the number of occurrences of n-gram g n j in the i-th sentence. Constraints (5) and (6) ensure that z kj = min{N (g n j , R k ), N (g n j , S)}.

Branch and Bound Technique for Enumerating Oracle Summaries
Since enumerating oracle summaries with an ILP solver is difficult, we extend the exhaustive search approach by introducing a search and prune technique to enumerate the oracle summaries. The search pruning decision is made by comparing the current upper bound of the ROUGE n score with the maximum ROUGE n score in the search history.

ROUGE n Score for Two Distinct Sets of Sentences
The enumeration of oracle summaries can be regarded as a depth-first search on a tree whose nodes represent sentences. Fig. 1 shows an example of a search tree created in a naive exhaustive search. The nodes represent sentences and the path from the root node to an arbitrary node represents a summary. For example, the red path in Fig.  1 from the root node to node s 2 represents a summary consisting of sentences s 1 , s 2 . By utilizing the tree, we can enumerate oracle summaries by exploiting depth-first searches while excluding the summaries that violate length constraints. However, this naive exhaustive search approach is impractical for large data sets because the number of nodes inside the tree is 2 |D| . If we prune the unwarranted subtrees in each step of the depth-first search, we can make the search more efficient. The decision to search or prune is made by comparing the current upper bound of the ROUGE n score with the maximum ROUGE n score in the search history. For instance, in Fig. 1, we reach node s 2 by following this path: "Root → s 1 , → s 2 ". If we estimate the maximum ROUGE n score (upper bound) obtained by searching for the descendant of s 2 (the subtree in the blue rectangle), we can decide whether the depthfirst search should be continued. When the upper bound of the ROUGE n score exceeds the current maximum ROUGE n in the search history, we have to continue. When the upper bound is smaller than the current maximum ROUGE n score, no summary is optimal that contains s 1 , s 2 , so we can skip subsequent search activity on the subtree and proceed to check the next branch: "Root → s 1 → s 3 ".
To estimate the upper bound of the ROUGE n score, we re-define it for two distinct sets of sentences, V and W , i.e., V ∩ W = φ, as follows: Here ROUGE n is defined as follows: V, W are the multiple sets of n-grams found in the sets of sentences V and W , respectively.
We omit the proof of Theorem 1 due to space limitations.

Upper Bound of ROUGE n
Let V be the set of sentences on the path from the current node to the root node in the search tree, and let W be the set of sentences that are the descendants of the current node. In Fig. 1, V ={s 1 , s 2 } and W ={s 3 , s 4 , s 5 , s 6 }. According to Theorem 1, the upper bound of the ROUGE n score is defined as: Since the second term on the right side in Eq.
(11) is an NP-hard problem, we turn to the following relation by introducing inequality, Here, x = (x 1 , . . . , x |W | ) and x i ∈ {0, 1}. The right side of Eq. (12) is a knapsack problem, i.e., a 0-1 ILP problem. Although we can obtain the optimal solution for it using dynamic programming or ILP solvers, we solve its linear programming relaxation version by applying a greedy algorithm for greater computation efficiency. The solution output by the greedy algorithm is optimal for the relaxed problem. Since the optimal solution of the relaxed problem is always larger than that of the original problem, the relaxed problem solution can be utilized as the upper bound. Algorithm 1 shows the pseudocode that attains the upper bound of ROUGE n . In the algorithm, U indicates the upper bound score of ROUGE n . We first set the initial score of upper bound U to ROUGE n (R, V ) (line 3). Then we compute the density of the ROUGE n scores (ROUGE n (R, V, {w})/ (w)) for each sentence w in W and sort them in descending order (lines 4 to 6). When we have room to add w to the summary, we update U by adding the ROUGE n (R, V, {w}) (line 10) and update length Algorithm 2 Greedy algorithm to obtain initial score 1: Function: GREEDY(R, D, Lmax) 2:

13:
return ROUGEn(R, S * ) 14: end constraint L max (line 11). When we do not have room to add w, we update U by adding the score obtained by multiplying the density of w by the remaining length, L max (line 13), and exit the while loop.

Initial Score for Search
Since the branch and bound technique prunes the search by comparing the best solution found so far with the upper bounds, obtaining a good solution in the early stage is critical for raising search efficiency.
Since ROUGE n is a monotone submodular function (Lin and Bilmes, 2011), we can obtain a good approximate solution by a greedy algorithm (Khuller et al., 1999). It is guaranteed that the score of the obtained approximate solution is larger than 1 2 (1 − 1 e )OPT, where OPT is the score of the optimal solution. We employ the solution as the initial ROUGE n score of the candidate oracle summary.
Algorithm 2 shows the greedy algorithm. In it, S denotes a summary and D denotes a set of sentences. The algorithm iteratively adds sentence s * that yields the largest gain in the ROUGE n score to current summary S, provided the length of the summary does not violate length constraint L max (line 4). After the while loop, the algorithm compares the ROUGE n score of S with the maximum ROUGE n score of the single sentence and outputs the larger of the two scores (lines 11 to 13).

Enumeration of Oracle summaries
By introducing threshold τ as the best ROUGE n score in the search history, pruning decisions involve the following three conditions: Algorithm 3 Branch and bound technique to enumerate oracle summaries 1: Read R,D,Lmax 2: τ ← GREEDY(R, D, Lmax),Oτ ← φ 3: for each s ∈ D do 4: append(S, ROUGEn(R, {s}), s ) 5: end for 6: sort(S,'descend') 7: call FINDORACLE(S, C) 8: output Oτ 9: Procedure: FINDORACLE(Q, V ) 10: while Q = φ do 11: s ←shift(Q) 12: append(V, s) 13: if Lmax − (V ) ≥ 0 then 14: if ROUGEn(R, V ) ≥ τ then 15: τ ← ROUGEn(R, V ) 16: append ( With case 1, we update the oracle summary as V and continue the search. With case 2, because both ROUGE n (R, V ) and ROUGE n (R, V ) are smaller than τ , the subtree whose root node is the current node (last visited node) is pruned from the search space, and we continue the depthfirst search from the neighbor node. With case 3, we do not update oracle summary as V because ROUGE n (R, V ) is less than τ . However, we might obtain a better oracle summary by continuing the depth-first search because the upper bound of the ROUGE n score exceeds τ . Thus, we continue to search for the descendants of the current node.
Algorithm 3 shows the pseudocode that enumerates the oracle summaries. The algorithm reads a set of reference summaries R, length limitation L max , and set of sentences D (line 1) and initializes threshold τ as the ROUGE n score obtained by the greedy algorithm (Algorithm 2). It also initializes O τ , which stores oracle summaries whose ROUGE n scores are τ , and priority queue C, which stores the history of the depth-first search (line 2). Next, the algorithm computes the ROUGE n score for each sentence and stores S after sorting them in descending order. After that, we start a depth-first search by recursively call- ing procedure FINDORACLE. In the procedure, we extract the top sentence from priority queue Q and append it to priority queue V (lines 11 to 12). When the length of V is less than L max , if ROUGE n (R, V ) is larger than threshold τ (case 1), we update τ as the score and append current V to O τ . Then we continue the depth-first search by calling the procedure the FINDORACLE (lines 15 to 17). If ROUGE n (R, V ) is larger than τ (case 3), we do not update τ and O τ but reenter the depthfirst search by calling the procedure again (lines 18 to 19). If neither case 1 nor case 3 is true, we delete the last visited sentence from V and return to the top of the recurrence.

Experimental Setting
We conducted experiments on the corpora developed for a multiple document summarization task in DUC 2001 to 2007. Table 1 show the statistics of the data. In particular, the DUC-2005 to -2007 data sets not only have very large numbers of sentences and words but also a long target length (the reference summary length) of 250 words. All the words in the documents were stemmed by Porter's stemmer (Porter, 1980). We computed ROUGE 1 scores, excluding stopwords, and computed ROUGE 2 scores, keeping them. Owczarzak et al. (2012) suggested using ROUGE 1 and keeping stopwords. However, as Takamura et al. argued (Takamura and Okumura, 2009), the summaries optimized with non-content words failed to consider the actual quality. Thus, we excluded stopwords for computing the ROUGE 1 scores.
We enumerated the following two types of oracle summaries: those for a set of references for a given topic and those for each reference in the set of references.
6.2 Results and Discussion 6.2.1 Impact of Oracle ROUGE n scores Table 2 shows the average ROUGE 1,2 scores of the oracle summaries obtained from both a set of references and each reference in the set ("multi" and "single"), those of the best conventional system (Peer), and those obtained from summaries produced by a greedy algorithm (Algorithm 2).
Oracle (single) obtained better ROUGE 1,2 scores than Oracle (multi). The results imply that it is easier to optimize a reference summary than a set of reference summaries. On the other hand, the ROUGE 1,2 scores of these oracle summaries are significantly higher than those of the best systems. The best systems obtained ROUGE 1 scores from 60% to 70% in "multi" and from 50% to 60% in "single" as well as ROUGE 2 scores from 40% to 55% in "multi" and from 30% to 40% in "single" for their oracle summaries.
Since the systems in Table 2 were developed over many years, we compared the ROUGE n scores of the oracle summaries with those of the current state-of-the-art systems using the DUC-2004 corpus and obtained summaries generated by different systems from a public repository 1 . The repository includes summaries produced by the following seven state-of-the-art summarization systems: CLASSY04 (Conroy et al., 2004), CLASSY11 (Conroy et al., 2011), Submodular (Lin and Bilmes, 2012), DPP (Kulesza and Tasker, 2011), RegSum , OCCAMS V (Davie et al., 2012;Conroy et al., 2013), and ICSISumm . Table 3 shows the results.
Based on the results, RegSum  achieved the best ROUGE 1 =0.331 result, while ICSISumm ) (a compressive summarizer) achieved the best result with ROUGE 2 =0.098. These systems outperformed the best systems (Peers 65 and 67 in Table 2), but the differences in the ROUGE n scores between the systems and the oracle summaries are still large. More recently, Hong et al. (2015) demonstrated that their system's combination approach achieved the current best ROUGE 2 score, 0.105, for the DUC-2004 corpus. However, a large difference remains between the ROUGE 2 score of oracle and 01 02 03 04 05 06 07 400 .164 .452 .186 .434 .185 .427 .162 .445 .177 .491 .211 .506 .236 Oracle (single) .500 .226 .515 .225 .525 .258 .519 .228 .574 .279 .607 .303 .622 .330 Greedy .387 .161 .438 .184 .424 .182 .412 .157 .430 .173 .473 .206 .495    In short, the ROUGE n scores of the oracle summaries are significantly higher than those of the current state-of-the-art summarization systems, both extractive and compressive summarization. These results imply that further improvement of the performance of extractive summarization is possible.
On the other hand, the ROUGE n scores of the oracle summaries are far from ROUGE n = 1. We believe that the results are related to the summary's compression rate. The data set's compression rate was only 1 to 2%. Thus, under tight length constraints, extractive summarization basically fails to cover large numbers of n-grams in the reference summary. This reveals the limitation of the extractive summarization paradigm and suggests that we need another direction, compressive or abstractive summarization, to overcome the limitation. .530 Table 4: Jaccard Index between both oracle and greedy summaries scores of the oracle summaries and greedy summaries, those obtained from the greedy summaries achieved near optimal scores, i.e., approximation ratio of them are close to 0.9. These results are surprising since the algorithm's theoretical lower bound is 1 2 (1 − 1 e )( 0.32)OPT. On the other hand, the results do not support that the differences between them are small at the sentence-level. Table 4 shows the average Jaccard Index between the oracle summaries and the corresponding greedy summaries for the DUC-2004 corpus. The results demonstrate that the oracle summaries are much less similar to the greedy summaries at the sentence-level. Thus, it might not be appropriate to use greedy summaries as training data for learning-based extractive summarization systems. Table 5 shows the median number of oracle summaries and the rates of the reference summaries that have multiple oracle summaries for each data set. Over 80% of the reference summaries and about 60% to 90% of the topics have multiple oracle summaries. Since the ROUGE n scores are based on the unweighted counting of n-grams, when many sentences have similar meanings, i.e., many redundant sentences, the number of oracle summaries that have the same ROUGE n scores increases. The source documents of multiple document summarization tasks are prone to have many such redundant sentences, and the amount of oracle summaries is large.  Table 5: Median number of oracle summaries and rates of reference summaries and topics with multiple oracle summaries for each data set

Impact of Enumeration
The oracle summaries offer significant benefit with respect to evaluating the extracted sentences. Since both the oracle and system summaries are sets of sentences, it is easy to check whether each sentence in the system summary is contained in one of the oracle summaries. Thus, we can exploit the F-measures, which are useful for evaluating classification-based extractive summarization (Mani and Bloedorn, 1998;Osborne, 2002;Hirao et al., 2002). Here, we have to consider that the oracle summaries, obtained from a reference summary or a set of reference summaries, are not identical at the sentence-level (e.g., the average Jaccard Index between the oracle summaries for the DUC-2004 corpus is around 0.5). The F-measures are varied with the oracle summaries that are used for such computation. For example, assume that we have system summary S={s 1 , s 2 , s 3 , s 4 } and oracle summaries O 1 ={s 1 , s 2 , s 5 , s 6 } and O 2 ={s 1 , s 2 , s 3 }. The precision for O 1 is 0.5, while that for O 2 is 0.75; the recall for O 1 is 0.5, while that for O 2 is 1; the F-measure for O 1 is 0.5, while that for O 2 is 0.86.
Thus, we employ the scores gained by averaging all of the oracle summaries as evaluation measures. Precision, recall, and F-measure are defined as follows: P ={ O∈O all |O ∩ S|/|S|}/|O all |, R={ O∈O all |O ∩ S|/|O|}/|O all |, F-measure=2P R/(P + R).
To demonstrate F-measure's effectiveness, we investigated the correlation between an F-measure and human judgment based on the evaluation results obtained from the DUC-2004 corpus. The results include summaries generated by 17 systems, each of which has a mean coverage score assigned by a human subject. We computed the correla-tion coefficients between the average F-measure and the average mean coverage score for 50 topics. Table 6 shows Pearson's r and Spearman's ρ. In the table, "F-measure (R 1 )" and "F-measure (R 2 )" indicate the F-measures calculated using oracle summaries optimized to ROUGE 1 and ROUGE 2 , respectively. "M" indicates the F-measure calculated using multiple oracle summaries, and "S" indicates F-measures calculated using randomly selected oracle summaries. "multi" indicates oracle summaries obtained from a set of references, and "single" indicates oracle summaries obtained from a reference summary in the set. For "S," we randomly selected a single oracle summary and calculated the F-measure 100 times and took the average value with the 95% confidence interval of the F-measures by bootstrap resampling.
The results demonstrate that the F-measures are strongly correlated with human judgment. Their values are comparable with those of ROUGE 1,2 . In particular, F-measure (R 1 ) (single-M) achieved the best Spearman's ρ result. When comparing "single" with "multi," Pearson's r of "multi" was slightly lower than that of "single," and the Spearman's r of "multi" was almost the same as those of "single." "M" has significantly better performance than "S." These results imply that F-measures based on oracle summaries are a good evaluation measure and that oracle summaries have the potential to be an alternative to human-made reference summaries in terms of automatic evaluation. Moreover, the enumeration of the oracle summaries for a given reference summary or a set of reference summaries is essential for automatic evaluation.  To demonstrate the efficiency of our search algorithm against the naive exhaustive search method, we compared the number of feasible solutions (sets of sentences that satisfy the length constraint) with the number of summaries that were checked in our search algorithm. Table 7 shows the median number of feasible solutions and checked summaries yielded by our method for each data set (in the case of "single"). The differences in the number of feasible solutions between ROUGE 1 and ROUGE 2 are very large. Input set (|D|) of ROUGE 1 is much larger than ROUGE 1 . On the other hand, the differences between ROUGE 1 and ROUGE 2 in our method are of the order of 10 to 10 2 . When comparing our method with naive exhaustive searches, its search space is significantly smaller. The differences are of the order of 10 7 to 10 30 with ROUGE 1 and 10 4 to 10 17 with ROUGE 2 . These results demonstrate the efficiency of our branch and bound technique.

Conclusions
To analyze the limitations and the future direction of extractive summarization, this paper proposed (1) Integer Linear Programming (ILP) formulation to obtain extractive oracle summaries in terms of ROUGE n scores and (2) an algorithm that enumerates all oracle summaries to exploit F-measures that evaluate the sentences extracted by systems.
The evaluation results obtained from the corpora of DUCs 2001 to 2007 identified the following: (1) room still exists to improve the ROUGE n scores of extractive summarization systems even though the ROUGE n scores of the oracle summaries fell below the theoretical upper bound ROUGE n =1.
(2) Over 80% of the reference summaries and from 60% to 90% of the sets of reference summaries have multiple oracle summaries, and the F-measures computed by utilizing the enumerated oracle summaries showed stronger correlation with human judgment than those computed from single oracle summaries.