Which Scores to Predict in Sentence Regression for Text Summarization?

The task of automatic text summarization is to generate a short text that summarizes the most important information in a given set of documents. Sentence regression is an emerging branch in automatic text summarizations. Its key idea is to estimate the importance of information via learned utility scores for individual sentences. These scores are then used for selecting sentences from the source documents, typically according to a greedy selection strategy. Recently proposed state-of-the-art models learn to predict ROUGE recall scores of individual sentences, which seems reasonable since the final summaries are evaluated according to ROUGE recall. In this paper, we show in extensive experiments that following this intuition leads to suboptimal results and that learning to predict ROUGE precision scores leads to better results. The crucial difference is to aim not at covering as much information as possible but at wasting as little space as possible in every greedy step.


Introduction
More and more data is generated in textual form in newspapers, social media platforms, and microblogging services and it has become impossible for humans to read, comprehend, and filter all the available data. Automatic summarization aims at mitigating these problems by "taking an information source, extracting content from it, and presenting the most important content to the user in a condensed form and in a manner sensitive to the users or applications needs" (Mani, 2001).
Very prominent in automatic text summarization is the idea of extractive summarization. In extractive summarization, summaries are not generated from scratch. Instead, sentences in the source documents, which are supposed to be summarized, are extracted and concatenated to form a summary. To be able to select sentences in a meaningful manner, it is crucial for the extractive systems to be able to estimate the utility of individual sentences.
Supervised extractive methods are usually modeled in a regression framework. Hence, this subfield of automatic summarization is called sentence regression. The predicted scores are used to generate a ranking of the sentences, and a greedy strategy is often used in combination with additional redundancy avoidance to select sentences which will be added to the iteratively generated summary (Carbonell and Goldstein, 1998). Another method for the selection is solving an integer linear programming (ILP) problem (Gillick et al., 2008;Hong and Nenkova, 2014) which is, however, an NP-hard problem (Filatova and Hatzivassiloglou, 2004). Even though it can be argued that the complexity is not an issue since there are good solvers for ILPs, it remains a problem when large document collections with many sentences have to be summarized or the system should be used on a large scale for many users. The greedy approach is due its simplicity and efficiency very appealing.
Crucial for building sentence regression models is the choice of the regressands which has to be predicted by the models. Most of the recent works try to predict ROUGE recall scores of individual sentences, which seems to be an obvious choice since the final summaries are also evaluated with ROUGE recall metrics (Lin, 2004;Owczarzak et al., 2012). We show in this paper that following this intuition leads to suboptimal results. In extensive experiments, we investigate sentence regression models with perfect and noisy prediction of different regressand candidates with and without redundancy avoidance. In all experiments, we observe the very same result: learning to predict ROUGE precision scores of sentences leads to better results than learning to predict ROUGE recall scores if the scores are selected with a greedy algorithm afterwards. Our findings are in particular important for automatic summarization research since the best models currently available are sentence regression models trained to predict ROUGE recall scores. We expect that simply replacing ROUGE recall scores as regressand with ROUGE precision scores can potentially improve these state-of-the-art models further.
We note in passing that the problem is reminiscent of defining heuristics in inductive rule learning: Individual rules are typically evaluated according to their consistency (minimizing the amount of false positives) and completeness (maximizing the amount of true positives), which loosely correspond to precision and recall (Fürnkranz and Flach, 2005). Heuristics such as weighted relative accuracy, which give equal importance to both dimensions, are successfully used for evaluating single rules in subgroup discovery (Lavrač et al., 2004), but tend to over-generalize when being used for selecting rules for inclusion into a predictive rule set. The reason for this is that a lack of completeness can be repaired by adding more rules, whereas a lack of consistency can not, so that consistency or precision of individual rules should receive a higher weight in the selection task. Transferred to summarization, this means that space wasted by recall-oriented selection cannot be used anymore whereas a low recall in a partial summary can be repaired by adding more sentences.
In the following, we will first formalize the problem of extractive summarization and outline the greedy selection strategy (Section 2). Previously extractive summarization systems, in particularly sentence regression models, are summarized in Section 3. We then present an intuition why predicting ROUGE precision scores can potentially give better results in Section 4. In extensive experiments (Section 5), we actually show the previously stated hypothesis which says that selecting sentence according to ROUGE precision instead of ROUGE recall leads to better results if sentence are selected greedily.

Extractive Summarization
In this section, we will first formally define the problem of extractive summarization and then describe the greedy sentence selection strategy which is used by many prior works.

Problem Definition
The task in extractive summarization is to generate a list of sentences S (the summary) from given list of input sentences I (the text to summarize). The size of the generated summary S must not be longer than a predefined length l (usually measured in words or characters).
In order to select sentences, both supervised and unsupervised models are used to predict utility scores of sentences in a first phase. In a second phase, sentences are selected and concatenated to build a summary.
For evaluation, the generated summary is typically compared to human written summaries by automatic means, in many cases by computing socalled ROUGE scores (Lin, 2004).

Greedy Selection Strategy
A popular strategy to select sentences based on the previously predicted utility scores is the greedy sentence selection strategy which is described in Algorithm 1.
Algorithm 1 Greedy Sentence Selection with Redundancy Avoidance in Extractive Summarization list of all input sentences I = s1, . . . , sn utility function u desired summary length l 1: π = permutation of I s.t. u(s π(1) ) ≥ · · · ≥ u(s π(n) ) 2: S ← ∅, i ← 1 3: while |S| < l and i < n do 4: if sim(s π(i) , S) < θ) then 5: S ← S + s π(i) 6: end if 7: i ← i + 1 8: end while 9: return S According to the greedy strategy, the sentence with the highest utility score is selected first. After the best sentence has been selected, it is removed from the input list of available sentences, and the former second best sentence is considered next. Redundancy avoidance strategies are used to ensure that sentences with similar contents are not added multiple times to the summary. A simple strategy computes the similarity of the currently best sentence and all already selected sentences. If the maximum similarity exceeds a predefined threshold θ, the summarizer removes the sentences from the input list without adding it to the summary. The selection process is repeated until the desired summary length is reached. Once a decision is made, it is never revised.

Sentence Regression for Extractive Summarization
After the field of automatic summarization has been dominated by unsupervised extractive summarization models for some time (Carbonell and Goldstein, 1998;Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Li et al., 2006), supervised regression models are more commonly used in recent years. The crucial difference is that supervised models learn to predict regressands based on training examples in a training phase whereas unsupervised models do not predict regressands. We focus on supervised extractive regression systems in this paper. Comprehensive overviews of automatic summarization (Nenkova and McKeown, 2011;Gambhir and Gupta, 2016;Yao et al., 2017) also cover unsupervised methods in more detail and include abstractive summarization methods which are out of scope for this paper. Extractive sentence regression can be described as the task of learning regressands for individual sentences from examples.
The general learning problem can be formulated as y i = u(x i ) + e i where y i denotes the regressand (also called dependent variable or target variable) of sentence x i (the regressor, also called independent variable or features), and e i denotes the ith residuum (also called error). Sentence regression aims at learning the utility function u from observed sentence-utility pairs in order to minimize the errors for unseen sentences-utility pairs. Kupiec et al. (1995) proposed one of the first supervised summarization systems, which trains a Bayesian model to predict the probability that a sentence will be included in the summary. They criticized that although a large number of different features had been used in previous unsupervised models, no principled method to select or weight the features had been proposed at this time. Instead of generating summaries, the performance of the model was evaluated based on the classification output of the model for individual sentences. Similarly, Conroy and O'leary (2001) use a Hidden Markov Model to predict the probability that a sentence is included in a reference summary.
The model proposed by Li et al. (2006) already predicts utility scores for individual sentences. The model weights are, however, not learned in a supervised training but assigned by humans. Li et al. (2007) extends this previously proposed unsupervised model and used a support vector re-gression (SVR) model in the DUC 2007 shared task (Over et al., 2007). Both Li et al. (2006) and Li et al. (2007) use a greedy selection strategy. Instead of learning to predict the probability of appearance of a sentence in a summary (Kupiec et al., 1995;Conroy and O'leary, 2001), Li et al. (2007) use the average and maximum text similarity of candidate sentences and reference summaries as regressands. Ouyang et al. (2011) also applied SVR but used the sum of word probabilities as regressand. Their system therefore also tends to select longer sentences similarly to systems which use ROUGE recall.
PriorSum (Cao et al., 2015b) follows Li et al. (2007) and presents a linear regression framework which uses prior and document dependent features. As regressand, ROUGE-2 recall is used. Cao et al. (2015a) propose a hierarchical regression process which predicts the importance of sentences based on its constituents. ROUGE-1 recall and ROUGE-2 recall are used as regressand for sentences. For sentence selection, they implement both a greedy selection and a selection based on integer linear programming.
The Redundancy-Aware Sentence Regression (Ren et al., 2016) framework models both importance and redundancy jointly. They train a multi-layer perceptron which then predicts relative importance utilities based on ROUGE-2 recall scores.
REGSUM (Hong and Nenkova, 2014) predicts sentence importance based on word importance and additional features. They use a greedy selection strategy with additional redundancy avoidance which only appends sentences to the summary if the maximum cosine similarity to already selected sentences is lower than a fixed threshold.
We summarize that ROUGE recall is often used in the field of sentence regression in combination with a greedy selection and an additional redundancy avoidance strategy. In the following, we first describe the underlying intuition of using ROUGE recall. Second, we describe why using ROUGE precision instead can be potentially better. Later, we show in the experiments that using ROUGE precision is not only theoretically appealing but also works better in practice than ROUGE recall.

ROUGE Recall vs. ROUGE Precision
The ROUGE metric (Lin, 2004) is the method of choice for the evaluation of generated summaries in the field of automatic summarization. Its idea is to compute the similarity between automatically generated summaries and references summaries, which are typically provided by humans.
ROUGE can be viewed as an evaluation measure for an information retrieval task in which precision and recall can be measured. Let E be a set of elements, R ⊂ E the multiset of desired elements in the reference output, G ⊂ E is the generated output multiset, and |.| the size of a multiset. Then, the recall is defined as and measures how much of the desired content was returned by the system. On the other hand, precision is defined as and measures how much of the returned content was actually desirable. We define the intersection ∩ of two multisets as the smallest multiset S with σ S (e) = min(σ G (e), σ R (e)) ∀e ∈ G, R, where σ S (e) indicates the number of appearances of element e in set S. In ROUGE-n, the multiset E is defined as the set of all n-grams, the desired reference multiset R contains all n-grams in a reference summary, and the multiset G contains all n-grams in the system summary. We use multisets and not sets since the same n-gram can be contained multiple times in a text.
When ROUGE was first introduced as the evaluation metric for the DUC 2003 shared task (Over et al., 2007), Lin and Hovy (2003) reported that metrics based on ROUGE recall scores have a good agreement with human judgments. A summary with a high ROUGE recall will contain many n-grams which also appear in the reference summaries. Owczarzak et al. (2012) showed that ROUGE-2 recall is the best variant (highest agreement with human judgments) of ROUGE recall if automatically generated summaries have to be evaluated. ROUGE-2 recall is therefore often used to evaluate automatic summarization systems.
Crucial for the use of ROUGE recall is the length limitation of the generated summaries. Figure 1: Exemplary illustration of selecting sentences according to precision and recall. The target summary has 5 slots. Sentence A will be selected according to recall since it has a recall scores of 0.6 whereas sentence B and C only have a recall score of 0.4. Sentence A, however, occupies already all available slots in the summary. No more sentence can be selected. Sentence B will be first selected according to precision due to a precision scores of 1.0. After the selection of sentence B, 3 slots are still available in the summary which can be used to fit sentence C to improve the overall summary recall to 0.8.
Usually, the generated summaries are limited to a fixed number of words or characters. Without such a length restriction, systems would be able to generate arbitrary long texts to increase the recall.
Summarization systems aim at maximizing ROUGE recall scores of the generated summaries, since the final summaries are evaluated with ROUGE recall. Greedy extractive summarization approaches try to maximize the overall ROUGE recall of a summary by incrementally adding sentences with a high ROUGE recall to the summary. The idea of this strategy is to pack as much important content as possible into the summary in every step in order to increase the ROUGE recall of the resulting summary. What is usually not considered is the fact that this strategy tends to select longer sentences, since longer sentences tend to have a higher recall. They, however, can contain proportionally more unimportant information, for example in subordinate clauses. As a result, fewer sentences can be selected since the maximum length of the summary is reached earlier.
An alternative strategy, which has not been discussed in the literature so far, is to select sentences according to their ROUGE precision scores. The idea behind this approach is not to cover as much as information as possible but to waste as little space as possible. Selecting sentences according to precision will not have a bias for longer sentences but for short and dense sentences. Since this strategy tends to selected shorter sentences, more sentence can be included in the summary, which can, in turn, again result in a higher ROUGE recall of the resulting summary. Figure 1 shows an example in which selecting sentences according to ROUGE precision leads to a higher ROUGE recall score of the resulting summary than selecting sentences according to ROUGE recall. In the following section, we will show that the intuition described in this section is not only appealing in theory, but can also be substantiated in empirical experiments.
We summarize that selecting sentences according to ROUGE precision scores can, intuitively, be better than selecting sentences according to ROUGE recall scores even though the final summaries are always evaluated with ROUGE recall metrics.

Experimental Setup
We now present the experimental setups in which we test different regressand candidates for sentence regression in three different, well-known multi-document summarization (MDS) corpora. We used the MDS corpora from the DUC 2004 1 , TAC 2008, and TAC 2009 2 summarization shared tasks. All corpora contain 10 input documents and 4 reference summaries for each topic. The number of topics are 50, 46, and 44, respectively. We simulate in the experiments the outcomes of regression models which use different regressands. This will provide us with theoretical insights on which regressand candidates should be considered in regression models and will answer the main question of this paper: Which scores to predict in sentence regression for text summarization? For our experiments, we produce summaries containing 665 characters for DUC2004 and summaries containing 100 words for TAC2008 and TAC2009.

Regressand Candidates
The key ingredient of greedy extractive summarization is the utility function u(.), which is used for sorting the sentences in the first step of Algorithm 1. In this paper, we examine 7 different regressand candidates (in boldface) which can be used as regressands when the utility function u is learned via supervised regression.
ROUGE-1 recall (R1 Rec) and ROUGE-2 recall (R2 Rec) are computed according to Equation 1 for all sentences in the input documents. ROUGE-n recall counts the n-gram overlap of the input sentence and the reference summaries. The more n-grams in the reference documents are covered by a sentence, the higher the score is. These regressands are usually used by prior sentence regression works.
We also compute the ROUGE-1 precision (R1 Prec) and ROUGE-2 precision (R2 Prec) for all sentences according to Equation 2. A sentence has a high ROUGE-n precision if a high rate of ngrams in the sentence match with n-grams in the reference documents. Sentences with a high density of matching n-grams are therefore preferred by ROUGE precision. The main claim of this paper is that ROGUE precision scores should be primarily considered in sentence regression works instead of ROUGE recall scores. We therefore expect that R1 Prec and R2 Prec will perform better than R1 Rec and R2 Rec.
As a reference point, we compute for each sentence the maximum similarity (maxADW) for and the average similarity (avgADW) with all sentences in the reference summaries (denoted by list S) according to a state-of-the-art ADW similarity measure (Pilehvar et al., 2013). ADW computes the semantic similarity of two sentences by finding an optimal alignment of word senses contained in the two sentences.
Computing the maximum similarity aligns with the idea that a good sentence in the input documents matches well with one sentence in the reference summary. A sentence is representative for the whole summary if it has a high average similarity with all the reference summary sentences. For each sentence, we also randomly generated (random) sentence scores which are used as regressand.

Optimal Prediction without Redundancy Avoidance
In the first experiment, we investigate how helpful the predicted scores are under the assumption that the regressand candidates can be predicted perfectly. The experiment therefore shows how a systems will perform in the optimal case. We do not consider redundancy avoidance strategies in this experiment so that observed performance differences are solely due to differences in the used regressand candidates. The results of the experiment are shown in Table 1. It can be seen that in all corpora the use of ROUGE-1 precision regressands of the sentences leads to better results than using ROUGE-1 recall regressands if ROUGE-1 recall is used as evaluation metric for the final summary. Analogous results can be observed for ROUGE-2 scores. This indicates that using ROUGE recall as regressand in a sentences regression framework is not very promising. Thus, the results are a first confirmation of the previously described intuition that predicting precision scores can be better than predicting recall scores. Table 2 provides details about the lengths of the produced summaries according to number of stems and number of sentences. The hypothesis that an algorithm that selects sentences according to recall tends to select longer sentences (stated in Section 4) is confirmed. The results therefore also confirm that longer sentences tend to have a higher recall.
In addition to the standard DUC and TAC corpora, we also report results for 2 German datasets, namely the DBS corpus (Benikova et al., 2016) and a subset of the German part of the auto-hMDS  Table 2: Averaged lengths of resulting summaries measured in number of stems (avg. stems) and number of sentences (avg. sentences). D04 refers to DUC2004 and T08 and T09 refer to TAC2008 and TAC2009, respectively. We count also partially contained sentences which have been cut by the ROUGE length limitation.

Optimal Prediction of F-Scores
The previous experiment clearly showed that selecting sentences according to ROUGE precision outperforms a selection according to ROUGE recall. In this experiment, we will evaluate if a tradeoff between recall and precision can lead to even better results. It is, e.g., known that in inductive rule learning, parametrized measures such as the m-estimate, which may be viewed as a trade-off between precision and weighted relative accuracy, can be tuned to outperform its constituent heuristics (Janssen and Fürnkranz, 2010). In retrieval tasks, the F-measure provides a more commonly used trade-off between precision and recall, so we chose to use this measure for our experiments.
We compute for all sentences the F-measure with 0 ≤ α ≤ 1 as where a α = 0 is equivalent to recall and α = 1 equals precision. The results of the experiment, which are displayed in Figure 2, show that precision (α = 1.0) is already close to the optimum but that incorporating also a small fraction of recall (α ≈ 0.9) leads to the best results which indicates that a slight bias towards longer sentences can improve the result even further. A possible explanation is that there are short sentences in the input documents which are considerably redundant to other high precision sentences. However, overall the trend in the results (increasing evaluation scores with increasing α, which means increasing impact of ROUGE precision) substantiate the general hypothesis of this paper, namely that sentence selection measures should target precision instead of recall.

Optimal Prediction with Redundancy Avoidance
Summarization systems usually apply a redundancy avoidance strategy in order to avoid including the same information multiple times in the summary. In this experiment, we investigate whether incorporating a simple redundancy avoidance strategy will lead to different results.
During the greedy selection process, we compute the similarity of the currently highest scoring sentence and all already selected sentences (see Algorithm 1, line 4). The highest scoring sentence will be skipped if the maximum similarity of the sentence and the already selected sentences is higher than a predefined threshold θ. We use the state-of-the-art ADW similarity measure to compute the similarities and test the quality of the generated summaries as in the previous experiments with ROUGE-1 and ROUGE-2 recall. The results of the experiment for the thresholds θ = 0.4, 0.5, . . . , 1.0 are displayed in Figure 3.
We see that sentence selection using ROUGE-1/2 precision scores (red and blue solid lines) consistently leads to better results than with ROUGE-1/2 recall scores (red and blue dashed lines) for all chosen redundancy thresholds. Selecting according to maximum ADW similarity leads to consistently better results than selecting according to the average ADW similarity. This indicates that it is better to search for sentences which align well with a part of the summary than selecting sentences which align relatively well with the whole summary. The best results are achieved with thresholds of θ = 0.5 and θ = 0.6 which worked well for both ROUGE-1 and ROUGE-2 recall in both datasets.

Noisy Predictions
In the previous experiments, we showed the results of a greedy summarizer which selects sentences according to perfectly predicted scores. Summarization systems are, however, not capable of pre-  dicting the scores perfectly. We will therefore investigate whether imperfect predictions have an influence on our results in the next experiment. This will also give insights about the robustness of a greedy summarizer in the presence of imprecise predictions.
In order to get model-independent results, we simulate imperfect precisions by adding two different kinds of noise to simulate imperfect predictions, namely additive uniformly distributed continuous noise U(a, b) and additive Gaussian noise N (µ, σ 2 ). For the uniform noise U(a, b), we test boundaries from a = −0.2, b = 0.2 to a = −0.4, b = 0.4. For Gaussian noise, we use mean µ = 0 and variance σ 2 ∈ {0.05, 0.1, 0.2}. Based on the results in the previous section, we fix the redundancy threshold to 0.6 in this experiment. Due to the random noise, the experiments are no longer deterministic. We therefore run each experiment 10 times and report averaged results.
The results of these experiments (see Table 4) confirm that predicting ROUGE precision is always better than predicting ROUGE recall, in the presence of different kinds of noises and different noise intensities. In case strong Gaussian noise is applied (  summaries decreases more strongly if ROUGE-2 precision scores are predicted, which means that predicting ROUGE-1 precision might be better than predicting ROUGE-2 precision in the case of low prediction quality.

Conclusions
Current state-of-the-art sentence regression systems for automatic summarization learn to predict ROUGE recall scores of individual sentences and apply a greedy sentence selection strategy in order to generate summaries. We show in a wide range of experiments that this design choice leads to suboptimal results. In all experiments, we observed the same pattern. The resulting summaries will have a lower quality if ROUGE recall scores for sentences are used instead of ROUGE precision -no matter whether or not redundancy avoidance is considered and whether or not the scores can be predicted perfectly.
In an experiment where we combined both ROUGE recall and ROUGE precision with an Fscore computation, we confirmed the previously described observation that the quality of summaries tends to improve with a growing ratio of ROUGE precision vs. ROUGE recall, with a maximum performance for a ratio of α ≈ 0.9. Biasing the sentence selection slightly to longer sentences is therefore promising. This goes in line with an often applied pre-processing step in which very short sentences are discarded without further analysis (Erkan and Radev, 2004;Cao et al., 2015b).
We also presented an intuition why a selection according to ROUGE precision leads to better results. A system which selects according to ROUGE recall will tend to select longer sentences, since longer sentences tend to have a higher recall. We conclude that systems should instead of fitting iteratively as much as possible into a summary rather aim at wasting as little space as possible in every step.
For future works, it is very simple to incorporate the findings presented in this paper. Instead of learning to predict ROUGE recall scores, the regressand can simply be exchanged and the ROUGE precision can be used instead. Based on the findings in this paper, we expect that the models will benefit from this modification. We furthermore conclude that comparisons between ILP and greedy methods (Cao et al., 2015a) are biased in favor of ILP. A better comparison is possible if precision scores are used as input for greedy systems instead of recall scores.