Objective Function Learning to Match Human Judgements for Optimization-Based Summarization

Supervised summarization systems usually rely on supervision at the sentence or n-gram level provided by automatic metrics like ROUGE, which act as noisy proxies for human judgments. In this work, we learn a summary-level scoring function \theta including human judgments as supervision and automatically generated data as regularization. We extract summaries with a genetic algorithm using \theta as a fitness function. We observe strong and promising performances across datasets in both automatic and manual evaluation.


Introduction
The task of extractive summarization can naturally be cast as a discrete optimization problem where the text source is considered as a set of sentences and the summary is created by selecting an optimal subset of the sentences under a length constraint (McDonald, 2007). This view entails defining an objective function which is to be maximized by some optimization technique. In the ideal case, this objective function would encode all the relevant quality aspects of a summary, such that by maximizing all these quality aspects we would obtain the best possible summary.
However, we find several issues with the objective function in previous work on optimizationbased summarization. First, the choice of the objective function is based on ad-hoc assumptions about which quality aspects of a summary are relevant (Kupiec et al., 1995). This bias can be mitigated via supervised techniques guided by data. In practice, these approaches use signals at the sentence (Conroy and O'leary, 2001;Cao et al., 2015) or n-gram Li et al., 2013) level and then define a combination function to estimate the quality of the whole summary (Carbonell and Goldstein, 1998;Ren et al., 2016). This combination θ determines the trade-off between conflicting quality aspects (importance vs redundancy) encoded in the objective function by making simplistic assumptions to ensure convenient mathematical properties of θ like linearity or submodularity (Lin and Bilmes, 2011). This restriction comes from computational considerations without conceptual justifications. More importantly, the supervision signal comes from automatic metrics like ROUGE (Lin, 2004) which are convenient but noisy approximations for human judgment.
In this work, we propose to learn the objective function θ at the summary-level from a pool of manually annotated system summaries to ensure the extraction of summaries considered good by humans. This explicitly targets the extraction of high-quality summaries as measured by humans and limits undesired gaming of the target evaluation metric. However, the number of data points is relatively low and the learned θ might not be well-behaved (high θ scores for bad summaries) pushing the optimizer to explore regions of the feature space unseen during training where θ wrongly assumes high scores. To prevent this scenario, we rely on a large amount of noisy but automatic training data providing supervision on a larger span of the feature space. Intuitively, it can be viewed as a kind of regularization.
By defining θ directly at the summary-level, one has access to features like redundancy or global information content without the need to define a combination function from individual sentence scores. Any feature available at the sentence or n-gram level can be transferred to the summarylevel (by summation), while the summary-level perspective provides access to new features capturing the interactions between sentences. Furthermore, recent works have demonstrated that global optimization using genetic algorithms without im-posing any mathematical restrictions on θ is feasible (Peyrard and Eckle-Kohler, 2016).
In summary, our contributions are: (1) We propose to learn a summary-level scoring function θ and use human judgments as supervision. (2) We demonstrate a simple regularization strategy based on automatic data generation to improve the behavior of θ under optimization. (3) We perform both automatic and manual evaluation of the extracted summaries, which indicate competitive performances.

Learning setup
Let θ * be the observed human judgments. θ * can be manual Pyramid (Nenkova et al., 2007) or overall responsiveness on a 0 to 5 LIKERT scale. We learn a function θ w with parameters w approximating θ * based on a feature set Φ. Φ(S) ∈ R d is the feature representation of a summary S.
Let T be the set of topics in the training set, and S T the set of scored summaries for the topic T . The learning problem consists in minimizing the following loss function: While any regression algorithm could be applied, we observed strong performances for the simple linear regression. It is particularly simple and not prone to overfitting.

Automatic data generation
Few annotated summaries are available (50 per topic) and they cover a small region of the feature space (low variability). θ may wrongly assume high scores in some parts of the feature space despite lack of evidence. The optimizer will explore these regions and output low-quality summaries.
To address this issue, we generate summaries distributed across the feature space. For each feature x, we sample a set of k = 100 summaries covering the range of possible values of x. For sampling, we use the genetic algorithm recently introduced by Peyrard and Eckle-Kohler (2016). 1 Their solver implements a Genetic Algorithm (GA) to create and iteratively optimize summaries over time. We use default values for the reproduction and mutation rate and set the population size to 50. With x as fitness function, the resulting population is a set of summaries ranging from random to (close to) maximal value. After both maximization and minimization, we obtain 100 summaries covering the full range of x.
In total, we sample m · k summaries per topic, where m is the number of features. We score these summaries with ROUGE-2 recall (R2), which is a noisy approximation of human judgments but provides indications preventing bad regions from getting high scores.

Summary Extraction
We trained 3 different scoring functions: θ pyr with manual pyramid annotations; θ resp with responsiveness annotations; and θ R2 with our automatically generated data. 2 The final scoring function is a linear combination: Therefore θ R2 acts as a regularizer for the θ's learned with human judgments. 3 It is a simple form of model averaging which combine the different information of the 3 different models.
We didn't constrain θ to have specific properties like linearity with respect to sentence scores, thus extracting high scoring summaries cannot be done with Integer Linear Programming. Instead, we search an approximate solution by employing the same meta-heuristic solver we used for sampling with θ as the fitness function.

Features
Learning a scoring function at the summary-level gives us access to both n-gram/sentence-level features and summary-level features. Sentence-level features can be transferred to the summary-level, while new features capturing the interactions between sentences in the summary become available.
As sentence-level features, we used the standard: TF*IDF, n-gram frequency and overlap with the title. As new summary-level features, we used: number of sentences, summary-level redundancy and summary-level n-gram distributions: Jensen-Shannon (JS) divergence with n-gram distribution in the source (Louis and Nenkova, 2013).
N-gram Coverage. Each n-gram g i in the documents has a frequency tf (g i ), the summary S is scored by: Here S n is the multiset of n-grams (with repetitions) composing S. Also, the frequency can be computed either by counting the number of occurrence of the n-gram or by counting the number of documents in which the n-gram appears. For both frequency computations, we extract features for unigrams, bigrams and trigrams.
TF*IDF. Each n-gram g i is also associated its Inverse Document Frequence: idf (g i ). The summary S is scored by: Here S n is the multiset of n-grams (with repetitions) composing the summary S. We also extract features for both frequency computations for unigrams, bigrams and trigrams.
Overlap with title. We measure the proportion of n-grams from the title that appear in the summary: Overlap n (S) = |T n ∩ S n | T n Where T n is the multiset of n-grams in the title, and S n is the multiset of n-grams in the summary. We compute it for unigrams, bigrams and trigrams.
Number of sentences. We also use the number of sentences in S as a feature because summaries with a lot of sentences tend to have very short and meaningless sentences.
Redundancy. Previous features were at the sentence-level, we obtained features for the whole summary by summation over sentences. However, the redundancy of S cannot be computed at the sentence-level. This is an example of features available at the summary-level but not available at the sentence-level. We define it as the number of unique n-gram types (|U n |) in the summary divided by the total number of n-gram tokens (the length of S) Red n (S) = |U n | |S n | Where U n is the set of n-grams (without repetitions) composing S and S n is the multiset of ngrams (with repetitions).
Divergences. This is another feature that can only be computed at the summary-level inspired by Haghighi and Vanderwende (2009) and Peyrard and Eckle-Kohler (2016). We compute the KL divergence and JS divergence between n-gram probability distributions of the summaries and of the documents. The probability distributions are built from the two kinds of frequency distributions and for unigrams, bigrams and trigrams.

Experiments
Dataset We use two multi-document summarization datasets from the Text Analysis Conference (TAC) shared tasks: TAC-2008 andTAC-2009. 4 TAC-2008 and TAC-2009 contain 48 and 44 topics, respectively. Each topic consists of 10 news articles to be summarized in a maximum of 100 words. We use only the so-called initial summaries (A summaries), but not the update part. We used these datasets because all system summaries and the 4 reference summaries were manually evaluated by NIST assessors for content selection (with Pyramid) and overall responsiveness. At the time of the shared tasks, 57 systems were submitted to TAC-2008 and 55 to TAC-2009. For our experiments, we use the Pyramid and the responsiveness annotations.
With our notations, for example with TAC-2009, we have n = 55 scored system summaries, m = 44 topics, D i contains 10 documents and θ i contains 4 reference summaries.
We also use the recently created German dataset DBS (Benikova et al., 2016) which contains 10 heterogeneous topics. For each topic, 5 summaries were evaluated by trained human annotators but only for content selection with Pyramid. The summaries have variable sizes and are about 500 words long.
Baselines (1) ICSI (Gillick and Favre, 2009) is a global linear optimization approach that extracts a summary by solving a maximum coverage problem considering the most frequent bigrams in the source documents. ICSI has been among the best systems in a standard ROUGE evaluation . (2) (4) Finally, SFOUR is a supervised structured prediction approach that trains an end-to-end on a convex relaxation of ROUGE (Sipos et al., 2012).
Objective function learning In this section, we measure how well our models can predict human judgments. We train each θ in a leave-one-out cross-validation setup for each dataset and compare their performance to the summary scoring function of baselines like it was done previously (Peyrard and Eckle-Kohler, 2017). Each individual feature is also included in the baselines. Correlations are measured with two complementary metrics: Spearman's ρ and Normalized Discounted Cumulative Gain (NDCG). Spearman's ρ is a rank correlation metric, which compares the ordering of systems induced by θ and the ordering of systems induced by human judgments. NDCG is a metric that compares ranked lists and puts more emphasis on the top elements with logarithmic decay weighting. Intuitively, it captures how well θ can recognize the best summaries. The optimization scenario benefits from high NDCG scores because only summaries with high θ scores are extracted.
The results are presented in Table 1. For simplicity, we report the average over the 3 datasets. Each θ is compared against the best performing baseline for the data annotation type it was trained on (R2, responsiveness or pyramid). 5 The trained models perform substantially and consistently bet-5 Best baseline for R2 and Responsiveness is: KL divergence on bigrams; for Pyramid: KL divergence on unigrams ter than the best baselines. They have a high correlation with human judgments and are capable of identifying good summaries.
However, we need to test whether the combination of the three θ's is well behaved under optimization. For this, we perform an evaluation of the summaries extracted by the genetic optimizer.
Summaries Evaluation Now, we evaluate the summaries extracted by the genetic optimizer with θ as fitness function (noted (θ, Gen)). We still train θ with leave-one-out cross-validation.
To evaluate summaries, we report the ROUGE variant identified by Owczarzak et al. (2012) as strongly correlating with human evaluation methods: ROUGE-2 (R2) recall with stemming and stopwords not removed. We also report JS2, the Jensen-Shannon divergence between bigrams in the reference summaries and the candidate system summary (Lin et al., 2006). The last metric is S3 , a combination of several existing metrics trained explicitly to maximize its correlation with human judgments.
Finally, our approach aims at improving summarization systems based on human judgments, therefore we also set up a manual evaluation for the two English datasets. Two annotators were given the summaries of every system for 10 randomly selected topic of both TAC-2008 andTAC-2009. They annotated (with a Cohen's kappa of 0.73) summaries on a LIKERT scale following the responsiveness guidelines.
The results are reported in Table 2. We perform significance testing with Approximate Random Testing to compare differences between two means in cross-validation 6 .
While θ's trained on human judgments have a high correlation with human judgments, they behave badly under optimization. This effect is much less visible for θ R2 because the data points have been sampled to cover the feature space. We observe the effectiveness of the regularization because each θ R2/pyr/resp performs much worse individually than the combined θ. We also note that (θ R2 , Gen) performs on par with the other supervised baseline SFOUR but both are outperformed by exploiting human judgments. (θ, Gen) is consistently and often significantly better than baselines across datasets and metrics. In particular, humans tend to prefer the summaries extracted by TAC-2008  (θ, Gen). Manual inspection of summaries reveals that (θ, Gen) has lower redundancy than previous baselines thanks to summary-level features.

Important Features
Since we used a linear regression, we can estimate the contribution of a feature by the amplitude of its associated weight. The two best features (n-gram distributions and redundancy) are summary-level features, which confirms the advantage of using a summary-level scoring function.

Related Work and Discussion
Supervised summarization started with Kupiec et al. (1995) who observed that there is no principled method to select and weight relevant features. Previous work focused on predicting sentence (Conroy and O'leary, 2001;Cao et al., 2015) or n-gram Li et al., 2013) scores and then defining a composition function to get a score for the summary. This combination usually accounts for redundancy or coherence (Nishikawa et al., 2014) in an ad-hoc fashion (Carbonell and Goldstein, 1998;Ren et al., 2016). Structure prediction has been investigated to learn the composition function as well (Sipos et al., 2012;Takamura and Okumura, 2010). The supervision is always provided by automatic metrics, whereas we incorporate human judgments as supervision and learn from it directly at the summary-level. We note that He et al. (2006) and Peyrard and Eckle-Kohler (2016) have used a scoring function at the summary-level but these approaches are unsupervised.
One of the challenges we face is the lack of data with human judgments. We hope that this work will encourage efforts to create new and large datasets as they will be decisive for the progress of summarization. Indeed, systems trained only with automatic metrics can only be as good as the metrics are as a proxy for humans.
We used simple features but using more complex and semantic features is promising. Indeed, two syntactically similar but semantically different summaries cannot be distinguished by ROUGE, which diminishes the usefulness of semantic features. However, humans can distinguish them, thus inducing better usage of such features.
Another promising direction is to investigate more sophisticated ways of combining the human judgments with the automatically generated data. For example, by exploiting techniques from semisupervised learning (Zhu et al., 2009) or by dynamically sampling unseen regions of the feature space with active learning (Settles, 2009).

Conclusion
We proposed an approach to learn a summarylevel scoring function θ with human judgments as supervision and automatically generated data as regularization. The summaries subsequently extracted with a genetic algorithm are of high quality according to both automatic and manual evaluation. We hope this work will encourage more research directed towards the generation and usage of human judgment datasets.