Learning to Score System Summaries for Better Content Selection Evaluation.

The evaluation of summaries is a challenging but crucial task of the summarization field. In this work, we propose to learn an automatic scoring metric based on the human judgements available as part of classical summarization datasets like TAC-2008 and TAC-2009. Any existing automatic scoring metrics can be included as features, the model learns the combination exhibiting the best correlation with human judgments. The reliability of the new metric is tested in a further manual evaluation where we ask humans to evaluate summaries covering the whole scoring spectrum of the metric. We release the trained metric as an open-source tool.


Introduction
The task of automatic multi-document summarization is to convert source documents into a condensed text containing the most important information. In particular, the question of evaluation is notably difficult due to the inherent lack of gold standard.
The evaluation can be done manually by involving humans in the process of scoring a given system summary. For example, with the Responsiveness metric, human annotators score summaries on a LIKERT scale ranging from 1 to 5. Later, the Pyramid scheme was introduced to evaluate content selection with high inter-annotator agreement (Nenkova et al., 2007).
Manual evalations are meaningful and reliable but are also expensive and not reproducible. This makes them unfit for systematic comparison.
Due to the necessity of having cheap and reproducible metrics, a significant body of research was dedicated to the study of automatic evaluation metrics. Automatic metrics aim to produce a semantic similarity score between the candidate summary and a pool of reference summaries previously written by human annotators (Lin, 2004;Yang et al., 2016;Ng and Abrecht, 2015). Some variants rely only on the source documents and the candidate summary ignoring the reference summaries (Louis and Nenkova, 2013;Steinberger and Ježek, 2012).
In order to select the best automatic metric, we typically consider manual evalution metrics as our gold standard, then a good automatic metric should reliably predict how well a summarizer would perform if human evaluation was conducted (Owczarzak et al., 2012;Lin, 2004;Rankel et al., 2013).
In practice, we use the human judgment datasets like the ones constructed during the manual evaluation of the Text Analysis Conference (TAC). The system summaries submitted to the shared tasks were manually scored by trained human annotators following the Responsiveness and/or the Pyramid schemes. An automatic metric is considered good if it ranks the system summaries similarly as humans did.
Currently, ROUGE (Lin, 2004) is the accepted standard for automatic evaluation of content selection because of its simplicity and its good correlation with human judgments. However, previous works on evaluation metrics comparison averaged scores of summaries over topics for each system and then computed the correlation with averaged scores given by humans. ROUGE works well in this scenario which compares only systems after aggregating their scores for many summaries. We call this scenario system-level correlation analysis.
A more natural analysis, which we use in this work, is to compute the correlation between the candidate metric and human judgments for each topic indivually and then average these correlations over topics. In this scenario, which we call summary-level correlation analysis, the performance of ROUGE significantly drops meaning that on average ROUGE does not really identify summary quality, it can only rank systems after aggregation of many topics.
In order to advance the field of summarization we need to have more consistent metrics correlating well with humans on every topic and capable of estimating the quality of individual summaries (not just systems).
We propose to rely on human judgment datasets to learn an automatic scoring metric. The learned metric presents the advantage of being explicitly trained to exhibit high correlation with the "goldstandard" human judgments at the summary level (and not just at the system level). The setup is also convenient because any already existing automatic metric can be incorporated as a feature and the model learns the best combination of features matching human judgments.
We should worry whether the learned metric is reliable. Indeed, typical human judgment datasets (like the ones from TAC-2008 or TAC-2009) contain manual scores only for several system summaries which have a limited range of quality. We conduct a manual evaluation specifically designed to test the metric accross its whole scoring spectrum.
To summarize our contributions: We performed a summary-level correlation analysis to compare a large set of existing evaluation metrics. We learned a new evaluation metric as a combination of existing ones to maximize the summary-level correlation with human judgments. We conducted a manual evaluation to test whether learning from available human judgment datasets yields a reliable metric accross its whole scoring spectrum.

Related Work
Automatic evaluation of content has been the subject of a lot of research. Many automatic metrics have been developed and we present here some of the most important ones.
ROUGE (Lin, 2004) simply computes the ngram overlap between a system summary and a pool of reference summaries. It has become a de-facto standard metric because of its simplicity and high correlation with human judgments at the system-level. Afterwards, Ng and Abrecht (2015) extended ROUGE with word embeddings. Instead of hard lexical matching of n-grams, ROUGE-WE uses soft matching based on the cosine similarity of word embedding.
Recently, a line of research aimed at creating strong automatic metrics by automating the Pyramid scoring scheme (Harnly et al., 2005). Yang et al. (2016) proposed PEAK, a metric where the components requiring human input in the original Pyramid annotation scheme are replaced by stateof-the-art NLP tools. It is more semantically motivated than ROUGE and approximates correctly the manual Pyramid scores but it is computationally expensive making it difficult to use in practice.
Some other metrics do not make use of the reference summaries, they compute a score based only on the candidate summary and the source documents (Lin et al., 2006;Louis and Nenkova, 2013). One representative of this class is the Jensen Shannon (JS) divergence, an informationtheoretic measure comparing system summaries and source documents with their underlying probability distributions of n-grams. JS divergence is simply the symmetric version of the well-known Kullback-Leibler (KL) divergence (Haghighi and Vanderwende, 2009).
Little work has been done on the topic of learning an evaluation metric. Conroy and Dang (2008) previously investigated the performances of ROUGE metrics in comparison with human judgments and proposed ROSE (ROUGE Optimal Summarization Evaluation) a linear combination of ROUGE metrics to maximize correlation with human responsiveness. We also look for a combination of features which correlates well with human judgements but, in contrast to Conroy and Dang (2008), we include a wider set of metrics: ROUGE scores, other evaluation metrics (like Jensen-Shannon divergence) and features typically used by summarization systems. Hirao et al. (2007) also proposed a related approach. They used a voting based regression to score summaries with human judgments as gold standard. Our setup is different because we train and evaluate our metric with the summary-level correlation analysis instead of the system-level one. Our experiments are done on multi-document datasets whereas they use single-documents. Finally, we also perform a further manual evaluation to test the metric outside of its training domain.

Approach
Let a dataset D contain m topics. A given topic t i consists of a set of documents D i , a set of reference summaries θ i , a set of n system summaries S i and the scores given by humans to the n summaries of S i noted R i . We note s i,j the j-th summary of the i-th topic and r h i,j the score it received from manual evaluation: An automatic evaluation metric is a function taking as input a document set D i , a set of reference summaries θ i and a candidate system summary s and outputs a score. For simplicity, we note: σ(D i , θ i , s) = σ i (s) the score of s as a summary of the i-th topic according to some scoring metric σ.
We search an automatic scoring function σ such that σ i (s i,j ) correlates well with the manual scores r h i,j . The final score can be computed at the systemlevel by aggregating scores over topics before and then computing the correlation or at the summarylevel by computing the correlation for each topic and then averaging over topics. We briefly present the difference between the two in the following paragraphs.
System-level correlation Let K be any correlation metric operating on two lists of scored elements, then the system-level correlation is computed by the following formula: Both terms in K are lists of size n. The scores for the summaries of the l-th summarizer are aggregated to form the l-th element of the lists. The correlation is computed on the two aggregated lists. Therefore, K sys avg only indicates whether the evaluation metrics can rank systems correctly after aggregation of many summary scores but it ignores individual summaries. It has been used before because evaluation metrics were initially tasked to compare systems.
Summary-level correlation Instead, we advocate for the summary-level correlation which is computed by the following formula: Here, we compute the correlation between human judgments and automatic scores for each topic and then average the correlation scores over topics. This measures how well evaluation metrics correlate with human judgments for summaries and not only for systems which is important in order to have finer grain of understanding. From now on, when we refer to correlation with human judgments we will refer to the summarylevel correlation.
Correlation metrics There exist many possible choices for K. As different correlation metrics measure different properties, we use three complementary metrics: Pearson's r, Spearman's ρ and Normalized Discounted Cumulative Gain (Ndcg).
Pearson's r is a value correlation metric which depicts linear relationships between the scores produced by the automatic metric and the human judgments.
Spearman's ρ is a rank correlation metric which compares the ordering of systems induced by the automatic metric and the ordering of systems induced by human judgments.
Ndcg is a metric that compares ranked lists and puts more emphasis on the top elements by logarithmic decay weighting. Intuitively, it captures how well the automatic metric can recognize the best summaries.

Features
The choice of features is a crucial part of every learning setup. Here, we can benefit from the large amount of previous works studying signals of summary quality. We can classify these signals in three categories.
First, any existing automatic scoring metric can be a feature. These metrics use the candidate summary and the reference summary to output a score.
The second category contains the previous summarization systems having an explicit formulation of summary quality. These systems can implicitly score any summary, then they extract the summary with maximal score via optimization techniques (Gillick and Favre, 2009;Haghighi and Vanderwende, 2009). Optimization-based systems have recently become popular (McDonald, 2007). Such features score the candidate summary based only on the document sources and the summary itself.
The last category contains the metrics producing a score based only on the summary. Examples of such metrics include readability or redundancy.
Clearly, features using reference summaries (existing automatic metrics) are expected to be more useful for our task. However, it has been shown that some metrics of the second category (like JS divergence) also contain useful signal to approximate human judgments (Louis and Nenkova, 2013). Therefore, we use features coming from all three categories expecting that they are sensitive to different properties of a good summary.
We considered only features cheap to compute in order to deliver a simple and efficient tool. We now briefly present the selected features.
Features using reference summaries ROUGE-N (Lin, 2004) computes the n-gram overlap between the candidate summary and the pool of reference summaries. We include as features the variants identified by Owczarzak et al. (2012) as strongly correlating with humans: ROUGE-2 recall with stemming and stopwords not removed (giving the best agreement with human evaluation), and ROUGE-1 recall (the measure with the highest ability to identify the better summary in a pair of system summaries).
ROUGE-L (Lin, 2004) considers each sentence of the candidate and reference summaries as sequences of words (after stemming). It interprets the longest common subsequence between sentences as a similarity measure. An overall score for the candidate summary is given by combining the scores of individual sentences. One advantage of using ROUGE-L is that it does not require consecutive matches but in-sequence matches reflecting sentence-level word order.
JS divergence measures the dissimilarity between two probability distributions. In summarization, it was also used to compare the n-gram probability distribution of a summary and souce documents (Louis and Nenkova, 2013), but here we employ it for comparing the n-gram probability distribution of the candidate summary with the reference summaries. Thus, it yields an informationtheoretic measure of the dissimilarity between the candidate summary and the reference summaries.
If θ i is the set of reference summaries for the i-th topic, then we compute the following score: ROUGE-WE (Ng and Abrecht, 2015) is the variant of ROUGE-N replacing the hard lexical matching by a soft matching based on the cosine similarity of word embeddings. We use ROUGE-WE-1 and ROUGE-WE-2 as part of our features.
FrameNet-based metrics ROUGE-WE proposes a statistical approach (word embeddings) to alleviate the hard lexical matching of ROUGE. We also include a linguistically motivated one. We replace all nouns and verbs of the reference and candidate summaries with their FrameNet (Baker et al., 1998) frames. This frame annotation is done with the best-performing system configuration from Hartmann et al. (2017) pre-trained on all FrameNet data. It assigns a frame to a word based on the word itself and the surrounding context in the sentence.
Frames are more abstract than words, thus different but related words might be associated with the same frames depending on the meaning of the words in the respective context. ROUGE-N can now match related words through their frames. We also use the unigram and bigram variants (Frame-N).
Semantic Vector Space Similarities In general, automatic evaluation metrics comparing system summaries with reference summaries propose a kind of semantic similarity between summaries. Finding good automatic evaluation metric is hard because the task of textual semantic similarity is challenging. With the development of word embeddings (Mikolov et al., 2013), several semantic similarities have arisen exploiting the inherent similarities built in vector space models. We include one such metric: AV G SIM , the cosine similarity between the average word embeddings of the system summary and the reference summaries. To reduce noise, we exclude stopwords.
Features using document sources are inspired by existing summarization systems: TF IDF comes from the seminal work from Luhn (1958). Each sentence in the summary is scored according to the TF*IDF of its term. The score of the summary is the sum of the scores of its sentences. We computed the version based on unigrams and bigrams (TF * IDF-N).
N-gram Coverage is inspired by the strong summarizer ICSI (Gillick and Favre, 2009). Each n-gram in the summary is scored with the frequency it has in the source documents. The final score of the system summary is the sum of the scores of its n-grams. We also use the variants based on unigrams and bigrams (Cov-N).
KL and JS measures the KL or JS divergence between the word distributions in the summary and source documents. We use as features both KL and JS based on unigram and bigram distributions (KL-N and JS-N).
Features using the candidate summary only Finally, we also include a redundancy metric based on n-gram repetition in the summary. It is the number of unique n-grams divided by the total number of n-grams in the summary. We also use unigrams and bigrams (Red-N).

Model
For a given topic t i , let φ be the function taking as input a document set D i , a set of reference summaries θ i and a system summary s and outputing the set of features described earlier. We note φ(D i , θ i , s) = φ i (s), the feature set representing s as a summary of the topic i.
We aim to learn a function σ ω with parameters ω scoring summaries similarly as humans would. If σ ω (φ i (s)) is the score given by the learned metric to the summary s, we look for the set of parameters ω which maximizes the summary-level correlation defined by equation 3. It means we are trying to solve the following problem: [r h i,1 , . . . , r h i,n ]) (5) We can approach this problem either with a learning-to-rank or with a regression framework. Learning-to-rank seems well suited because it captures the fact that we are interested in ranking summaries, however we selected the regression approach in order to keep the model simple. It solves a different but closely related problem: The regression finds the parameters predicting the scores closest to the ones given by humans. We use an off-the-shelf implementation of Support Vector Regression (SVR) from scikit-learn (Pedregosa et al., 2011).

Experiments
We conducted both automatic and manual testing of the learned metric. We present here the datasets and results of the experiments.

Datasets
We use two multi-document summarization datasets from the Text Analysis Conference (TAC) shared tasks: TAC-2008 and TAC-2009. 1 TAC-2008 and TAC-2009 contain 48 and 44 topics, respectively. Each topic consists of 10 news articles to be summarized in a maximum of 100 words. We use only the so-called initial summaries (A summaries), but not the update part.
For each topic, there are 4 human reference summaries. In both editions, all system summaries and the 4 reference summaries were manually evaluated by NIST assessors for readability, content selection (with Pyramid) and overall responsiveness. At the time of the shared tasks, 57 systems were submitted to TAC-2008 and 55 to TAC-2009. For our experiments, we use the Pyramid and the responsiveness annotations.
With our notations, for example with TAC-2009, we have n = 55 scored system summaries, m = 44 topics, D i contains 10 documents and θ i contains 4 reference summaries.
We also use the recently created German dataset DBS-corpus (Benikova et al., 2016). It contains 10 topics consisting of 4 to 14 documents each. The summaries have variable sizes and are about 500 words long. For each topic, 5 summaries were evaluated by trained human annotators but only for content selection with Pyramid.
We experiment with this dataset because it contains heterogeneous sources (different text types) in German about the educational domain. This contrasts with the English homogeneous news documents from TAC-2008 and TAC-2009. Thus, we can test our technique in a different summarization setup.

Correlation Analysis
Baselines Each feature presented earlier is evaluated individually. 2 Indeed, they all produce scores for summaries meaning we can measure their correlation with human judgments. Classical evaluation metrics, like ROUGE-N variants, are therefore also included in this analysis and serve as baselines. Identifying which metrics have high correlation with human judgments constitutes an initial feature analysis.
Most of the features do not need language dependent information, except those requiring word embeddings or frame identification based on a frame inventory. We do not include the frame identification features when experimenting with the German DBS-corpus. However, for the other language dependent features, we used the German word embeddings developed by Reimers et al. (2014). For the English datasets, we use dependency-based word embeddings (Levy and Goldberg, 2014).
The performances of the baselines on TAC-2008 and TAC-2009 are displayed in Table 1, and  Table 2 depicts scores for the DBS-corpus. In order to have an insightful view, we report the scores for the three correlation metrics presented in the previous section: Pearson's r, Spearman's ρ and Ndcg.
Feature Analysis There are fewer scored summaries per topic in the DBS-corpus (5 compared to 55 in TAC-2008). Shorter ranked lists generally have higher scores which explains the overall higher correlation scores in the DBS-corpus. It also contains longer summaries (500 words compared to 100 words for TAC) which provides a reason behind the better performances of JS features. Indeed, word frequency distributions are more representative for longer texts.
First, we see that classical evaluation metrics like ROUGE-N have lower correlation when computed at the summary-level. Here the correlations are around 0.60 spearman's ρ while they often surpass 0.90 in the system-level scenario (Lin, 2004).
However, the experiments confirm that ROUGE-N, especially ROUGE-2, are strong when compared to other available metrics. Even the more semantically motivated metrics like ROUGE-N-WE or Frame-N (ROUGE-N enriched with frame annotations) can not outperform the simple ROUGE-N. The added semantic information might be too noisy to really give improvements. Simple lexical comparison still seems to be better for evaluation of summaries.
Interestingly, it is the other simple evaluation metric JS ref − N which competes with ROUGE-N. This metric only compares the distribution of n-grams in the reference summaries with the distribution of n-grams in the candidate summary and it outperforms ROUGE-N for pearson's r. However, ROUGE-N still outperforms JS ref − N for Ndcg. It indicates that this metric can be complementary with ROUGE-N even though it was rarely used for evaluation before.
Finally, we observe that the features not using the reference summaries have poor performances. It is troubling because these are the strategies used by classical summarization systems in order to decide which summary to extract. Overall, they have Ndcg scores higher than 0.5 meaning they can decently identify some of the best summaries explaining why these systems can produce good summaries.
Our Models For each dataset, we trained two models. The first model (S 3 f ull for Supervised Summarization Scorer) uses all the available features for training. However, the previous feature analysis revealed that some features are poor. We hypothesized that they might harm the learning process. Therefore we trained a second model S 3 best using only 6 of the best features. 3 We normalize human scores so that they every topic has the same mean.
Both models are trained and tested in a leaveone-out cross-validation scenario ensuring proper testing of the approach. The results for TAC-2008 andTAC-2009 are presented in Table 1 while the results for the DBS-corpus are in Table 2. For comparison we also added the correlation between pyramid and responsiveness when both annotations are available.
Model analysis As expected we observe that using the restricted set of non-noisy features gives stronger results. S 3 best is the best metric and outperforms the classical ROUGE-N. Thanks to the combination of ROUGE-N and JS ref − N , it gets the best of both worlds and has consistent performances accross datasets and correlation measures.  Thanks to the combination of metrics, our model has more consistent performances accross different correlation metrics. It especially benefits from the complementarity of ROUGE and JS ref .
While the improvements are sometimes good, they are not dramatic. A bigger and more diverse training data should give further improvements. With a better training set, it might even not be necessary to manually remove the noisy features as the model will learn when to ignore which features.

Percentage of failure
By analysing the average correlation between the different metrics and human judgments over all topics, we only get an average overview. It would be useful to estimate the number of topics on which a metric fails or works. One could plot cumulative distribution graphs where the x-axis is the correlation range (from 0 to 1 in absolute values) and the y-axis indicates the number of topics on which the metric's correlation with humans was above the given x point. However, this would require 460 plots (3 datasets * 20 metrics * 6 correlations measures) which would not be readable.
Instead, we define a threshold for each correlation measure and count the percentage of topics for which the metric's correlation with humans was below the threshold. The threshold value is   -1-WE .7083 .7708 .1042.6875 .6875 .4583 .5455 .7500 .2500.4318 .5682 .2955ROUGE-2-WE .6667 .8333 .1667.6667 .8333 .6458 .5455 .7727 .2500 Table 3: Percentage of topics for which the correlation between the metric and human judgments is below the chosen thresholds for TAC-2008 andTAC-2009. an indicator of when the metrics fails to correctly model human judgments on a given topic. We chose: 0.65 for pearson's r, 0.55 for spearman's ρ and 0.85 for Ndcg. The values are chosen arbitrarily but in order to get a meaningful picture, if we choose a threshold too low then all metrics are always above, if the threshold is too high all metrics are always below. We report the scores for the set of best features and our best metric S 3 best on TAC datasets in Table 3.
We observe that our metric performs well and has low percentage of failure. It exhibits again its robustness accross different correlation measures. We also observe the strong performances of the JS ref especially the unigram version, however it fails completely for the Ndcg metrics which indicates that it always has problems to identify the top best summaries even though its overall correlation is good. Again this confirms that our metric benefits from the complementarity of JS ref and ROUGE because ROUGE has performs well with Ndcg.

Manual annotation
Our models are trained with human judgment datasets constructed during the shared tasks, meaning that only some system summaries and the 4 references summaries have been evaluated by humans. Systems have a limited range of quality as they rarely propose excellent summaries, and bad summaries are usually due to unrelated errors (like empty summaries). This is a concern because our learned metric will certainly perform well in this quality range, but it should also perform well outside of this range. It has to be capable to correctly recognize the new and better summaries that will be proposed by future systems.
As the learning is constrained to a specific quality range, we need to check that the whole scoring spectrum of the metric correlates well with humans. We check that what is considered upperbound (resp. random) by the metric is also considered as excellent (resp. bad) by humans.
Annotation setup We collect summaries by employing a meta-heuristic solver introduced recently for extractive MDS by Peyrard and Eckle-Kohler (2016). Specifically, we use the tool published with their paper. 4 Their meta-heuristic solver implements a Genetic Algorithm to create and iteratively optimize summaries over time. In this implementation, the individuals of the population are the candidate solutions which are valid extractive summaries. Each summary is represented by a binary vector indicating for each sentence in the source document whether it is included in the summary or not. The size of the population is a hyper-parameter that we set to 100. Two evolutionary operators are applied: the mutation and the reproduction. Mutations happen to several randomly chosen summaries by randomly removing one of its sentences and adding a new one that does not violate the length constraint. The reproduction is performed by randomly extracting a valid summary from the union of sentences of randomly selected parent summaries. Both operators are controlled by hyper-parameters which we set to their default values.
We use our metric S 3 best as the fitness function and, after the algorithm converges, the final population is a set of summaries ranging from almost random to almost upper-bound. For 15 topics of TAC-2009, we automatically selected 10 summaries of various quality from the final population and asked two humans to score them following the  Table 4: Correlation of automatic metrics with human accross the whole scoring spectrum of S 3 best .
guidelines used during DUC and TAC for assessing responsiveness. To select the summaries, we ranked them according to their S 3 best scores and for a population of 100 we picked 10 evenly spaced summaries (the first, the tenth and so on). We observe an inter-annotator agreement of 0.74 Cohen's κ. The results are displayed in Table 4 where S 3 best is compared to the best baseline (ROUGE-2) and S 3 f ull . The S 3 best metric gets consistent correlation scores with human judgments as it had with responsiveness in the previous experiements (on TAC-2009, for responsiveness, S 3 best has 0.7386 pearson's r, 0.5952 spearman's ρ and 0.9015 Ndcg) . It is a strong indicator that the metric is reliable even outside of its training domain. It also outperforms ROUGE-2 in this experiment.

Discussion
The experiments showed that even semantically motivated metrics struggle to outperform ROUGE-N. However, the simple JS ref and ROUGE-N using only n-gram are the best baselines. Reporting these two metrics together might be more insightful than simply reporting ROUGE-N because they are complementary. Our learned metric is benefiting from this complementarity to achieve its scores.
However, finding a good evaluation metric for summarization is a challenging task which is still not solved. We proposed to tackle this problem by learning the metric to approximate human judgments with a regression framework. A learning-torank approach could give stronger results because it might be easier to rank summaries. Even after normalization human scores are noisy and topicdependent. We expect ranking to be more transferable from one topic to another. Here, we constrained ourselves to a simple approach in order to provide a user-friendly tool and the regression offered a simple and effective solution.
Our experiments revealed that the available human judgment datasets are somehow limited. While it is possible to learn a reliable combination of existing metrics, one would need better and bigger human judgment datasets to really get strong improvements. In particular, it is important to extend the coverage of these datasets because we rely on them to compare evaluation metrics. These annotations are the key to understand what humans consider to be good summaries. Statistical analysis on such datasets will likely be beneficial to develop both evaluation metrics and summarization systems (Peyrard and Eckle-Kohler, 2017). The metric was evaluated on English news datasets and on a German dataset of heterogeneous sources but a wider study might be needed in order to measure the generalization of the learned metric to other datasets and domains. Such generalization capabilities would be interesting because one would not need to re-train a new metric for every domain.
We believe it is important to develop evaluation metrics correlating well with human judgments at the summary-level. This gives a more insightful and reliable metric. If the metric is reliable enough, one can use it as a target to train supervised summarization systems (Takamura and Okumura, 2010;Sipos et al., 2012) and approach summarization as a principled machine learning task.

Conclusion
We presented an approach to learn an automatic evaluation metrics correlating well with human judgments at the summary-level. The metric is a combination of existing automatic scoring strategies learned via regression. We release the metric as an open-source tool. 5 We hope this study will encourage more work on learning evaluation metrics and improving the human judgement datasets. Better human judgment datasets will be greatly beneficial for improving both evaluation metrics and summarization systems.