SUM-QE: a BERT-based Summary Quality Estimation Model

We propose SUM-QE, a novel Quality Estimation model for summarization based on BERT. The model addresses linguistic quality aspects that are only indirectly captured by content-based approaches to summary evaluation, without involving comparison with human references. SUM-QE achieves very high correlations with human ratings, outperforming simpler models addressing these linguistic aspects. Predictions of the SUM-QE model can be used for system development, and to inform users of the quality of automatically produced summaries and other types of generated text.


Introduction
Quality Estimation (QE) is a term used in machine translation (MT) to refer to methods that measure the quality of automatically translated text without relying on human references (Bojar et al., 2016(Bojar et al., , 2017. In this study, we address QE for summarization. Our proposed model, SUM-QE, successfully predicts linguistic qualities of summaries that traditional evaluation metrics fail to capture (Lin, 2004;Lin and Hovy, 2003;Papineni et al., 2002;Nenkova and Passonneau, 2004). SUM-QE predictions can be used for system development, to inform users of the quality of automatically produced summaries and other types of generated text, and to select the best among summaries output by multiple systems.
SUM-QE relies on the BERT language representation model (Devlin et al., 2019). We use a pre-trained BERT model adding just a taskspecific layer, and fine-tune the entire model on the task of predicting linguistic quality scores manually assigned to summaries. The five criteria addressed are given in Figure 1. We provide a thorough evaluation on three publicly available summarization datasets from NIST shared Q1 -Grammaticality: The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. Q2 -Non redundancy: There should be no unnecessary repetition in the summary. Q3 -Referential Clarity: It should be easy to identify who or what the pronouns and noun phrases in the summary are referring to. Q4 -Focus: The summary should have a focus; sentences should only contain information that is related to the rest of the summary. Q5 -Structure & Coherence: The summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic. Figure 1: SUM-QE rates summaries with respect to five linguistic qualities (Dang, 2006a). The datasets we use for tuning and evaluation contain human assigned scores (from 1 to 5) for each of these categories. tasks, and compare the performance of our model to a wide variety of baseline methods capturing different aspects of linguistic quality. SUM-QE achieves very high correlations with human ratings, showing the ability of BERT to model linguistic qualities that relate to both text content and form. 1

Related Work
Summarization evaluation metrics like Pyramid (Nenkova and Passonneau, 2004) and ROUGE (Lin and Hovy, 2003;Lin, 2004) are recalloriented; they basically measure the content from a model (reference) summary that is preserved in peer (system generated) summaries. Pyramid requires substantial human effort, even in its more recent versions that involve the use of word embeddings (Passonneau et al., 2013) and a lightweight crowdsourcing scheme (Shapira et al., 2019). ROUGE is the most commonly used evaluation metric (Nenkova and McKeown, 2012;Allahyari et al., 2017;Gambhir and Gupta, 2017). Inspired by BLEU (Papineni et al., 2002), it relies on common n-grams or subsequences between peer and model summaries. Many ROUGE versions are available, but it remains hard to decide which one to use (Graham, 2015). Being recall-based, ROUGE correlates well with Pyramid but poorly with linguistic qualities of summaries. Louis and Nenkova (2013) proposed a regression model for measuring summary quality without references. The scores of their model correlate well with Pyramid and Responsiveness, but text quality is only addressed indirectly. 2 Quality Estimation is well established in MT (Callison-Burch et al., 2012;Bojar et al., 2016Bojar et al., , 2017Martins et al., 2017;Specia et al., 2018). QE methods provide a quality indicator for translation output at run-time without relying on human references, typically needed by MT evaluation metrics (Papineni et al., 2002;Denkowski and Lavie, 2014). QE models for MT make use of large postedited datasets, and apply machine learning methods to predict post-editing effort scores and quality (good/bad) labels.
We apply QE to summarization, focusing on linguistic qualities that reflect the readability and fluency of the generated texts. Since no postedited datasets -like the ones used in MT -are available for summarization, we use instead the ratings assigned by human annotators with respect to a set of linguistic quality criteria. Our proposed models achieve high correlation with human judgments, showing that it is possible to estimate summary quality without human references.

Datasets
We use datasets from the NIST DUC-05, DUC-06 and DUC-07 shared tasks (Dang, 2006a,b;Over et al., 2007). Given a question and a cluster of newswire documents, the contestants were asked to generate a 250-word summary answering the question. DUC-05 contains 1,600 summaries (50 questions x 32 systems); in DUC-06, 1,750 summaries are included (50 questions x 35 2 In the Responsiveness annotation instructions, annotators were asked to assess the linguistic quality of the summary only if it interfered with the expression of information and reduced the amount of conveyed information. See https://duc.nist.gov/duc2005/ responsiveness.assessment.instructions Figure 2: Illustration of different flavors of the investigated neural QE methods. An encoder (E) converts the summary to a dense vector representation h. A regressor R i predicts a quality score S Qi using h. E is either a BiGRU with attention (BiGRU-ATT) or BERT (SUM-QE). R has three flavors, one single-task (a) and two multi-task (b, c). systems); and DUC-07 has 1,440 summaries (45 questions x 32 systems).
The submitted summaries were manually evaluated in terms of content preservation using the Pyramid score, and according to five linguistic quality criteria (Q1, . . . , Q5), described in Figure 1, that do not involve comparison with a model summary. Annotators assigned scores on a fivepoint scale, with 1 and 5 indicating that the summary is bad or good with respect to a specific Q. The overall score for a contestant with respect to a specific Q is the average of the manual scores assigned to the summaries generated by the contestant. Note that the DUC-04 shared task involved seven Qs, but some of them were found to be highly overlapping and were grouped into five in subsequent years (Over et al., 2007). 3 We address these five criteria and use DUC data from 2005 onwards in our experiments.

The SUM-QE Model
In SUM-QE, each peer summary is converted into a sequence of token embeddings, consumed by an encoder E to produce a (dense vector) summary representation h. Then, a regressor R predicts a quality score S Q as an affine transformation of h: Non-linear regression could also be used, but a linear (affine) R already performs well. We use BERT as our main encoder and fine-tune it in three ways, which leads to three versions of SUM-QE.
Single-task (BERT-FT-S-1): The first version of SUM-QE uses five separate estimators, one per quality score, each having its own encoder E i (a separate BERT instance generating h i ) and regressor R i (a separate linear regression layer on top of the corresponding BERT instance): Multi-task with one regressor (BERT-FT-M-1): The second version of SUM-QE uses one estimator to predict all five quality scores at once, from a single encoding h of the summary, produced by a single BERT instance. The intuition is that E will learn to create richer representations so that R (an affine transformation of h with 5 outputs) will be able to predict all quality scores: is the i-th element of the vector returned by R.

Multi-task with 5 regressors (BERT-FT-M-5):
The third version of SUM-QE is similar to BERT-FT-M-1, but we now use five different linear (affine) regressors, one per quality score: Although BERT-FT-M-5 is mathematically equivalent to BERT-FT-M-1, in practice these two versions of SUM-QE produce different results because of implementation details related to how the losses of the regressors (five or one) are combined.

Baselines
BiGRUs with attention: This is very similar to SUM-QE but now E is a stack of BiGRUs with self-attention (Xu et al., 2015), instead of a BERT instance. The final summary representation (h) is the sum of the resulting context-aware token embeddings (h = i a i h i ) weighted by their selfattention scores (a i ). We again have three flavors: one single-task (BiGRU-ATT-S-1) and two multitask (BiGRU-ATT-M-1 and BiGRU-ATT-M-5).
ROUGE: This baseline is the ROUGE version that performs best on each dataset, among the versions considered by Graham (2015). Although ROUGE focuses on surface similarities between peer and reference summaries, we would expect properties like grammaticality, referential clarity and coherence to be captured to some extent by ROUGE versions based on long n-grams or longest common subsequences.
Language model (LM): For a peer summary, a reasonable estimate of Q1 (Grammaticality) is the perplexity returned by a pre-trained language model. We experiment with the pre-trained GPT-2 model (Radford et al., 2019), and with the probability estimates that BERT can produce for each token when the token is treated as masked (BERT-FR-LM). 4 Given that the grammaticality of a summary can be corrupted by just a few bad tokens, we compute the perplexity by considering only the k worst (lowest LM probability) tokens of the peer summary, where k is a tuned hyper-parameter. 5 Next sentence prediction: BERT training relies on two tasks: predicting masked tokens and next sentence prediction. The latter seems to be aligned with the definitions of Q3 (Referential Clarity), Q4 (Focus) and Q5 (Structure & Coherence). Intuitively, when a sentence follows another with high probability, it should involve clear referential expressions and preserve the focus and local coherence of the text. 6 We, therefore, use a pretrained BERT model (BERT-FR-NS) to calculate the sentence-level perplexity of each summary: where p(s i |s i−1 ) is the probability that BERT assigns to the sequence of sentences s i−1 , s , and n is the number of sentences in the peer summary.

Experiments
To evaluate our methods for a particular Q, we calculate the average of the predicted scores for the summaries of each particular contestant, and the average of the corresponding manual scores assigned to the contestant's summaries. We measure the correlation between the two (predicted vs.   manual) across all contestants using Spearman's ρ, Kendall's τ and Pearson's r.
We train and test the SUM-QE and BiGRU-ATT versions using a 3-fold procedure. In each fold, we train on two datasets (e.g., DUC-05, DUC-06) and test on the third (e.g., DUC-07). We follow the same procedure with the three BiGRU-based models. Hyper-perameters are tuned on a held out subset from the training set of each fold. Table 1 shows Spearman's ρ, Kendall's τ and Pearson's r for all datasets and models. The three fine-tuned BERT versions clearly outperform all other methods. Multi-task versions seem to per-form better than single-task ones in most cases. Especially for Q4 and Q5, which are highly correlated, the multi-task BERT versions achieve the best overall results. BiGRU-ATT also benefits from multi-task learning.

Results
The correlation of SUM-QE with human judgments is high or very high (Hinkle et al., 2003) for all Qs in all datasets, apart from Q2 in DUC-05 where it is only moderate. Manual scores for Q2 in DUC-05 are the highest among all Qs and years (between 4 and 5) and with the smallest standard deviation, as shown in Table 2. Differences among systems are thus small in this respect, and although SUM-QE predicts scores in this range, it struggles to put them in the correct order, as illustrated in Figure 3.
BEST-ROUGE has a negative correlation with the ground-truth scores for Q2 since it does not account for repetitions. The BiGRU-based models also reach their lowest performance on Q2 in DUC-05. A possible reason for the higher relative performance of the BERT-based models, which achieve a moderate positive correlation, is that BiGRU captures long-distance relations less effectively than BERT, which utilizes Transformers (Vaswani et al., 2017) and has a larger receptive field. A possible improvement would be a stacked BiGRU, since the states of higher stack layers have a larger receptive field as well. 7 The BERT multi-task versions perform better with highly correlated qualities like Q4 and Q5 (as illustrated in Figures 2 to 4 in the supplementary material). However, there is not a clear winner among them. Mathematical equivalence does not lead to deterministic results, especially when random initialization and stochastic learning algorithms are involved. An in-depth exploration of this point would involve further investigation, which will be part of future work.

Conclusion and Future Work
We propose a novel Quality Estimation model for summarization which does not require human references to estimate the quality of automatically produced summaries. SUM-QE successfully predicts qualitative aspects of summaries that recall-oriented evaluation metrics fail to approximate. Leveraging powerful BERT representations, it achieves high correlations with human scores for most linguistic qualities rated, on three different datasets. Future work involves extending the SUM-QE model to capture content-related aspects, either in combination with existing eval- Figure 3: Comparison of the mean gold scores assigned for Q2 and Q3 to each of the 32 systems in the DUC-05 dataset, and the corresponding scores predicted by SUM-QE. Scores range from 1 to 5. The systems are sorted in descending order according to the gold scores. SUM-QE makes more accurate predictions for Q2 than for Q3, but struggles to put the systems in the correct order.
uation metrics (like Pyramid and ROUGE) or, preferably, by identifying important information in the original text and modelling its preservation in the proposed summaries. This would preserve SUM-QE's independence from human references, a property of central importance in real-life usage scenarios and system development settings.
The datasets used in our experiments come from the NIST DUC shared tasks which comprise newswire articles. We believe that SUM-QE could be easily applied to other domains. A small amount of annotated data would be needed for fine-tuning -especially in domains with specialized vocabulary (e.g., biomedical) -but the model could also be used out of the box. A concrete estimation of performance in this setting will be part of future work. Also, the model could serve to estimate linguistic qualities other than the ones in the DUC dataset with mininum effort.
Finally, SUM-QE could serve to assess the quality of other types of texts, not only summaries. It could thus be applied to other text generation tasks, such as natural language generation and sentence compression.