Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.


Introduction
Even though there has been an enormous increase in automatic summarization research, human evaluation of summarization is still an understudied aspect. One the one hand, there is no standard procedure for conducting human evaluation, which is leading to a high degree of variation and different results (Van Der Lee et al., 2019); on the other hand, human evaluation is usually carried out in a traditional laboratory environment by linguistic experts, which is costly and time-consuming to run and prone to subjective biases (Celikyilmaz et al., 2020). Therefore, automatic evaluation metrics such as BLEU and ROUGE have been used as substitutes for human evaluation (Papineni et al., 2002;Lin, 2004). However, they require expert summaries as references to be calculated and are often reported not to correlate with human evaluations regarding the readability, grammaticality, and content-related factors (Novikova et al., 2017).
In the other NLP domains, crowdsourcing has been proposed as an alternative to overcome these challenges, showing that crowd workers' aggregated responses could produce quality approaching those produced by experts (Snow et al., 2008;Callison-Burch, 2009;Nowak and Rüger, 2010). In the summarization evaluation, very few researchers have investigated crowdsourcing as an alternative, eventually concluding that the chosen crowd-based evaluation methods are not reliable enough to produce consistent scores (Gillick and Liu, 2010;Fabbri et al., 2020). However, the authors did not apply any pre-qualification test, did not provide information about the number of crowd workers, did not apply annotation aggregation methods, or did not analyze the effect of reading effort and readability of source texts caused by the text's structural, and formal composure. Additionally, they used the TAC and CNN/Daily Mail data set derived from high-quality English texts. So, there is a research gap regarding the best practices for crowd-based evaluation of summarization, especially for languages other than English and noisy internet data.
We address this gap in the following ways: 1) We use a German summarization data set derived from an online question-answering forum; 2) We apply pre-qualification tests and set a threshold for minimum task completion duration in crowdsourcing; 3) We collect intrinsic and extrinsic quality ratings from 24 different crowd workers per summary in order to analyze consistency; 4) We use different annotation aggregation methods on crowdsourced data; 5) We analyze the effect of annotation aggregation methods, reading effort, and the number of crowd workers per item on robustness, comparing results from a) expert assessment; b) crowd assessment; c) state of the art automatic assessment metrics. Especially, languages other than English can benefit from our results, since they lack easyto-use automatic evaluation metrics in the form of simplified toolkits, and a well-executed evaluation can accelerate the research on automatic summarization (Fabbri et al., 2020).

Automatic Summarization Evaluation
The automatic evaluation of summarization can be categorized into two categories: untrained automatic metrics, which do not require machine learning but are based on string overlap, or content overlap between machine-generated and expert generated summaries (ground-truth), and machinelearned metrics that are based on machine-learned models (Celikyilmaz et al., 2020).

Untrained Automatic Metrics
The most common untrained automatic metrics for summarization evaluation are BLEU, METEOR, and ROUGE, which rely on counting n-grams and calculating Precision, Recall, and F-measure by comparing one or several system summaries to reference summaries generated by experts (Papineni et al., 2002;Denkowski and Lavie, 2014;Lin, 2004). As  stated, ROUGE is the most popular method to assess the summarization quality, and at least one of the ROUGE variant is used in 87% of papers on summarization in ACL conferences between 2013 and 2018. In recent years, many variations on ROUGE and other measures have been introduced in the literature (Zhou et al., 2006;Ng and Abrecht, 2015;Ganesan, 2018). However, they have been criticized because of the wide range of correlations being weak to strong with human assessment reported in the summarization literature and for being not suitable for capturing important quality aspects (Reiter and Belz, 2009;Graham, 2015;Novikova et al., 2017;Peyrard and Eckle-Kohler, 2017). Therefore, more and more researchers refrain from using automatic metrics as a primary evaluation method (Reiter, 2018). Still, Van Der Lee et al. (2019) report that 80% of the empirical papers presented at the ACL track on NLG or at the INLG conference in 2018 using automatic metrics due to the lack of alternatives and the fast and cost-effective nature.

Trained Automatic Metrics
Over the last few years, NLP researchers proposed new machine-learned automatic metrics trained using BERT contextual embeddings such as BertScore, BLEURT, and BLANC to evaluate the natural language generation (NLG) quality, which can also be applied to summarization evaluation (Devlin et al., 2019;Zhang et al., 2019;Sellam et al., 2020;Vasilyev et al., 2020). BertScore and BLEURT still require expert generated summaries as ground-truth and computes the similarity of two summaries as a sum of cosine similarities between their tokens' embeddings. Zhang et al. (2019) reported that BertScore correlates better than the other state of the art metrics in the domain of machine translation and image captioning tasks, Sellam et al. (2020) showing that BLEURT correlates better than BertScore with human judgments on the WMT17 Metrics Shared Task. Unlike these metrics, the BLANC score is designed not to require any reference summaries aiming for fully humanfree summary quality estimation (Vasilyev et al., 2020). BLANC was shown to correlate as good as ROUGE on CNN/DailyMail data set.

Human Evaluation
Human evaluation can be conducted as pair comparison (compared to expert summaries) or using absolute scales without having a reference. One of the common human evaluation methods using pair comparison is the PYRAMID method (Nenkova and Passonneau, 2004). In the PYRAMID method, sentences in summaries are split into Summary Content Units for both system and reference summaries and compared with each other based on content. So, it measures only the summaries' relative quality and does not give a sense of the summary's absolute quality. In this paper, we focus on absolute quality measurement in which the generated summaries are demonstrated to the evaluators one at a time, and they judge summary quality individually by rating the quality along a Likert or sliding scale. Therefore, we do not use the PYRAMID method in our human evaluation and collect human ratings on two categories: intrinsic (linguistic) and extrinsic (content) evaluation (Jones and Galliers, 1995;Steinberger and Ježek, 2012).

Intrinsic (Linguistic) Evaluation
In intrinsic evaluation, domain experts are usually asked to evaluate the quality of the given summary, either as overall quality or along some specific dimension without reading the source document (Celikyilmaz et al., 2020). To determine the intrinsic quality of summarization, the following five text readability (linguistic quality) scores are most commonly used: grammaticality, non-redundancy, referential clarity, focus, and structure & coherence. In the section 3, we determine these scores based on the definitions in Dang (2005).

Extrinsic (Content) Evaluation
In extrinsic evaluation, domain experts evaluate a system's performance on the task for which it was designed, so the evaluation of summary quality is accomplished based on the source document (Lloret et al., 2018). The most common extrinsic quality measures are: 1) "Summary usefulness"also called content responsiveness -which determines the summary's usefulness concerning how useful the extracted summary is to satisfy the given goal; 2) "Source text usefulness" -also called relevance assessment -which examines how useful the source document is to satisfy the given goal; 3) "Summary informativeness" measuring how much information from the source document is preserved in the extracted summary (Mani, 2001;Conroy and Dang, 2008;Shapira et al., 2019).

Crowdsourcing for Summarization Evaluation
Crowdsourcing has been used as a fast and costeffective alternative to traditional subjective evaluation with experts in summarization evaluation; however, it has not been explored as thoroughly as other NLG tasks, such as evaluating machine translation (Lloret et al., 2018). In the few papers where crowdsourcing has been used for summarization evaluation, the quality of crowdsourced data has been repeatedly questioned because of the crowd worker's inaccuracy and the complexity of summarization evaluation. For example, Gillick and Liu (2010) found that the ratings from non-expert crowd workers do not correlate the expert ratings on the TAC summarization data set, which contains 100-word summaries of a set of 10 newswire articles about a particular topic. A similar conclusion was reached by Lloret et al. (2013), who created a corpus for abstractive image summarization with five crowd workers per item. However, besides the fact that results were obtained from other domains than the presented telecommunication domain in this work, in both works, the authors did not apply any pre-qualification test or did not provide information about crowdsourcing task details, which can also cause a rather large influencing effect. Following, Gao et al. (2018); Falke et al. (2017); Fan et al. (2018) have used crowdsourcing as the source of human evaluation to rate their automatic summarization systems. Nevertheless, they did not question the robustness of crowdsourcing for this task and compared the crowd with expert data. Also, we have shown that crowdsourcing achieves almost the same results as the laboratory studies using 7-9 crowd workers, but we did not compare the crowd with experts (Iskender et al., 2020). Fabbri et al.
(2020) compared the crowd with expert evaluation on CNN/Daily Mail data set using only five crowd workers per summary. They also found that crowd and expert ratings do not correlate and emphasized the need for protocols for improving the human evaluation of summarization.
To improve the quality of crowdsourcing, researchers have developed several methods such as filtering and aggregation (Kairam and Heer, 2016). When filtering crowd workers, the first approach focuses on the pre-qualification tasks designed based on the task characteristics (Mitra et al., 2015). While aggregating crowd judgments, the majority vote is the most common technique (Chatterjee et al., 2019). Much more complex annotation aggregation methods such as probabilistic models of annotation, accounting item level effects, or clustering methods have been introduced in the recent years (Passonneau and Carpenter, 2014;Whitehill et al., 2009;Luther et al., 2015).
To provide the best practices for crowdbased summarization evaluation, we apply prequalification and focus on the following aggregation methods in this paper: 1) MOS: Mean Opinion Score (MOS) takes the mean of all judgments for a given item and is one of the most popular metrics for subjective quality evaluation (Streijl et al., 2016;Chatterjee et al., 2019), 2) Majority Vote: In Majority Vote, the answer with the highest votes is selected as the final aggregated value, and it is the most popular method in subjective quality evaluation with crowdsourcing (Hovy et al., 2013;Hung et al., 2013), 3) Crowdtruth: It represents the crowdsourcing system in its three main components -input media units, workers, and annotations. It is designed to capture inter-annotator disagreement in crowdsourcing and aims to collect gold standard data for training and evaluation of cognitive computing systems using crowdsourcing (Dumitrache et al., 2018a). Dumitrache et al. (2018b) have shown that the Crowdtruth performs better than the majority vote in different domains, 4) MACE: Multi-Annotator Competence Estimation (MACE) is a probabilistic model that computes competence estimates of the individual annotators and the most likely answer to each item (Hovy et al., 2013). Paun et al. (2018) have shown that MACE performs better than the other annotation aggregation methods in evaluations against the gold standard. This model is possibly most widely applied to linguistic data (Plank et al., 2014;Sabou et al., 2014;Habernal and Gurevych, 2016).

Data Set
In our experiments, we used the same German summary data set with 50 summaries as described in Iskender et al. (2020). The corpus contains queries with an average word count of 7.78, the shortest one with four words, and the longest with 17 words; posts from a customer forum of Deutsche Telekom with an average word count of 555, the shortest one with 155 words, and the longest with 1005 words; and corresponding query-based extractive summaries with an average word count of 63.32, the shortest one with 24 words, and the longest one with 147 words.

Crowdsourcing Study
We collected crowd annotations using Crowdee 1 Platform. Crowd workers were only allowed to perform the summary evaluation task after passing two qualification tests in the following order: 1) German language proficiency test provided by the Crowdee platform with a score of 0.9 and above (scale [0, 1]), 2) Summarization evaluation test containing deliberately designed bad and good examples of summaries to be recognized by the crowd.
Here, a maximum of 20 points could be reached by crowd workers, and we kept crowd workers exceeding 12 points. Besides, according to our expert pre-testing, we set 90 seconds as a threshold for the minimum task completion duration and eliminated all the crowd answers under this threshold.
In the main task, a brief explanation of the summary creation process was shown first with an example of a query, forum posts, and a summary to provide background information. After reading all instructions, crowd workers evaluated nine quality factors of a single summary using a 5 point scale with the labels very good, good, moderate, bad, very bad in the following order: 1) overall quality, 2) grammaticality, 3) non-redundancy, 4) referential clarity, 5) focus, 6) structure & coherence, 7) summary usefulness, 8) post usefulness and 9) summary informativeness. In the first six questions, the corresponding forum posts and the query were not shown to the crowd workers (intrinsic quality); in question 7, we showed the original query; in questions 8 and 9, the original query and the corresponding forum posts. In total, 24 repetitions per item for each of these nine questions were collected, resulting in 10,800 labels (50 summaries x 9 questions x 24 repetitions). Compensation was carefully calculated to ensure the minimum wage of e 9.35 per hour in Germany. Overall, 46 crowd workers (19f, 27m, M age = 43) completed the individual sets of tasks within 20 days where they spent 249,884 seconds, ca. 69.4 hours at total.

Expert Evaluation
We used a similar approach to the Delphi method to obtain a consensus among experts in an iterative procedure (Linstone et al., 1975;Sanchan et al., 2017). In the first evaluation round, two experts, who are Masters students in linguistics, evaluated separately the same summarization data set using the same task design as crowd workers by using Crowdee Platform to avoid any user interface biases. After the first evaluation round, the inter-rater agreement calculated by Cohen's κ showed that the experts often diverted in their assessment. In order to reach an acceptable inter-rater agreement score, physical follow-up meetings with experts were arranged. In these meetings, experts discussed causes and backgrounds of their ratings for each item they disagreed, simultaneously creating a more detailed definition and evaluation criteria catalog for each score for future experiments. After the meeting, acceptable inter-rater agreement scores were achieved (see Section 4). In total, 900 ratings (50 Summary x 9 questions x 2 experts) were collected.

Automatic Evaluation
We calculated the BLEU and ROUGE scores using the sumeval library 2 for German, BertScore 3 , and BLEURT 4 scores using bert-base-german-cased configuration. All of these four metrics require gold standard summaries, which were created by the two linguistic experts. The gold standard summaries have an average word count of 58.18, the shortest one with 14 words, and the longest with 112 words. In addition, we calculated the humanfree summary quality estimation metric BLANC 5 using bert-base-german-cased configuration. The reason for selecting these five metrics is that they either are the baseline of automatic summarization evaluation metrics (BLEU and ROUGE) or the latest AI-based metrics (BertScore, BLEURT, BLANC) which have not been applied to a German summarization data set.

Results
Results are presented for the scores overall quality (OQ), the five intrinsic quality scores (including grammaticality (GR), non-redundancy (NR), referential clarity (RC), focus (FO), structure & coherence (SC)) and the three extrinsic quality scores (summary usefulness (SU), post usefulness (PU) and summary informativeness (SI)). We will refer to these labels by their abbreviations in this section. For our human-based evaluation, we analyzed 10,800 ratings from the crowdsourcing study and 900 ratings from the expert evaluation. For automatic evaluation, we analyzed the BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BertScore (we use Fscores for these metrics), BLEURT by taking the mean of scores calculated using two expert summaries and the BLANC scores resulting in 350 scores (50 summaries x 7 automatic metrics).

Comparing Crowd with Expert
Before comparing expert ratings with the crowd, we calculated Cohen's κ and Krippendorff's α scores to measure the inter-rater agreement between two experts and the raw agreement scores as recommended in Van Der Lee et al. (2019) (see Table 1). Looking at the raw agreement, we see that experts gave the same ratings at least 70 % of the data for all nine measures after the second evaluation round. Further, Cohen's κ scores show that there is substantial (0.  (Landis and Koch, 1977). Also, we calculated Krippendorff's α, which is technically a measure of evaluator disagreement rather than agreement and the most common of the measures in the set NLG papers surveyed in Amidei et al. (2019). The Krippendorff's α scores for all the other measures are good [0.8-1.0] except for PO and SI measures, which are tentative [0.67-0.8) and PU measure, which should be discarded because it is 0.04 lower than the threshold 0.67 Krippendorff (1980). Because of the minimal difference of 0.04, we decided to still use the PU measure in our further analysis for interpretation. With these results, we achieved a better agreement level than the average expert agreement of summarization evaluation reported in other papers Van Der Lee et al. (2019).
We use the mean of expert ratings for all quality measures as our ground-truth for our further analysis. To test the normality of expert ratings, we carried out Anderson-Darling tests showing that the measures OQ, NR, FO, and SI were not normally distributed (p < 0.05). Therefore, we apply non-parametric statistics in the following sections.

Annotation Aggregation Methods
To investigate the effect of the annotation aggregation methods on the correlation coefficients between the crowd and expert ratings, we compared MOS with the baseline Majority Vote and two weighted-rank metrics CrowdTruth and MACE using crowdtruth-core 6 and MACE 7 libraries. Table  2 shows the Spearman's ρ correlation coefficients between crowd and experts by using these four  To determine if these differences are statistically significant, we applied Zou's confidence intervals test for dependent and overlapping variables and found out that the differences between correlation coefficients were not statistically significant for all nine measures (Zou, 2007). Based on this correlation analysis, we recommend using MOS as the aggregation method for crowd-based summarization evaluation since aggregation using MOS delivers the most comparable aggregates compared to experts and easy to apply.
Analyzing the Spearman's ρ correlation coefficients between the crowd and expert ratings by MOS, we see that all correlation coefficients were statistically significant, ranging from moderate (NR, PU) to strong (OQ, GR, RC, FO, SU, SI) and very strong (SC), where SC had the highest correlation coefficient of 0.828 and PU the lowest correlation coefficient of 0.464. This result suggests that crowdsourcing can be used instead of experts when determining the structure & coherence of a summarization. For determining OQ, GR, RC, FO, SU, and SI, crowdsourcing can be preferred since the overall correlation coefficients are strong, but the results should be interpreted with some degree of caution. However, when evaluating the non-redundancy and post usefulness, experts should be used for more robust results.
To investigate the differences between the crowd and expert judgments, we conducted the Mann-Whitney U test for each pair of nine quality scores. We observed no significant difference between the median ratings of OQ, SC, SU, and SI measures. This result suggests that crowdsourcing can be used instead of experts when determining these four measures without significant deviation in absolute score rating value. Please note that the ratings' distributions allow for significant equality in estimated mean values (here as the median) even on levels where correlations did not show very strong but only strong magnitudes.
However, there were statistically significant difference between GR Crowd (M = 3.667) and , showing that the crowd workers rated these factors statistically lower than the experts. This observation might be explained by the fact that the nature of extractive summarization and inherent text quality losses -compared to naturally composed text flow -are more familiar to experts than to non-experts, so they can distinguish between the unnaturalness and the linguistic quality in more robust ways.

Effect of Reading Effort
In this section, we analyzed the seven measures which achieve a correlation coefficient above 0.6 with experts: OQ, GR, RC, FO, SC, SU, and SI. Because the text's structural and formal composure, among many other factors, can cause difficulty in summarization evaluation, we analyzed the quality assessment performance of crowd workers regarding two distinct factors: a) readability of the text; b) reading effort in terms of overall stimuli length by dividing our data into six groups.
As our first reading effort criteria, we used the automated readability index (ARI), a readability test designed to assess a text's understandability, where a low ARI score indicates higher readability of a text (Feng et al., 2010). We split the packaged data into two groups by the median ARI scores of source texts (ARI-Low, ARI-High) calculated using textstat 8 library. Because the amount of information to be read and understood by any crowd Figure 1 displays all the correlation coefficients for the six groups. Here, we recognized that there was a certain pattern for all group pairs where correlation coefficients between the crowd and expert ratings were in groups "ARI-Low", "Summary-Short", and "Posts-Short" higher than the correlation coefficients in groups "ARI-High", "Summary-Long", and "Posts-Long" except for SI. The reason for the opposite trend of SI in groups divided by the summary length might be that the long summaries naturally contain more information, so it is easier for crowd workers to identify the summary informativeness. Other than this opposite trend of SI, we can derive the intuitive assumption that text understandability and reading effort have a noticeable effect on crowd judgments' robustness. Crowd workers may be used instead of experts for the evaluation of rather short summaries derived from documents with high readability.

Optimal Crowd Worker Number
To find out the optimal number of required crowd workers assessments per item, we plot the change of correlation coefficients between the crowd and expert ratings for all nine measures, where the x-axis shows the number of crowd workers per item in measured order, and the y-axis displays the Spearman ρ correlation coefficients between the crowd and expert ratings in Figure 2.
Looking at Figure 2, three or fewer crowd workers as annotators are not sufficient, and a study with Figure 2: Spearman's ρ correlation coefficients between crowd and expert ratings by the number of crowd workers a low number of crowd workers would not deliver a qualitative result since the correlation coefficient increase by increasing the number of crowd workers. However, this increase ends a saturation point between the number of repetitions and the resulting correlation coefficient. In order to determine the accurate optimal number of repetitions, we applied the method described in our paper Iskender et al. (2020), where multiple randomized runs are simulated in order to determine a "knee point" robustly, after which any additional repetitions no longer cause an adequate increase of overall correlation coefficients between the crowd and expert ratings. Our findings are directly in line with our findings in Iskender et al. (2020), where we applied this method to compare the crowd rating with laboratory ratings and stated that 7-9 crowd workers are the optimal number to achieve almost the same results as laboratory results in general.
We found that the knee point is 5 for RC; 7 for OQ, GR, NR, FO, SC, SU, and SI; 8 for PU. This result shows that generally, after collecting data from 5-8 different crowd workers depending on the measure, collecting one more additional crowd judgment was no longer worth the increase in correlation coefficient between the crowd and expert.
While analyzing the Spearman's correlation co-  Table 3: Spearman's ρ correlation coefficients between ROUGE scores, BertScore, and crowd ratings efficients between the automatic scores and the crowd ratings, we observed that only ROUGE and BertScore scores correlated with OQ, RC, FO, SU, and SI of the crowd judgments (see Table 3). Looking at the correlation coefficients between expert ratings and automatic metrics (see Table 4), we also found that there was a significant correlation between only ROUGE and BertScore scores and OQ, GR, RC, and SI of expert ratings. Generally, we observed that overall correlations were of weak level, looking at the magnitude of any significant correlation found. Even though we used most recent metrics other than ROUGE trained on BERT, such as BertScore (Van Der Lee et al., 2019), our findings verify that automatic metrics do not correlate with linguistic quality metrics in the summarization domain.
Although Papineni et al. (2002); Lin (2004); Zhang et al. (2019) reported high correlations with humans on the content-related quality assessment in the corresponding original papers, we showed that these metrics correlate poorly with any human rating, from crowd or expert, verifying the findings of Van Der Lee et al. (2019) for our data set. The reason for this difference is that the BLEU score is developed for measuring machine translation quality and tested on a translation data set. BertScore is also not evaluated using a summarization data set in the original corresponding paper.
Only, ROUGE metric is tested on summarization data sets. However, in the human evaluation part of the original paper, the evaluators assigned content coverage scores to a candidate summary compared to a manual summary, which is very similar to the way of working of ROUGE calculating the n-gram match of a candidate summary in comparison to a manual summary. In our human evaluation, we did not apply pair comparison, and the ratings were given on an absolute scale, which might be the reason for the low correlation coefficients between  automatic metrics and human ratings in our study. We also calculated BLEURT and BLANC scores, but we treat them as preliminary results since we did not apply any special pre-training to these metrics. We found that BLEURT does not correlate with any of the crowd and expert ratings significantly. Similarly, BLANC does not correlate with any of the crowd rating except for NR (ρ = −0.342), and surprisingly it correlates significantly and negatively with expert ratings for NR (ρ = −0.473) , RC (ρ = −0.308), and SC (ρ = −0.347). We can not explain the reasons for the negative correlation and speculate that this might be due to not applying pre-training.

Conclusion and Future Work
In this paper, we provide a basis for best practices for crowd-based summarization evaluation by comparing different annotation aggregation methods, analyzing the effect of reading effort and readability, and approaching an estimate of an optimal number of required crowd workers per item in order to as closely as possible resemble experts' assessment quality through crowdsourcing.
When determining structure & coherence, we suggest that crowdsourcing can be used as a direct substitute for experts proven by the very strong correlation coefficient. For determining overall quality, grammaticality, referential clarity, focus, summary usefulness, and summary informativeness, crowdsourcing can be preferred as the overall correlation still results strong, but the results should be interpreted carefully. However, when evaluating nonredundancy and post usefulness, experts should be used for more robust results, as correlations result moderate only.
Our experiments further recommend following best-practices when using crowdsourcing instead of experts: 1) In general 5-8 crowd workers should annotate a given summary, 2) MOS should be used as an aggregation method to achieve optimally comparable results to experts, 3) Crowdsourcing may be used at best when readability of the source and reading effort of the task is of rather low and straightforward nature. We also confirm the findings of Dumitrache et al. (2018b) that Crowdtruth performs better than the MACE. Further, we confirm that the automatic evaluation metrics BLEU, ROUGE, and BertScore can not be used to evaluate the linguistic quality, and we show that automatic evaluation metrics correlate poorly with any content-related absolute human rating, from crowd or expert, verifying the findings of Van Der Lee et al. (2019) for our domain. Therefore, crowdsourcing should generally be the preferred evaluation method over automated scores in the summarization evaluation.
Since the vast majority of research on summarization bases on the TAC or CNN/Dailymail data sets, there is a lack of works from other languages or domains. We address this gap by using a German forum summarization data set derived from an online forum in the telecommunication domain. Contrary to the findings of Gillick and Liu (2010) and Fabbri et al. (2020), we achieve significant correlations between the crowd and expert ratings ranging from moderate to very strong magnitude, as well as no significant difference in absolute mean rating in between the crowd and expert assessment for overall quality, structure & coherence, summary usefulness, and summary informativeness. Other scales show a slight but still significant bias towards lower ratings of about less than 0.3pt absolute. These are important findings in the development of NLG tools for summarization. In particular, summarization tools developed for languages other than English for which it is harder to conduct expert evaluations and find easy-to-use automatic metrics could benefit highly from our findings.
However, this study has some limitations since we conduct our analysis using only a single data set derived from an online forum in the telecommunication domain. The level of domain knowledge of crowd workers and experts about the telecommunication service might play a role when determining content-related quality measures such as post usefulness. So, the effect of the domain knowledge should be investigated in detail in future work. Another shortcoming of this paper is that our summarization data set is derived from noisy internet data, and the summary length does not differ much. As shown in section 4.1.2, the readability of the source document and varying lengths of summaries might affect the results; therefore, the same anal-ysis should be conducted on one more data set. Additionally, our data set was monolingual, so exploring the language-based effects will also be part of future work.
Further, this study is that we did not investigate the effect of the crowdsourcing task design and learning effect on the correlation coefficient between the crowd and expert ratings. Questions regarding the limitation to the number of assignments taken on by an evaluator (both for crowd and expert) and evaluators' behavior (becoming more lenient or strict over time) should also be analyzed in future work. Also, we did not use the pairwise comparison in our task design and only focused on absolute quality rating. For that reason, investigating the pairwise comparison using crowdsourcing and its comparison to absolute rating should be considered as an essential aspect of the crowdsourcing task design in future work.
Despite the limitations of our study, this paper is the first paper in the summarization evaluation literature that provides evidence for clear support for using crowdsourcing to evaluate summarization quality and adds to a growing corpus of research on the summarization evaluation.