GRUEN for Evaluating Linguistic Quality of Generated Text

Automatic evaluation metrics are indispensable for evaluating generated text. To date, these metrics have focused almost exclusively on the content selection aspect of the system output, ignoring the linguistic quality aspect altogether. We bridge this gap by proposing GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text. GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output. Unlike most existing evaluation metrics which require human references as an input, GRUEN is reference-less and requires only the system output. Besides, it has the advantage of being unsupervised, deterministic, and adaptable to various tasks. Experiments on seven datasets over four language generation tasks show that the proposed metric correlates highly with human judgments.


Introduction
Automatic evaluation metrics for Natural Language Generation (NLG) tasks reduce the need for human evaluations, which can be expensive and timeconsuming to collect. Fully automatic metrics allow faster measures of progress when training and testing models, and therefore, accelerate the development of NLG systems (Chaganty et al., 2018;Zhang et al., 2020;Clark et al., 2019).
To date, most automatic metrics have focused on measuring the content selection between the human references and the model output, leaving linguistic quality to be only indirectly captured (e.g., n-gram and longest common subsequence in ROUGE-N and ROUGE-L respectively (Lin and Hovy, 2003; Q1: Grammaticality The summary should have no datelines, system-internal formatting, capitalization errors or obviously ungrammatical sentences (e.g., fragments, missing components) that make the text difficult to read. Q2: Non-redundancy There should be no unnecessary repetition in the summary. Q3: Focus The summary should have a focus; sentences should only contain information that is related to the rest of the summary. Q4: Structure and Coherence The summary should be well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic. Table 1: Dimensions of linguistic quality as proposed in Dang (2006). Lin, 2004), and alignment in METEOR (Banerjee and Lavie, 2005)). Even though the need for an explicit measure of linguistic quality has long been pointed out in Dang (2006); Conroy and Dang (2008), this aspect has remained under-explored barring a few studies that focused on measuring the linguistic quality of a generated piece of text (Pitler et al., 2010;Kate et al., 2010;Xenouleas et al., 2019).
In this paper, we bridge this gap by proposing a novel metric for evaluating the linguistic quality of system output. Taking into consideration the guidelines put forth for the Document Understanding Conference (DUC) in Table 1, we evaluate: 1) Grammaticality by computing the sentence likelihood and the grammatical acceptability with a BERT-based language representation model (Devlin et al., 2019), 2) Non-redundancy by identifying repeated components with inter-sentence syntactic features, 3) Focus by examining semantic relatedness between adjacent sentences using Word Mover's Distance (WMD) (Kusner et al., 2015), and 4) Structure and Coherence by measuring the Sentence-Order Prediction (SOP) loss with A Lite BERT (Lan et al., 2019).
Compared with existing metrics, GRUEN is advantageous in that it is: • Most correlated with human judgments: It achieves the highest correlation with human judgments when compared with other metrics of linguistic quality, demonstrated using seven datasets over four NLG tasks.
• Reference-less: Most existing evaluation metrics (e.g., ROUGE, METEOR, MoverScore (Zhao et al., 2019)) require human references for comparison. However, it is only logical to assume that the linguistic quality of a system output should be measurable from the output alone. To that end, GRUEN is designed to be referenceless, and requires only the system output as its input.
• Unsupervised: Available supervised metrics (e.g., SUM-QE (Xenouleas et al., 2019)) not only require costly human judgments 3 as supervision for each dataset, but also risk poor generalization to new datasets. In addition, they are non-deterministic due to the randomness in the training process. In contrast, GRUEN is unsupervised, free from training and deterministic.
• General: Almost all existing metrics for evaluating the linguistic quality are task-specific (e.g., Pitler et al. (2010) and SUM-QE (Xenouleas et al., 2019) are for text summarization), whereas GRUEN is more generally applicable and performs well in various NLG task settings as we demonstrate empirically.

Related Work
The growing interest in NLG has given rise to better automatic evaluation metrics to measure the output quality. We first review the widely used metrics for NLG tasks and then discuss available metrics for evaluating linguistic quality.
Embedding-based metrics: These metrics utilize neural models to learn dense representations of words (Mikolov et al., 2013;Pennington et al., 2014) and sentences (Ng and Abrecht, 2015;Pagliardini et al., 2018;Clark et al., 2019).  (Gao et al., 2020) and Mao et al. (2020). In dialogue systems, diversity and coherence are assessed in Li et al. (2016a,b) and Dziri et al. (2019). However, these proposed metrics are not generally applicable to the evaluation of other aspects or tasks.

Evaluating Linguistic Quality
Existing metrics have focused mostly on evaluating the aspect of content selection in the system output, while ignoring the aspect of linguistic quality. This suggests the long-standing need for automatic measures of linguistic quality of NLG output, despite requests for further studies in this important direction. For instance, the Text Analysis Conference (TAC) 4 and the Document Understanding Conference (DUC) 5 (Dang, 2006) have motivated the need to automatically evaluate the linguistic quality of summarization since 2006. As another example, Conroy and Dang (2008) have highlighted the downsides of ignoring linguistic quality while focusing on summary content during system evaluation. Additionally, the need for linguistic quality evaluation has been underscored in Dorr et al. (2011); Graham et al. (2013); Novikova et al. (2017); Way (2018); Specia and Shah (2018). The uniqueness of our study is that it bridges the need of an automatic evaluation metric of language quality to enable a more holistic evaluation of language generation systems.
Among the few existing metrics of linguistic quality available in prior studies, the early ones Pitler et al. (2010); Kate et al. (2010) rely only on shallow syntactic linguistic features, such as partof-speech tags, n-grams and named entities. To better represent the generated output, the recent SUM-QE model (Xenouleas et al., 2019) encodes the system output by a BERT encoder and then adopts a linear regression model to predict the linguistic quality. It shows the state-of-the-art results and is most relevant to our work. However, SUM-QE is a supervised metric, which not only requires costly human judgments as input for each dataset, but also has non-deterministic results due to the intrinsic randomness in the training process. Besides, SUM-QE has been shown to work well with the DUC datasets of the summarization task only (Xenouleas et al., 2019), calling into question its effectiveness for other datasets and tasks. GRUEN, as an unsupervised metric, requires no additional human judgments for new datasets and has been shown to be effective on seven datasets over four NLG tasks.

Proposed Metric
In this section, we describe the proposed linguistic quality metric in detail. We define the problem as follows: given a system output S with n sentences [s 1 , s 2 , ..., s n ], where s i is any one sentence (potentially among many), we aim to output a holistic score, Y S , of its linguistic quality. We explicitly assess system output for the four aspects in Table 1 -Grammaticality, Non-redundancy, Focus, and Structure and Coherence. We leave Referential Clarity as suggested in Dang (2006) for future work. Grammaticality: A system output with a high grammaticality score y g is expected to be readable, fluent and grammatically correct. Most existing works measure the sentence likelihood (or perplexity) with a language model. We, in addition, explicitly capture whether the sentence is grammatically "acceptable" or not.
We measure y g using two features: sentence likelihood and grammar acceptance. For a system output S, we first use the Punkt sentence tokenizer (Kiss and Strunk, 2006) to extract its component sentences s 1 , s 2 , ..., s n . Then, for each sentence s i = (w i,1 , w i,2 , ..., w i,k ), a sequence of words w i,j , we measure its sentence likelihood score l i and grammar acceptance score g i by a BERT model (Devlin et al., 2019). 6 The choice of BERT is to leverage the contextual features and the masked language model (MLM), which can best examine the word choice. However, BERT can not be directly applied to get the likelihood of a sentence, as it is designed to get the probability of a single missing word. Inspired by Wang and Cho (2019); Wang et al. (2019), we estimate l i by a unigram approximation of the words in the sentence: l i = j log p(w i,j |w i,1 ..., w i,j−1 , w i,j+1 , ..., w i,k ). By such approximation, l i can be estimated by computing the masked probability of each word. To obtain the grammar acceptance score g i , we fine-tune the BERT model on the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2018), a dataset with 10,657 English sentences labeled as grammatical or ungrammatical from linguistics publications. Finally, scores from both models (i.e., l i and g i ) are linearly combined to examine the grammaticality of the sentence s i . The final grammaticality score y g is obtained by averaging scores of all n component sentences: y g = i (l i + g i )/n. Non-redundancy: As shown in Dang (2006), nonredundancy refers to having no unnecessary repetition, which takes the form of whole sentences or sentence fragments or noun phrases (e.g., "Bill Clinton") when a pronoun ("he") would suffice across sentences. To calculate the non-redundancy score y r , we capture repeated components by using four inter-sentence syntactic features: 1) string length of the longest common substring, 2) word count of longest common words, 3) edit distance, and 4) number of common words. We compute the four features for each pair of component sentences and there are n 2 such pairs in total. For each pair of sentences (s i , s j ), we count the number of times m i,j that these pairs are beyond a non-redundancy penalty threshold. The penalty threshold for each feature are: <80% string length of the shorter sentence, <80% word count of the shorter sentence, >60% string length of the longer sentence, and <80% word count of the shorter sentence, respectively. Finally, we get y r = −0.1 * i,j m i,j . Note that the non-redundancy penalty threshold and penalty weight are learned empirically from a held-out validation set. We discuss the effectiveness of each feature in detail in Appendix B.1.
Focus: Discourse focus has been widely studied and many phenomena show that a focused output should have related semantics between adjacent sentences (Walker, 1998;Knott et al., 2001;Pitler et al., 2010). We compute the focus score y f by calculating semantic relatedness for each pair of adjacent sentences (s i , s i+1 ). Specifically, we calculate the Word Mover Similarity wms(s i , s i+1 ) (Kusner et al., 2015) for the sentence pair (s i , s i+1 ). If the similarity score is less than the similarity threshold 0.05, we will impose a penalty score -0.1 on the focus score y f . A focused output should expect y f = 0.
Structure and coherence: A well-structured and coherent output should contain well-organized sentences, where the sentence order is natural and easyto-follow. We compute the inter-sentence coherence score y c by a self-supervised loss that focuses on modeling inter-sentence coherence, namely Sentence-Order Prediction (SOP) loss. The SOP loss, proposed by Lan et al. (2019), has been shown to be more effective than the Next Sentence Prediction (NSP) loss in the original BERT (Devlin et al., 2019). We formulate the SOP loss calculation as follows. First, for a system out-put S, we extract all possible consecutive pairs of segments (i.e., ([s 1 , ..., s i ], [s i+1 , ..., s n ]), where i ∈ [1, 2, ..., n − 1]). Then, we take as positive examples two consecutive segments, and as negative examples the same two consecutive segments but with their order swapped. Finally, the SOP loss is calculated as the average of the logistic loss for all segments, 7 and the coherence score y c is the additive inverse number of the SOP loss. Final score: The final linguistic quality score Y S is a linear combination of the above four scores: Y S = y g +y r +y f +y c . Note that the final score Y S is on a scale of 0 to 1, and all the hyper-parameters are learned to maximize the Spearman's correlation with human judgments for the held-out validation set.

Empirical Evaluation
In this section, we evaluate the quality of different metrics on four NLG tasks: 1) abstractive text summarization, 2) dialogue system, 3) text simplification and 4) text compression. Evaluating the metrics: We assess the performance of an evaluation metric by analyzing how well it correlates with human judgments. We, following existing literature, report Spearman's correlation ρ, Kendall's correlation τ , and Pearson's correlation r. In addition, to tackle the correlation non-independence issue (two dependent correlations sharing one variable) (Graham and Baldwin, 2014), we report William's significance test (Williams, 1959), which can reveal whether one metric significantly outperforms the other. Correlation type: Existing automatic metrics tend to correlate poorly with human judgments at the instance-level, although several metrics have been found to have high system-level correlations (Chaganty et al., 2018;Novikova et al., 2017;Liu et al., 2016). Instance-level correlation is critical in the sense that error analysis can be done more constructively and effectively. In our paper, we primarily analyze the instance-level correlations and briefly discuss the system-level correlations. Baselines: We compare GRUEN with the following baselines: The CNN/Daily Mail dataset contains online news articles paired with multi-sentence summaries (3.75 sentences or 56 tokens on average). We obtain the human annotated linguistic quality scores from Chaganty et al. (2018) and use the 2,086 system outputs from 4 neural models. Each system output has human judgments on a scale from 1-3 for: Grammar, Non-redundancy and Overall linguistic quality of the summary using the guideline from the DUC summarization challenge (Dang, 2006). In addition, it measures the number of Post-edits to improve the summary quality. For all human judgments except Post-edits, higher scores indicate better quality.
The TAC-2011 dataset, from the Text Analysis Conference (TAC), contains 4488 data instances (4.43 sentences or 94 tokens on average). It has 88 document sets and each document set includes 4 human reference summaries and 51 summarizers. We report correlation results on the Readability score, which measures the linguistic quality according to the guideline in Dang (2006). Results: Instance-level correlation scores are summarized in Table 2. As expected, all the baseline approaches except SUM-QE perform poorly because they do not aim to measure linguistic quality explicitly. We note that most of the baselines are highly unstable (and not robust) across the different datasets. For instance, BLEU performs relatively well on TAC-2011 but poor on CNN/Daily Mail, while WMD performs relatively well on CNN/Daily Mail but poor on TAC-2011. GRUEN outperforms SUM-QE on all aspects except the Grammar of CNN/Daily Mail, where they have comparable performance. We performed a set of William's tests for the significance of the differences in performance between GRUEN and SUM-QE for each linguistic score and each correlation type. We found that the differences were significant (p < 0.01) in all cases expect the Grammar of CNN/Daily Mail, as shown in Table 8 in Appendix.

Dialogue System
Dataset: We use three task-oriented dialogue system datasets: BAGEL (Mairesse et al., 2010), SFHOTEL (Wen et al., 2015) and SFREST (Wen et al., 2015), which contains 404, 875 and 1181 instances respectively. Each system output receives Naturalness and Quality scores (Novikova et al., 2017). Naturalness measures how likely a system utterance is generated by native speakers. Quality measures how well a system utterance captures fluency and grammar. Results (Table 3): GRUEN outperforms all other metrics by a significant margin. Interestingly, no metric except GRUEN produces even a moderate correlation with human judgments, regardless of dataset or aspect of human judgments.

Text Simplification
Dataset: We use a benchmark text simplification dataset with 350 data instances, where each instance has one system output and eight human ref-

Text Compression
Dataset: We use the text compression dataset collected in Toutanova et al. (2016). It has 2955 instances generated by four machine learning systems and each system output instance receives a human-assigned Grammar score.
Results (Table 5): We notice that GRUEN outperforms all the other metrics by a significant margin.

Discussion
The discussion is primarily conducted for the text summarization task considering that GRUEN can measure multiple dimensions in Table 1 of the generated text.

Ablation study
The results of the ablation analysis (Figure 1) show the effectiveness of G (the Grammaticality module alone), GU (the Grammaticality+focUs modules), GRU (the Grammaticality+non-Redundancy+focUs modules) on the summarization output using the CNN/Daily Mail dataset. We make the following three observations: 1) The Grammar score is largely accounted for by our grammaticality module, and only marginally by the  (1) outputs. Right shows the scattered Post-edits score distribution, which is negatively correlated with the output quality. The dotted line indicates a regression line, which implies the Pearson's correlation r.  others; 2) The focus and non-redundancy module of GRUEN more directly target the Post-edits and Non-redundancy aspects of linguistic quality; 3) The structure and coherence module does not have significant improvement over the linguistic quality dimensions. One possible reason is that structure and coherence is a high-level feature. It is difficult to be captured by not only the models but also the human annotators. Please refer to Table 6 for an example of a system output with poor structure and coherence.

Alignment with Rating Scale
We compared the scores of ROUGE-2, Mover-Score, SUM-QE and GRUEN with those of human judgments on outputs of different quality as shown in Figure  that existing automatic metrics are well correlated with human ratings at the lower end of the rating scale than those in the middle or high end. In contrast, we observe that GRUEN is particularly good at distinguishing high-end cases, i.e., system outputs which are rated as good by the (c) Non-redundancy: The brutal murder of Farkhunda, a young woman in Afghanistan, was burnt and callously chucked into a river in Kabul. The brutal murder of Farkhunda, a young woman in Afghanistan became pallbearers.
Unnecessary repetition (underlined), which can be avoided by using a pronoun (i.e., she). The large overlap between the two sentences is captured by the inter-sentence syntactic features.
(d) Focus: The FDA's Nonprescription Drugs Advisory Committee will meet Oct. Infant cough-and-cold products were approved decades ago without adequate testing in children because experts assumed that children were simply small adults, and that drugs approved for adults must also work in children. Ian Paul, an assistant professor of pediatrics at Penn State College of Medicine who has studied the medicines.
Component sentences are scattered, of different themes or even irrelevant to each other. The sentence embedding similarity of each pair of adjacent sentences is low and thus, results in low Focus score.
(e) Structure and Coherence: Firefighters worked with police and ambulance staff to free the boy, whose leg was trapped for more than half an hour down the hole. It is believed the rubber drain cover had been kicked out of position and within hours, the accident occurred.
A 12-year-old schoolboy needed to be rescued after falling down an eight-foot drain in Peterborough.
The output is only a heap of related information, where the component sentences are in a unorganized, wrong or incomprehensible order. Its sentence structure and readability can be much improved if the three component sentences are in the order of 3,1,2.  human judges. Figure 3 shows how the Spearman's correlation of each metric varies with different numbers of human references in the text simplification dataset. It is clear that existing reference-based metrics show better performance with more human references. One possible reason is that the system outputs are compared with more allowable grammatical and semantic variations. These allowable variations could potentially make the reference-based metrics better at distinguishing high-end cases, alleviate the shortcoming in Section 5.2, and thus allow the metrics to perform well. However, in most cases, it is expensive to collect multiple human references for each instance.  Table 10-11 in the Appendix, we further analyze how non-redundancy is captured by each of the inter-sentence syntactic features, and also present a comparative study for each linguistic dimension.

System-level Correlation
Our results have shown that GRUEN improves the instance-level correlation performance from poor to moderate. At the system-level too, we observe significant improvements in correlation. Table 7 shows the system-level linguistic quality correlation scores for Readability on the TAC-2011 dataset, which consists of 51 systems (i.e., summarizers). At the system level, most baseline metrics have moderate correlations, which aligns with the findings in Chaganty et al. (2018), while GRUEN achieves a high correlation. Note that we do not further study the system-level correlations on other datasets, since they have no more than four systems and thus the correlations are not meaningful to be compared with.

Limitations and Future Work
GRUEN evaluates non-redundancy by looking for lexical overlap across sentences. However, they still remain unexamined for semantically relevant components that are in different surface forms. Besides, it does not handle intra-sentence redundancy, such as "In 2012, Spain won the European Championships for a second time in 2012.". Another challenging problem is to evaluate the referential clarity as proposed in Dang (2006), which is particularly important for long sentences and multi-sentence outputs. Future work should aim for a more comprehensive evaluation of redundancy and tackle the referential clarity challenge.

Conclusion
We proposed GRUEN to evaluate Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.

A Quantitative Analysis
A.1 William's Significance Test In Table 8, we perform William's significance tests on GRUEN against the best baselines for each linguistic score and each correlation measurement (e.g., SUM-QE for ρ on the Overall score of the CNN/Daily Mail dataset, METEOR for r on the Grammar score of the dataset in Xu et al. (2016)). We found that the differences are significant (p < 0.0001) in 24 out of 39 cases.

A.2 Performance on Reliable Instances
In the human annotation process, each instance receives a score that is the aggregate of multiple people's ratings. Given the subjective nature of the task of annotating for linguistic quality, there are some instances where annotators disagree. To analyze how we perform on reliably coded instances, we show in Table 9 the correlation scores on the instances where all annotators agreed perfectly on the Overall score for the CNN/Daily Mail dataset (N = 1323). We observe that GRUEN consistently outperforms the baselines on the reliable data instances. Importantly, GRUEN and SUM-QE are better correlated with human judgements on the reliable data instances than on all the data instances.     Table 3 has shown an extremely poor correlation with human ratings for the baseline metrics on the BAGEL, SFHOTEL and SFREST datasets. Novikova et al. (2017) hypothesizes the reason to be the unbalanced label distribution. It turns out that the majority of system outputs are good for Naturalness with 64% and Quality (58%), whereas bad examples are only 7% in total. 9 Our discussion in Section 5.2 further explains the reason. Existing metrics are bad at assigning high scores to good outputs and thus, have a very poor correlation in such datasets with mostly good examples. In contrast, GRUEN is capable of assigning high scores to good outputs and thus, achieves decent correlation results. While our correlation results may appear to be slightly different from Table 3 in Novikova et al. (2017), they are in fact the same. The only difference is the result presentation format. Novikova et al. (2017) presents only the best correlation re-sults for each dataset (i.e., BAGEL, SFHOTEL and SFREST) and each NLG system (i.e., TGEN, LOLS and RNNLG), while we present the average correlation score for each dataset. Therefore, in Table 3 of Novikova et al. (2017), a correlation metric that performs well on one NLG system does not mean it performs equally well on another NLG system. As an example of measuring Informativeness, BLEU-1 performs well on the TGEN system for the BAGEL dataset, while it performs poorly on the LOLS system for the BAGEL dataset. Therefore, BLEU-1 has only a mediocre correlation score over informativeness for the BAGEL dataset, as presented in our result. The analysis in Novikova et al. (2017) is more focused in that it analyzes different metrics in a more restricted manner, whereas our analysis of metrics is more general in that we compare correlation scores regardless of which NLG system the output was generated from.

B.1 Analysis on Non-redundancy
To evaluate the non-redundancy score y r of a system output, we capture repeated components of a y f = 0.0 (g) Coherence and Structure: Firefighters worked with police and ambulance staff to free the boy, whose leg was trapped for more than half an hour down the hole. It is believed the rubber drain cover had been kicked out of position and within hours, the accident occurred. A 12-year-old schoolboy needed to be rescued after falling down an eight-foot drain in Peterborough. yc = −0.1 (h) Coherence and Structure: A 12-year-old schoolboy needed to be rescued after falling down an eight-foot drain in Peterborough. Firefighters worked with police and ambulance staff to free the boy, whose leg was trapped for more than half an hour down the hole. It is believed the rubber drain cover had been kicked out of position and within hours, the accident occurred.  pair of sentences by four empirical inter-sentence syntactic features: (A) length of longest common substring, (B) length of longest common words, (C) edit distance, and (D) number of common words. Features (A) and (B) focus on continuous word overlap of a pair of sentences. Intuitively, when most characters of a sentence already appears in the other sentence, the system output should probably have a poor non-redundancy score. However, features (A) and (B) fail to make a quality evaluation when the repeated components are of a inflected form (e.g., stemming, lemmatization) or not continuous. To account for the above limitation, we introduce features (C) and (D) that measures the edit distance and the number of common words respectively.
To gain more intuition, we present a few examples of poor non-redundancy in Table 10. The features that contribute to the non-redundancy penalty are labeled on the right. Case (1) has two almost identical sentences and therefore, captured by all four features. However, when the word lengths of the two sentences differ a lot, feature (C) is no longer effective as shown in case (2). In case (3) where the word overlap is not continuous (i.e., "The monkey took a" and "bunch of bananas on the"), the non-redundancy can only be detected by features (C) and (D). In case (4), the composing words are of an inflected form and thus, can not be captured by exact word matching features (i.e., features (A), (B), (D)). As such, we have the four features to complement each other and aim to capture non-redundancy well.

C Complete Results
We present the complete results of BLEU, ROUGE and WMD for all tasks in Table 12-15.