Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Motivated by recent findings on the probabilistic modeling of acceptability judgments, we propose syntactic log-odds ratio (SLOR), a normalized language model score, as a metric for referenceless fluency evaluation of natural language generation output at the sentence level. We further introduce WPSLOR, a novel WordPiece-based version, which harnesses a more compact language model. Even though word-overlap metrics like ROUGE are computed with the help of hand-written references, our referenceless methods obtain a significantly higher correlation with human fluency scores on a benchmark dataset of compressed sentences. Finally, we present ROUGE-LM, a reference-based metric which is a natural extension of WPSLOR to the case of available references. We show that ROUGE-LM yields a significantly higher correlation with human judgments than all baseline metrics, including WPSLOR on its own.


Introduction
Producing sentences which are perceived as natural by a human addressee-a property which we will denote as fluency 1 throughout this paper -is a crucial goal of all natural language generation (NLG) systems: it makes interactions more natural, avoids misunderstandings and, overall, leads to higher user satisfaction and user trust (Martindale and Carpuat, 2018). Thus, fluency evaluation is important, e.g., during system development, or * *This research was carried out while the first author was interning at Google. 1 Alternative names include naturalness, grammaticality or readability. Note that the exact definitions of all those terms vary slightly throughout the literature.
If access to a synonym dictionary is likely to be of use, then this package may 3 be of service.
Participants are invited to submit a set pair do domain name that is already 1.6 taken along with alternative.
Even $15 was The HSUS. 1 for filtering unacceptable generations at application time. However, fluency evaluation of NLG systems constitutes a hard challenge: systems are often not limited to reusing words from the input, but can generate in an abstractive way. Hence, it is not guaranteed that a correct output will match any of a finite number of given references. This results in difficulties for current reference-based evaluation, especially of fluency, causing wordoverlap metrics like ROUGE (Lin and Och, 2004) to correlate only weakly with human judgments (Toutanova et al., 2016). As a result, fluency evaluation of NLG is often done manually, which is costly and time-consuming.
Evaluating sentences on their fluency, on the other hand, is a linguistic ability of humans which has been the subject of a decade-long debate in cognitive science. In particular, the question has been raised whether the grammatical knowledge that underlies this ability is probabilistic or categorical in nature (Chomsky, 1957;Manning, 2003;Sprouse, 2007). Within this context, Lau et al. (2017) have recently shown that neural lan-guage models (LMs) can be used for modeling human ratings of acceptability. Namely, they found SLOR (Pauls and Klein, 2012)-sentence logprobability which is normalized by unigram logprobability and sentence length-to correlate well with acceptability judgments at the sentence level.
However, to the best of our knowledge, these insights have so far gone disregarded by the natural language processing (NLP) community. In this paper, we investigate the practical implications of Lau et al. (2017)'s findings for fluency evaluation of NLG, using the task of automatic compression (Knight and Marcu, 2000;McDonald, 2006) as an example (cf. Table 1). Specifically, we test our hypothesis that SLOR should be a suitable metric for evaluation of compression fluency which (i) does not rely on references; (ii) can naturally be applied at the sentence level (in contrast to the system level); and (iii) does not need human fluency annotations of any kind. In particular the first aspect, i.e., SLOR not needing references, makes it a promising candidate for automatic evaluation. Getting rid of human references has practical importance in a variety of settings, e.g., if references are unavailable due to a lack of resources for annotation, or if obtaining references is impracticable. The latter would be the case, for instance, when filtering system outputs at application time.
We further introduce WPSLOR, a novel, Word-Piece (Wu et al., 2016)-based version of SLOR, which drastically reduces model size and training time. Our experiments show that both approaches correlate better with human judgments than traditional word-overlap metrics, even though the latter do rely on reference compressions. Finally, investigating the case of available references and how to incorporate them, we combine WPSLOR and ROUGE to ROUGE-LM, a novel reference-based metric, and increase the correlation with human fluency ratings even further.
Contributions. To summarize, we make the following contributions: 1. We empirically show that SLOR is a good referenceless metric for the evaluation of NLG fluency at the sentence level.
2. We introduce WPSLOR, a WordPiece-based version of SLOR, which disposes of a more compact LM without a significant loss of performance.
3. We propose ROUGE-LM, a reference-based metric, which achieves a significantly higher correlation with human fluency judgments than all other metrics in our experiments.

On Acceptability
Acceptability judgments, i.e., speakers' judgments of the well-formedness of sentences, have been the basis of much linguistics research (Chomsky, 1964;Schütze, 1996): a speakers intuition about a sentence is used to draw conclusions about a language's rules. Commonly, "acceptability" is used synonymously with "grammaticality", and speakers are in practice asked for grammaticality judgments or acceptability judgments interchangeably. Strictly speaking, however, a sentence can be unacceptable, even though it is grammatical -a popular example is Chomsky's phrase "Colorless green ideas sleep furiously." (Chomsky, 1957) In turn, acceptable sentences can be ungrammatical, e.g., in an informal context or in poems (Newmeyer, 1983).
Scientists-linguists, cognitive scientists, psychologists, and NLP researcher alike-disagree about how to represent human linguistic abilities. One subject of debates are acceptability judgments: while, for many, acceptability is a binary condition on membership in a set of wellformed sentences (Chomsky, 1957), others assume that it is gradient in nature (Heilman et al., 2014;Toutanova et al., 2016). Tackling this research question, Lau et al. (2017) aimed at modeling human acceptability judgments automatically, with the goal to gain insight into the nature of human perception of acceptability. In particular, they tried to answer the question: Do humans judge acceptability on a gradient scale? Their experiments showed a strong correlation between human judgments and normalized sentence log-probabilities under a variety of LMs for artificial data they had created by translating and back-translating sentences with neural models. While they tried different types of LMs, best results were obtained for neural models, namely recurrent neural networks (RNNs).
In this work, we investigate if approaches which have proven successful for modeling acceptability can be applied to the NLP problem of automatic fluency evaluation.

Method
In this section, we first describe SLOR and the intuition behind this score. Then, we introduce WordPieces, before explaining how we combine the two.

SLOR
SLOR assigns to a sentence S a score which consists of its log-probability under a given LM, normalized by unigram log-probability and length: where p M (S) is the probability assigned to the sentence under the LM. The unigram probability p u (S) of the sentence is calculated as with p(t) being the unconditional probability of a token t, i.e., given no context. The intuition behind subtracting unigram logprobabilities is that a token which is rare on its own (in contrast to being rare at a given position in the sentence) should not bring down the sentence's rating. The normalization by sentence length is necessary in order to not prefer shorter sentences over equally fluent longer ones. 2 Consider, for instance, the following pair of sentences: (i) He is a citizen of France.
(ii) He is a citizen of Tuvalu.
Given that both sentences are of equal length and assuming that France appears more often in a given LM training set than Tuvalu, the lengthnormalized log-probability of sentence (i) under the LM would most likely be higher than that of sentence (ii). However, since both sentences are equally fluent, we expect taking each token's unigram probability into account to lead to a more suitable score for our purposes.
We calculate the probability of a sentence with a long-short term memory (LSTM, Hochreiter and Schmidhuber (1997)) LM, i.e., a special type of RNN LM, which has been trained on a large corpus. More details on LSTM LMs ILP NAMAS SEQ2SEQ T3 fluency 2.22 1.30 1.51 1.40 Table 2: Average fluency ratings for each compression system in the dataset by Toutanova et al. (2016).
can be found, e.g., in Sundermeyer et al. (2012). The unigram probabilities for SLOR are estimated using the same corpus.

WordPieces
Sub-word units like WordPieces (Wu et al., 2016) are getting increasingly important in NLP. They constitute a compromise between characters and words: On the one hand, they yield a smaller vocabulary, which reduces model size and training time, and improve handling of rare words, since those are partitioned into more frequent segments.
On the other hand, they contain more information than characters. WordPiece models are estimated using a datadriven approach which maximizes the LM likelihood of the training corpus as described in Wu et al. (2016) and Schuster and Nakajima (2012).

WPSLOR
We propose a novel version of SLOR, by incorporating a LM which is trained on a corpus which has been split by a WordPiece 3 model. This leads to a smaller vocabulary, resulting in a LM with less parameters, which is faster to train (around 12h compared to roughly 5 days for the word-based version in our experiments). We will refer to the wordbased SLOR as WordSLOR and to our newly proposed WordPiece-based version as WPSLOR.

Experiment
Now, we present our main experiment, in which we assess the performances of WordSLOR and WPSLOR as fluency evaluation metrics.

Dataset
We experiment on the compression dataset by Toutanova et al. (2016). It contains single sentences and two-sentence paragraphs from the Open American National Corpus (OANC), which belong to 4 genres: newswire, letters, journal, and non-fiction. Gold references are manually created and the outputs of 4 compression systems (ILP (extractive), NAMAS (abstractive), SEQ2SEQ (extractive), and T3 (abstractive); cf. Toutanova et al. (2016) for details) for the test data are provided. Each example has 3 to 5 independent human ratings for content and fluency. We are interested in the latter, which is rated on an ordinal scale from 1 (disfluent) through 3 (fluent). We experiment on the 2955 system outputs for the test split.
Average fluency scores per system are shown in Table 2. As can be seen, ILP produces the best output. In contrast, NAMAS is the worst system for fluency. In order to be able to judge the reliability of the human annotations, we follow the procedure suggested by Pavlick and Tetreault (2016) and used by Toutanova et al. (2016), and compute the quadratic weighted κ (Cohen, 1968) for the human fluency scores of the system-generated compressions as 0.337.

LM Hyperparameters and Training
We train our LSTM LMs on the English Gigaword corpus (Parker et al., 2011), which consists of news data.
The hyperparameters of all LMs are tuned using perplexity on a held-out part of Gigaword, since we expect LM perplexity and final evaluation performance of WordSLOR and, respectively, WPSLOR to correlate. Our best networks consist of two layers with 512 hidden units each, and are trained for 2, 000, 000 steps with a minibatch size of 128. For optimization, we employ ADAM (Kingma and Ba, 2014).

Baseline Metrics
ROUGE-L. Our first baseline is ROUGE-L (Lin and Och, 2004), since it is the most commonly used metric for compression tasks. ROUGE-L measures the similarity of two sentences based on their longest common subsequence. Generated and reference compressions are tokenized and lowercased. For multiple references, we only make use of the one with the highest score for each example.
N-gram-overlap metrics. We compare to the best n-gram-overlap metrics from Toutanova et al. (2016); combinations of linguistic units (bi-grams (LR2) and tri-grams (LR3)) and scoring measures (recall (R) and F-score (F)). With multiple references, we consider the union of the sets of ngrams. Again, generated and reference compressions are tokenized and lowercased.
Negative cross-entropy. We further compare to the negative LM cross-entropy, i.e., the logprobability which is only normalized by sentence length. The score of a sentence S is calculated as with p M (S) being the probability assigned to the sentence by a LM. We employ the same LMs as for SLOR, i.e., LMs trained on words (WordNCE) and WordPieces (WPNCE).
Perplexity. Our next baseline is perplexity, which corresponds to the exponentiated crossentropy: About BLEU. Due to its popularity, we also performed initial experiments with BLEU (Papineni et al., 2002). Its correlation with human scores was so low that we do not consider it in our final experiments.

Correlation and Evaluation Scores
Pearson correlation. Following earlier work (Toutanova et al., 2016), we evaluate our metrics using Pearson correlation with human judgments. It is defined as the covariance divided by the product of the standard deviations: Mean squared error. Pearson cannot accurately judge a metric's performance for sentences of very similar quality, i.e., in the extreme case of rating outputs of identical quality, the correlation is either not defined or 0, caused by noise of the evaluation model. Thus, we additionally evaluate using mean squared error (MSE), which is defined as the squares of residuals after a linear transformation, divided by the sample size: with f being a linear function. Note that, since MSE is invariant to linear transformations of X but not of Y , it is a non-symmetric quasi-metric. We apply it with Y being the human ratings. An additional advantage as compared to Pearson is that it has an interpretable meaning: the expected error made by a given metric as compared to the human rating.

Results and Discussion
As shown in Table 3, WordSLOR and WPSLOR correlate best with human judgments: Word-SLOR (respectively WPSLOR) has a 0.025 (respectively 0.008) higher Pearson correlation than the best word-overlap metric ROUGE-L-mult, even though the latter requires multiple reference compressions. Furthermore, if we consider with ROUGE-L-single a setting with a single given reference, the distance to WordSLOR increases to 0.048 for Pearson correlation. Note that, since having a single reference is very common, this result is highly relevant for practical applications. Considering MSE, the top two metrics are still WordSLOR and WPSLOR, with a 0.008 and, respectively, 0.002 lower error than the third best metric, ROUGE-L-mult. Comparing WordSLOR and WPSLOR, we find no significant differences: 0.017 for Pearson and 0.006 for MSE. However, WPSLOR uses a more compact LM and, hence, has a shorter training time, since the vocabulary is smaller (16, 000 vs. 128, 000 tokens).
Next, we find that WordNCE and WPNCE perform roughly on par with word-overlap metrics. This is interesting, since they, in contrast to traditional metrics, do not require reference compressions. However, their correlation with human fluency judgments is strictly lower than that of their respective SLOR counterparts. The difference between WordSLOR and WordNCE is bigger than *Significantly worse than best (bold) result with p < 0.05; one-tailed; Fisher-Z-transformation for Pearson, two sample t-test for MSE. that between WPSLOR and WPNCE. This might be due to accounting for differences in frequencies being more important for words than for Word-Pieces. Both WordPPL and WPPPL clearly underperform as compared to all other metrics in our experiments.
The traditional word-overlap metrics all perform similarly. ROUGE-L-mult and LR2-F-mult are best and worst, respectively.

Analysis I: Fluency Evaluation per Compression System
The results per compression system (cf. Table 4) look different from the correlations in Table 3: Pearson and MSE are both lower. This is due to the outputs of each given system being of comparable quality. Therefore, the datapoints are similar and, thus, easier to fit for the linear function used for MSE. Pearson, in contrast, is lower due to its invariance to linear transformations of both variables. Note that this effect is smallest for ILP, which has uniformly distributed targets (Var(Y ) = 0.35 vs. Var(Y ) = 0.17 for SEQ2SEQ).
Comparing the metrics, the two SLOR approaches perform best for SEQ2SEQ and T3. In particular, they outperform the best word-overlap metric baseline by 0.244 and 0.097 Pearson correlation as well as 0.012 and 0.012 MSE, respectively. Since T3 is an abstractive system, we can conclude that WordSLOR and WPSLOR are applicable even for systems that are not limited to make use of a fixed repertoire of words.
For ILP and NAMAS, word-overlap metrics obtain best results. The differences in performance, however, are with a maximum difference of 0.072 for Pearson and ILP much smaller than for SEQ2SEQ. Thus, while the differences are significant, word-overlap metrics do not outperform our SLOR approaches by a wide margin. Recall, additionally, that word-overlap metrics rely on references being available, while our proposed approaches do not require this.

Analysis II: Fluency Evaluation per Domain
Looking next at the correlations for all models but different domains (cf.  Next, we focus on an important question: How much does the performance of our SLOR-based metrics depend on the domain, given that the respective LMs are trained on Gigaword, which consists of news data?
Comparing the evaluation performance for individual metrics, we observe that, except for letters, WordSLOR and WPSLOR perform best across all domains: they outperform the best word-overlap metric by at least 0.019 and at most 0.051 Pearson correlation, and at least 0.004 and at most 0.014 MSE. The biggest difference in correlation is achieved for the journal domain. Thus, clearly even LMs which have been trained on out-ofdomain data obtain competitive performance for fluency evaluation. However, a domain-specific LM might additionally improve the metrics' correlation with human judgments. We leave a more detailed analysis of the importance of the training data's domain for future work.

Incorporation of Given References
ROUGE was shown to correlate well with ratings of a generated text's content or meaning at the sentence level (Toutanova et al., 2016). We further expect content and fluency ratings to be correlated. In fact, sometimes it is difficult to distinguish which one is problematic: to illustrate this, we show some extreme examples-compressions which got the highest fluency rating and the lowest possible content rating by at least one rater, but the lowest fluency score and the highest content score by another-in Table 6. We, thus, hypothesize that ROUGE should contain information about fluency which is complementary to SLOR, and want to make use of references for fluency evaluation, if available. In this section, we experiment with two reference-based metrics -one trainable one, and one that can be used without fluency annotations, i.e., in the same settings as pure word-overlap metrics.

Experimental Setup
First, we assume a setting in which we have the following available: (i) system outputs whose fluency is to be evaluated, (ii) reference generations for evaluating system outputs, (iii) a small set of system outputs with references, which has been annotated for fluency by human raters, and (iv) a large unlabeled corpus for training a LM. Note that available fluency annotations are often uncommon in real-world scenarios; the reason we use them is that they allow for a proof of concept. In this setting, we train scikit's (Pedregosa et al., 2011) support vector regression model (SVR) with the default parameters on predicting fluency, given WP-SLOR and ROUGE-L-mult. We use 500 of our total 2955 examples for each of training and development, and the remaining 1955 for testing.
Second, we simulate a setting in which we have only access to (i) system outputs which should be evaluated on fluency, (ii) reference compressions, and (iii) large amounts of unlabeled text. In particular, we assume to not have fluency ratings for system outputs, which makes training a regression model impossible. Note that this is the standard setting in which word-overlap metrics are applied. Under these conditions, we propose to normalize both given scores by mean and variance, and to simply add them up. We call this new reference-

Results and Discussion
Results are shown in Table 7. First, we can see that using SVR (line 1) to combine ROUGE-Lmult and WPSLOR outperforms both individual scores (lines 3-4) by a large margin. This serves as a proof of concept: the information contained in the two approaches is indeed complementary. Next, we consider the setting where only references and no annotated examples are available. In contrast to SVR (line 1), ROUGE-LM (line 2) has only the same requirements as conventional wordoverlap metrics (besides a large corpus for training the LM, which is easy to obtain for most languages). Thus, it can be used in the same settings as other word-overlap metrics. Since ROUGE-LM-an uninformed combination-performs significantly better than both ROUGE-L-mult and WPSLOR on their own, it should be the metric of choice for evaluating fluency with given references.
6 Related Work

Fluency Evaluation
Fluency evaluation is related to grammatical error detection (Atwell, 1987;Wagner et al., 2007;Schmaltz et al., 2016;Liu and Liu, 2017) and grammatical error correction (Islam and Inkpen, 2011;Ng et al., 2013Ng et al., , 2014Bryant and Ng, 2015;Yuan and Briscoe, 2016). However, it differs from those in several aspects; most importantly, it is concerned with the degree to which errors matter to humans.
Work on automatic fluency evaluation in NLP has been rare. Heilman et al. (2014) predicted the fluency (which they called grammaticality) of sentences written by English language learners. In contrast to ours, their approach is supervised. Stent et al. (2005) and Cahill (2009) found only low correlation between automatic metrics and fluency ratings for system-generated English paraphrases and the output of a German surface realiser, respectively. Explicit fluency evaluation of NLG, including compression and the related task of summarization, has mostly been performed manually. Vadlapudi and Katragadda (2010) used LMs for the evaluation of summarization fluency, but their models were based on partof-speech tags, which we do not require, and they were non-neural. Further, they evaluated longer texts, not single sentences like we do. Toutanova et al. (2016) compared 80 word-overlap metrics for evaluating the content and fluency of compressions, finding only low correlation with the latter. However, they did not propose an alternative evaluation. We aim at closing this gap.

Compression Evaluation
Automatic compression evaluation has mostly had a strong focus on content. Hence, word-overlap metrics like ROUGE (Lin and Och, 2004) have been widely used for compression evaluation. However, they have certain shortcomings, e.g., they correlate best for extractive compression, while we, in contrast, are interested in an approach which generalizes to abstractive systems. Alternatives include success rate (Jing, 2000), simple accuracy (Bangalore et al., 2000), which is based on the edit distance between the generation and the reference, or word accuracy (Hori and Furui, 2004), the equivalent for multiple references.

Criticism of Common Metrics for NLG
In the sense that we promote an explicit evaluation of fluency, our work is in line with previous criticism of evaluating NLG tasks with a single score produced by word-overlap metrics. The need for better evaluation for machine translation (MT) was expressed, e.g., by Callison-Burch et al. (2006), who doubted the meaningfulness of BLEU, and claimed that a higher BLEU score was neither a necessary precondition nor a proof of improved translation quality. Similarly, Song et al. (2013) discussed BLEU being unreliable at the sentence or sub-sentence level (in contrast to the system-level), or for only one single reference. This was supported by Isabelle et al. (2017), who proposed a so-called challenge set approach as an alternative. Graham et al. (2016) performed a large-scale evaluation of human-targeted metrics for machine translation, which can be seen as a compromise between human evaluation and fully automatic metrics. They also found fully automatic metrics to correlate only weakly or moderately with human judgments. Bojar et al. (2016a) further confirmed that automatic MT evaluation methods do not perform well with a single reference. The need of better metrics for MT has been addressed since 2008 in the WMT metrics shared task (Bojar et al., 2016b(Bojar et al., , 2017. For unsupervised dialogue generation, Liu et al. (2016) obtained close to no correlation with human judgements for BLEU, ROUGE and ME-TEOR. They contributed this in a large part to the unrestrictedness of dialogue answers, which makes it hard to match given references. They emphasized that the community should move away from these metrics for dialogue generation tasks, and develop metrics that correlate more strongly with human judgments. Elliott and Keller (2014) reported the same for BLEU and image caption generation. Dušek et al. (2017) suggested an RNN to evaluate NLG at the utterance level, given only the input meaning representation.

Future Work
The work presented in this paper brings up multiple interesting next steps for future research.
First, in Subsection 4.7, we investigated the performances of WordSLOR and WPSLOR in dependence of the domain of the considered text. We concluded that an application was possible even for unrelated domains. However, we did not experiment with alternative LMs, which leaves the following questions unresolved: (i) Would training LMs on specific domains improve Word-SLOR's and WPSLOR's correlation with human fluency judgments, i.e., to what degree are the properties of the training data important? (ii) How does the size of the training corpus influence performance? Ultimatly, this research could lead to answering the following question: Is it better to train a LM on a small, in-domain corpus or on a large corpus from another domain?
Second, we showed that, in certain settings, Pearson correlation does not give reliable insight into a metric's performance. Since in general eval-uation of evaluation metrics is hard, an important topic for future research would be the investigation of better ways to do so, or to study under which conditions a metric's performance can be assessed best.
Last but not least, a straight-forward continuation of our research would encompass the investigation of SLOR as a fluency metric for other NLG tasks or languages. While the results for compression strongly suggest a general applicability to a wider range of NLP tasks, this has yet to be confirmed empirically. As far as other languages are concerned, the question what influence a given language's grammar has would be an interesting research topic.

Conclusion
We empirically confirmed the effectiveness of SLOR, a LM score which accounts for the effects of sentence length and individual unigram probabilities, as a metric for fluency evaluation of the NLG task of automatic compression at the sentence level. We further introduced WP-SLOR, an adaptation of SLOR to WordPieces, which reduced both model size and training time at a similar evaluation performance. Our experiments showed that our proposed referenceless metrics correlate significantly better with fluency ratings for the outputs of compression systems than traditional word-overlap metrics on a benchmark dataset. Additionally, they can be applied even in settings where no references are available, or would be costly to obtain. Finally, for given references, we proposed the reference-based metric ROUGE-LM, which consists of a combination of WPSLOR and ROUGE. Thus, we were able to obtain an even more accurate fluency evaluation.