BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language generation (NLG) models. Automatic metrics such as BLEU rely on references, but for tasks such as open-ended generation, there are no references to draw upon. Although language diversity can be estimated using statistical measures such as perplexity, measuring language quality requires human evaluation. However, because human evaluation at scale is slow and expensive, it is used sparingly; it cannot be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation. To this end, we propose BLEU Neighbors, a nearest neighbors model for estimating language quality by using the BLEU score as a kernel function. On existing datasets for chitchat dialogue and open-ended sentence generation, we find that – on average – the quality estimation from a BLEU Neighbors model has a lower mean squared error and higher Spearman correlation with the ground truth than individual human annotators. Despite its simplicity, BLEU Neighbors even outperforms state-of-the-art models on automatically grading essays, including models that have access to a gold-standard reference essay.


Introduction
Despite recent advances on many natural language generation (NLG) tasks -including open-ended generation, chitchat dialogue, and abstractive summarization -evaluation remains a challenge.Automatic metrics such as BLEU rely on references, but for many NLG tasks, there is no single correct answer.In dialogue, the space of acceptable responses to a given prompt is often very large, yet most datasets only provide a few gold-standard references (Serban et al., 2015).In open-ended generation, where text is generated freely by a language model, there are no references at all; statistical measures such as perplexity capture language examples S = {s 1 , s 2 , s 3 } with known quality scores {q(s 1 ), q(s 2 ), q(s 3 )}.BLEU Neighbors works as follows: calculate BLEU * (x, •), a variant of the BLEU-4 score, for each s; ignore those below τ = 0.08; take the average score of those that remain to predict q(x).
diversity but not language quality (Hashimoto et al., 2019).These limitations necessitate human evaluation.However, because human evaluation at scale is slow and expensive, it is used sparingly; it cannot be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation.
Prior work on automating reference-less evaluation has largely been limited in scope.Heuristicbased evaluation was found to be effective for grammatical error correction, but the methods used were problem-specific and cannot be extended to other tasks (Napoles et al., 2016;Choshen and Abend, 2018;Asano et al., 2017).Using the log-odds from a language model, Kann et al. (2018) made automatic judgments of sentence-level fluency that correlated moderately well with human judgment, but this captured only one facet of language quality.Approaches that were broader in scope found less success: although ADEM, an RNN trained to arXiv:2004.12726v2 [cs.CL] 29 Apr 2020 score dialogue responses, was initially thought to correlate well with human judgment (Lowe et al., 2017), it was later found to generalize poorly, placing outsized influence on factors such as response length (Lowe, 2019).
Can we come up with a fast and simple method for reference-less evaluation of language quality, analogous to BLEU for machine translation?Note that our goal here is not to supplant human evaluation, but to complement it: as long as the method's predictions correlate moderately well with the ground-truth quality scores, it can be used to speed up NLG model development.Our desiderata are then as follows: simplicity, speed, and a moderately strong correlation with the ground truth.To this end, we propose BLEU Neighbors, a new approach to reference-less automatic evaluation.
Our approach is a nearest neighbors model that predicts the quality of natural language by using BLEU as a kernel function.We start with training examples S, where each sentence s ∈ S has a ground-truth quality score q(s).Note that these examples are not references -we do not expect the NLG model being evaluated to generate any sentence in S. In fact, S contains sentences of varying quality, including incoherent sentences with low quality scores.Given a test sentence x, we use the BLEU score to identify its neighbors in the training data: {s | BLEU * (x, s) > τ, s ∈ S}, where τ is a similarity threshold.Then we simply take the mean of the neighbors' known quality scores to estimate q(x), the quality of x.Consider the test sentence x = 'The fox is quick'.As seen in Figure 1, it overlaps with s 1 = 'The dog was quick.' and s 2 = 'It is the fox.' but not with s 3 = 'Dogs are lazy.'.Therefore, we estimate q(x) as the mean of q(s 1 ) and q(s 2 ) but not q(s 3 ).
We test BLEU Neighbors on the datasets from HUSE (Hashimoto et al., 2019), where each sentence's ground-truth quality score is the average over 20 human judgments.On the dialogue and open-ended generation datasets, we find that -on average -the BLEU Neighbors model has a lower mean squared error (MSE) and higher Spearman correlation with the ground truth than individual annotators!The premise of our method is that past approaches to reference-less evaluation fell short because they were too ambitious -if a given test sentence is not sufficiently similar to any training example, no prediction should be made at all.Although we sacrifice some coverage in order to make more accurate estimates, this sacrifice is modest: BLEU Neighbors makes predictions for 41% to 99% of sentences from the HUSE datasets.
Our method is weakest on evaluating summaries; this is unsurprising, given that summary quality is conditioned on the source text, which the method ignores.In contrast, BLEU Neighbors is surprisingly effective at automatically grading essays, achieving a new state-of-the-art and even beating out models that have access to a gold-standard reference essay.These findings suggest that despite its simplicity, our approach has broad applicability.Although BLEU Neighbors does not measure language diversity, it is sufficient for it measure quality alone.The former is easier to estimate (e.g., perplexity) and can be combined with the BLEU Neighbors score in a hybrid metric (Hashimoto et al., 2019).We conclude by providing some practical advice, such as how to prevent NLG models from explicitly optimizing for a high BLEU Neighbors score without generating high-quality output.

Reference-based Evaluation
BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) are the de facto canonical metrics of reference-based automatic evaluation.Given a candidate sentence x and a reference sentence s, each metric assigns a score q(x, s) ∈ [0, 1] based on how well x overlaps with s.Where the metrics differ is in how they define this overlap.Letting n (•) denote the list of n-grams, the n-gram precision P n and recall R n can be defined as follows: (1) BLEU The BLEU score for (x, s) is the geometric mean of the n-gram precision P n up to a chosen n (typically, n = 4).BLEU also implements clipping, such that each n-gram g ∈ n (x) can be matched at most once.It also includes a brevity penalty to penalize shorter candidates.
METEOR The METEOR score takes the harmonic mean of P 1 and R 1 , with greater weight placed on R 1 .It is laxer than BLEU, allowing words in x and s to match, for example, if they are synonyms or share the same stem (Banerjee and Lavie, 2005).Instead of looking at higher order ngrams, METEOR tries to align the tokens in x and s and penalizes alignments that are not contiguous.ROUGE-L ROUGE-L, the variant of ROUGE we discuss in this work, measures the overlap between x and s as the size of their longest common subsequence LCS(x, s).Specifically, it calculates LCS(x, s)/P 1 (x, s) and LCS(x, s)/R 1 (x, s) and takes their harmonic mean.
Although there have been advances in referencebased automatic evaluation -such as BEER (Stanojević and Simaan, 2014) and RUSE (Vedantam et al., 2015), among others (Shimanaka et al., 2018;Ma et al., 2017;Lo et al., 2018;Zhao et al., 2019) -BLEU and METEOR are still widely used for machine translation; ROUGE, for summarization (Liu et al., 2016).This is partially because some of the newer methods are learned metrics that do not generalize well to new domains (Chaganty et al., 2018).Moreover, most do not enjoy the incumbent status that BLEU, ROUGE, and METEOR have.To our knowledge, the current state-of-the-art in reference-based evaluation metrics is BERTScore (Zhang et al., 2019), which uses BERT embeddings (Devlin et al., 2019) to compute similarity at the token-level before aggregating the similarities using importance-weighting.As it is state-of-the-art for reference-based evaluation, it is the only noncanonical metric we consider as a kernel function.

Reference-less Evaluation
Compared to reference-based evaluation, little work has been done on automating reference-less evaluation.The most successful approaches have been task-specific: heuristic-based evaluation was found to be effective for grammatical error correction (Napoles et al., 2016;Choshen and Abend, 2018;Asano et al., 2017).However, those heuristics cannot be extended to other tasks.Kann et al. (2018) proposed two metrics for judging the fluency of a sentence: sentence-level log-odds ratio (SLOR) and a Wordpiece-based variant named WP-SLOR.Although the latter correlates moderately well (Pearson's r > 0.40) with human judgment, it should be noted that sentence-level fluency is only one facet of language quality -a sentence may be probable according to a language model while making little sense to a human.
Approaches that were broader in scope were less successful.ADEM, an RNN trained to score dia-logue responses, was initially thought to correlate well with human judgment (Lowe et al., 2017).However, the authors later found that it generalized poorly (Lowe, 2019), placing outsized influence on factors such as response length.It was also found to be vulnerable to adversarial examples (Sai et al., 2019).In any case, ADEM was not a purely reference-less method -it still required a gold-standard reference as input.Rather, its key insight was that the space of acceptable responses is much larger than the handful of gold-standard references provided in dialogue datasets, and that this should be considered when estimating quality.

BLEU Neighbors
Given a candidate sentence x, training examples S, and ground-truth quality scores {q(s) | s ∈ S}, we want to estimate q(x), the language quality of x.How can we do so in a fast and simple manner such that our predictions correlate well with the ground truth?We propose a nearest neighbors model that uses a variant of the BLEU score called BLEU * as the kernel function.Once the neighbors of x have been identified, we take the mean of their known quality scores as q(x).
Definition 3.1.The non-unigram BLEU-4 score is a variant of the BLEU-4 score that ignores unigram precision.Where β = exp min 0, 1 is the brevity penalty and P i is defined in (1), we define this score as (2) BLEU Neighbors uses this variant of the BLEU score as the kernel function.We ignore the unigram precision P 1 because we are not comparing candidates and their direct references, but rather candidates and training examples.It is not uncommon for two random sentences to have stopwords in common, in which case a non-zero P 1 is unexceptional.We validated this empirically as well, finding that ignoring P 1 improves correlation with the ground-truth quality.Definition 3.2.Given a candidate sentence x, training examples S, and a similarity threshold τ ∈ [0, 1] , the BLEU neighbors of x are To ensure that the quality estimate is stable, we require that N have a minimum size of a ∈ Z + .
Conversely, a candidate sentence that overlaps with many training examples in S likely does so because it contains many common n-grams.This complicates evaluation: since BLEU * does not weigh ngrams by their frequency, an abundance of common n-grams -such as "on the" or "it is", for examplecan exaggerate the similarity between the candidate and a training example.In this scenario, it is best that no prediction be made at all.Since N ⊆ S, let b ∈ [0, 1] denote the largest fraction of S that N can contain.We express b as a fraction of the training set size |S| because if S is very large, it would not be uncommon for even sentences with rare n-grams to have matches in S.
When N meets the aforementioned size constraints, the BLEU Neighbors estimate of x's quality is the average of its neighbors' quality scores: In other words, N ⊆ S comprises all the training examples that are sufficiently similar to the candidate with respect to BLEU * .If there are fewer than a examples or more than b|S| examples in N , then no prediction is made; otherwise, the estimate q(x) is the average quality of the examples in N .
Although τ, a, b are parameters to be set, we find that τ = 0.08, a = 5, b = 0.66 are near-optimal for all tasks (see section 5.2).This universality allows BLEU Neighbors to be used out-of-the-box, without hyperparameter tuning.Note that S should only be used to train the evaluator (i.e., BLEU Neighbors).The generator (i.e., the NLG model being evaluated) should not have access to S; otherwise, it could optimize for a high BLEU Neighbors score by including n-grams that only belong to examples with a high ground-truth quality, thus artificially inflating the quality estimates.Definition 3.3.Given a set of candidates X to be evaluated, the coverage of X is the proportion of candidates for which q(x) is defined.This is a key distinction between BLEU Neighbors and prior approaches to reference-less evaluation: our approach does not necessarily make a prediction for all candidates.This is by design -as mentioned earlier, we surmise that past approaches fell short because they were too ambitious, trying to score sentences that simply could not be scored.There is a trade-off between coverage and prediction error, with greater coverage generally coming at the cost of greater prediction error.

Experiments
NLG Tasks We test BLEU Neighbors on evaluating sentences from the following NLG tasks: chitchat dialogue, open-ended sentence generation (from a language model), and abstractive summarization.Hashimoto et al. ( 2019) provided a dataset for each of these tasks, which we collectively refer to as the HUSE datasets.We ignore the story generation dataset in that work because the machinegenerated examples are far from human quality and can thus be trivially assigned a low quality score.
Each dataset contains a mixture of machine-and human-generated sentences, in roughly equal proportion.Each sentence in the HUSE datasets was judged by 20 human annotators, who assigned it a label based on its typicality.These labels map to an integer score from 0 to 5. We divide the raw judgment by 5 to bound it in [0, 1] and then take the mean across all 20 annotators, which we treat as the ground-truth language quality q(s) for each sentence s.Because these datasets are small, we use leave-one-out prediction.That is, given a candidate sentence from a particular HUSE dataset, we treat the remaining n − 1 sentences as S.

Grading Essays
We also test our model on automatically grading essays from the ASAP-SAS dataset1 .Although each essay is a multi-sentence paragraph, we did not adapt our model in any way.Each essay's quality score is an integer from 0 to 3, which we divide by 3 to bound in [0, 1].This normalization is done for the sake of consistency.Because there are distinct training and test sets, we draw the training examples from the training data and the candidates to be evaluated from the test data.The ASAP-SAS data is also broken down by topic.The current state-of-the-art model only evaluates on topic #3 -specifically, on essays from topic #3 that contain 5 to 15 sentences (Clark et al., 2019).Therefore, to allow for a fair comparison, we also draw test sentences from this subset.
Threshold Settings Unless otherwise stated, for all HUSE datasets, we use τ = 0.08, a = 5, b = 0.66.These settings were chosen to maximize the Spearman correlation with the ground-truth quality The mean squared error (MSE) and Spearman's ρ of language quality predictions q(•) with respect to the ground truth q(•).The lowest MSE and highest ρ across all models is in bold and * signifies p < 0.01.For all tasks, BLEU Neighbors achieves a higher Spearman's ρ than its ROUGE, METEOR, and BERTScore counterparts.
For dialogue and open-ended generation, it even has a lower MSE and higher ρ than human annotators on average!Other Kernel Functions In addition to using BLEU * as the kernel function, we try other automatic metrics, including ROUGE, METEOR, and BERTScore (Zhang et al., 2019).As with BLEU * , a single value of τ for each metric works universally well: 0.06 (for ROUGE); 0.18 (for METEOR); 0.10 (for BERTScore).The evidence thresholds a, b are fixed across all kernels.

BLEU Neighbors vs. Humans
In Table 1, using mean squared error (MSE) and the Spearman correlation, we compare the language quality predictions q(•) made by our various models with the ground-truth quality q(•).Because the ground-truth quality is the mean over 20 annotator judgments, we provide the performance of the best human annotator (across all examples) and the average performance across all individual annotators.Note that not all annotators scored all the examples: the average MSE and ρ we report in Table 1 is the average over what each annotator obtained on their respective subset of the data.We find that there is a significant gap between the best-and average-case, both in terms of MSE and Spearman's ρ.For example, on the summarization task, the MSE and Spearman's ρ of the best human annnotator is 4x and 2x better than those of annotators on average.

Spearman Correlation
As shown in the second section of Table 1, for all tasks, we find that BLEU Neighbors has a higher Spearman correlation with the ground truth than its ROUGE, METEOR, and BERTScore counterparts.For open-ended generation and dialogue, it even outperforms the average- case human annotator!Only on evaluating summaries does the average-case annotator beat all nearest neighbors models with respect to Spearman's ρ; this is unsurprising, given that summary quality is strongly conditioned on the source text, which these models ignore.However, even for summarization, the difference between the averagecase annotator and BLEU Neighbors is not statistically significant at p < 0.01 (per a Williams test), owing to the small dataset size; the same holds for dialogue and open-ended generation.Despite the impressive performance of BLEU Neighbors, it should be noted that it is still well behind the best human annotator for each task.
Mean Squared Error Although BLEU Neighbors achieves a much higher Spearman correlation with the ground-truth quality than its counterparts, the model that achieves the lowest MSE varies across tasks.How can we reconcile these observations?We find that the variance of the ground-truth quality is quite small for all datasets.By just predicting the mean of q(•) for all candidates, we can get an MSE for each task that is only slightly higher than the best annotator's.Models that obtain the lowest MSE while also having a low Spearman's ρ are thus making low-variance estimates close to the mean that do not correlate well with the ground truth.It is also worth noting that the annotators of the HUSE datasets assigned discrete values to each sentence (Hashimoto et al., 2019), while q(•), being an average over those judgments, is a continuous value.This discrepancy is conducive to human annotators having a higher MSE than the models, whose predictions are continuous.

Varying the Evidence Thresholds
Recall that BLEU Neighbors has two evidence thresholds: a, the minimum number of neighbors needed to make a prediction, and b, the maximum number of neighbors allowed (as a fraction of the training set S).In Figure 2, we plot the Spearman correlation between predictions q(•) and the ground truth q(•) as each threshold changes, while the other is held constant at the default setting (a = 5, b = 0.66).In Figure 3, we plot the change in coverage as the thresholds change.

Spearman Correlation
The correlation for all tasks is sensitive to changes in a, with the correlation peaking at a = 30 or a = 35 before declining.This is intuitive: increasing the amount of evidence required yields more robust predictions, but sentences that meet the stringent requirement of having at least a > 35 neighbors likely have many common n-grams, making them harder to score.While performance on all tasks is sensitive to a, only performance on open-ended generation is sensitive to b, with the correlation decreasing as b increases (i.e., as we loosen the upper bound on the number of neighbors).This suggests that sentences in the dialogue and summarization datasets do not have many neighbors to begin with, which is why tightening the upper bound has little effect.
Sentences in the open-ended generation data, on the other hand, seem to have many more neighbors on average, resulting in Coverage within a Model The higher a is and the lower b is, the more candidates we reject for having too few or too many neighbors.This can be seen in Figure 3, where the coverage falls almost linearly as a increases.Conversely, the coverage rises linearly before plateauing as b increases.The plateau indicates that loosening the upper bound further will have no effect because no candidate sentence has that many neighbors to begin with.
Coverage across Models Holding constant the evidence thresholds a and b, we see in Table 1 that coverage across different models is unrelated to MSE and Spearman's ρ.For all models, τ is set to minimize the MSE and maximize Spearman's ρ.However, models with the lowest MSE or highest ρ on a given task are not necessarily the most selective (i.e., those with the lowest coverage).BLEU Neighbors, which has the highest correlation on all tasks, has a coverage of 41%, 76%, and 99% on open-ended generation, dialogue, and summarization respectively.In other words, the trade-off between coverage and prediction error exists within a model -as a function of parameters a and b -but not across different types of models.
Performance vs. Coverage As seen in Figures 2 and 3, there is a trade-off when choosing a and b.Higher a and lower b result in better performance (i.e., greater correlation with the ground truth), but they also decrease coverage.Recall the default settings: a = 5, b = 0.66.Even though correlation on most tasks peaks at a = 30 or a = 35, we choose a = 5 as the default because we want to keep the coverage as high as possible.By choosing a > 1, however, we still see some benefit from requiring a minimum number of neighbors.We choose b = 0.66 because it is near the end of a plateau past which performance on open-ended generation data drops precipitously.In other words, the default settings of a, b are near Pareto-optimal, maximizing coverage while outperforming human annotators on average.Depending on one's preferences, however, it is possible to trade off some performance for additional coverage -or vice-versa -by picking a different point (a, b) on the Pareto frontier.

Low-Hanging or High-Hanging Fruit?
Does BLEU Neighbors only make predictions for sentences that humans consider easy to score (i.e., low-hanging fruit)?Let A i denote the set of all sentences for task i and B i ⊆ A i denote the subset of those sentences for which BLEU Neighbors makes predictions.We can answer this question by comparing the average MSE of human annotators on A i with their average MSE on B i , which we will denote as MSE(A i ) and MSE(B i ) respectively.We cannot use the Spearman correlation for comparison because not every annotator scored every sentence; recall that the average MSE and Spearman's ρ reported in Table 1 are computed over each annotator's performance on the subset of the data that they annotated.If our model were only scoring the easy-to-score sentences, then we would expect MSE(A i ) to be significantly larger than MSE(B i ).However, for both summarization and open-ended generation, we find that there is no statistically significant difference between these means at any level.Only on the dialogue dataset could this theory partially explain the success of our model: MSE(B dialogue ) is 15.6% lower than MSE(A dialogue ) and this difference is significant at p < 0.01.However, the average-case annotator MSE on the subset of the dialogue data scored by ROUGE Neighbors is only 2 × 10 −4 higher than MSE(B dialogue ), yet ROUGE Neighbors performs far worse than its BLEU counterpart (see Table 1).This implies that the success of BLEU Neighbors is much more than simply picking the right sentences to score.

Cross-Task Performance
In Table 2, we report the Spearman's ρ for BLEU Neighbors when the training and test examples are drawn from different tasks.Of all the tasks, performance on dialogue is the most robust: regardless of which task is used to source the training data, it is possible to achieve a moderately strong correlation (ρ > 0.27), albeit with lower coverage.Performance on summarization drops to near zero in this setup -this is unsurprising, given that summary quality is strongly conditioned on the source text, which is ignored here.For open-ended generation, it is still possible to achieve a weak correlation (ρ > 0.09) with this setup.Curiously, we find that the coverage for open-ended generation actually improves when the training data is sourced from a different task, suggesting that it may be possible to adjust parameters a, b to trade off some coverage for a higher correlation.

How much Training Data is Needed?
In Figure 4, we plot the performance of BLEU Neighbors on the HUSE datasets for different amounts of training data.This is simulated by drawing a random subset of the data with n examples, doing hold-one-out prediction with n − 1 examples, and then taking the mean performance over 20 such runs.We find that BLEU Neighbors is surprisingly robust on all tasks, with 75 training examples being sufficient to achieve a Spearman's ρ > 0.30 on dialogue and open-ended generation while retaining above 60% coverage.As more data is used, both coverage and the Spearman correlation improve, though there are diminishing returns.Unsurprisingly, the variation in performance across random subsets also drops as more data is used.

Automated Essay Grading
In Table 3, we report the performance on the essay grading task described in Section 4, where the goal is to score essays from Topic #3 of the ASAP-SAS Table 3: Spearman's ρ between predicted essay quality and the ground truth, where the test essays are from Topic #3 and * denotes p < 0.01.When using essays from Topic #8 as the training data, BLEU Neighbors is state-of-the-art, even beating out models with access to a gold-standard reference essay for Topic #3.dataset.Unlike the NLG tasks, every test example here is a multi-sentence paragraph, which makes scoring more difficult: ten random sentences may be high-quality on their own while making little sense when put together.The difficulty of this task is compounded by the fact that the ground-truth quality of each essay is based on a gold-standard reference for Topic #3.Since BLEU Neighbors does not use references, it is at a disadvantage compared to approaches that do, such as ROUGE-L.Excluding ROUGE-L, all the models we list in Table 3 are optimal transport methods that leverage text embeddings (Clark et al., 2019).
Despite not being given the gold-standard reference essay, when BLEU Neighbors is trained with sample essays from Topic #8, it achieves a new state-of-the-art: a Spearman's ρ of 0.500 between its predicted scores and the ground-truth quality judgments.Due to the small amount of test data, however, this improvement over the state-of-the-art is not statistically significant at p < 0.01 when using a Williams test.As seen in the second half of Table 3, the performance of the model depends strongly on which topic the training data is sourced from.This is unsurprising, given that some topics are more related to #3 than others.Some topics (e.g., #4) are so different from the test topic that its training examples are not useful at all.When we use essays from all topics but topic #3 as the training data -denoted in Table 3 as 3 -we still outperform most of the past approaches.

Limitations and Future Work
Quality + Diversity Although the BLEU Neighbors model does not measure language diversity, this is by design.Consider that if an NLG model were ideal, even the optimal discriminator could not tell whether its outputs were humanor machine-generated.Hashimoto et al. ( 2019) proved that such an optimal discriminator would only need two statistics, a measure of language diversity (e.g., perplexity) and a measure of language quality.The former is trivial to computeit is the latter that is cost-and time-intensive, and which we thus try to automate using BLEU Neighbors.These two measures can be combined using a metric such as HUSE (Hashimoto et al., 2019), meaning that it is sufficient for our model to predict quality alone.The next step would be to use such a hybrid metric in rapidly evaluating NLG models during development.
Preventing "Hacks" How can we prevent the NLG model being evaluated from "hacking" a BLEU Neighbors model so as to receive inflated quality estimates?More formally, how can we prevent an NLG model generating language that minimizes human quality judgments while maximizing the BLEU Neighbors score?As mentioned in section 3, one way to prevent this is to use disjoint training sets for the NLG model and BLEU Neighbors, so that the former has no idea what the latter considers a high-quality candidate.Additionally, it would help to have a large set of training examples for BLEU Neighbors and then subsample it during each evaluation instance to create S, as that would discourage NLG models from generating n-grams that just so happen to occur in one high-quality example in S. In any case, NLG models that do "hack" the evaluation in this way will sacrifice language diversity; by aiming to improve (or at least, preserve) language diversity during model development, such pernicious behavior can be avoided.Moreover, as mentioned earlier, BLEU Neighbors is not intended to supplant humans, so any attempts to inflate quality estimates during development would have poor long-term outcomes.

Metric Learning
The success of BLEU Neighbors can largely be ascribed to it using BLEU * , a variant of the BLEU-4 score, as the kernel function in sentence space.Despite its simplicity, BLEU * works surprisingly well.There is likely a more convoluted variant of BLEU-4 that works even better for this purpose -one that excludes stopwords, one that places greater weight on rarer n-grams, etc.We experimented with a variant of the BLEU-4 score that transformed its sentence inputs into sequences of part-of-speech tags before calculating n-gram overlap on those sequences, but found that it led to poorer quality estimates.This suggests that looking at only a single dimension of grammaticality to estimate the similarity between two sentences is insufficient.However, this is only one of many possible modifications that can be made.Instead of representing each sentence as a sequence of words, another possibility is to transform it into a sentence embedding and then learn a kernel function as a metric in the embedding space.It is worth noting that a metric that works well for reference-based evaluation will not necessarily work well as a metric for reference-less evaluation, as we found in our experiments with BERTScore (see Table 1).

Conclusion
The absence of a reference-less evaluation metric for language quality has been an impediment to developing NLG models.To address this problem, we proposed BLEU Neighbors, a nearest neighbors model that leverages the BLEU score as a kernel function in sentence space.Our simple approach worked surprisingly well: it outperformed human annotators -on average -in predicting the quality of dialogue and open-ended generation data.We also found BLEU Neighbors to be state-of-the-art on automatically grading essays, even beating out models that had access to a gold-standard reference essay.Moreover, our model is fast, data-efficient, and easy-to-use; it has only two hyperparameters and those have settings that work universally well, across various tasks.Still, BLEU Neighbors is intended to complement, not supplant, human evaluation -its speed, simplicity, and ease of use makes it ideal for rapidly iterating on NLG models long before any human evaluation is done.

Figure 1 :
Figure 1: We want to score a sentence x given training examples S = {s 1 , s 2 , s 3 } with known quality scores {q(s 1), q(s 2 ), q(s 3 )}.BLEU Neighbors works as follows: calculate BLEU * (x, •), a variant of the BLEU-4 score, for each s; ignore those below τ = 0.08; take the average score of those that remain to predict q(x).

Figure 2 :
Figure 2: Spearman's ρ between BLEU Neighbors estimates q(•) and the ground-truth quality q(•) as each evidence threshold changes, while the other is held constant at a 5 or b = 0.66.a is the minimum number of neighbors needed; b is the maximum allowed (as a fraction of the training set).For all tasks, increasing a improves correlation, up to a point.Only open-ended generation is sensitive to changes in b, which decreases as b increases.The shaded area for each task indicates above-human performance (on average).

Figure 3 :
Figure 3: The coverage (i.e., the fraction of sentences for which BLEU Neighbors makes predictions) as each evidence threshold changes while the other is held constant at a = 5 or b = 0.66.a is the minimum number of neighbors needed; b is the maximum number allowed (as a fraction of the training set).For all tasks, coverage falls as a increases and b decreases (i.e., as the range for the acceptable number of neighbors gets smaller).
ρ being inversely related to b.The two sudden drops in Spearman's ρ for open-ended generation -at approximately b = 0.2 and b = 0.7 -suggests that the distribution of |N |, the number of neighbors, is multi-modal.

Figure 4 :
Figure 4: BLEU Neighbors performance when only a random subset of the training data is used.With more data, both coverage and the Spearman correlation with the ground truth improve, albeit with diminishing returns.The shaded area denotes one standard deviation (i.e., variation in performance across random samples).

Table 2 :
BLEU Neighbors performance when the training and test examples are sourced from different tasks.For example, the intersection of G → and → D means that training examples from open-ended generation are used to score dialogue data.In this setup, a moderate Spearman's ρ can still be achieved on the dialogue data; for summarization, the correlation drops to near zero.