Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived from population estimation methods widely used in ecology. More specifically, we use mark-recapture and maximum-likelihood methods that have been applied over the past several decades to estimate the size of closed populations in the wild. We propose three novel metrics: ME_\text{Petersen} and ME_\text{CAPTURE}, which retrieve a single-valued assessment, and ME_\text{Schnabel} which returns a double-valued metric to assess the evaluation set in terms of quality and diversity, separately. In synthetic experiments, our family of methods is sensitive to drops in quality and diversity. Moreover, our methods show a higher correlation to human evaluation than existing metrics on several challenging tasks, namely unconditional language generation, machine translation, and text summarization.


Introduction
Population estimation methods have been widely used in ecology to study the development of species over the last several decades (Krebs and others, 1989). Existing population estimation methods focus on open populations, where births, deaths, and migrations are taken into account, or closed population methods, where the population is assumed to remain static over the population estimation study. In this work, we focus on closed population methods and study how their respective population estimates can be used to evaluate an evaluation set of generated samples, given a reference set of real samples. In this work, samples are either contextualized word or sentence embeddings. We study two markrecapture methods, namely the Petersen (Ricker, 1975) and the Schnabel (Schnabel, 1938) estimators, where samples are captured and marked, released, and recaptured. The number of such different samples is then used to estimate the population size. We further use CAPTURE (Otis et al., 1978), a maximumlikelihood method, which uses the number of marked samples over multiple captures for such estimation.
Accurate evaluation of generated data is essential to correctly measure in what degrees we can improve the overall generation process. Depending on the use case, single-valued metrics may suffice to assess specific conditional language generation tasks, such as machine translation and text summarization, where we are interested in evaluating the similarities of a generated translation or summary to a specific reference translation or summary, respectively. On the other hand, on unconditional language generation, for example, it may be useful to have separate measures for the diversity and quality of the generated set, enabling the identification of possible shortcomings of our generation system and try to fix it accordingly. This has been an active area of generative models research (Goodfellow et al., 2014), with several works focusing on stimulating diversity while maintaining the overall sample quality (Srivastava et al., 2017;Lin et al., 2018;Mordido et al., 2018;Sauder et al., 2020).
Mark-Evaluate (ME) is a family of 3 novel language evaluation methods based on the above population estimation methods: ME Petersen and ME CAPTURE retrieve a single-valued metric to assess an evaluation set, while ME Schnabel returns a double-valued metric, separately measuring the quality and diversity of the evaluation set. Our main contributions can be listed as follows: (i) Proposal of 3 novel language metrics (Section 3) that are sensitive to mode collapse (Section 4.1) and quality detriment (Section 4.2) and show a high correlation to human evaluation on challenging text generation tasks, such as unconditional language generation (Section 5), machine translation (Section 6.1) and text summarization (Section 6.2). (ii) In-depth study of the language assessment capability of popular existing metrics, i.e. FID (Heusel et al., 2017), PRD (Sajjadi et al., 2018) and IMPAR (Kynkäänniemi et al., 2019), primarily used to evaluate image generation in the past. (iii) Usage of contextual information and different levels of granularity to assess language, by using either contextualized word (Sections 6.1 and 6.2) or sentence embeddings (Sections 4, 5) derived from BERT (Devlin et al., 2019). (iv) Code for the reproducibility of the results will be publicly available.

Related work
While acknowledging the importance of traditional evaluation metrics, such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005), we will focus on the new trend of unsupervised methods that use embedding representations of pre-trained models to assess a set of evaluation samples. Our family of methods analyzes the data manifold to assess the evaluation set by using k-nearest neighbors to determine the capture volume. Several methods have been recently proposed to assess data generation using topological information, however, they were primarily intended to assess image generation (Sajjadi et al., 2018;Khrulkov and Oseledets, 2018;Kynkäänniemi et al., 2019;Niedermeier et al., 2020). In this work, we investigate the performance of such methods in the text domain, analyzing their behavior on synthetic experiments and their correlation with human evaluation.
Precision and recall for distributions (PRD) was proposed by Sajjadi et al. (2018) and uses of kmeans (MacQueen and others, 1967) to build histograms of the discrete reference and evaluation distributions over the clusters' centers. The evaluation distribution is then assessed in terms of relative probability densities. Precision is obtained by calculating the probability of an evaluation sample falling within the reference distribution's support. On the other hand, recall is retrieved by calculating the probability of a reference sample falling within the evaluation distribution's support. Kynkäänniemi et al. (2019) suggested several improvements to the above method, which we call improved precision and recall (IMPAR). First, instead of k-means, they proposed to use k-nearest neighbors to approximate the reference and evaluation manifolds by building a hypersphere around each sample to its k-th nearest neighbor. Second, they simplify PRD's notions of precision and recall, by calculating the probability of an evaluation sample to fall within at least one reference sample's hypersphere, and vice-versa, respectively. The proposed manifold approximations by the usage of hyperspheres present a simple, yet effective way of representing the reference and evaluation manifold in an explicit, nonparametric way. We build upon this idea and use identical hyperspheres to determine the capture volume used by our different estimators to estimate the population size.
Fréchet Inception Distance or FID (Heusel et al., 2017) is a widely used single-valued metric that assesses data similarity by calculating the distance between the reference and evaluation distributions. Even though originally proposed for the image domain, Semeniuta et al. (2018) adapted FID to evaluate text generation by getting vector representations from InferSent (Conneau et al., 2017), instead of Inception-V3 (Szegedy et al., 2016). Even though this metric precedes PRD and IMPAR, FID is still commonly used to assess generative models in the image and text domain.
As previously mentioned, we study both the usage of sentence embeddings, derived from SBERT (Reimers and Gurevych, 2019), as well as contextualized word embeddings from BERT (Reimers and Gurevych, 2019), which have been recently shown to improve language assessment, both in a supervised (Mathur et al., 2019;Sellam et al., 2020) and unsupervised manner (Zhao et al., 2019;Zhang* et al., 2020). More specifically, BERTScore measures precision and recall from a reference and evaluation text by calculating the required transport of each word of a given text to the most semantically similar word in the other text. On the other hand, MoverScore measures the semantic distance between two texts by calculating the minimum transport required between the reference and evaluation texts. These metrics are ideal for conditional text generation, such as machine translation and text summarization, where an evaluation text should match a given reference text. Figure 1: Each sample s ∈ S and s ∈ S is represented as a blue and red circle, respectively. Marked samples are represented by filled circles. In this illustration, hyperspheres reach to each sample's nearest neighbor of the same set, i.e. K = 1. For ME Petersen and ME Schnabel , we first capture and mark all samples inside any hypersphere of s. Then, for ME Petersen , we count the number of marked samples or recaptures inside any hypersphere of s . For ME Schnabel , we perform a similar process iteratively, marking and recapturing samples inside the hypersphere of each s , resulting in all samples being marked in the end. On the other hand, ME CAPTURE captures and marks samples inside each hypersphere of s and s .

S S
single marking first marking multiple markings and captures single recapture multiple markings and recaptures ME Petersen ME CAPTURE ME Schnabel

Mark-Evaluate
In this work, we consider population estimation methods for closed populations, where the true population size remains constant throughout the estimation study. In our use case, our population consists of two sets, namely a reference set S r and an evaluation set S e . The true population size (P ) is then known a priori and represents the total number of samples in the two sets: P = |S r | + |S e |. Given an estimated population size ( P ) from one of the used estimators, we measure the accuracy loss (A) as follows: with a low accuracy loss, i.e. A(P, P ) ≈ 0, representing a good population estimate, and a high accuracy loss, i.e. A(P, P ) ≈ 1, otherwise. Our population estimation methods assume all samples to have an equal chance of capture, which is influenced by our capture volumes: hyperspheres that reaches each reference or evaluation sample's k-th nearest reference or evaluation neighbor, respectively. Hence, if evaluation samples tend not to be inside any reference sample's hypersphere and vice-versa, the population estimate will likely be poor due to the lack of captured samples in the estimation study.
Our methods can be separated into three categories: single marking and recapture (ME Petersen ), multiple markings and recaptures (ME Schnabel ) and multiple markings and captures (ME CAPTURE ). Let us consider two sets of samples S and S , where each sample is a contextualized word embedding or a sentence embedding derived from BERT, depending on the task. Figure 1 illustrates our family of methods.
Adapting Kynkäänniemi et al. (2019)'s formulations, we define a binary function f that returns whether a sample s ∈ S lays inside any capture volume or hypersphere of a sample s ∈ S: where NN k (s, S) returns an ordered set containing s and its k-nearest neighbors in the set S, in ascending order of Euclidean distances to s. Hence, NN k (s, S)[−1] represents the k'th nearest neighbor of s. We may refer to individual samples in S and S as {s 1 , . . . , s |S| } and {s 1 , . . . , s |S | }, respectively.
The Petersen estimator (Ricker, 1975), relies on a single marking step and a single recapture step. It merely assumes that the ratio of marked samples (M ) in the marking step and the population size (P ) is equivalent to the ratio of recaptured samples (R) and captured samples (C) in the recapture step. The population size estimate ( P Petersen ) is then calculated as follows: The Petersen estimator was extended by Schnabel (1938) to incorporate multiple markings and recaptures. The population size estimate ( P Schnabel ) is calculated from T consecutive Petersen estimates: The set of marked samples at each iteration t ∈ {1, . . . , T }, can be defined recursively as: (5) ME Schnabel 's first marking step is identical to ME Petersen 's single marking step (M (1, S, S ) = M (S, S )), with all samples in S as well as samples in S that are inside at least one hypersphere of s being marked. By the final marking step, all samples will be marked since we iterated through all of them: M T (S, S ) = |M (T, S, S )|. For the other iterations, 1 < t < T , samples in S that are captured, i.e. are k-nearest neighbors of the s being iterated, but are not yet marked, are added to the marked set.
After all recapture steps, which excludes the first marking step, the number of captured samples will be the number of samples in S and their respective k'th nearest neighbors as well as samples in S that are inside the hypersphere of each s : S )). Since all samples in S have been marked in the first marking step, the number of total recaptures is the number of samples in S inside the hypersphere of each s as well as the number of k-nearest neighbors of the iterated s that have already been marked: Both ME Petersen and ME Schnabel are mark-recapture methods since they rely on marking and recapturing information to estimate the population size. We further used a maximum log-likelihood method: the model null from Program CAPTURE (Otis et al., 1978). By considering the total number of marked 1.00 score MNLI ME Petersen ME Schnabel ME CAPTURE samples (M T ) and the total number of captures (C total ) over T iterations, with T = |S ∪ S |, we iterate through several provisional population estimates (P CAPTURE ∈ N ≥M ) and compute their log-likelihood: The total number of captures corresponds to the number of samples in S and S and their respective neighbors, as well as the number of samples in S inside the hypersphere of a given s and vice-versa: The final population estimate ( P CAPTURE ) is then the estimate that maximizes Equation 6: Our family of methods uses the accuracy loss of each estimator to compute their scores as follows: Note that, due to its iterative nature, ME Schnable may be used to separately assess the quality and diversity of an evaluation set S e given a reference set S r . More specifically, quality may be calculated by ME Schnable (S r , S e ), whereas diversity may be measured by ME Schnable (S e , S r ). On the other hand, ME Petersen and ME CAPTURE are single-valued metrics, since ME Petersen (S r , S e ) = ME Petersen (S e , S r ) and ME CAPTURE (S r , S e ) = ME CAPTURE (S e , S r ). We refer to the Appendix for theoretical discussions.
To study the effects of different capture volumes, determined by different K, we used SBERT to get the sentence embeddings of 10k training sentences from MNLI (Williams et al., 2017) as the reference set, and 10k validation sentences as the evaluation set. Results are shown in Figure 2, with K ∈ {1, . . . , 40}. We observe that as K increases, the population size estimated by all estimators converges to the true population size. In turn, the scores of our family of methods also converge to their maximum value of 1.

Synthetic experiments
To simulate drops in quality and diversity, we used the MNLI dataset (Williams et al., 2017), which consists of 433k sentence pairs annotated with one out of 5 possible topics. Since we were interested in FID the sentence-pair information for these experiments, we treated each sentence independently. Our reference set consists of sentences from the training set, whereas our evaluation set has sentences from the validation set. We kept the size of the reference and evaluation sets equal throughout our experiments to reduce possible method instabilities regarding sample size. We follow the experiments in Semeniuta et al. (2018) and simulate diversity loss by dropping sentences from certain topics (Section 4.1), whereas quality detriment is induced by swapping the words of each sentence (Section 4.2). For both experiments, we use SBERT embeddings from BERT-base pre-trained on SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2017) datasets ( 'bert-base-nli-mean-tokens' 1 ).

Mode collapse
To evaluate mode collapse, we dropped the sentences from specific topics from the evaluation set, containing sentences from all the available topics. Thus, the evaluation set only contains sentences from a subset of topics. What differs at each step is the number of topics included in the evaluation set: for example, dropping one topic means that the evaluation set only contains samples from the rest of the four available topics. The reference set remained unaltered throughout this process. We used 4k reference and 4k evaluation samples throughout this experiment.
We expect quality assessments to remain constant and diversity assessments to drop as fewer topics are represented in the evaluation set. For single-metric methods, we expect a detriment of the overall score throughout the mode dropping process. Results are presented in Figure 3, where we observe that our family of methods displays the expected behavior. IMPAR and FID also show expected performance (note that higher FID is worse since it represents a distance from the reference and evaluation distributions). On the other hand, PRD's quality assessment or precision drops significantly as mode collapse aggravates, which is not expected since the quality of the evaluation set is not affected in this experiment. We further observe that all methods show high sensitivity when only one topic is represented in the evaluation set. For example, when 4 topics are dropped, IMPAR's quality assessment shifts by ≈ 0.4, while ME Schnabel and ME Schnabel shifts by ≈ 0.5. Hence, both methods show similar variance, despite the visualization contrast originated from different y-scales.

Word swap
To evaluate quality detriment, we swapped the words of each sentence in the evaluation set with a certain swap probability. Similarly to the mode collapse experiment, the reference set remains constant throughout this study. We used 10k reference and 10k evaluation samples. Semeniuta et al. (2018) showed that sentence embeddings derived from InferSent (Conneau et al., 2017) and Transformers (Vaswani et al., 2017) models were unable to detect similar quality perturbations. However, we observe that BERT For this experiment, precision is expected to drop, while recall should remain constant. The overall score of single-valued metrics should deteriorate as the swap probability increases. Results are presented in Figure 4. Both the quality and diversity assessments of ME Schnabel show the expected behavior, similarly to our single-metrics and FID. On the other hand, the recall or diversity assessment of both PRD and IMPAR drops unexpectedly. Moreover, IMPAR's quality or precision does not drop as significantly at higher swap probabilities, which is not desirable.

Language generation
We further assessed the text generated by ten different language generation models presented in Cífka et al. (2018). The models include a traditional language model and several types of autoencoders, namely variational, adversarial, adversarially regularized, and plain autoencoders. We used the human ratings assigned to each model's fluency presented in their work to study the correlation of our family of methods and other tested metrics to human evaluation. Reverse and forward cross-entropy, i.e. Reverse CE and Forward CE, have been commonly used to assess text generation in the past (Cífka et al., 2018;Semeniuta et al., 2018;Zhao et al., 2018). The reported Reverse CE and Forward CE results were taken from Cífka et al. (2018), obtained by training a language model on English Gigaword (Napoles et al., 2012). We refer to Cífka et al. (2018) for additional details.
We used SBERT embeddings from BERT-large trained on SNLI and MNLI datasets ('bert-large-nlimean-tokens') since they achieved the best-reported performance in Reimers and Gurevych (2019). Note that, since human evaluation is only related to each model's fluency, we only report the quality assessment scores for ME Schnabel , PRD, and IMPAR. For our family of methods, as well as PRD and IMPAR, we iterate through K values until correlation drops and present the results with the best K of each method. Table 1 shows the Pearson p, Kendall k, and Spearman p correlations to human evaluation. Overall, our family of methods achieves the highest correlations to human evaluation. Note that despite being outperformed by FID, ME Petersen still outperforms PRD and IMPAR across all correlations. Additional results with default K for our family of methods, PRD, and IMPAR, as well as a comparison with InferSent and different SBERT embeddings, are provided in the Appendix.

Contextualized word embeddings
We will now shift our focus to conditional language generation under finer-grained representations, i.e. contextualized word embeddings. We used embeddings from BERT-base fine-tuned on MNLI, identically to Zhao et al. (2019). For a fair comparison, we used the same embedding representations for all the methods in the following experiments. Due to the likely imbalance of reference and evaluation samples, we only report the quality assessment or precision of double-valued metrics. Similarly to Section 5, we report the results with the best K. Additional results with default K can be found in the Appendix.
Using the information of the last layers of BERT has been shown to help in several downstream tasks (Liu et al., 2019a). This has also been shown for language assessment, observed by, for example, the fact that the best performing layers of BERTScore are often latter layers (Zhang* et al., 2020). MoverScore extends this thinking and aggregates the representations of the last five layers of BERT with p-means. For our methods, instead of aggregating or routing this information, we use the vector representation from the last five layers for each specific word. Thus, each word has five representations, defined as five samples, in our scheme. See Figure 5 for an illustration of this process. This also allows us to produce a better population estimate in the end due to the increase of the sample size.

Machine translation
We start by assessing system-level machine translations from the WMT17 metrics task (Bojar et al., 2017). We evaluated the different methods on the five language pairs provided by Zhao et al. (2019)'s implementation 2 . Namely, we assess translations from Czech (cs), German (de), Russian (ru), Turkish (tr), and Chinese (zh) to English (en). Each language pair has around 3k reference with the respective evaluation translations from multiple systems (the number of systems for each language pair varies).
Pearson (r) correlations with human evaluation are presented in Table 2. Our family of metrics outperforms all the rest in several language pair translations. Moreover, our metrics show the highest correlation to human evaluation when considering the average correlation across all language pairs. BERTScore results were calculated using the embeddings from the last fifth layer of the aforementioned BERT model.

Text summarization
We further assessed text summarization with the TAC-2009 dataset 3 , consisting of news articles from ten different topics, with four reference summaries and fifty-five evaluation summaries from summarization systems per article. We evaluate each evaluation summary independently, performing a summary-level evaluation. Two scores were assigned to each evaluation summary: the pyramid score, which evaluates the semantic similarity between the reference and evaluation summaries, and the responsiveness score, that measures the overall quality of the evaluation summary in terms of grammar and content. Table 3 shows the Kendall (k), Pearson (r), and Spearman (p) correlation to human evaluation for each score. Considering Kendall and Spearman correlations, our family of methods outperforms all the rest on responsiveness score. Moreover, ME Petersen and ME CAPTURE outperform all methods on the above correlations on the pyramid score. Considering Pearson correlation, our family of methods outperforms PRD, and at least one of our metrics consistently outperforms IMPAR on both scores. We hypothesize that the lower Pearson correlations of our metrics could be explained by the instability of the population estimation process due to the low amount of samples, i.e. reference and evaluation words.

Conclusion
In this work, we present a family of methods derived from popular population size estimators that have been widely used in ecology in the past several decades. We show that our family of methods is able to assess language systems under different representations effectively, i.e. using contextualized word and sentence embeddings. Our methods show a high correlation to human evaluation on challenging language generation tasks as well as the desired sensitivity to detect mode collapse and quality detriment.
In the future, we would like to evaluate our family of metrics on image generation tasks, reinforcing the general applicability of our methods. Moreover, we plan to extend our family of methods to also cover popular open populations estimation methods, where the population size may vary over time. In the end, we hope that combining the information from closed and open population methods will improve the overall assessment of language systems, further fostering the adoption of ecology methods in NLP.

A Theoretical discussions
We will briefly study the validity of our methods when assessing two equal sets, i.e. when the reference set is identical to the evaluation set. Formally, we define that: Definition A.1. Two sets S and S are equal if S ⊆ S and S ⊆ S.
Theorem A.1. Considering two equal sets S and S , ME Petersen (S, S ) returns its maximum score of 1.
Theorem A.2. Considering two equal sets S and S , ME Schnabel (S, S ) returns its maximum score of 1.
Finally, adopting Equation 8, we conclude the proof: Theorem A.3. Considering two equal sets S and S , ME CAPTURE (S, S ) returns its maximum score of 1.

B Additional experiments on dialogue generation
Human evaluation correlation on assessing language generation with default K, as well as a comparison with InferSent and SBERT embeddings from BERT-base and BERT-large, are provided in Table 4. We observe that the relative performance between all methods does not change when compared to using the best K, with our family of methods showing the overall best performance between the compared methods and embeddings. Furthermore, SBERT-based embeddings tend to show a higher correlation than InferSent embeddings across all correlations and methods, with the exception of IMPAR's r and k. This goes in accordance with several recent works that show that contextualized embeddings from BERT seem to help across a wide variety of tasks (Liu et al., 2019b;Li et al., 2019;Gabriel et al., 2019;Mathur et al., 2019;Yoshimura et al., 2019).

C Additional experiments on machine translation
We further experimented with assessing machine translation systems using contextualized sentence embeddings. To achieve this, we use all the reference translations as reference samples and the translations of each translation system as evaluation samples. We perform this assessment individually for each translation system available for each language pair. Pearson (r) correlations are presented in Table 5. Considering the average across all language pairs, our family of methods outperforms PRD and IMPAR. Note that, as expected, using contextualized sentence embeddings shows lower performance than contextualized word embeddings (Table 6) due to the finer-granularity of the assessment in the latter case.  Table 4: Correlations to human evaluation regarding the fluency of 10 different models with default K. ISENT refers to InferSent embeddings, sSBERT refers to sentence embeddings from BERT-base ('bert-base-nli-mean-tokens'), and SBERT refers to sentence embeddings from BERT-large ('bert-largenli-mean-tokens'). For each embedding type, best scores of each correlation are underlined, while bold values represent the correlations where our methods outperform or match all of the other methods' performance. Absolute correlation values are presented for FID.
Translations PRD IMPAR ME Schnabel ME Petersen ME CAPTURE cs-en ( Table 6: Pearson correlations for the WMT17 metrics task using contextualized word embeddings with default K. The best correlation of each language pair is underlined. Correlations where our methods outperform or match all of the other methods are highlighted in bold.