MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.


Introduction
Reading comprehension (RC) has seen significant progress in the last few years, with a number of question answering (QA) datasets being created (Rajpurkar et al., 2016;Lai et al., 2017;Talmor et al., 2018). However, a majority of datasets are presented using a span-selection or multiple-choice (MC) format. Both formats are easy to evaluate, Figure 1: Generative reading comprehension example. Properly scoring the candidate requires access to the passage. Current metrics, such as BLEU, ROUGE and METEOR, are agnostic to the end-task while LERC is trained with the passage and question as input. As a result, LERC assigns a score that better reflects human judgement. but in return, have restrictions placed on the questions that can be asked or the answers that can be returned. Furthermore, both formats hinge on distractor spans/choices for learning to be effective. Ensuring high quality distractors is a challenging task in and of itself, which can lead to models that exploit spurious correlations (Jia and Liang, 2017;Min et al., 2019;Geva et al., 2019). Posing RC as a generation task addresses the aforementioned issues. Generative RC does not require distractors, circumventing biases that could be introduced by them, and allows arbitrary questions and answers.
Unfortunately, existing metrics for evaluating text generation come with significant shortcomings. Many metrics score n-gram overlap, and it is well established that using token overlap as a measure of similarity has drawbacks Edunov et al., 2019;Wang et al., 2020). Cur-rent metrics also only consider the reference and are agnostic to the end-task being evaluated. Fig. 1 demonstrates that this is problematic for generative RC because scoring a candidate may require a metric to also consider the passage and the question. Without cheap and reliable evaluation, progress in generative reading comprehension has been extremely slow.
To address the need for better evaluation metrics tailored to reading comprehension, we present a dataset called MOCHA, aimed at developing learned metrics that MOdel the Correctness of candidates using Human Annotation scores. MOCHA contains human judgement scores on 40K candidates, an order of magnitude larger than prior work . The candidates come from six diverse QA datasets which test a wide range of RC phenomena such as commonsense reasoning and understanding narrative over movie scripts. After collecting all annotations, we follow work on creating more robust evaluation sets (Kaushik et al., 2020;Gardner et al., 2020) and augment the test set of MOCHA by manually writing a small set of minimal pairs (Table 3). The set of minimal pairs serve as a harder evaluation set for probing metric robustness.
Using MOCHA, we train a Learned Metric for Reading Comprehension which we abbreviate as LERC. We compare LERC against two sets of baselines: (1) existing metrics such as METEOR (Banerjee and Lavie, 2005) and BERTScore (Zhang et al., 2019); and (2) a sentence similarity model trained on STS-B (Cer et al., 2017). To ensure fair comparison, we evaluate LERC in an out-of-dataset setting: LERC is trained on all datasets except the one it is being evaluated on. On the test set, LERC outperforms baselines by as much as 36 Pearson correlation points and on the minimal pairs set, by as much as 26 accuracy points. Error analysis and minimal pair results indicate that there is substantial room to improve the robustness of LERC and its sensitivity to different linguistic phenomena. We hope that MOCHA and LERC enables a continual cycle of generative RC model and dataset developments that will enable easier collection of more diverse and useful candidates, allowing better learned metrics to be trained.  Table 1: Example instances with human judgement scores from MOCHA highlighting the diverse phenomenon that an evaluation metric needs to handle. These phenomenon include resolving coreference, dealing with factual correctness, understanding paraphrases, and understanding semantic roles.

A Description of MOCHA
Reading comprehension is the task of probing how well systems can understand passages of text. Framing reading comprehension as a generation problem provides a great deal of flexibility, but introduces the challenging problem of evaluation. These challenges are further amplified when applied to generative reading comprehension, where the introduction of a passage and a question can add to the complexity of evaluation (Table 1). To handle this challenge, we propose to train a generative reading comprehension metric. This first requires a large set of human judgement scores to be gathered.
In this section, we present MOCHA, a dataset that pairs reading comprehension instances, which consists of a passage, question, and reference, with candidates and human judgement scores. We describe the process of gathering candidates, collect- Figure 2: A compressed version of the Mechanical Turk interface for evaluating answer correctness. Workers were asked to score (1 to 5) how similar a candidate is to a reference using the passage and the question.
ing human judgement scores, and creating minimal pairs for evaluation.

Datasets
Candidates in MOCHA come from 6 constituent QA datasets that are diverse in their domains and answer types. This ensures that training and evaluation with MOCHA does not overfit to the characteristics of any constituent dataset.
NarrativeQA (Kociský et al., 2017) tests reasoning about events, entities, and their relations on movie scripts and book summaries.
MCScript (Ostermann et al., 2018) tests reasoning on stories written for a child-level reader.
SocialIQA (Sap et al., 2019) tests social reasoning with passages constructed from a knowledge base.
DROP (Dua et al., 2019) tests predicate argument structure and numerical reasoning on Wikipedia articles concerning American football games, census results, and history.
Quoref  tests coreferential reasoning on Wikipedia articles.
NarrativeQA was created as a generative RC dataset. CosmosQA, MCScript, and SocialIQA were created as MC datasets which we re-purpose as generative datasets by using the correct choice as the reference. Our motivation for doing this is that the number of generative QA datasets is quite small, which we attribute to the quality of evaluation metrics.
The main focus of this work is in developing and evaluating metrics for generative RC. However, we wanted to see whether a learned metric could do well on span-selection datasets. We collected candidates on two span-based datasets, DROP and Quoref, to test this.

Collecting Candidates
Candidates on all four generative datasets are generated using backtranslation (Sennrich et al., 2016) and using a fine-tuned GPT-2 model (Radford et al., 2019). We also generate candidates for Nar-rativeQA and MCScript using a trained MHPG model (Bauer et al., 2018). We tried using MHPG for CosmosQA and SocialIQA but candidates were of poor quality. Unique to NarrativeQA, each question has two references. We treat the second reference as a candidate to be annotated if it has low n-gram overlap with the first reference. We use a span-selection BERT-based model to generate candidates for Quoref and NAQANET (Dua et al., 2019) and NABERT 2 models for DROP.
Models are trained on the training sets of each constituent dataset and candidates are produced on instances from the validation set (and test set if available). We filtered out candidates that exactly matched the reference. We also filtered out instances in DROP where the reference and the candidate are both numbers. 3 In total, MOCHA contains 40K candidates, large enough for training a learned metric as well as for evaluating current and future metrics.

Annotation Procedure
Annotations are collected with Mechanical Turk using the interface in Fig. 2. Workers are asked to score candidate answers on an ordinal scale from 1 to 5. We start by collecting a single annotation per candidate. Following this, candidates are split into training, validation, and test sets such that all candidates from a passage are contained within a dataset split. For instances in our validation and  test sets, we collect one additional human judgement score per candidate for span-based datasets, and two additional human judgement scores per candidate for generative datasets. Multiple annotations for a given candidate are averaged to form a gold annotation. More details such as payout and qualification testing are provided in Appendix D.
We calculated inter-annotator agreement using Krippendorff's Alpha-Reliability (Krippendorff, 2011) on the validation set of all 6 constituent datasets. We choose this metric because it applies to our setting, where there are multiple annotators per instance, and the annotators vary between instances. Agreement on our 6 datasets range from 0.71 to 0.92 (average = 0.82), indicating strong agreement.

Statistics for MOCHA
Statistics of instances and dataset splits in MOCHA are provided in Table 2. The number of unique passages varies considerably across datasets. Nar-rativeQA, which has the longest passages, has few unique passages, while SocialIQA has a unique passage for each question/reference pair. The num- ber of candidates also varies across datasets. The most pronounced outlier is DROP, where we collected a tenth of the candidates compared to the other datasets. This is because we filtered out instances when both the candidate and reference were numbers, leaving much fewer candidates to annotate. The number of candidates outnumbers the question/reference pairs because for each pair, we generated multiple candidates using different generation sources (e.g. backtranslation, different model outputs). Fig. 3 provides the annotation score distribution on the training set of MOCHA. Score distributions are right-skewed because we did not collect annotations when the reference exactly matched the candidate. The right-skew is most pronounced for Quoref because the number of ways a candidate can get a perfect score while not matching the reference is limited in a span extraction format.

Limitations and Robust Evaluation with Minimal Pairs
Candidates in MOCHA come from existing models, so that a metric learned on this data will be most applicable to current research. However, as research in generative reading comprehension models is presently limited, the strength of these models can be low. Fig. 4 shows that generative QA models struggle to produce quality answers when asked about commonsense scenarios. The majority of 5's in CosmosQA and SocialIQA are produced via backtranslation, while GPT-2 struggles to produce "correct" candidates. This raises an issue with the evaluation; a metric can look strong when evaluated on current model outputs, but may in-fact struggle in the future when QA systems produce better answers. Thus, using only these candidates for evaluation could lead to overconfidence in a learned metric's capabilities.
We take inspiration from from recent work creating more robust evaluations (Kaushik et al., 2020;Gardner et al., 2020) and augment the test set of MOCHA with a small number of minimal pairs created by the authors. Given a passage, question, and reference from the test set, we manually create two new candidates, c 1 and c 2 , which form a minimal pair. Accompanying c 1 and c 2 are human judgement scores, s 1 and s 2 , collected using the same interface in Fig. 2. The minimal pair is created so that c 1 has a higher score (i.e. is a better answer) than c 2 . Each minimal pair is designed to capture a particular linguistic phenomenon (see Table 3). Using this set of minimal pairs, we can study how often a metric prefers the better candidate. We create 200 minimal pairs (50 for each generative QA dataset), which we use for evaluation separately from the original test set.

A Learned Metric
We provide details on LERC, our learned metric. LERC is initialized using BERT-base (Devlin et al., 2019) We define as input a tuple consisting of a passage, p, a question, q, a reference answer, a, and a candidate answer,â. The input to BERT is   Figure 5: LERC is a BERT model that has been finetuned on human judgment scores. LERC takes as input a passage, question, reference, and candidate, and returns a score rating the "correctness" of the candidate. structured as: BERT returns a hidden state for each input token. We use the first hidden state h [CLS] , as the pooled representation of the input.

Fine-Tuning with Human Judgements
Our goal is to train BERT to mimic the human judgements given a set of input tuples, {(p, q, a,â)} n i=1 , and a set of human judgment scores, {y} n i=1 , We apply a regression layer on top of our pooled representation (Fig. 5) and train with a MSE loss.ŷ

Pre-Training the Learned Metric
Learning the interactions between the input components can be difficult with only human judgement fine-tuning. To overcome this, we pre-train on four multiple-choice QA datasets: BoolQ (Clark et al., 2019a), MCTest (Richardson et al., 2013), RACE (Lai et al., 2017), and MultiRC (Khashabi et al., 2018). We use the same input structure as fine-tuning, but the reference and candidate are replaced by two answer choices, a 1 and a 2 : We pre-train BERT via 3-way classification to predict whether: a 1 is the correct answer, a 2 is the correct answer, or a 1 and a 2 are both correct. Mul-tiRC has multiple correct answers per question and we create additional instances where both a 1 and a 2 are correct by duplicating the correct answer for all three datasets.

Experiments
Training LERC: We use the PyTorch (Paszke et al., 2019), HuggingFace Transformers (Wolf et al., 2019), and AllenNLP (Gardner et al., 2017) libraries to implement LERC. We pre-train LERC before fine-tuning on MOCHA. We evaluate LERC in two settings, an out-of-dataset (OOD) setting and an all-datasets (AD) setting. In the OOD setting, we train and tune LERC on all datasets in MOCHA except the dataset we are evaluating on. This reflects the use case where we want to apply LERC to evaluate a new dataset where we do not have human judgement scores. In the AD setting, we train on all datasets in MOCHA and evaluate on all datasets. All results reported for LERC are the average of three runs using the best set of hyperparameters found on the validation set of MOCHA.
Baselines: We compare LERC against BLEU-1 (Papineni et al., 2001), ROUGE-L (Lin, 2004), ME-TEOR (Banerjee and Lavie, 2005), and BERTScore (Zhang et al., 2019). We also compare LERC against a BERT-base model fine-tuned on the sentence similarity task, STS-B (Cer et al., 2017). Results for BERT STS-B are the average of three runs using the best set of hyperparameters found on the   validation set of STS-B. All baselines are agnostic to the passage and the question.

Correlation Results
We evaluate the baselines and OOD LERC in Table  4 using Pearson correlation. LERC outperforms the baseline metrics despite being trained in a out-ofdataset situation. METEOR does surprisingly well despite relying on n-gram overlap to do evaluation. Interestingly, the sentence similarity model does better than the baseline metrics while falling behind LERC.
We also study whether having human judgements for a particular dataset helps. We present results in Table 5 on the validation set of MOCHA when LERC is trained in an AD setting. Having human judgements for the target dataset is always helpful.

Error Analysis of LERC
We gather the 10 validation instances per generative dataset (40 instances total) with the highest absolute difference between the human judgement score and LERC score. We categorize the errors Passage: The train was slow and ambling, so much so that we were 2 hours late when we arrived in Montreal, missing our connection. Q: What might be true if the freight trains didn't cause a delay ? Ref: They wouldn't have missed their connection Cand: they couldn't help noticing their connection Human Score: 1 LERC: 4.2 Table 7: Error analysis of LERC. We take the 10 validation instances per generative dataset (40 total) with the largest difference between the score assigned by LERC and the score assigned by humans. We then group the highest error instances by the sources of the error. made by LERC in Table 7. A large source of error is the inability to leverage the passage correctly as well as handling large lexical gaps between references and correctly paraphrased candidates. The "Other" category includes understanding semantic roles and misspellings of the reference.

Ablation Results
We study five ablations of OOD LERC with results in Table 6. All ablations do not involve any pre-training. When looking at ablations of LERC, several interesting phenomena emerge.
Pre-training is important with such a complex input structure. Removing pre-training while still  using the passage and question as input hurts performance. Ablations of LERC that do not use the passage but still have the reference and candidate as input only fall slightly behind the complete metric.
One explanation is that current generative QA models may not generate many candidates that would require the metric to use the passage. Therefore, even the complete version of LERC may have learned to ignore the passage. We explore this in the following section when conducting an error analysis of LERC. As sanity checks for dataset biases, we also evaluate impoverished ablations that should not perform well: when the model has access only to the reference or to the candidate. These ablations correlate quite poorly with human judgments. The correlation is slightly positive for both, however, perhaps measuring the grammaticality of a candidate, or the difficulty of matching long references.

Minimal Pair Results
We now present results on the set of minimal pairs. We use these minimal pairs to evaluate preference: given a minimal pair of candidates (c 1 , c 2 ), what percentage of the time does a metric prefer the better candidate? For cases where a metric assigns the same score to both candidates, we give a halfpoint.
Results are reported in terms of accuracy in Table 8. N-gram based metrics are close to random, which aligns with intuition because minimal pairs were created such that both candidates have a similar token overlap with the reference. The sentence similarity model does much better, likely because it generalizes beyond token overlap. Finally, LERC (OOD setting) does the best, suggesting that while there is still room for improvement, the phenomena targeted by the minimal pairs is captured when  Table 9: Analysis of LERC vs BLEU-1. We take the 10 validation instances per generative dataset (40 total) with the largest difference between the score assigned by LERC and the score assigned by BLEU-1. We then group these instances by the source of the difference. evaluated using preference.

LERC vs BLEU
To understand the differences in behavior between LERC and the popular BLEU metric, we collect the 10 validation instances per generative dataset with the highest absolute difference between the BLEU-1 and LERC score. We categorize the source of the differences in Table 9. In about 90% of the cases, the gap is due to BLEU scoring candidates too low (e.g. not capturing paraphrases). In the remaining cases, the gap is due to LERC over-scoring the candidate, usually due to the reference and candidate being similar (e.g. both are numbers).
There has been a long history of developing evaluation metrics, which have generally fallen into one of three categories. The first consists of metrics that use some variant of n-gram matching (Papineni et al., 2001;Lin, 2004;Banerjee and Lavie, 2005). They are easy to implement, but lack flexibility by focusing only on token overlap. The second cateogry of metrics eschew some of the aforementioned issues by calculating a softer similarity score using embeddings of tokens (Clark et al., 2019b;Zhang et al., 2019). However, it is unclear how to tailor them to question answering, where the passage and question should be assimilated. The final category consists of metrics learned end-to-end from human judgements (Cui et al., 2018;Sellam et al., 2020). These metrics are flexible in that they can be tuned to the specific evaluation setting but depend on a large corpus of human judgement scores to train on. We hope that the release of MOCHA pushes the development of QA metrics that fall into this category.
MOCHA is directly inspired by the annual WMT Metrics Shared Task (Machácek and Bojar, 2014;Stanojević et al., 2015;Bojar et al., 2016Bojar et al., , 2017Ma et al., 2018Ma et al., , 2019. Participants submit automatic translations and human judgement scores are collected for the submitted translations. The annotations collected as part of the WMT Metrics Shared Task have made it easy to evaluate and create new translation metrics (Popovic, 2015;Ma et al., 2017;Shimanaka et al., 2018). In a similar vein, SummEval is a recently released dataset that evaluates a number of evaluation metrics for summarization (Fabbri et al., 2020).

Conclusion
We present MOCHA, a dataset of human judgement scores for training and evaluating generative reading comprehension metrics. Using MOCHA, we train a learned metric, LERC, that outperforms all existing metrics and is much more robust when evaluated on a set of minimal pairs.
While we have demonstrated that LERC is a better metric for evaluating generative reading comprehension than any existing metric, considerable work remains. Error analysis reveals that there exist gaps in LERC's ability to handle certain phenomena, such as correctly leveraging the passage. Future work involves collecting data to addresses weaknesses of LERC. We also anticipate a con-tinual cycle of generative RC model and dataset developments that will enable easier collection of more diverse and useful candidates. This in turn will allow better learned metrics, which can be used to evaluate ever more complex models.

Dataset/Generation Source
Avg. Dev r  to pass the test. After qualification testing, we run a small trial. During this trial, we release 200 candidates and gather 5 human judgements per candidate to get a sense of annotation agreement and to see if our instructions and examples need to be revised. Finally, during the full dataset collection process we solicit human judgements on all candidates. Here, each HIT is an aggregate of 10 candidates that all share the same passage to amortize the cost of reading the passage and workers are paid 40 cents per HIT. 5 During dataset collection, we randomly sample annotations to check for quality and remove workers that consistently do a poor job.
Workers are paid for working on any of the three stages. The total cost of collecting MOCHA is about $6,000.

E Correlation Results based on Generation Source
We supplement Table 4 by calculating correlation results per generation source for the generative datasets in Table 10. We find that LERC handles candidates from different generation sources with roughly the same performance.