R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason

Recent studies have revealed that reading comprehension (RC) systems learn to exploit annotation artifacts and other biases in current datasets. This prevents the community from reliably measuring the progress of RC systems. To address this issue, we introduce R4C, a new task for evaluating RC systems’ internal reasoning. R4C requires giving not only answers but also derivations: explanations that justify predicted answers. We present a reliable, crowdsourced framework for scalably annotating RC datasets with derivations. We create and publicly release the R4C dataset, the first, quality-assured dataset consisting of 4.6k questions, each of which is annotated with 3 reference derivations (i.e. 13.8k derivations). Experiments show that our automatic evaluation metrics using multiple reference derivations are reliable, and that R4C assesses different skills from an existing benchmark.


Introduction
Reading comprehension (RC) has become a key benchmark for natural language understanding (NLU) systems, and a large number of datasets are now available (Welbl et al., 2018;Kočiskỳ et al., 2018;Yang et al., 2018, i.a.).However, it has been established that these datasets suffer from annotation artifacts and other biases, which may allow systems to "cheat": Instead of learning to read and comprehend texts in their entirety, systems learn to exploit these biases and find answers via simple heuristics, such as looking for an entity with a particular semantic type (Sugawara et al., 2018;Mudrakarta et al., 2018) (e.g.given a question starting with Who, a system finds a person entity found in a document).
To address this issue, the community has introduced increasingly more difficult Question Answering (QA) problems, for example, so that answer-Title: Return to Olympus [1] Return to Olympus is the only album by the alternative rock band Malfunkshun.
[2] It was released after the band had broken up and after lead singer Andrew Wood (later of Mother Love Bone) had died... related information is scattered across several articles (Welbl et al., 2018;Yang et al., 2018) (i.e. multi-hop QA).However, recent studies show that such multi-hop QA also has weaknesses (Chen and Durrett, 2019;Min et al., 2019;Jiang et al., 2019), e.g.combining multiple sources of information is not always necessary to find answers.Another direction, which we follow, includes evaluating a systems' reasoning (Jansen, 2018;Yang et al., 2018;Thorne and Vlachos, 2018;Camburu et al., 2018;Fan et al., 2019;Rajani et al., 2019).In arXiv:1910.04601v2[cs.CL] 2 May 2020 the context of RC, Yang et al. (2018) propose Hot-potQA, which requires systems not only to give an answer but also to identify supporting facts (SFs), sentences containing information that supports the answer.SFs are defined as sentences containing information that supports the answer (see "Supporting facts" in Fig. 1 for an example).
As shown in SFs [1] , [2] , and [7] , however, only a subset of SFs may contribute to the necessary reasoning.For example, [1] states two facts: (a) Return to Olympus is an album by Malfunkshun; and (b) Malfunkshun is a rock band.Among these, only (b) is related to the necessary reasoning.Thus, achieving a high accuracy in the SF detection task does not fully prove a RC systems's reasoning ability.
This paper proposes R 4 C, a new task of RC that requires systems to provide an answer and derivation 1 : a minimal explanation that justifies predicted answers in a semi-structured natural language form (see "Derivation" in Fig. 1 for an example).Our main contributions can be summarized as follows: • We propose R 4 C, which enables us to quantitatively evaluate a systems' internal reasoning in a finer-grained manner than the SF detection task.We show that R 4 C assesses different skills from the SF detection task.
• We create and publicly release the first dataset of R 4 C consisting of 4,588 questions, each of which is annotated with 3 high-quality derivations (i.e.13,764 derivations), available at https://naoya-i.github.io/r4c/.
• We present and publicly release a reliable, crowdsourced framework for scalably annotating existing RC datasets with derivations in order to facilitate large-scale dataset construction of derivations in the RC community.
2 Task description

Task definition
We build R 4 C on top of the standard RC task.Given a question q and articles R, the task is (i) to find the answer a from R and (ii) to generate a derivation D that justifies why a is believed to be the answer to q.
There are several design choices for derivations, including whether derivations should be structured, whether the vocabulary should be closed, etc.This 1 R 4 C is short for "Right for the Right Reasons RC." leads to a trade-off between the expressivity of reasoning and the interpretability of an evaluation metric.To maintain a reasonable trade-off, we choose to represent derivations in a semi-structured natural language form.Specifically, a derivation is defined as a set of derivation steps.Each derivation step d i ∈ D is defined as a relational fact, i.e.
, where d h i , d t i are entities (noun phrases), and d r i is a verb phrase representing a relationship between d t i and d h i (see Fig. 1 for an example), similar to the Open Information Extraction paradigm (Etzioni et al., 2008).d h i , d r i , d t i may be a phrase not contained in R (e.g. is lead singer of in Fig. 1).

Evaluation metrics
While the output derivations are semi-structured, the linguistic diversity of entities and relations still prevents automatic evaluation.One typical solution is crowdsourced judgement, but it is costly both in terms of time and budget.We thus resort to a reference-based similarity metric.
Specifically, for output derivation D, we assume n sets of golden derivations G 1 , G 2 , ..., G n .For evaluation, we would like to assess how well derivation steps in D can be aligned with those in G i in the best case.For each golden derivation G i , we calculate c(D; G i ), an alignment score of D with respect to G i or a soft version of the number of correct derivation steps in D (i.e.0 ≤ c(D; G i ) ≤ min(|D|, |G i |)).We then find a golden derivation G * that gives the highest c(D; G * ) and define the precision, recall and f 1 as follows: An official evaluation script is available at https: Alignment score To calculate c(D; G i ), we would like to find the best alignment between derivation steps in D and those in G i .See Fig. 2 for an example, where two possible alignments A 1 , A 2 are shown.As derivation steps in D agree with those in G i with A 2 more than those with A 1 , we would like to consider A 2 when evaluating.We first define c(D; G i , A j ), the correctness of D given a specific alignment A j , and then pick the best alignment as follows: where a(d j , g j ) is a similarity [0, 1] between two derivation steps d j , g j , and A(D, G i ) denotes all possible one-to-one alignments between derivation steps in D and those in G i .
For a(d j , g j ), we consider three variants, depending on the granularity of evaluation.We first introduce two fine-grained scorer, taking only entities or relations into account (henceforth, entity scorer and relation scorer): where s(•, •) denotes an arbitrary similarity measure [0, 1] between two phrases.In this study, we employ a normalized Levenshtein distance.Finally, as a rough indication of overall performance, we also provide a full scorer as follows: 3 Data collection The main purpose of R 4 C is to benchmark an RC systems' internal reasoning.We thus assume a semi-supervised learning scenario where RC systems are trained to answer a given question on a large-scale RC dataset and then fine-tuned to give a correct reasoning on a smaller reasoning-annotated datasets.To acquire a dataset of derivations, we use crowdsourcing (CS).

Crowdsourcing interface
We design our interface to annotate existing RC datasets with derivations, as a wide variety of high quality RC datasets are already available (Welbl et al., 2018;Yang et al., 2018, etc.).We assume that RC datasets provide (i) a question, (ii) the answer, and (iii) supporting articles, articles that support the answer (optionally with SFs).Initially, in order to encourage crowdworkers (henceforth, workers) to read the supporting articles carefully, we ask workers to answer to the question based on the supporting articles (see Appendix A).To reduce the workload, four candidate answers are provided. 2We also allow for neither as RC datasets may contain erroneous instances.
Second, we ask workers to write derivations for their answer (see Fig. 3).They click on a sentence (either a SF or non-SF) in a supporting article (left) and then input their derivation in the form of triplets (right).They are asked to input entities and relations through free-form textboxes.To reduce the workload and encourage annotation consistency, we also provide suggestions.These suggestions include predefined prepositions, noun phrases, and verb phrases automatically extracted from supporting articles. 3We also highlight SFs if they are available for the given RC dataset.

Workflow
To discourage noisy annotations, we first deploy a qualification test.We provide the same task described in §3.1 in the test and manually identify competent workers in our task.The final annotation is carried out solely by these qualified workers.
We deploy the task on Amazon Mechanical Turk (AMT). 4We allow workers with ≥ 5,000 Human Intelligence Tasks experience and an approval rate of ≥ 95.0% to take the qualification test.For the test, we pay ¢15 as a reward per instance.For the final annotation task, we assign 3 workers per instance and pay ¢30 to each worker.

Dataset
There are a large number of choices of RC datasets that meet the criteria described in §3.1 including SQuAD (Rajpurkar et al., 2016) and Wiki-Hop (Welbl et al., 2018).Our study uses Hot-potQA (Yang et al., 2018), one of the most actively used multi-hop QA datasets. 5The multi-hop QA setting ensures that derivation steps are spread across documents, thereby posing an interesting unsolved research problem.
For annotation, we sampled 3,000 instances from 90,564 training instances and 3,000 instances from 7,405 development instances.For the qualification test and interface development, we sampled another 300 instances from the training set.We used the annotations of SFs provided by HotpotQA.We assume that the training set is used for fine-tuning RC systems' internal reasoning, and the development set is used for evaluation.

Statistics
In the qualification test, we identified 45 competent workers (out of 256 workers).To avoid noisy annotations, we filter out submissions (i) with a wrong answer and (ii) with a neither answer.After the filtering, we retain only instances with exactly three derivations annotated.Finally, we obtained 7,137 derivations for 2,379 instances in the training set and 7,623 derivations for 2,541 instances in the dev set.See Appendix B for annotation examples.

Evaluation 4.1 Methodology
To check whether annotated derivations help humans recover answers, we setup another CS task on AMT (answerability judgement).Given a Hot-potQA question and the annotated derivation, 3 workers are asked whether or not they can answer the question solely based on the derivation at three levels.We evaluate all 7,623 derivations from the dev set.For reliability, we targeted only qualified workers and pay ¢15 as a reward per instance.
To see if each derivation step can actually be derived from its source SF, we asked two expert annotators (non co-authors) to check 50 derivation steps from the dev set (derivability judgement).

Results
For the answerability judgement, we obtained Krippendorff's α of 0.263 (a fair agreement).With majority voting, we obtained the following results: YES: 95.2%, LIKELY: 2.2%, and NO: 1.3% (split: 1.3%). 6For the derivability judgement, 96.0% of the sampled derivation steps (48/50) are judged as derivable from their corresponding SFs by both expert annotators.Despite the complexity of the annotation task, the results indicate that the proposed annotation pipeline can capture competent workers and produce high-quality derivation annotations.For the final dev set, we retain only instances with YES answerability judgement.
The final R 4 C dataset includes 4,588 questions from HotpotQA (see Table 1), each of which is annotated with 3 reference derivations (i.e.13,764 derivations).This is the first dataset of RC annotated with semi-structured, multiple reference derivations.The most closest work to our dataset is the WorldTree corpus (Jansen et al., 2018) which contains 1,680 questions.Jansen et al. (2018) use experts for annotation, and the annotated explanations are grounded on a predefined, structured knowledge base.In contrast, our work proposes a non-expert-based annotation framework and grounds explanations using unstructured texts.

Analysis
Effect of multiple references Do crowdsourced multiple golden derivations help us to evaluate output derivations more accurately?To verify this, we evaluated oracle derivations using one, two, or all three references.The derivations were written by qualified workers for 100 dev instances.
Table 2 shows that having more references increases the performance, which indicates that references provided by different workers are indeed diverse enough to capture oracle derivations.The peak performance with # rf= 3 establishes the upper bound performance on this dataset.
The larger improvement of the relation-level performance (+14.5)compared to that of the entitylevel performance (+8.0) also suggests that relations are linguistically more diverse than entities, as we expected (e.g. is in, is a town in, and is located in are annotated for a locational relation).

Baseline models
To analyze the nature of R 4 C, we evaluate the following heuristic models.IE: extracting all entity relations from SFs.7 CORE: extracting the core information of SFs.Based on the dependency structure of SFs (with article title t), it extracts a root verb v and the right, first child c r of v, and outputs t, v, c r as a derivation step.
Table 3 shows a large performance gap to the human upper bound, indicating that R 4 C is different to the HotpotQA's SF detection task-it does not simply require systems to exhaustively extract information nor to extract core information from SFs.The errors from these baseline models include generating entity relations irrelevant to reasoning (e.g.Return to Olympus is an album in Fig. 2) or missing implicit entity relations (e.g.Andrew Wood is a member of Mother Love Bone in Fig. 1).R 4 C introduces a new research problem for developing RC systems that can explain their answers.

Conclusions
Towards evaluating RC systems' internal reasoning, we have proposed R 4 C that requires systems not only to output answers but also to give their derivations.For scalability, we have carefully developed a crowdsourced framework for annotating existing RC datasets with derivations.Our experiments have demonstrated that our framework produces high-quality derivations, and that automatic evaluation metrics using multiple reference derivations can reliably capture oracle derivations.The experiments using two simple baseline models highlight the nature of R 4 C, namely that the derivation generation task is not simply the SF detection task.We make the dataset, automatic evaluation script, and baseline systems publicly available at https://naoya-i.github.io/r4c/.
One immediate future work is to evaluate stateof-the-art RC systems' internal reasoning on our dataset.For modeling, we plan to explore recent advances in conditional language models for jointly modeling QA with generating their derivations.

Figure 1 :
Figure 1: R 4 C, a new RC task extending upon the standard RC setting, requiring systems to provide not only an answer, but also a derivation.The example is taken from HotpotQA (Yang et al., 2018), where sentences [1-2, 4, 6-7] are supporting facts, and [3,5] are not.

Figure 3 :
Figure 3: Crowdsourcing interface for derivation annotation.Workers click on sentences and create derivation steps in the form of entity-relation triplets.

Table 1 :
Statistics of R 4 C corpus."st." denotes the number of derivation steps.Each instance is annotated with 3 golden derivations.

Table 2 :
Performance of oracle annotators on R 4 C as a function of the number of reference derivations.