HoVer: A Dataset for Many-Hop Fact Extraction And Claim Verification

We introduce HoVer (HOppy VERification), a dataset for many-hop evidence extraction and fact verification. It challenges models to extract facts from several Wikipedia articles that are relevant to a claim and classify whether the claim is supported or not-supported by the facts. In HoVer, the claims require evidence to be extracted from as many as four English Wikipedia articles and embody reasoning graphs of diverse shapes. Moreover, most of the 3/4-hop claims are written in multiple sentences, which adds to the complexity of understanding long-range dependency relations such as coreference. We show that the performance of an existing state-of-the-art semantic-matching model degrades significantly on our dataset as the number of reasoning hops increases, hence demonstrating the necessity of many-hop reasoning to achieve strong results. We hope that the introduction of this challenging dataset and the accompanying evaluation task will encourage research in many-hop fact retrieval and information verification.


Introduction
The proliferation of social media platforms and digital content has been accompanied by a rise in deliberate disinformation and hoaxes, leading to polarized opinions among masses. With the increasing number of inexact statements, there is a large interest in a fact-checking system that can verify claims based on automatically retrieved facts and evidence. FEVER (Thorne et al., 2018) is an open-domain fact extraction and verification dataset closely related to this real-world application. However, more than 87% of the claims in FEVER require information from a single Wikipedia article, while real-world "claims" might refer to information from multiple sources. QA datasets like HOT-POTQA (Yang et al., 2018) and QAngaroo (Welbl et al., 2018) represent the first efforts to challenge models to reason with information from three documents at most. However, Chen and Durrett (2019) and Min et al. (2019) show that single-hop models can achieve good results in these multi-hop datasets. Moreover, most models were also shown to degrade in adversarial evaluation (Perez et al., 2020), where word-matching reasoning shortcuts are suppressed by extra adversarial documents (Jiang and Bansal, 2019). In the HOTPOTQA open-domain setting, the two supporting documents can be accurately retrieved by a neural model exploiting a single hyperlink (Nie et al., 2019b;Asai et al., 2020).
Hence, while providing very useful starting points for the community, FEVER is mostly restricted to a single-hop setting and existing multihop QA datasets are limited by the number of reasoning steps and the word overlapping between the question and all evidence. An ideal multi-hop example should have at least one piece of evidence (supporting document) that cannot be retrieved with high precision by shallowly performing direct semantic matching with only the claim. Instead, uncovering this document requires information from previously retrieved documents. In this paper, we try to address these issues by creating HOVER (i.e., HOppy VERification) whose claims (1) require evidence from as many as four English Wikipedia articles and (2) contain significantly less semantic overlap between the claims and some supporting documents to avoid reasoning shortcuts.We create HOVER with 26k claims in three stages. In stage 1 (left box in Fig. 1), we ask a group of trained and evaluated crowd-workers to rewrite the questionanswer pairs from HOTPOTQA (Yang et al., 2018) into claims that mention facts from two English   Wikipedia articles. We then introduce extra hops 2 to a subset of these 2-hop claims by asking crowdworkers to substitute an entity in the claim with information from another English Wikipedia article that describes the original entity. We then repeat this process on these 3-hop claims to further create 4-hop claims. To make many-hop claims more natural and readable, we encourage crowdworkers to write the 3/4-hop claims in multiple sentences and connect them using coreference. An entire evolution history from 2-hop claims to 3/4hop claims is presented in the leftmost box in Fig. 1 and Table 1, where the latter further presents the reasoning graphs of various shapes embodied by the many-hop claims.
In stage 2 (the central box in Fig. 1), we create claims that are not supported by the evidence by mutating the claims collected in stage 1 with a combination of automatic word/entity substitution and human editing. Specifically, we ask the trained crowd-workers to rewrite a claim by making it either more specific/general than or negating the original claim. We ensure the quality of the machine-generated claims using human validation detailed in Sec. 2.2. In stage 3, we follow Thorne et al. (2018) to label the claims as 2 The number of hops of a claim is the same as the number of supporting documents for this claim. SUPPORTED, REFUTED, or NOTENOUGHINFO. However, we find that the decision between RE-FUTED and NOTENOUGHINFO can be ambiguous in many-hop claims and even the high-quality, trained annotators from Appen, instead of Mturk, cannot consistently choose the correct label from these two classes. Recent works (Pavlick and Kwiatkowski, 2019;Chen et al., 2020a) have raised concern over the uncertainty of NLI tasks with categorical labels and proposed to shift to a probabilistic scale. Since this work is mainly targeting the many-hop retrieval, we combine the REFUTED and NOTENOUGHINFO into a single class, namely NOT-SUPPORTED. This binary classification task is still challenging for models given the incomplete evidence retrieved, as we will explain later.
Next, we introduce the baseline system and demonstrate its limited ability in addressing manyhop claims. Following a state-of-the-art system (Nie et al., 2019a) for FEVER, we build the baseline with a TF-IDF document retrieval stage and three BERT models fine-tuned to conduct document retrieval, sentence selection, and claim verification respectively. We show that the bi-gram TF-IDF (Chen et al., 2017)'s top-100 retrieved documents can only recover all supporting documents in 80% of 2-hop claims, 39% of 3-hop claims, and 15% of 4-hop claims. The performance of down- Figure 1: Data Collection flow chart for HOVER. In the first stage, we create claims from HOTPOTQA, validate them and extend to more hops. In the second stage, we apply a variety of mutations to the claims performed by crowd-workers and automatic methods. In the final stage, we ask crowd-workers to label the resulting claims. stream neural document and sentence retrieval models also degrades significantly as the number of supporting documents increases. These results suggest that the possibility of a word-matching shortcut is reduced significantly in 3/4-hop claims. Because the complete set of evidence cannot be retrieved for most claims, the claim verification model only achieves 73.7% accuracy in classifying the claims as SUPPORTED or NOT-SUPPORTED, while the model given all evidence predicts 81.2% of the claims correctly under this oracle setting. We further provide a sanity check to show that the model can only correctly predict the labels for 63.7% of claims without any evidence. This suggests that the claims contain limited clues that can be exploited independently of the evidence during the verification, and a strong retrieval method capable of many-hop reasoning can improve the claim verification accuracy. In terms of HOVER as an integrated task, the best pipeline can only retrieve the complete set of evidence and correctly verify the claim for 14.9% of dev set examples, falling behind the 81% human performance significantly.
Overall, we provide the community with a novel, challenging and large many-hop fact extraction and claim verification dataset with over 26k claims that can be comprised of multiple sentences connected by coreference, and require evidence from as many as four Wikipedia articles. We verify that the claims are challenging, especially in the 3/4hop cases, by showing the limited performance of a state-of-the-art system for both retrieval and verification. We hope that the introduction of HOVER and the accompanying evaluation task will encourage research in complex many-hop reasoning for fact extraction and claim verification.

Data Collection
The many-hop fact verification dataset, HOVER, is a collection of human-written claims about facts in English Wikipedia articles created in three main stages (shown in Fig. 1). In the Claim Creation stage (Sec. 2.1), we ask trained annotators on Appen 3 to create claims by rewriting question-answer pairs (Sec. 2.1.1) from the HOTPOTQA dataset 4 (Yang et al., 2018). The validated 2-hop claims are then extended to (Sec. 2.1.2) include facts from more Wikipedia articles. In the Claim Mutation stage (Sec. 2.2), claims generated from the above two processes are mutated with human editing and automatic word substitution. Finally, in the Claim Labeling stage (Sec. 2.3), trained crowdworkers classify the original and mutated claims as either SUPPORTED, REFUTED or NOTENOUGH-INFO. We merge the latter two labels into a single NOT-SUPPORTED class, owing to ambiguity explained in Sec. 2.3. The guidelines and design for every task are shown in the appendix.

Claim Creation
The goal is to create claims by rewriting questionanswer pairs from HOTPOTQA (Yang et al., 2018) and extend these claims to include facts from more documents (shown in the left box in Fig. 1).

Creating 2-Hop Claims from HOTPOTQA
To begin with, crowd-workers are asked to combine question-answer pairs to write claims. These claims require information from two Wikipedia articles. Based on the guidelines, the annotators can neither exclude any information from the original QA pairs nor introduce any new information.
Validating Created Claims. We then train another group of crowd-workers to validate the claims created from Sec. 2.1.1. To ensure the quality of the claims, we only keep those where at least two out of three annotators agree that it is a valid statement and covers the same information from the original question-answer pair. These validated 2-hop claims are automatically labeled as SUPPORTED.

Extending to 3-Hop and 4-Hop Claims
Consider a valid 2-hop claim c from Sec. 2.1.1 that includes facts from 2 supporting documents A = {a 1 , a 2 }. We extend c to a new, 3-hop claim c by substituting a named entity e in c with information from another English Wikipedia article a 3 that describes e. The resulting 3-hop claimĉ hence has 3 supporting document {a 1 , a 2 , a 3 }. We then repeat this process to extend the 3-hop claims to include facts from the forth documents. We use two methods to substitute different entities e, leading to 4-hop claims with various reasoning graphs.
Method 1. We consider the entity e to be the title of a document a k ∈ A. We search for English Wikipedia articlesâ / ∈ A whose text body mentions e's hyperlink. We exclude theâ whose title is mentioned in the text body of one of the document in A. We then ask crowd-workers to select a 3 from a candidate group ofâ and write the 3-hop claimĉ by replacing e in c with a relative clause or phrase using information from a sentence s ∈ a 3 .
Method 2. In this method, we consider e to be any other entity in the claim, which is not the title of a document a k ∈ A but exists as a Wiki hyperlink in the text body of one document in A. The last 4-hop claim in Table 1 is created via this method and the entity e is "NASCAR". The remaining efforts are the same as Method 1 as we search for English Wikipedia articlesâ / ∈ A whose text body mentions e's hyperlink and ask crowd-workers to replace e with information from a 3 .
Task Setup. We employ Method 1 to extend the collected 2-hop claims, for which we can find at least oneâ. Then we use both Method 1 and Method 2 to extend the 3-hop claims to 4hop claims of various reasoning graphs. In a 3document reasoning graph (a chain), the title of the middle document is substituted out during the extension from the 2-hop claim and thus does not exist in the 3-hop claim. Therefore, Method 1, which replaces the title of one of the three documents in the claim, can only be applied to either the leftmost or the rightmost document. In order to append the fourth document to the middle document in the 3-hop reasoning chain, we have to substitute a non-title entity in the 3-hop claim, which can be achieved by Method 2. In Table 1, the last 4-hop claim with a star-shape reasoning graph is the result of applying Method 1 for 3-hop extension and Method 2 for the 4-hop extension, while the first two 4-hop claims are created by applying Method 1 twice. We ask the crowd-workers to submit the index of the sentence and add this sentence to the supporting facts of the 2-hop claim to form the supporting facts of this new, 3-hop claim.

Claim Mutation
We mutate the claims created in Sec. 2.1 to collect new claims that are not necessarily supported by the facts. We employ four types of mutation methods (shown in the middle column of Fig. 1) that are explained in the following sections.
Making a Claim More Specific or General. A more specific claim contains information that is not in the original claim. A more general claim contains less information than the original one. We design guidelines (shown in the appendix) and quizzes to train the annotators to use natural logic. We constrain the annotators from replacing the supporting document titles in a claim to ensure that verifying this claim requires the same set of evidence as the original claims. We also forbid mutating location entities (e.g., Manhattan − → New York) as this may introduce external evidence ("Manhattan is in New York") that is not in the original set of evidence.
Automatic Word Substitution. In this mutation process, we first sample a word from the claim that is neither a named entity nor a stopword. We then use a BERT-large model (Devlin et al., 2019) to predict this masked token, as we found that human annotators usually fall into a small, fixed vocabulary when thinking of the new word. We ask 3 annotators to validate whether each claim mutated by BERT is logical and grammatical to further ensure the quality and keep the claims where at least 2 workers decide it suffices our criteria. 500 BERTmutated claims passed the validation and labeling.
Automatic Entity Substitution. We design a separate mutation process to substitute named entities in the claims. First, we perform Named Entity Recognition on the claims. We then randomly select a named entity that is not the title of any supporting document, and replace it with an entity of the same type sampled from the context. 5 Claim Negation. Understanding negation cues and their scope is of significant importance to NLP models. Hence, we ask crowd-workers to negate the claims by removing or adding negation words (e.g., not), or substituting a phrase with its antonyms. However, it is shown in Schuster et al.
(2019) that models can exploit this bias as most claims containing a negation word have the label REFUTED. To mitigate this bias, we only include a subset of negated 2-hop claims where 60% of them don't include any explicit negation word.

Claim Labeling
In this stage (the right column in Fig. 1), we ask annotators to assign one of the three labels (SUPPORTED, REFUTED, or NOTENOUGHINFO) to all 3/4-hop claims (original and mutated) as well as 2-hop mutated claims. The workers are asked to make judgments based on the given supporting facts solely without using any external knowledge. Each claim is annotated by five crowd-workers and we only keep those claims where at least three agree on the same label, resulting in a fleiss-kappa inter-annotator agreement score of 0.63. 6 NOT-SUPPORTED Claims. The demarcation between NOTENOUGHINFO or REFUTED is subjective and the threshold could vary based on the world knowledge and perspective of annotators. Consider the claim "Christian Bale starred in a 2010 movie directed by an American director" and the fact "English director Christopher Nolan directed the Dark Knight in 2010". Although the "American" in the claim directly contradicts the word "English" in the fact, this claim should still be classified as NOTE-NOUGHINFO as Bale could have starred in another 2010 film by an American director. More of such examples are provided in the appendix. In this case, a piece of evidence contradicts a relative clause in the claim but does not refute the entire claim. Similar problems regarding the uncertainty of NLI tasks have been pointed out in previous works (Zaenen et al., 2005;Pavlick and Kwiatkowski, 2019;Chen et al., 2020a).
We design an exhaustive list of rules with abundant examples, trying to standardize the decision process for the labeling task. We acknowledge the difficulty and cognitive load it sometimes bears on well-informed annotators to think of corner cases like the example shown above. The final annotated data revealed the ambiguity between NOTE-NOUGHINFO and REFUTED labels, as in a 100sample human validation, only 63% of the labels assigned by another annotator match the majority labels collected. Hence we combine the REFUTED and NOTENOUGHINFO into a single class, namely NOT-SUPPORTED. 90% of the validation labels match the annotated labels under this binary classification setting.

Annotator Details
Most annotators are native English speakers from the UK, US, and Canada. For all tasks, we first launch small-scale pilots to train annotators and incorporate their feedback for at least two rounds. Then for claim creation and extension tasks, we manually evaluate the claims they created and only keep those workers who can write claims of high quality. For claim validation (Sec. 2.1.1) and labeling (Sec. 2.3) tasks, we additionally launch quizzes and annotators scoring 80% accuracy in the quiz are then admitted to the job. During the job, we use test questions to ensure their consistent performance. Crowd-workers whose test-question accuracy drops below 82% are rejected from the tasks and all his/her annotations are re-annotated by other qualified workers. As suggested in Ramírez et al.
Diverse Many-Hop Reasoning Graphs. As questions from HOTPOTQA (Yang et al., 2018) require two supporting documents, our 2-hop claims created from HOTPOTQA question-answer pairs inherit the same 2-node reasoning graph as shown in the first row in Table 1. However, as we extend the original 2-hop claims to more hops using approaches described in Sec. 2.1.2, we achieve manyhop claims with diverse reasoning graphs. Every node in a reasoning graph is a unique document that contains evidence, and an edge that connects two nodes represents a hyperlink from the original Wikipedia document or a comparison between two titles. As shown in Table 1, we have three unique 4-hop reasoning graphs that are derived from the 3-hop reasoning graph by appending the 4th node to one of the existing nodes in the graph.
Qualitative Analysis. The process of removing a bridge entity and replacing it with a relative clause or phrase adds a lot of information to a single hypothesis. Therefore, some of the 3/4-hop claims are of relatively longer length and have complex syntactic and reasoning structure. In systematic aptitude tests as well, humans are assessed on synthetically designed complex logical puzzles. These tests require critical problem solving abilities and are effective in evaluating logical reasoning capabilities of humans and AI models. Overly complicated claims are discarded in our labeling stage if they are reported as ungrammatical or incomprehensible by the annotators. The resulting examples form a challenging task of evidence retrieval and multi-hop reasoning.

Baseline System
Following a state-of-the-art system (Nie et al., 2019a) on FEVER (Thorne et al., 2018), we build a pipeline system of fact extraction and claim verification. 7 This provides an initial baseline for future works and its performance indicates the many-hop challenge posed by HOVER.
Rule-based Document Retrieval. We use the document retrieval component from Chen et al. (2017) that returns the k closest Wikipedia documents for a query using cosine similarity between binned uni-gram and bi-gram TF-IDF vectors. This step outputs a set P r of k r document that are processed by downstream neural models.
Neural-based Document Retrieval. Similar to the retrieval model in Nie et al. (2019a), the BERTbase model (Devlin et al., 2019) takes a single document p ∈ P r and the claim c as the input, and outputs a score that reflects the relatedness between p and c. We select a set P n of top k p documents having relatedness scores higher than a threshold of κ p .
Neural-based Sentence Selection. We fine-tune another BERT-base model that encodes the claim c and all sentences from a single document p ∈ P n , and predicts the sentence relatedness scores using the first token of every sentence. We select a set   S n of top sentences from the entire P n having relatedness scores higher than a threshold of κ s .
Claim Verification Model. We fine-tune a BERT-base model for recognizing textual entailment between the claim c and the retrieved evidence S n . We feed the claim and retrieved evidence, separated by a [SEP] token, as the input to the model and perform a binary classification based on the output representation of the [CLS] token at the first position.

Experiments and Results
We explain the evaluation metrics we use and report the results of the baseline in three evaluation tasks.

Evaluation Metrics
We evaluate the final accuracy of the claim verification task to predict a claim as SUPPORTED or NOT-SUPPORTED. The document and sentence retrieval are evaluated by the exact-match and F1 scores between the predicted document/sentencelevel evidence and the ground-truth evidence for the claim. We refer to the appendix for the detailed experimental setups and hyper-parameters.

Document Retrieval Results
The results in Table 3 show the task becomes significantly harder for the bi-gram TF-IDF when the number of supporting documents increases. This decline in single-hop word-matching retrieval rate suggests that the method to extend the reasoning hops (Sec. 2.1.2) is effective in terms of promoting multi-hop document retrieval and minimizing word-matching reasoning shortcuts. We then use a   BERT-base model (1st row in Table 4) to re-rank the top-20 documents returned by the TF-IDF. The "BERT " (2nd row) is trained with an oracle training set containing all golden documents. Overall, the performances of the neural models are limited by the low recall of the 20 input documents and the F1 scores degrade as the number of hops increase. The oracle model (3rd row) is the same as "BERT " but evaluated on the oracle data. It indicates an upper bound of the BERT retrieval model given a perfect rule-based retrieval method. These findings again demonstrate the high quality of the many-hop claims we collected, for which the reasoning shortcuts are significantly reduced because of the approach described in Sec. 2.1.2.

Sentence Selection Results
We evaluate the neural-based sentence selection models by re-ranking the sentences within the top-5 documents returned by the best neural document retrieval method. For "BERT " (2nd row in Table 5), we again ensured that all golden documents are contained within the 5 input documents during the training. We then measure the oracle result by evaluating "BERT " on the dev set with all golden documents presented. This suggests an upper bound of the sentence retrieval model given a perfect document retrieval method. The same trend holds as the F1 scores decrease significantly as the number of hops increases. 8

Claim Verification Results
In an oracle (1st row in Table 6) setting where the complete set of evidence is provided, the model achieves 81.2% accuracy in verifying the claims. We also conduct a sanity check in a claim-only environment (2nd row) where the model can only exploit the bias in the claims without any evidence, in which the model achieves 63.7% accuracy. Although the model can exploit limited biases within the claims to achieve higher-than-random accuracy without any evidence, it is still 17.5% worse than the model given the complete evidence. This suggests the NLI model can benefit from an accurate evidence retrieval model significantly.

Full Pipeline Results
The full pipeline ("BERT+Retr" in Table 7) uses the sentence-level evidence retrieved by the best document/sentence retrieval models as the input to the NLI model, while the "BERT+Gold" is the oracle in Table 6 but evaluated with retrieved evidence instead. We further propose the HOVER Score, which is the percentage of the examples where the the model must retrieve at least one supporting fact from every supporting document and predict the correct label. We show the performance of the best model (BERT+Gold in Table 7) on the test set in Table 8. Overall, the best pipeline can only retrieve the complete set of evidence and predict the correct label for 14.9% of examples on the dev set and 15.32% of examples on the test set, suggesting that our task is indeed more challenging than the previous work of this kind.

Human Performance
We measure the human performance on 100 sampled claims. In the document (Table 4) and sentence retrieval (Table 5) tasks, the human F1 score is 37.9% and 33.1% higher than the best base-line respectively. In the oracle claim verification (Table 6), the human accuracy is 90%, i.e., 8.8% higher than BERT's accuracy. Comparing on the full pipeline (Table 7), the human accuracy and human HOVER score are 88% and 81%, while the best BERT model only obtains 67.6% accuracy and 14.9% HOVER score respectively on the dev set. Human evaluation setup is explained in appendix.

Related Work
Natural Language Inference and Fact Verification. Textual Entailment and natural language inference (NLI) datasets like RTE (Dagan et al., 2010), SNLI (Bowman et al., 2015) or MNLI (Williams et al., 2018) consist of single sentence premise. In this task, every premise-hypothesis pair is labeled as ENTAILMENT, CONTRADICTION, or NEUTRAL. Another related task is fact verification, where claims (hypothesis) are checked against facts (premise). Vlachos and Riedel (2014) and Ferreira and Vlachos (2016) collected statements from Poli-tiFact, a Pulitzer Prize-winning fact-checking website that covers political topics. The veracity of these facts is crowd-sourced from journalists, public figures and ordinary citizens. However, developing machine learning based assessments on datasets with less than five hundred datapoints is not feasible. Wang (2017) introduced LIAR which includes 12,832 labeled claims from PolitiFact. The dataset is based on the metadata of the speaker and their judgments. However, the evidence supporting the statements are not provided. A recent work in Table-based fact verification (Chen et al., 2020b) points out the difficulty of collecting accurate neutral labels and leaves out those neutral claims at the claim creation phase. We instead merge neutral (NOTENOUGHINFO) claims with REFUTED claims into a single class.
Fact Extraction and Verification. Thorne et al. (2018) introduced FEVER, a fact extraction and verification dataset. It consists of single sentence claims that are verified against the pieces of evidence retrieved from at most two documents. In our dataset, the claims vary in size from one sentence to one paragraph and the pieces of evidence are derived from information ranging from one document to four documents. More recently, Thorne et al. (2019) introduced the FEVER2.0 shared task which challenge participants to fact verify claims using evidence from Wikipedia and to attack other participant's system with adversarial models. In HOVER, the claim needs verification from multiple documents. Prior to verification, the relevant documents and the context inside these documents must also be retrieved accurately. More recently,  enriched the claim with multiple perspectives that support or oppose the claim in different scale. Each perspective can also be verified by existing facts. MultiFC (Augenstein et al., 2019) is a dataset of naturally occurred claims from multiple domains. The contribution of these two fact-checking dataset is orthogonal to ours.  (Clark et al., 2020) are synthetic datasets created to challenge models' ability to understand the complex reasoning in natural language. With the same motive, HOVER is created by humans following the guidelines and rules designed to enforce a multihop structure within the claim. Compared to synthetic datasets like RuleTaker, HOVER's examples are more natural as they are created and verified by humans and cover a wider range of vocabulary and linguistic variations. This is extremely important because models usually get close-to-perfect perfor-mance (e.g., 99% in RuleTaker) on these synthetic datasets.

Conclusion
We present HOVER, a fact extraction and verification dataset requiring evidence retrieval from as many as four Wikipedia articles that form reasoning graphs of diverse shapes. We show that the performance of existing state-of-the-art models degrades significantly on our dataset as the number of reasoning hops increases, hence demonstrating the necessity of robust many-hop reasoning in achieving strong results. We hope that HOVER will encourage the development of models capable of performing complex many-hop reasoning in the tasks of information retrieval and verification.   document retrieval, sentence selection, and claim verification. The fine-tuning is done with a batch size of 16 and the default learning rate of 5e-5 without warmup. We set k r = 20, k p = 5, κ p = 0.5, and κ s = 0.3 based on the memory limit and the dev set performance. We select our system with the best dev-set verification accuracy and report its scores on the hidden test set. The entire pipeline is visualized in Fig. 2. For document retrieval and sentence selection tasks, we fine-tune the BERT on 4 Nvidia V100 GPUs for 3 epochs. The training of both tasks takes around 1 hour. For claim verification task, we fine-tune the BERT on a single Nvidia V100 for 3 epochs. The training finishes in 30 minutes.

Human Evaluation
We measure the human performance in all three evaluation tasks on 100 sampled claims.
To perform the open-domain document retrieval task, the testee is given a claim and a python program that can retrieve the Wikipedia document from the database by its title. The testee is additionally allowed to search in the official Wikipedia web page as retrieving some documents requires matching the claim against the document content. To select the sentence-level evidence from the retrieved documents, the testee uses the documents, tokenized by sentence, returned from the python program. To verify the claim in the oracle setting, the testee is given all golden supporting documents. The testee is given infinite amounts of time for each example. Only 2 out of 100 claims are labeled as not grammatical/logical during the human evaluation.

B.1 Claim Creation Guidelines
Claim. A claim is written in single or multiple sentences that has information (true or mutated) about single or multiple entities.

B.1.1 Simple Claim Creation
The objective of this task is to generate singlesentence claims using QA pairs from HOTPOTQA dataset as shown in Fig. 4 Instructions • Given the question and answer pair , rate the clarity of the question on a scale of 1 (very confusing) to 3 (very clear) • Extract as much information as possible from the Question and Answer and rewrite them as sentences to create claims.
• Avoid including any extra information or uncommon words that are not part of the original Question and Answer • Claims must not exclude any information or uncommon words from the original Question and Answer • Claims must not include any information beyond the question and answer • Claims should be grammatically correct and in formal English • Correct capitalization and spelling of entities should be followed • Claims must not contain speculative language (e.g. probably, might be, maybe, etc.) • Some claims might not be true • Claim should be a single-sentence statement and must not contain a question mark

B.1.2 Claim Validation
The objective of this task is to validate whether the generated claims from Simple Claim Creation meet the requirements

Instructions
• Indicate whether the claim meets the criteria mentioned in Section Sec. B.1.1 • Rate the clarity of question answer pair on a scale of 1 to 5 We collect three judgments per claim and keep those claims where at least two annotators decide that it is validated.

B.1.3 Extending to 3-hop and 4-hop
The objective of this task is to substitute an entity in the claim with the information provided in the given English Wikipedia article.

Overview
• Review the original claim and the given entity • Select a paragraph from 1 to 5 candidate paragraphs (Every paragraph mentions the entity at least once) • Replace the entity with the information from your selected paragraph that describes the entity and rewrite the claim

Instructions
• The rewritten claim must contain the title of the selected paragraph (unless the title contains the entity to be replaced.) • Do not fact check the information or use any external knowledge for this task • The claim should be broken into multiple sentences to form a coherent paragraph • In order to write coherent sentences, use proper pronoun/coreference in the latter sentence to properly refer to the entities mentioned in previous sentences • The claim must not contain the entity that need to be replaced • The claim should preserve other information from the original claim except for the entity to be replaced • Write concise claims. Use the shortest chunk of words from one selected sentence to accurately describe the entity to be replaced • When necessary, rephrase the claim to make it fluent and grammatically correct In this mutation process, we first sample a word from the claim that is not a named entity nor a stopword. We then use a pre-trained BERT-large model (Devlin et al., 2019) to predict this masked token. We only keep the claims where (1) the new word predicted by BERT and the masked word do not have a common lemma and where (2) the cosine similarity of the BERT encoding between the masked word and the predicted word lie between 0.7 and 0.8. The entire procedure is visualized in Fig. 5

Examples of Negated Claims
Original: The scientific name of the true creature featured in "Creature from the Black Lagoon" is Eucritta melanolimnetes.

Negated:
The scientific name of the imaginary creature featured in "Creature from the Black Lagoon" is Eucritta melanolimnetes.

B.2.3 Specifically Implied Claims
The objective of this task is to create specifically implied claims from the claims created in Sec. B.1 such that the mutated claim implies the original claim.

B.2.4 Instructions
• Make the claim more specific by adding information about target entities so that the mutated claim implies the original claim.
• Information must be added that is directly related to the target entities.
• Annotators are discouraged to verify the added information from Wikipedia or other external sources.
• Target entity must not be added to the mutated claim if it was not originally in the claim as it would decrease the number of hops in a claim.
• An entity name that is explained in a relative clause or phrase in the original claim must not be added as it would decrease the number of hops in a claim.
Examples of specifically implied claims Claim: Skagen Painter Peder Severin Kroyer favored naturalism along with Theodor Esbern Philipsen and the artist Ossian Elgstrom studied with in 1907.
Specifically Implied Claim: Skagen Painter Peder Severin Kroyer favored naturalism along with Theodor Esbern Philipsen and the muralist Ossian Elgström studied with in 1907.

B.2.5 Generally Implied Claims
The objective of this task is to create generally implied claims from the claims created in Sec. B.1 such that the original claim implies the mutated claim.

Instructions
• Make the claim more general by deleting information about target entities so that the original claim implies the mutated claim.
• Pick an entity and consider the less specific/more generic term BERT Mutated Claim: This Maroon 5 song, is one of the tracks that Zaedan is best known for remixing. He is a Swedish producer who worked with Taylor Swift. Figure 5: Bert Mutation Procedure. We first randomly select 1-2 non-entity words from a range of Choices and mask them. Then the BERT model predict the masked token and provides the mutated claim.

Examples of generally implied claims
Claim: Skagen Painter Peder Severin Kroyer favored naturalism along with Theodor Esbern Philipsen and the artist Ossian Elgstrom studied with in 1907.
Generally Implied Claim: Skagen Painter Peder Severin Krøyer favored naturalism along with Theodor Esbern Philipsen and the artist Ossian Elgström studied with in the early 1900s.

B.3 Claim Labeling
The objective of this task is to identify the claims to be SUPPORTED, REFUTED, or NOTENOUGHINFO given the supporting facts.
Supported You have strong reasons from the supporting documents, or based on your linguistic knowledge, to justify this claim is true.
Refuted Based on the supporting documents, it's impossible for this claim to be true. You can find information contradicts the supporting documents in REFUTED claims.
NotEnoughInfo Any claim that doesn't fall into one of the two categories above should be labeled as NOTENOUGHINFO. This usually suggests you need ADDITIONAL information to validate whether the claim is TRUE or FALSE after reviewing the paragraphs. Whenever you are not sure whether a claim is Refuted or NOTENOUGHINFO, ask yourself "Is it possible for this claim to be true based on the information from paragraphs?" If yes, select NOTENOUGHINFO.
External Knowledge. The concept of external knowledge is ambiguous and hard to define precisely, and the failure to address this issue could confuse workers regarding what information they are allowed to use when making their judgments.
To address this, we distinguish linguistic knowledge and commonsense from external, encyclopedia knowledge, as additional information that they are allowed to use in the task. Linguistic knowledge can be defined as vocabulary and syntax of an English speaker. It is invariant to most of the English speakers and can play a crucial role in this task. For example, given the supporting facts Messi is the captain of the Argentina national team., the claim was generated by substituting captain to leader. From our linguistic knowledge, captain and leader are synonyms, hence the mutated claim conveys the same idea as the provided supporting facts, and therefore should be annotated as SUPPORTED. On the other hand, if captain is replaced by goalkeeper, an English speaker can easily tell they are words of different meanings. Hence, additional information such as Messi's position should be provided in order to justify this claim. This type of information is beyond the supporting facts and should be considered as external information, and therefore the mutated claim should be annotated as NOTENOUGHINFO. In addition to linguistic knowledge, commonsense should also be taken into account. Few examples of commonsense would be: a person can only have one birth place, a person cannot perform actions after their death, etc. Hence, claims which are found to not respect commonsense are labeled as REFUTED.

Instructions
• Review the claim. Then review the supporting documents, especially the highlighted sentences.
• Extract information from the supporting documents, to justify the given claim is SUP-PORTED or REFUTED. If you are not certain and need additional information, please select NOTENOUGHINFO.
• Avoid using any external information that is not part of the supporting documents.
• If information from the claim and supporting documents is exclusive and is impossible to be both true, the claim should be labeled as REFUTED.
• If information from the claim and supporting documents is nonexclusive and it's possible that both can be true, the claim should be labeled as NOTENOUGHINFO.
Examples of labeled claims Refer Table 9 for original claims, claim mutations and labels.
Refuted vs NotEnoughInfo. Refer Table 10 for ambiguous examples.