Fact or Fiction: Verifying Scientific Claims

We introduce the task of scientific fact-checking. Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute the claim. In addition, it must provide rationales for its predictions in the form of evidentiary sentences from the retrieved abstracts. For this task, we introduce SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. We present a baseline model and assess its performance on SciFact. We observe that, while fact-checking models trained on Wikipedia articles or political news have difficulty generalizing to our task, simple domain adaptation techniques represent a promising avenue for improvement. Finally, we provide initial results showing how our model can be used to verify claims relevant to COVID-19 on the CORD-19 corpus. Our dataset will be made publicly available at https://github.com/allenai/scifact.


Introduction
Fact-checking -a task in which the veracity of an input claim is verified against a corpus of documents that support or refute the claim -has seen increased attention as an important research area.This attention is motivated by the proliferation of misinformation in political news, social media, and on the web.In turn, interest in fact-checking has spurred the creation of many datasets across different domains to support research and development of automated fact-checking systems.Yet, to our knowledge, no such dataset exists to facilitate research on another important domain for factchecking -scientific literature.The ability to verify claims about scientific concepts, especially those Aβ production in CSF was slowed by 37% in the citalopram group compared to placebo.
Taking anti-depressants is associated with an increase in the Aβ level in the brain of experimental animals

Corpus
Figure 1: A SCIFACT claim refuted by evidence.To refute this claim, the system must recognize that (1) "CSF" is an acronym for "cerebral spinal fluid", found in the brain, (2) "Citalopram" is a type of antidepressant, but "placebo" is not, and (3) "Slowing by 37%" indicates a reversal in effect relative to the claim.related to biomedicine, is an important application area for fact-checking.Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, since successfully verifying most scientific claims requires expert background knowledge, complex language understanding, and reasoning capability, as demonstrated in Figure 1.
In this paper, we introduce the task of scientific fact-checking.To facilitate research on this task, we construct SCIFACT, a dataset of 1,409 scientific claims accompanied by scientific abstracts that support or refute each claim, and annotated with rationales justifying each support / refute decision.To curate this dataset, we use a novel annotation protocol that takes advantage of a plentiful source of naturally-occurring claims in the scientific literature -citation sentences, or "citances" (Nakov et al.).
To establish performance baselines on this new arXiv:2004.14974v2[cs.CL] 1 May 2020 task, we develop a pipeline model following the "BERT-to-BERT" approach from DeYoung et al. (2019), which achieves strong performance on FEVER.Our model, which we call VERISCI, retrieves abstracts related to a given claim, uses a BERT-based (Devlin et al., 2019) sentence selector to identify rationale sentences, and then labels each claim as SUPPORTS, REFUTES, or NOTE-NOUGHINFO with respect to the claim.Our system is able to identify correctly-labeled and rationalized evidence abstracts with performance of 46.5 F1, indicating that the task is doable but leaving ample room for improvement.Despite its small size, training VERISCI on SCIFACT leads to better performance than training on fact-checking datasets constructed from Wikipedia articles (Thorne et al., 2018) and political news (Hanselowski et al., 2019).The strongest performance is achieved using a simple domain adaptation strategy, pretraining on FEVER and then finetuning on SCIFACT.
To evaluate the real-world applicability of our dataset and approach, we showcase the ability of our model to verify expert-written claims concerning the novel coronavirus COVID-19 against the newly-released CORD-19 corpus (Wang et al., 2020).Medical student reviewers judge the retrieved evidence to be plausible in 23 of the 36 claims.1Our data and models will be released publicly at https://github.com/allenai/scifact.

Related work
We discuss SCIFACT in relation to existing factchecking datasets and other related scientific NLP tasks.

Natural vs synthetic claims
We distinguish between synthetic and natural claims.FEVER uses synthetic claims created by annotators by mutating Wikipedia sentences selected as related evidence.Most other prior work uses natural claims curated from fact checking sites, Twitter, debates, or news articles.The claims in SCIFACT natural, since they are derived from citation sentences that occur naturally in scientific articles, and annotators to not see the evidence at time of claim writing.We discuss this claim-writing process further in §3.2.
Labeling claims vs claim-document pairs In fact-checking, a claim is a statement of actuality whose veracity is a fixed target for investigation.Therefore, claims can be assigned a global supported or refuted label.For example in FEVER, the claim "Barack Obama was the 44 th President of the United States" can be verified as globally supported given sufficient evidence.
While SCIFACT claims are indeed factual assertions, we do not attempt to assign them global labels because the asserted "fact" may still be under active scientific research.Instead of labeling claims, we label claim-document pairs with support or refute relations.This is similar to the task in Perspectrum (Chen et al., 2019), which identifies evidence-backed "perspective" statements as agreeing or disagreeing with an opinion-based claim, such as "Animals should have lawful rights."We discuss this claim-document labeling process further in §3.3.

Related scientific NLP tasks
The SCIFACT task is closely related to two other scientific NLP tasks -citation contextualization and evidence inference.The goal of citation contextualization is to identify all spans in a cited document that are relevant to a particular citation in a citing document (Cohan et al., 2015).A dataset of 20 biomedical articles annotated with contextualized citations was released at TAC 2014 for this task. 2While the dataset was annotated by domain experts, the average inter-annotator agreement rate on annotated spans was only 21.7% 3 .More re-cently, the SciSummNet dataset (Yasunaga et al., 2019) was released, focusing on NLP papers rather than biomedicine.Similar to these datasets, the annotation in SCIFACT involves contextualizing citances in the cited document, but in SCIFACT, citances are first converted into claims, and evidence is restricted to the abstracts of the cited documents.
The evidence inference task (Lehman et al., 2019), involves predicting the effect of a medical intervention on a specified outcome.Like SCIFACT, the evidence inference task requires the model to identify evidence justifying its label predictions.Unlike the full-sentence claims given as input to SCIFACT, the inputs for evidence inference are individual text spans specifying an intervention, comparator, and treatment outcome.

The SCIFACT dataset
For this task, we introduce SCIFACT, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts.Abstracts that support or refute a claim are additionally annotated with rationales.We describe our corpus creation and annotation protocol.

Data source
To construct SCIFACT, we use S2ORC (Lo et al., 2020), a publicly-available corpus of millions of scientific articles.We restrict articles to those with at least 10 citations and with full text freely available4 .To ensure that documents in our dataset are of high quality, we randomly sample articles from a manually curated collection of well-regarded journals spanning domains from basic science (e.g., Cell, Nature) to clinical medicine (e.g., JAMA, BMJ).The full list is in Appendix B. We refer to the resulting collection of articles as our seed set.
We use the S2ORC citation graph to sample citances (from citing articles) that cite these seed articles.If a citance cites other articles not in the seed set, we refer to these as co-cited articles.

Claim writing
Definition In SCIFACT, a scientific claim is an atomic factual statement expressing a finding about one aspect of a scientific entity or process, which can be verified from a single source.5For instance, "Future studies are also warranted to evaluate the potential association between WNT5A/PCP signaling in adipose tissue and atherosclerotic CVD, given the major role that IL-6 signaling plays in this condition as revealed by large Mendelian randomization studies 44, 45 ."IL-6 signaling plays a major role in atherosclerotic cardiovascular disease.

Claim
Figure 2: A claim written based on a citance.Material unrelated to the citation is removed.The acronym "CVD" is expanded to "cardiovascular disease".
"The R 0 of the novel coronavirus is 2.5" is considered a valid scientific claim.Opinion-based statements like "The government should require people to stand six feet apart to slow the spread of coronavirus" are not considered scientific claims.Compound claims like "Aerosolized coronavirus droplets can travel at least 6 feet and can remain in the air for 3 hours" should be split into two atomic claims.
Annotation Citances (Nakov et al.) are an ideal source for claims since they contain expert-written assertions about important findings reported in related research articles, and, unlike claims found on the web, they specify the documents where supporting evidence can be found.
Annotators are shown a citance -the source citance -in the context of its source article, and are asked to write up to three claims based on the content of the citance while ensuring the produced claims conform to our claim definition.This results in natural claims because the annotator does not see the cited article's abstract -the cited abstract -at the time of claim writing.Figure 2 shows an example.See Appendix C for screenshots of the claim and evidence interfaces.
Annotators The annotators include four experts with background in scientific NLP, fifteen undergraduates studying life sciences, and four graduate students (doctoral or medical) in the life sciences.Student claim writers attend an in-person training session where they are introduced to the task and receive feedback from the four experts.Following training, student annotators continue writing claims remotely.The expert annotators monitor quality of annotation.
these claims for quality and provide feedback when necessary.As a final check, all submitted claims are proofread by an undergraduate whose claims are deemed especially high-quality by the expert annotators.
Claim negation Unless the authors of the source citance were mistaken, cited articles should provide supporting evidence for the claims made in a citance.To obtain examples where an abstract REFUTES a claim, we create claim negations.Performing this task improperly can introduce biases into the dataset; for instance, a model could learn to associate the word "not" with a REFUTED label (Schuster et al., 2019).To mitigate these effects, a scientific NLP expert performed the negations, skipping claims that could not be negated without introducing obvious dataset artifacts.The majority of claim negations involved a reversal of effect direction; for instance "A high microerythrocyte count protects against severe anemia" can be negated as "A high microerythrocyte count raises vulnerability to severe anemia".

Claim verification
Annotation Annotators are shown a claim, together with one of the claim's cited abstracts, and asked to label the claim-abstract pair as SUPPORTS, REFUTES, or NOTENOUGHINFO.If the abstract is not relevant to the claim, they are instructed to label it NOTENOUGHINFO.If the annotator assigns a SUPPORTS or REFUTES label, they must also identify all valid rationales justifying the label.A rationale is a minimal collection of sentences sufficient to justify the label.An abstract may have multiple rationales,6 as in Figure 3, but they must be mutually exclusive -i.e. they may not share any sentences.
Annotators The annotators include three NLP experts, five undergraduates studying life sciences, and five graduate students studying life sciences.Annotations are performed remotely through a web interface.Annotators are required to pass a 10question "quiz" before annotating their own claims.After passing the quiz, subsequent submissions are reviewed by an NLP expert until that expert deems the annotator reliable.Approved annotators are then assigned to review each others' submissions.In general, graduate students are assigned to review annotations from undergraduates.2019).To measure rationale agreement, we treat each sentence as either classified as "part of a rationale" or "not part of a rationale" and compute sentence-level agreement on abstracts where annotators agreed on the entailment label.The resulting Cohen's κ is 0.71.Additional statistics on the dataset can be found in Appendix B.

Adding distractors to the corpus
Our initial corpus is defined as the union of the seed and co-cited abstract sets from §3.1.To simulate a more realistic corpus for retrieval, we introduce additional distractor abstracts.In doing so, we observe a tradeoff.Adding too many distractors (e.g., all biomedical papers in S2ORC) increases the likelihood of false negatives -that is, when a distractor actually contains evidence relevant to a written claim, but may have been unknown to the authors who wrote the source citance.However, adding a small number of uniformly-sampled distractors does not pose a retrieval challenge, since these documents may not share much lexical overlap with the claims.We address this problem as follows: for each citance, we sample articles that are cited in the same document as the citance, but in a different paragraph (see Figure 4).These articles should have cover topics related to the evidence articles.At the same time, the citance authors were clearly aware of these articles, and presumably would have mentioned them in the citance if they were relevant.We add five distractor articles per citance.

The SCIFACT task
We formalize our definition of the SCIFACT task and define how we perform evaluation.

Task Formulation
The inputs to our fact-checking task are a scientific claim c and a corpus of abstracts A. All abstracts a ∈ A are labeled as y(c, a) ∈ {SUPPORTS, REFUTES, NOTENOUGHINFO } with respect to a claim c.The abstracts that either SUPPORT or REFUTE c are referred to as evidence abstracts for c.We denote the set of evidence abstracts E(c).Each evidence abstract a ∈ E(c) is annotated with rationales.A single rationale R is a collection of sentences {r 1 (c, a), . . ., r m (c, a)} sufficient to justify the label y(c, a), where m is the number of sentences in rationale R. We denote the set of all rationales as R(c, a) = {R 1 (c, a), . . ., R n (c, a)}, where n is the number of rationales.
Given a claim c and a corpus A, the system must predict a set of evidence abstracts E(c).For each abstract a ∈ E(c), it must predict a label y(c, a), and a collection of rationale sentences S(c, a) = { s 1 (c, a), . . ., s m (c, a)}.Note that although the gold annotations may contain multiple separate rationales, to simplify the prediction task we simply require the model to predict a single collection of rationale sentences; these sentences may come from multiple gold rationales.

Task Evaluation
Abstract-level evaluation is inspired by the FEVER Score and measures the system's ability to correctly identify evidence abstracts.A predicted abstract a ∈ E(c) is correctly identified if (1) a is a gold evidence abstract for c, (2) The predicted label is correct: y(c, a) = y(c, a), (3) the predicted rationale sentences contain a gold rationale, i.e., there exists some gold rationale R(c, a) ⊆ S(c, a).Like FEVER, which limits the maximum number of predicted rationale sentences to five, SCIFACT limits to three predicted rationale sentences. 7Overall performance is measured by the F1 of the precision and recall of correctly-identified evidence abstracts, which we refer to as F 1 abstract .
Sentence-level evaluation measures the system performance at identifying individual rationale sentences.We consider this evaluation in addition to the abstract-level evaluation because the abstractlevel evaluation does not penalize the prediction of extra rationale sentences.To address this, we define an additional evaluation criterion at the level of individual rationale sentences.When the model correctly identifies all the sentences in a gold rationale, it is rewarded for each sentence in that rationale, but it is also penalized for all other sentences it predicts.More formally, a rationale sentence s(c, a) ∈ S(c, a) is correctly identified if (1) the abstract a is correctly labeled, (2) s(c, a) is a member of a gold rationale R(c, a), and (3) all other members of R(c, a) are among the predicted S(c, a).
Denote the set of correctly predicted rationale sentences for claim c and abstract a as S * (c, a).We compute rationale sentence precision and recall as Overall performance is measured as the F1 of the precision and recall, denoted as F 1 sentence .For sentence-level evaluation, we do not limit the number of predicted rationale sentences, since the evaluation penalizes models that over-predict.

VERISCI: Baseline model
We develop a baseline for scientific fact checking by adapting the "BERT-to-BERT" model for "hard" rationale selection presented in DeYoung et al. ( 2019) for a number of rationalized NLP tasks including FEVER; this approach is also similar to the fact-checking model presented in Soleimani et al. (2019).Our baseline (called VERISCI) takes a claim c and corpus A as input, identifies evidence abstracts E(c), and predicts a label y(c, a) and rationale sentences S(c, a) for each a ∈ E(c).VERISCI is a pipeline of three components: 1. ABSTRACTRETRIEVAL, which retrieves k abstracts with highest TF-IDF similarity to the input claim.2. RATIONALESELECTION, which identifies rationals S(c, a) for each candidate abstract ( §5.1). 3. LABELPREDICTION, which makes the final label prediction y(c, a) ( §5.2).

Rationale selection
Given a claim c and candidate abstract a, we train a model to predict z i for each abstract sentence a i , where z i = 1[a i is a rationale sentence].For each sentence, we encode the concatenated sequence w i = [a i , SEP, c] using BERT8 and predict a score zi = σ[f (CLS(w i ))], where σ is the sigmoid function, f is a linear layer and CLS(w i ) is the CLS token from the BERT encoding of w i .We minimize cross-entropy loss between z i and zi during training.We train the model on pairs of claims and their cited abstracts from our corpus.For each claim, we use cited abstracts labeled NOTENOUGHINFO, as well as non-rationale sentences from abstracts labeled SUPPORTS and REFUTES as negative examples.We threshold the sigmoid values when performing selection.

Label prediction
Sentences identified by the rationale selector are passed to a separate BERT model to make the final labeling decision.Given a claim c and abstract a, we concatenate the claim and the rationale sentences u = [s 1 (c, a), . . .s m (c, a), SEP, c],9 and predict ỹ(c, a) = φ[f (CLS(u))], where φ is the softmax function, and f is a linear layer with three outputs representing the {SUPPORTS, REFUTES, NOTENOUGHINFO } labels.We minimize the cross-entropy loss between ỹ(c, a) and the true label y(c, a).
We train the model on pairs of claims and their cited abstracts using gold rationales as input.For abstracts labeled NOTENOUGHINFO, we randomly choose k sentences from the cited abstract as input rationales. 10When making predictions, we use the predicted rationale sentences S(c, a) as input and predict ŷ(c, a) = argmax ỹ(c, a).The system predicts NOTENOUGHINFO when given an abstract with no rationale sentences.

Experiments
In our experiments, we (1) establish a performance baseline on SCIFACT using VERISCI, (2) analyze the performance of the three components of VERISCI, (3) demonstrate the importance of in-domain training data, and (4) present promising qualitative results on verifying claims about COVID-19 using VERISCI.

Results
Table 1 shows the full-pipeline performance of VERISCI on the SCIFACT test set, evaluated using the abstract-level and sentence-level metrics defined in §4.The F 1 abstract value of 46.5 indicates that, for roughly half of the claim-abstract pairs, VERISCI correctly identifies the SUPPORTS or RE-FUTES label and provides reasonable evidence to justify the decision.Given the difficulty of the task and limited in-domain training data, we consider this a promising result, while leaving plenty of room for improvement.
Oracle experiments To examine the performance of each system component, we run the VERISCI pipeline, replacing some components with "oracles" that always make correct predictions when given correct inputs. 11The first three rows in isolate the performance of a single model component together with two oracles.The next three rows are single-oracle, and examine performance using two of the three model components combined with one oracle.Interestingly, the three pipeline components share similar levels of responsibility for model errors as measured by F 1 sentence .The double-oracle models all have F 1 sentence values around 80. The single-oracle models have values around 60, and the final system F 1 sentence is roughly 40.Thus, replacing a single oracle component introduces a loss of roughly 20 F 1 sentence .These results suggest that no single module is serving as a performance "bottleneck"; improvements at each stage of the pipeline are likely to improve overall performance.
Training datasets During model development, we train the RATIONALESELECTION and LABEL-PREDICTION modules on four different datasets: FEVER, UKP Snopes, SCIFACT, and FEVER pretraining followed by SCIFACT fine-tuning.The RATIONALESELECTION module is evaluated on its ability to identify rationale sentences given gold abstracts. 12The LABELPREDICTION module is evaluated on its classification accuracy given gold rationales from evidence abstracts (including evidence documents labeled NOTENOUGHINFO).The results of these experiments are shown in Table 2.For RATIONALESELECTION, training on SCI-FACT alone produces good results, perhaps because domain-specific lexical cues are sufficient in most cases for identifying rationale sentences.For the more complex reasoning involved in LABELPRE-DICTION, domain adaptation was the most effective approach, training first on the large FEVER dataset and then the smaller in-domain SCIFACT training set.Based on these results, we use the RA-TIONALESELECTION module trained on SCIFACT only, and the LABELPREDICTION module trained on FEVER + SCIFACT for our final end-to-end system VERISCI.Additional implementation details can be found in Appendix A.

Verifying claims about COVID-19
We conduct exploratory experiments using our system to fact-check claims concerning COVID-19.
We task a medical student to write 36 COVIDrelated claims.For each claim c, we use VERISCI to predict evidence abstracts E(c).The same medical student annotator assigns a label to each (c, E(c)) pair.A pair is labeled plausible if at least half of the evidence abstracts in E(c) are judged to have reasonable rationales and labels.It is labeled missed if E(c) = ∅.Finally, it is labeled implausible if the majority of the abstracts in E(c) have irrelevant rationales or incorrect labels.Table 3 shows two example claims, both with supporting and refuting evidence identified by VERISCI.For the majority of these COVID-related claims (23 out of 36), the rationales produced by VERISCI was deemed plausible by our annotator, demonstrating that VERISCI is able to successfully retrieve and classify evidence in many cases.An examination of errors reveals that the system can be confused by context, where abstracts are labeled SUPPORTS or REFUTES even though the rationale sentences reference a different disease or drug from the claim.An example of this is also provided in Table 3.

Discussion and Future Directions
Though SCIFACT represents progress in scientific fact-checking, we look forward to making further improvements.In several cases described below, we attempt to collect more fine-grained data for certain subtasks, but are impeded by annotation challenges.We also discuss how the task of scientific fact-checking can be naturally extended to involve evidence synthesis.

Partial evidence
During pilot annotations for entailment labeling, annotators are instructed to label abstracts as one of SUPPORTS, PARTIALLYSUPPORTS, NOTE-NOUGHINFO, PARTIALLYREFUTES, or REFUTES.The Perspectrum dataset (Chen et al., 2019) features a similar annotation scheme for annotating evidence in online debates.The PARTIAL categorization is useful in cases like the one shown in Figure 5, where the abstract contains relevant evidence, but the context is different (mouse vs. human).When an annotator selects a PARTIAL label, they are also instructed to edit the claim being verified, making as few changes as possible, such that the evidence would provide full support / contradiction for the edited claim.
Unfortunately, inter-annotator label agreement is only 0.48 Cohen's κ on this more granular annotation task, largely due to disagreement over the PAR-TIAL label.This is unsurprising given the subjectivity of the task, and is consistent with the findings from Chen et al. (2019).Based on this low agreement, we completely remove partially-supported claims from the task dataset. 13Improving agree- 13 Though we make these claims and their edits available as a supplement to the dataset.
Treating the gut microbiome with antibiotics reduces levels of free fatty acids in patients with high-fat diets.

Rationale
Treating the gut microbiome with antibiotics reduces levels of free fatty acids in mice.

Edited claim
Antibiotic treatment reduces free fatty acid levels in the gut microbial community of mice susceptible to C. difficile infection ment on partial labels is part of ongoing work.

Modeling contextual information
Similarly, for claim verification, we initially instruct annotators to identify primary and supplemental rationale sentences for each rationale.Primary sentences are those that are needed to verify the claim, while supplemental sentences provide important context missing from primary sentences that are still necessary for appropriately selecting the SUPPORTS or REFUTES label.For example, in Figure 1, the claim specifies "in experimental animals," yet no part of the rationale sentence indicates that its content applies to experimental animals.In this case, another sentence in the rationale abstract supplying information that the experiment was conducted in mice would qualify as a supplemental sentence for this rationale.
We provide some guidance on when and how to select supplemental sentences, such as defining context to be aspects of the claim such as country or population, or instructing annotators to select the first sentence in an abstract that provides the supplementary information.However, agreement on supplemental rationale sentences is low among annotators (Cohen's κ = 0.45).Consequently, we remove supplemental rationale sentences from the task dataset, though we continue to work with annotators on improving agreement.

Evidence synthesis
Evidence synthesis (Marshall et al., 2017) is the task of combining relevant information across different sources to inform decision making.Evaluating the veracity of a scientific statement is challenging, even for human experts.It requires assessing the strength of conflicting evidence from Claim: Lopinavir / ritonavir have exhibited favorable clinical responses when used as a treatment for coronavirus.

Supports:
The 54-year old male is the third patient diagnosed with COVID-19 in Korea . . .Interestingly, after lopinavir/ritonavir (Kaletra, AbbVie) was administered, β-coronavirus viral loads significantly decreased and no or little coronavirus titers were observed.
Refutes: The focused drug repurposing of known approved drugs (such as lopinavir/ritonavir) has been reported failed for curing SARS-CoV-2 infected patients..It is urgent to generate new chemical entities against this virus . . .Wrong context: There are no approved treatments for MERS-CoV infection although a combination of lopinavir, ritonavir and interferon beta . . .In mice, both prophylactic and therapeutic RDV improve pulmonary function and reduce lung viral loads and severe lung pathology.
Claim: The coronavirus cannot thrive in warmer climates.Supports: ...most outbreaks display a pattern of clustering in relatively cool and dry areas...This is because the environment can mediate human-to-human transmission of SARS-CoV-2, and unsuitable climates can cause the virus to destabilize quickly... Refutes: ...significant cases in the coming months are likely to occur in more humid (warmer) climates, irrespective of the climate-dependence of transmission and that summer temperatures will not substrantially limit pandemic growth.
Table 3: Results of our system on several claims concerning COVID-19.In some cases, the label is predicted given the wrong context, e.g. the third evidence sentence for the first claim is a finding about Lopinavir, but for the wrong disease (MERS-CoV).
documents of varying degrees of support, credibility, and recency, and synthesizing the results in a meaningful and actionable way.
Evidence synthesis is not a current part of our task definition.Though we do not ask our system to make corpus-level decisions about a claim's veracity, the extracted evidence and entailment labels produced by VERISCI can naturally be extended for evidence synthesis.However, because performance degrades with each additional pipeline component, further understanding of the scientific factchecking task and its subtasks is necessary before such a system could be useful in practice.Accurate representations of partial evidence and contextual knowledge are necessary steps towards this goal.

Conclusion
Fact checking is important in the scientific domain because it allows us to trace the sources and measure the veracity of scientific claims.These abilities have emerged as particularly important in the context of the reproducibility crisis in science and the rise of disinformation in society.In this article, we formalize the definition of scientific fact checking, and release a dataset (SCIFACT) and models (VERISCI) to support work on this task.
Scientific fact checking poses a set of unique challenges, pushing the limits of neural models on complex language understanding and reasoning.Domain-adaptation techniques show promise, but our findings suggest that additional work is necessary to improve the performance of end-to-end fact-checking systems.We also demonstrate how fact checking might work in practice, by applying our system to the real-world problem of verifying claims related to COVID-19.We hope that these resources encourage others to pursue and expand upon our work, and to further shed light on the broader and more challenging goal of scientific document understanding.

B Detailed corpus statistics
We compute statistics separately for structured abstracts, abstracts that are organized into welldefined sections, and for unstructured abstracts.
Table 4 provides statistics summarizing the lengths of abstracts and rationales.Table 5 shows the counts for each claim-abstract label category in the train, dev, and test sets.Table 6 shows the number of evidence documents supporting each claim.The majority of claims are supported by a single document set.
Figure 6a shows the distribution of the number of rationales in structured and unstructured abstracts.Structured abstracts are more likely to have two evidence sets -for instance, one in the "results" section, and one in the "conclusions" section.Figure 6b shows the distribution of sentences per rationale.
Figure 7 shows the fraction of sentences in each abstract that are part of a rationale.Unstructured abstracts have a heavier "right tail", representing cases where the abstract is short and the entire abstract supports the claim.

MeSH terms for evidence documents appear in
Figure 3: A claim supported by two rationales from the same abstract.The text of each rationale on its own provides sufficient evidence to verify the claim.

Figure 4 :
Figure 4: Citance and abstract selection.Citing abstracts are identified for each seed document.A claim is written based on the citation in the citing abstract.Co-cited and distractor abstracts are added to the corpus.

Figure 5 :
Figure 5: An abstract that partially supports a claim.The edited claim is fully supported.

Figure 8 .
Terms like Human, Risk factors, and Treatment outcome are common to randomized control trial reports.Terms like DNA, RNA, and Cell differentiation indicate molecular biology research.

Figure 9 :
Figure 9: The claim-writing interface.The citation sentence is highlighted in blue on the top left.Additional context is provided on bottom left.The right side shows two claims that could be written based on this citation sentence.

Figure 10 :
Figure 10: The evidence collection interface.

Table 1 are
double-oracle; in these experiments, we

Table 1 :
Test set performance of our VERISCI system on SCIFACT, as measured by the sentence-level and abstractlevel performance evaluations defined in §4.2.For the Oracle experiments, the first three columns in the table indicate whether each module in the pipeline has been replaced by an oracle, or uses a VERISCI system component.In Final system, we present the results of the full VERISCI pipeline.

Table 2 :
12 Our FEVER-trained RATIONALESELECTION module achieves 79.9 sentence-level F1 on the FEVER test set, virtually identical to the value of 79.6 reported in DeYoung et al.Comparison of different training datasets for RATIONALESELECTION and LABELPREDICTION, evaluated on the SCIFACT dev set.

Table 4 :
Summary statistics on the abstracts in the corpus.The Abstract length is measured in number of sentences.The Rationale fraction is the fraction of sentences in each abstract that are rationales.