Constrained Fact Verification for FEVER

Fact-veriﬁcation systems are well explored in the NLP literature with growing attention owing to shared tasks like FEVER. Though the task requires reasoning on extracted evidence to verify a claim’s factuality, there is little work on understanding the reasoning process. In this work, we propose a new methodology for fact-veriﬁcation, speciﬁcally FEVER, that enforces a closed-world reliance on extracted evidence. We present an extensive evaluation of state-of-the-art veriﬁcation models under these constraints.


Introduction
A rapid increase in the spread of misinformation on the Internet has necessitated automated solutions to determine the validity of a given piece of information. To this end, the Fact Extraction and VERification (FEVER) shared task (Thorne et al., 2018a) 1 introduced a dataset for evidencebased fact verification. Given a claim, the task involves extracting relevant evidence sentences from a given Wikipedia dump and assigning a label to the claim by reasoning over the extracted evidence (SUPPORTS / REFUTES / NOTENOUGHINFO).
Several recent works (Liu et al., 2020;Soleimani et al., 2020;Zhao et al., 2020) leverage representations from large pre-trained language models (LMs) like BERT (Devlin et al., 2019), and RoBERTa  to achieve state-of-the-art results on FEVER. However, it is unclear how factual knowledge encompassed in these LMs influences the verification process.
More recently, Lee et al. (2020) developed a fact verification system solely based on large pretrained LMs and presented their superior zero-shot performance on FEVER compared to a random baseline. This result clearly shows the influence of factual knowledge embedded inside these LMs, but relying entirely on such knowledge directly contrasts to the evidence-based paradigm of factverification. Such reliance can be problematic, especially with evolving evidence (Wikipedia pages are constantly updated to reflect the latest events). Schuster et al. (2019) illustrate this phenomenon through an example fact, "Halep failed to ever win a Wimbledon title", which was valid until July 2019 but not thereafter.
In this work, we propose methods to train factverification models that explicitly reason on the available evidence instead of relying on the factual knowledge in pre-trained LMs, thereby emulating a closed-world setting. This is particularly important in the context of the FEVER dataset because of the overlap between the source corpus used for compiling FEVER and the ones commonly used to pre-train LMs (Wikipedia).
We build upon the work of Clark et al. (2020) that demonstrated the ability of transformers (BERT, RoBERTa) to function as soft theorem provers. They induce a closed-world reasoning process by fine-tuning on a carefully curated synthetic natural language rulebase. In this work, we transfer this ability to FEVER and gauge the feasibility of such closed-world reasoning. Additionally, we also construct an entity-anonymized version of FEVER following Hermann et al. (2015) for evaluating our proposed models. We construct the anonymized version by masking prominent named entities in the claim-evidence pairs, thereby reducing any reliance on pre-trained factual knowledge.
Our experiments adopt the popular three-stage pipeline of FEVER task, comprising document selection, evidence sentence extraction, and claim verification (Thorne et al., 2018b). We primarily focus on the claim verification stage of FEVER, while using the state-of-the-art document selec-tion and evidence sentences extraction from Liu et al. (2020). Our focus is motivated since only the claim verification step involves a joint (often complicated) reasoning over the extracted evidence. Our main contributions are, • We propose various pre-training strategies for large pre-trained LMs to induce a closedworld setting during fact verification in FEVER. • We adapt an existing synthetic natural language rulebase to FEVER by incorporating NOTENOUGHINFO label. • We create an anonymized version of the FEVER dataset to facilitate investigation into the factual knowledge through named entities.
Our datasets and code are publicly available. 2

Constrained Verification
Traditionally, most FEVER systems rely on large pre-trained language models (LMs) to encode the claim and extracted evidence sentences. Previously, Schuster et al. (2019) studied various reasons for the surprisingly good performance of claim-only classifiers on FEVER and reported dataset idiosyncrasies to be the primary reason as opposed to world knowledge in word embeddings. However, they present only a preliminary analysis of the impact of world knowledge from GloVe embeddings (Pennington et al., 2014). In this work, we present an in-depth analysis because the issue is particularly relevant in the context of large pre-trained LMs. To the best of our knowledge, we are not aware of any other works that look into the impact of embedding's world knowledge on FEVER.
In a nutshell, we model the task under a closedworld setting with the extracted evidence as the only available factual information to the model. Overall, we believe the methods proposed in this paper are general enough to apply to any factverification task. However, we show a case study only on FEVER due to its wide-spread popularity.
To this end, we first present an entityanonymized version of the FEVER dataset and then propose pre-training strategies to enforce the above described closed-world setting on FEVER models.

Anonymization
A straightforward way to discourage the use of prior factual knowledge in fact-verification systems 2 https://github.com/adithya7/ constrained-fever Kung Fu (TV series) ent1 Kung Fu ent1 is an American action adventure martial arts western drama television series starring David Carradine ent0 is to anonymize the named entities. An intuitive way to achieve this is to replace them with a custom list of abstract entity markers. We adapt a related technique from reading comprehension literature (Hermann et al., 2015) to our task. Given a pair of claim and extracted evidence sentences, we first identify the set of named entities from Wikititles of evidence sentences. We then replace all the occurrences of these named entities with abstract markers sampled randomly from a predefined list. We present an anonymized FEVER instance in Table 1. We use the resulting anonymized FEVER dataset to evaluate our proposed methods. Clark et al. (2020) analyze the logical reasoning capabilities of transformer-based models on a variety of question-answering and reading comprehension tasks. Given a question and a context comprising of a set of simple facts and rules in natural language, models are expected to reason only based on the provided context, thereby emulating the ability to perform closed-world reasoning. They propose a synthetic training dataset (henceforth referred to as RuleTaker dataset) to fine-tune pre-trained models like RoBERTa. They observe high performances (≥95% accuracy) on the synthetic test set, motivating us to adapt a similar training methodology for FEVER. Table 2 shows an example context from the RuleTaker dataset. Each question-context pair in this dataset belongs to one of the following types, Type-A: provable/disprovable statements, can be labeled by reasoning directly over the context, Type-B: unprovable statements, reasoning Facts/Triples F 1 : Bob is blue. F 2 : Fiona is kind.

Rules
R 1 : All white people are red. R 2 : Blue people are white. R 3 : If someone is red then they are kind. over the context is not sufficient to conclude these statements. 3 The RuleTaker dataset assigns a TRUE or FALSE label to each question-context pair. Type-A were labeled by reasoning over the context, whereas Type-B were labeled by invoking the closed-world assumption (CWA) (Q4, Q5 in Table 2). The provided context (facts and rules) constitutes the closed-world setup. Moreover, Type-A are additionally annotated with a proof constituting a reasoning chain over a subset of facts and rules.

RuleTaker-CWA: Questions
We adapt the RuleTaker dataset to FEVER by introducing a new NOTENOUGHINFO label for unprovable question-context pairs. In particular, we construct two FEVER-style RuleTaker datasets, namely RuleTaker-CWA and RuleTaker-Skip-Fact (example in Table 2).

RuleTaker-CWA:
We convert all the labels for Type-B pairs into NOTENOUGHINFO (Q4, Q5 in Table 2) and relabel TRUE and FALSE from Type-A into SUPPORTS and REFUTES respectively (Q1, Q2, Q3 in Table 2).
RuleTaker-Skip-Fact: For each Type-A question, we create a contrastive setting by removing a  necessary fact (i.e., required in proof) from the original context. The label for the modified questioncontext pair becomes NOTENOUGHINFO because the question can no longer be answered under the modified context (Q6, Q7, Q8 in Table 2). We also retain the original Type-A pairs by converting all TRUE and FALSE labels to SUPPORTS and RE-FUTES respectively (Q1, Q2, Q3 in Table 2). To maintain a balanced dataset, we randomly sample a fraction of newly created NOTENOUGHINFO labels. Note that we only work with Type-A pairs in this variant. Occasionally there could be multiple valid proofs for the same question-context pair. We currently ignore these questions to avoid inconsistencies arising from other valid reasoning methods over the modified context. Table 3 presents the statistics for the train, dev and test splits in the proposed RuleTaker-CWA and RuleTaker-Skip-fact datasets.
As a natural adaptation, we also considered creating a similar Skip-fact variant of the FEVER dataset. Each claim in FEVER was annotated with potentially many evidence sets, and each evidence set can consist of multiple evidence sentences. Ideally, we need all sentences within single evidence set to validate the claim, i.e., it requires multi-hop reasoning. Unfortunately, we noticed cases where a proper subset of an evidence set is enough to prove/disprove the claim (see Table 4).

Methodology
We now present the methodology to train constrained fact-verification models for the FEVER shared task. Many state-of-the-art FEVER models use the standard BERT encoder (Devlin et al., 2019) to encode a concatenation of claim and evidence sentences. To enforce closed-world reasoning over available evidence, we first pre-train the BERT encoder on the proposed variants of Rule-

Claim
Roman Atwood is a content creator.

(Roman Atwood) Roman Bernard Atwood (born
2. (Comedian) A popular saying, variously quoted but generally attributed to Ed Wynn, is, "A comic says funny things; a comedian says things funny", which draws a distinction between how much of the comedy can be attributed to verbal content and how much to acting and persona.  Taker datasets following Clark et al. (2020). Firstly, the reasoning models in Clark et al. (2020) were first trained on the RACE multi-choice question answering dataset (Lai et al., 2017) and then fine-tuned on the RuleTaker dataset. In our experiments, we follow the same pipeline (including hyper-parameters) except to replace original RuleTaker dataset with our adaptations, RuleTaker-CWA and RuleTaker-Skip-fact. 4 In Table 5, we present the results of the pretrained RuleTaker-CWA and RuleTaker-Skip-fact on their respective test sets. In general, we notice high performance on the synthetic test sets, indicating the model's ability to rely only on available evidence.
We now utilize the above fine-tuned BERT en-coders (CWA, Skip-fact) with two state-of-the-art graph-based reasoning networks for claim verification, KGAT (Liu et al., 2020) and Transformer-XH (Zhao et al., 2020), as well as a robust BERTbased classifier. BERT-concat: Evidence sentences retrieved before claim verification are concatenated to the claim along with their Wiki-titles and are encoded using a pretrained BERT encoder. The [CLS] representation from the encoder is then directly used for classification. 5 KGAT (Liu et al., 2020): A kernel-based graph attention network over the evidence graph. Each node in the graph encodes a concatenation of individual evidence sentence (along with Wiki-title) and the claim. Knowledge propagation between the nodes of this graph is achieved using a Gaussian edge kernel on a word-word similarity matrix, while individual node importance is measured using a separate node kernel. The initial node representations are refined using the above kernels and a single graph attention layer.
Transformer-XH (Zhao et al., 2020): Evidence graph is constructed and initialized in a way similar to KGAT, but the knowledge propagation between the nodes is achieved using special eXtra-Hop attention mechanism. For each node, the [CLS] token embedding from BERT is considered as an attention hub and is revised using a combination of the extra-hop attention and the traditional in-sequence attention. 6 We compare the above-proposed curricula (CWA, Skip-fact) against a baseline curriculum (Original) where we initialize the verification models with standard pretrained BERT weights (bert-base-cased). We use huggingface transformers (Wolf et al., 2019) in all of our experiments. 7

Experiments
For each of the three models, BERT-concat, Transformer-XH, and KGAT, we show results on the three different training curricula, Original, CWA, and Skip-fact in Table 6. We evaluate all our trained models on three datasets, the official devset of FEVER task (Std.), symmetric FEVER v0.2   On most evaluation sets, we found the models trained with Original curriculum performed better than our proposed curricula (CWA, Skip-fact) except on symmetric FEVER where Transformer-XH with Skip-fact does slightly better. Across the models, we notice a considerable drop in performance on Anon. set, validating our hypothesis about existing reliance on factual knowledge. To see the individual impact of the entity-anonymization, we train the BERT-concat model on train split of Anon. FEVER dataset. We observe improvements across the three curricula, with Original still outperforming the proposed curricula (Table 7).
Through our constrained verification setup, we expect the models to reason using only the extracted evidence. The evidence retrieval from Liu et al. (2020) achieves a recall of 94%, indicating the feasibility of reasoning only on extracted evidence in FEVER. With Original outperforming the proposed strategies on both the standard and anonymized FEVER, we find that world knowledge is helpful for FEVER.
Limitations Firstly, our anonymization is a regex-based method and relies only on the entities in Wiki-titles, and this might be insufficient for handling ambiguous titles. Secondly, the Rule-Taker dataset's domain is significantly different from that of the FEVER dataset, thereby presenting a challenge in re-using the pretrained encoder. Additionally, it is not entirely clear as to what constitutes the world (or factual) knowledge for a given task and as highlighted by Clark et al. (2020), effectively combining implicit pretrained knowledge (from encoders) with explicitly stated knowledge (from evidence) remains a challenge.

Related Work
We adopted the widely used document selection method from Hanselowski et al. (2018). Many recent state-of-the-art FEVER systems involve reasoning over evidence graphs Zhong et al., 2019;Liu et al., 2020;Zhao et al., 2020) along with competitive LMbased models (Soleimani et al., 2020). Dataset specific idiosyncrasies have been identified in FEVER (Thorne et al., 2019;Schuster et al., 2019) as well as in NLI (Gururangan et al., 2018;Poliak et al., 2018;Naik et al., 2018;McCoy et al., 2019), but is not the focus of this work.

Conclusion
We identify a critical issue with existing claim verification systems, especially the recent models that utilize large pre-trained LMs. We propose to perform fact verification under a closed-world setting and present our results on the task of FEVER. While it is hard to evaluate the reliance on implicit pretrained knowledge, our initial results indicate that such reliance is helpful for FEVER.