Distilling the Evidence to Augment Fact Verification Models

The alarming spread of fake news in social media, together with the impossibility of scaling manual fact verification, motivated the development of natural language processing techniques to automatically verify the veracity of claims. Most approaches perform a claim-evidence classification without providing any insights about why the claim is trustworthy or not. We propose, instead, a model-agnostic framework that consists of two modules: (1) a span extractor, which identifies the crucial information connecting claim and evidence; and (2) a classifier that combines claim, evidence, and the extracted spans to predict the veracity of the claim. We show that the spans are informative for the classifier, improving performance and robustness. Tested on several state-of-the-art models over the Fever dataset, the enhanced classifiers consistently achieve higher accuracy while also showing reduced sensitivity to artifacts in the claims.


Introduction
The increased quantity of information that circulates in social media and on the Web every day, together with the high cost of assessing its veracity, has demanded the application of natural language processing (NLP) techniques to the task of fact verification. In the last years, the NLP community has proposed a large number of datasets and approaches for addressing this task, facing complicated challenges that are still far from being solved.
The task of fact verification can be split into (i) retrieving one or more candidate pieces of evidence; (ii) assessing whether they are either supporting or refuting a claim, or whether they contains insufficient information to state either of the above. In this paper, we mostly focus on the reasoning between the claim and the evidence.
To generate models that work on real world data, fact verification solutions are expected to: (i) per- form well not only on synthetic datasets but also in realistic scenarios, where both text form and text content are highly unpredictable; (ii) produce transparent decisions, providing an explanation for their verdict, so that the readers may consider whether trusting them or not. To address these two requirements, we propose a model-agnostic framework that includes two modules: (i) a span extractor that aims to identify in the evidence the pieces of relevant information that are informative with respect to the claim; (ii) a classifier that uses the claim, evidence and extracted spans to predict whether the evidence is supporting, refuting or containing insufficient information. The spans extracted by the first module are useful to enhance the classifier and inform the user. Humans can in fact exploit the spans to effectively understand why a claim is true or false.
We evaluate our pipeline with three highly performing neural models on the FEVER dataset (Thorne et al., 2018), comparing the uninformed to the informed setting. While this dataset includes ground truth for both evidence retrieval and evidence classification, in this paper we only exploit the latter annotations. Our experiments show that the models informed with the extracted spans consistently achieve higher performance than their uninformed counterparts, demonstrating the usefulness of spans. We also evaluate our models on the challenging SYMMETRIC FEVER dataset (Schuster et al., 2019), which tests system's robustness in absence of FEVER's artifacts. We find the models trained with our pipeline to achieve higher accuracy.
Finally, we assess the quality of the extracted spans as decision rationales to be shown to end-user. Manually examining a subset of outputs shows that 67% of the support and 88% of the refute spans are well explanatory with respect to the decision, leading to an aggregated score of 75%.

Related Work
Fake news detection has recently gained interest in the NLP community. Most of the initial works have focused on style (Feng et al., 2012) and linguistic approaches (Pérez-Rosas and Mihalcea, 2015). Despite the good performance in synthetic datasets, these methods failed when applied to real-world data. New approaches based on fact verification over retrieved evidence have therefore taken the stage in the literature.
Datasets. Several fact verification datasets were developed over the last decade. Vlachos and Riedel (2014) created a dataset which consisted of 221 statements and hyperlinks to pieces of evidence of various formats. Many datasets were created in the following years, with collections of claims of increasing size and various kinds of additional information. Among them Ferreira and Vlachos (2016)'s debunking dataset (300 rumoured claims and 2,595 associated news articles) and Wang (2017)'s LIAR dataset (12,836 short statements labeled for veracity, topic and various metadata on the speaker). In the last years, most systems have been developed over FEVER (Thorne et al., 2018), a large-scale dataset for Fact Extraction and VERification that consists of 185,445 claims and their related evidence, labeled as either supporting, refuting or not containing enough information.
Approaches. There has been a large development since the first approaches for fact verification (Ferreira and Vlachos, 2016;Wang, 2017;Long et al., 2017). To provide a strong base-line for FEVER, Thorne et al. (2018) proposed a pipeline consisting of document and sentence retrieval and a multi-layer perceptron as textual entailment recognizer. More sophisticated models followed. Among them, the Bi-Directional Attention Flow (BiDAF) network (Seo et al., 2016a), originally introduced for machine comprehension, has been recently adapted to the task of fact verification (Tokala et al., 2019). BiDAF combines LSTMs with both a context-to-query and query-tocontext attention, to produce a query-aware context representation at multiple hierarchical levels. Nie et al. (2019) introduced the Neural Semantic Matching Networks (NSMNs), which aligns two encoded texts and computes the semantic matching between the aligned representations with LSTMs and used it to earn the first place in the first competitions organized on the FEVER dataset. Soleimani et al.

Method
Given a claim C = {c 1 , . . . , c n } and a piece of evidence E = {e 1 , . . . , e m }, two word sequences of length n and m respectively, the fact verification problem requires to predict the relation rel = {(S)upports, (R)efutes, (I)nsufficient} between E and C. Framework. We propose a pipeline of two modules: a span extractor M span and a classifier M classifier . The goal of M span (C, E) is to identify polarizing pieces of information {e i 1 , . . . , e i N } in E without which rel(E, C) would be neutral (i.e. C would neither be entailed nor contradicted by E). The identified pieces of information are passed to M classifier , together with C and E, to perform a three-label classification aimed at predicting rel(E, C): M classifier (C, E, {e i 1 , . . . , e i N }) = l ∈ {S, R, I}.

Span Extractor
We utilize the TokenMasker architecture from Shah et al. (2020) for M span . This masker was developed to identify the minimal group of tokens without which E would be neutral with respect to C. stead, pretrained on an entailment task over a multigenre corpus (i.e. three-label classification: entailment/neutral/contradiction on the MULTINLI dataset (Williams et al., 2018)).
The choice of using a rationale-style extractor (Shah et al., 2020) is due to its ability to provide informative spans that can be used as explanations to the relation of the evidence with the claim. This approach was shown to perform better than simply relying on the internal attention weights of a classifier (Lei et al., 2016;Jain and Wallace, 2019).

Classifiers
To test our assumption, we consider three neural network architectures that have achieved the best performance on the first FEVER shared Task recently: BiDAF (Seo et al., 2016b), NSMN (Nie et al., 2019) and BERT (Devlin et al., 2019). Note that the architecture of M classifier is independent of M span . The spans extracted by M classifier are forwarded to the classifier by concatenating them to the original evidence, followed by a separator token. BiDAF consists of four layers: (i) the embedding layer, which encodes two raw text sequences (i.e. C and E) into two vector sequencesĈ andÊ; (ii) the attention layer, which computes the attention scores between the two sequences and returns two attended sequences C A and E A ; (iii) the modeling layer, which takes C A and E A as input and outputs two fixed size vectors,Ĉ A andÊ A , that capture the semantic similarity between the original sequences; and (iv) the output layer, which takesĈ A andÊ A and returns the output labels. NSMN encodes C and E into vector sequencesĈ andÊ, similarly to BiDAF. It then applies an alignment layer, which computes the alignment matrix, A =Ĉ TÊ , and the aligned representations, C A and E A , usingĈ,Ê, A. It follows a matching layer, which performs semantic matching using LSTM between C A andĈ, as well as E A andÊ, to output matching matrices M C and M E , which are finally pooled by the output layer and mapped to output labels. BERT (we use the base-uncased version) consists of 12 encoder layers with self-attention (enc 1 , . . . , enc 12 ) and one classification layer. Each encoder enc i takes an input sequence I i−1 and outputs I i , a sequence of the same length where each token is replaced with an embedding capturing its relationship with the other words in I i−1 . The output of enc i becomes the input of enc i+1 . I 0 is set as the concatenation of C and E, preceded by the special [CLS] token. The output of the last encoder enc 12 is therefore an highly embedded representation of C and E. It is passed to the classification layer which maps the representation of the [CLS] token to the output labels.

Experiments
We evaluate the three classifiers described in section 3 in two conditions: uninformed (W/O) and informed (With), where the latter refers to the utilization of the information extracted by M span .

Data
We use the FEVER dataset to train all of our classifiers. We evaluate the classifiers both on FEVER and on SYMMETRIC FEVER. FEVER dataset (Thorne et al., 2018): the current largest available Wikipedia-based dataset, consisting of 185,445 claims. Each claim is matched with supporting or refuting evidence from Wikipedia or with a "not enough information" label.
We use the development set from FEVER's shared-task as our test set (containing 19,998 samples). We randomly split FEVER's training set into our training and validation sets. Following this process, we have 125,451 samples in our training set (73,369 support, 23,109 refute, and 28,973 insufficient information).
While evidence sentences for supporting and refuting examples are provided in the ground truth, those for the "insufficient information" were obtained by us. We use the document retrieval module of the best performing system on the first FEVER Shared Task (Nie et al., 2019). Given a claim and the Wikipedia dump provided with the FEVER dataset, this document retrieval module returns a list of Wikipedia articles which are possibly related to the claim, ranked with a score calculated by comparing the claim, the title of the article and its first sentence. We keep the highest scoring document. Thereafter, we pick the sentence with the highest TF-IDF similarity with the claim. Also, to disambiguate pronouns, we extend all evidence sentences by appending the title of their Wikipedia page. SYMMETRIC FEVER (Schuster et al., 2019): a smaller unbiased extension of FEVER, consisting of 712 claim-evidence pairs which were synthetically generated from FEVER to remove strong cues in the claims which could allow predicting the label without looking at the evidence (give-away phrases).

Hyperparameters
TokenMasker is trained on the same dataset and configuration as Shah et al. (2020). However, we replace their neutrailty classifier with a RoBERTa classifier, pretrained on MNLI. This model is trained once and used in inference mode for all subsequent experiments. BiDAF is trained for 12 epochs using cross entropy loss and Adam optimizer with initial learning rate 1e-3. We use a dropout probability of 0.2 and a batch size of 8. NSMN is trained for 12 epochs using cross entropy loss and Adam optimizer with initial learning rate 1e-4. We use a dropout probability of 0.5 and a batch size of 8. BERT is fine-tuned for 8 epochs using cross entropy loss and Adam optimizer with initial learning rate 2e-5. We use a dropout probability of 0.1 and a batch size of 16.
These hyperparemeters were found to achieve the highest accuracy on our validation set. For our final classifiers, we fix these settings and retrain them using the full FEVER training set.  Table 1 shows the results obtained in our experiments on both FEVER and the SYMMETRIC dataset. Scores are much higher in the first dataset as the systems can rely on give-away phrases, some words in the claims which have a high correlation with the correct output label regardless of the evidence. This situation does not exist in the SYMMETRIC dataset, where the give-away phrases have been eliminated. As expected, all systems perform worse on this dataset, but the drop in performance is more significant for the uninformed models (W/O) than for the informed (With) ones. In fact, the informed models consistently perform better than the uninformed ones (W/O), often obtaining statistical significance. While the difference in performance between W/O and With is particularly relevant for BiDAF and NSMN, it thins for BERT, which is already a strong classifier leveraging on a robust pretraining. Output Explainability. We also manually evaluated the spans for 100 randomly extracted claimoutput pairs, to assess whether they represented an understandable explanation for the verdict. The spans were deemed explanatory in 88% of the cases for refute claims and 67% of the support claims, which leads to an aggregated score of 75%. The extracted spans are therefore not only informative to the classifier, but can also be used to produce human-readable justifications for a positive or negative relation.

Conclusions
This paper has introduced a classifier-agnostic framework that allows fact verification models to improve their performance and robustness, utilizing concise spans of the available evidence sentences. The experiments have shown that the extracted spans are indeed informative for the final classifier, supporting the usefulness of the framework. Furthermore, this work opens the possibility of providing to the human users a justification for the model's predictions.