Improving Evidence Retrieval for Automated Explainable Fact-Checking

Automated fact-checking on a large-scale is a challenging task that has not been studied systematically until recently. Large noisy document collections like the web or news articles make the task more difficult. We describe a three-stage automated fact-checking system, named Quin+, using evidence retrieval and selection methods. We demonstrate that using dense passage representations leads to much higher evidence recall in a noisy setting. We also propose two sentence selection approaches, an embedding-based selection using a dense retrieval model, and a sequence labeling approach for context-aware selection. Quin+ is able to verify open-domain claims using results from web search engines.


Introduction
With the emergence of social media and many individual news sources online, the spread of misinformation has become a major problem with potentially harmful social consequences. Fake news can manipulate public opinion, create conflicts, elicit unreasonable fear and suspicion. The vast amount of unverified online content led to the establishment of external post-hoc fact-checking organizations, such as PolitiFact, FactCheck.org, Snopes etc, with dedicated resources to verify claims online. However, manual fact-checking is time consuming and intractable on a large scale. The ability to automatically perform fact-checking is critical to minimize negative social impact.
Automated fact checking is a complex task involving evidence extraction followed by evidence reasoning and entailment. For the retrieval of relevant evidence from a corpus of documents, existing systems typically utilize traditional sparse retrieval which may have poor recall, especially when the relevant passages have few overlapping words with the claims to be verified. Dense retrieval models have proven effective in question answering as these models can better capture the latent semantic content of text. The work in (Samarinas et al., 2020) is the first to use dense retrieval for fact checking. The authors constructed a new dataset called Factual-NLI comprising of claim-evidence pairs from the FEVER dataset (Thorne et al., 2018) as well as synthetic examples generated from benchmark Question Answering datasets (Kwiatkowski et al., 2019;Nguyen et al., 2016). They demonstrated that using Factual-NLI to train a dense retriever can improve evidence retrieval significantly.
While the FEVER dataset has enabled the systematic evaluation of automated fact-checking systems, it does not reflect well the noisy nature of real-world data. Motivated by this, we introduce the Factual-NLI+ dataset, an extension of the FEVER dataset with synthetic examples from question answering datasets and noise passages from web search results. We examine how dense representations can improve the first-stage retrieval recall of passages for fact-checking in a noisy setting, and make the retrieval of relevant evidence more tractable on a large scale.
However, the selection of relevant evidence sentences for accurate fact-checking and explainability remains a challenge. Figure 1 shows an example of a claim and the retrieved passage which has three sentences, of which only the last sentence provides the critical evidence to refute the claim. We propose two ways to select the relevant sentences, an embedding-based selection using a dense retrieval model, and a sequence labeling approach for context-aware selection. We show that the former generalizes better with a high recall, while the latter has higher precision, making them suitable for the identification of relevant evidence sentences. Our fact-checking system Quin+ is able to verify open-domain claims using a large corpus or web search results.

Related Work
Automated claim verification using a large corpus has not been studied systematically until the availability of the Fact Extraction and VERification dataset (FEVER) (Thorne et al., 2018). This dataset contains claims that are supported or refuted by specific evidence from Wikipedia articles. Prior to the work in (Samarinas et al., 2020), fact-checking solutions have relied on sparse passage retrieval, followed by a claim verification (entailment classification) model (Nie et al., 2019). Other approaches used the mentions of entities in a claim and/or basic entity linking to retrieve documents and a machine learning model such as logistic regression or an enhanced sequential inference model to decide whether an article most likely contains the evidence (Yoneda et al.;Chen et al., 2017;Hanselowski et al., 2018).
However, retrieval based on sparse representations and exact keyword matching can be rather restrictive for various queries. This restriction can be mitigated by dense representations using BERTbased language models . The works in Karpukhin et al., 2020;Xiong et al., 2020;Chang et al., 2020) have successfully used such models and its variants for passage retrieval in open-domain question answering. The results can be further improved using passage re-ranking with cross-attention BERT-based models (Nogueira et al., 2019). The work in (Samarinas et al., 2020) is the first to propose a dense model to retrieve passages for fact-checking.
Apart from passage retrieval, sentence selection is also a critical task in fact-checking. These evidence sentences provide an explanation why a claim has been assessed to be credible or not. Re-cent works have proposed a BERT-based model for extracting relevant evidence sentences from multi-sentence passages (Atanasova et al., 2020). The authors observe that joint training on veracity prediction and explanation generation performs better than training separate models. The work in (Stammbach and Ash, 2020) investigates how the few-shot learning capabilities of the GPT-3 model (Brown et al., 2020) can be used for generating fact-checking explanations.

The Quin+ System
The automated claim verification task can be defined as follows: given a textual claim c and a corpus D = {d 1 , d 2 , ..., d n }, where every passage d is comprised of sentences s j , 1 ≤ j ≤ k, a system will return a set of evidence sentencesŜ ⊂ d i and a labelŷ ∈ {probably true, probably false, inconclusive}.
We have developed an automated fact-checking system, called Quin+, that verifies a given claim in three stages: passage retrieval from a corpus, sentence selection and entailment classification as shown in Figure 2. The label is determined as follows: we first perform entailment classification on the set of evidence sentences. When the number of retrieved evidence sentences that entail or contradict the claim is low, we label the claim as "inconclusive". If the number of evidence sentences that support the claim exceeds the number of sentences that refute the claim, we assign the label "probably true". Otherwise, we assign the label "probably false".

Passage Retrieval
The passage retrieval model in Quin+ is based on a dense retrieval model called QR-BERT (Samarinas et al., 2020). This model is based on BERT and creates dense vectors for passages by calculating their average token embedding. The relevance of a passage d to a claim c is then given by their dot product: Dot product search can run efficiently using an approximate nearest neighbors index implemented using the FAISS library (Johnson et al., 2019). QR-BERT maximizes the sampled softmax loss: where D b is the set of passages in a training batch b, D + b is the set of positive claim-passage pairs in the batch b, and θ represents the parameters of the BERT model.
The work in (Samarinas et al., 2020) introduced the Factual-NLI dataset that extends the FEVER dataset (Thorne et al., 2018) with more diverse synthetic examples derived from question answering datasets. There are 359,190 new entailed claims with evidence and additional contradicted claims from a rule-based approach. To ensure robustness, we compile a new large-scale noisy version of Factual-NLI called Factual-NLI+ 1 . This dataset includes all the 5 million Wikipedia passages in the FEVER dataset. We add 'noise' passages as follows. For every claim c in the FEVER dataset, we retrieve the top 30 web results from the Bing search engine and keep passages with the highest BM25 score that are classified as neutral by the entailment model. For claims generated from MSMARCO queries (Nguyen et al., 2016), we include the irrelevant passages that are found in the MSMARCO dataset for those queries. This results in 418,650 additional passages. The new dataset reflects better the nature of a largescale corpus that would be used by real-world factchecking system. We trained a dense retrieval model using this extended dataset.
The Quin+ system utilizes a hybrid model that combines the results from the dense retrieval model described above and BM25 sparse retrieval to obtain the final list of retrieved passages. For efficient sparse retrieval, we used the Rust-based Tantivy full text search engine 2 .
The embedding-based selection method relies on the dense representations learned by the dense passage retrieval model QR-BERT. For a given claim c, we select the sentences s i from a given passage d = {s 1 , s 2 , ..., s k } whose relevance score r(c, s i ) is greater than some threshold λ which is set experimentally.
The context-aware sentence selection method uses a BERT-based sequence labeling model. The input of the model is the concatenation of the tokenized claim C = {C 1 , C 2 , ..., C k }, the special [SEP] token and the tokenized evidence passage E = {E 1 , E 2 , ..., E m } (see Figure 3). For the output of the model, we adopt the BIO tagging format so that all the irrelevant tokens are classified as O, the first token of an evidence sentence classified as B evidence and the rest tokens of an evidence sentence as I evidence. We trained a model based on RoBERTa-large (Liu et al., 2019), minimizing the cross-entropy loss: where N is the number of examples in the training batch, l i the number of non-padding tokens of the i th example, and p θ (y i j ) is the estimated softmax probability of the correct label for the j th token of the i th example. We trained this model on Factual-NLI with batch size 64, Adam optimizer and initial learning rate 5 × 10 −5 until convergence.

Entailment Classification
Natural Language Inference (NLI), also known as textual entailment classification, is the task of detecting whether a hypothesis statement is entailed by a premise passage. It is essentially a text classification problem, where the input is a Even though pre-trained NLI models seem to perform well on the two popular NLI datasets (SNLI and Multi-NLI), they are not as effective in a real-world setting. This is possibly due to the bias in these two datasets, which has a negative effect in the generalization ability of the trained models (Poliak et al., 2018). Further, these datasets are comprised of short single-sentence premises. As a result, models trained on these datasets usually do not perform well on noisy realworld data involving multiple sentences. These issues have led to the development of additional more challenging datasets such as Adversarial NLI (Nie et al., 2020).
Our Quin+ system utilizes an NLI model based on RoBERTa-large with a linear transformation of the [CLS] token embedding : where P ; H is the concatenation of the premise with the hypothesis, W 3×1024 is a linear transformation matrix, and a 3×1 is the bias. We trained the entailment model by minimizing the cross-entropy loss on the concatenation of the three popular NLI datasets (SNLI, Multi-NLI and Adversarial-NLI) with batch size 64, Adam optimizer and initial learning rate 5 × 10 −5 until convergence.

Performance of Quin+
We evaluate the three individual components of Quin+ (retrieval, sentence selection and entailment classification) and finally perform an end-toend evaluation using various configurations. Table 1 gives the recall@k and Mean Reciprocal Rank (MRR@100) of the passage retrieval models on FEVER and Factual-NLI+. We also compare the performance on a noisy extension of the FEVER dataset where additional passages from the Bing search engine are included as 'noise' passages. We see that when noise passages are added to the FEVER dataset, the gap between the hybrid passage retrieval model in Quin+ and sparse retrieval widens. This demonstrates the limitations of using sparse retrieval, and why it is crucial to have a dense retrieval model to surface relevant passages from a noisy corpus. Overall, the hybrid passage retrieval model in Quin+ gives the best performance compared to BM25 and the dense retrieval model.   Table 2 shows the token-level precision, recall and F1 score of the proposed sentence selection methods on the Factual-NLI dataset and a domainspecific (medical) claim verification dataset, Sci-Fact (Wadden et al., 2020). We also compare the performance to a baseline sentence-level NLI approach, where we perform entailment classification (using the model described in Section 3.3) on each sentence of a passage and select the nonneutral sentences as evidence. We observe that the sequence labeling model gives the highest precision, recall and F1 score when tested on the Factual-NLI dataset. Further, the precision is significantly higher than the other methods.
On the other hand, for the SciFact dataset, we see that sequence labeling method remains the top performer in terms of precision and F1 score after fine-tuning, although its recall is lower than the embedding-based method. This shows that sequence labeling model is able to mitigate the high false positive rate observed with the embeddingbased selection method by taking into account the surrounding context.
The Factual-NLI+ dataset contains claims with passages that either support or refute the claims with some sentences highlighted as ground truth specific evidence. Table 3 shows the performance of the entailment model to classify the input evidence as supporting or refuting the claims. The input evidence can be in the form of the whole passage, ground truth evidence sentences, or sentences selected by our sequence labeling model. We observe that the entailment classification model performs poorly when whole passages are passed as input evidence. However, when the specific sentences are passed as input, the precision, recall, and F1 measures improve. The reason is that our entailment classification model is trained mostly on short premises. As a result, it   Finally, we carry out an end-to-end evaluation of our fact-checking system on Factual-NLI+ using various configurations of top-k passage retrieval (BM25, dense, hybrid, for various values of k ∈ [5, 100]) and evidence selection approaches (embdedding-based and sequence labeling). Table 4 shows the macro-average F1 score for the three classes (supporting, refuting, neutral) for some of the tested configurations. We see that dense or hybrid retrieval with evidence selection using the proposed sequence labeling model gives the best results. Even though hybrid retrieval seems to lead to slightly worse performance, it requires much fewer passages (6 instead of 50) and makes the system more efficient.

System Demonstration
We have created a demo for verifying opendomain claims using the top 20 results from a web search engine. For a given claim, Quin+ returns relevant text passages with highlighted sentences. The passages are grouped into two sets, supporting and refuting. It computes a veracity rating based on the number of supporting and refuting evidence. It returns "probably true" if there are more supporting evidence, otherwise it returns "probably false". When the number of retrieved evidence is low, it returns "inconclusive". Figure 4 shows a screen dump of the system with a claim that has been assessed to be probably false based on the overwhelming number of refuting sentence evidence (21 refute versus 0 support). Quin+ can also be used on a large-scale corpus.

Conclusion & Future Work
In this work, we have presented a three-stage factchecking system. We have demonstrated how a dense retrieval model can lead to higher recall when retrieving passages for fact-checking. We have also proposed two schemes to select relevant sentences: an embedding-based approach and a sequence labeling model to improve the claim verification accuracy. Quin+ gave promising results in our extended Factual-NLI+ corpus, and is also able to verify open-domain claims using web search results. The source code of our system is publicly available 3 . Even though our system is able to verify multiple open-domain claims successfully, it has some limitations. Quin+ is not able to effectively verify multi-hop claims that require the retrieval of multiple pieces of evidence. For the verification of multi-hop claims, methodologies inspired by multi-hop question answering could be utilized.
For the future development of large-scale factchecking systems we believe that a new benchmark needs to be introduced. The currently available datasets, including Factual-NLI+, are not suitable for evaluating the verification of claims using multiple sources.