Towards Few-shot Fact-Checking via Perplexity

Few-shot learning has drawn researchers’ attention to overcome the problem of data scarcity. Recently, large pre-trained language models have shown great performance in few-shot learning for various downstream tasks, such as question answering and machine translation. Nevertheless, little exploration has been made to achieve few-shot learning for the fact-checking task. However, fact-checking is an important problem, especially when the amount of information online is growing exponentially every day. In this paper, we propose a new way of utilizing the powerful transfer learning ability of a language model via a perplexity score. The most notable strength of our methodology lies in its capability in few-shot learning. With only two training samples, our methodology can already outperform the Major Class baseline by more than an absolute 10% on the F1-Macro metric across multiple datasets. Through experiments, we empirically verify the plausibility of the rather surprising usage of the perplexity score in the context of fact-checking and highlight the strength of our few-shot methodology by comparing it to strong fine-tuning-based baseline models. Moreover, we construct and publicly release two new fact-checking datasets related to COVID-19.


Introduction
Few-shot learning is being actively explored to overcome the heavy dependence on large-scale labeled data that serves as a crucial bottleneck to machine learning models. Recently, researchers have explored few-shot learning that leverages the powerful transfer learning ability of pre-trained large language models (LMs) in various NLP tasks. Petroni et al. demonstrated that an LM serves as a good zero-shot learner on the question-answering task due to its encoded commonsense knowledge. Figure 1: Illustration of our simple yet effective perplexity-based approach. Few-shot data samples are used to find the optimal perplexity threshold th that separates Unsupported claims from Supported claims.
Going further, Brown et al. illustrated the impressive potential of LMs as strong zero-shot and fewshot learners across translation, commonsense reasoning and natural language inference (NLI). However, little or no exploration has been made on fewshot learning in the fact-checking domain, which is a timely and important task in which data-scarcity is particularly problematic.
Previous works have proposed different ways of leveraging LMs to conduct zero-or few-shot learning. One common approach is to query the LM for the missing token (i.e., "answer") for the zeroshot question-answering task (Petroni et al., 2019;Brown et al., 2020) by transforming questions into a form of statement. Another approach is to adopt an in-context learning approach where the input context of the LM is carefully crafted to control the output. For example, a natural language task instruction (e.g., "Translate English to French:") or training sample (e.g., "sea otter => loutre de mer") is provided as the context for zero-shot/few shot translation (Brown et al., 2020).
In this work, we explore a new way of leveraging LMs for few-shot learning in the fact-checking task. This is done by leveraging a perplexity score from  Table 1: Relations between veracity of claim and perplexity. Unsupported claims have higher perplexity compared to Supported claims. Note that the perplexity score listed here is using GPT2-base on each of the claims.
evidence-conditioned LMs. Fact-checking is the task of verifying a claim based on its corresponding evidence, and one of its most important objectives is to correctly model the relationship between the given claim and evidence. We hypothesize that a perplexity score from evidence-conditioned LMs is helpful for such purpose since perplexity measures the likelihood of a given sentence with reference to previously encountered text (i.e., given the evidence prefix and the LM's training corpus). Therefore, this paper attempts to investigate this hypothesis and proposes a novel perplexity-based few-shot learning methodology for fact-checking. Through experimental analysis, we empirically demonstrate the effectiveness of our proposed methodology in few-shot learning, and we compare it to strong fine-tuning-based baselines. Moreover, we compare different LMs (BERT and GPT2) in different sizes, from small to XL, to unveil interesting insights on which model is more suitable for this task. Finally, we discuss the potential application of evidence-conditioned perplexity for ranking candidate claims in priority order of the most urgent to be fact-checked to the least.
Our contribution is three-fold: First, we propose an effective way of leveraging the perplexity score in the context of fact-checking. We would like to emphasize that our approach is a simple yet effective way of leveraging large pre-trained LMs. Second, we demonstrate the effectiveness of the perplexity-based approach in the few-shot setting by outperforming strong fine-tuned baselines, such as BERT (Devlin et al., 2019), RoBERTA , and XLNet , by an absolute 10 ∼ 20% F1-Macro scores in the 2-,10-, and 50-shot settings. Third, we construct two new fact-checking datasets related to COVID-19, which has caused the problem of an "infodemic".

Related Work
Fact-checking is a complex task that is split into many sub-tasks. First, credible sources of evidence need to be identified. Second, a set of relevant evi-dence needs to be retrieved from the identified credible sources. Last, veracity classification of claims can be made based on the retrieved evidence.
Some works have focused on full-pipeline systems that handle all sub-tasks and provide real working web prototypes (Karadzhov et al., 2017;Popat et al., 2017Popat et al., , 2018aHasanain et al., 2019;Tokala et al., 2019). These works use the entire Web as a knowledge source to confirm or reject a claim taking the credibility or reliability of the Web source into account. Another common setting for fact-checking is to assume a credible evidence source is given (e.g., Wikipedia), and to focus on the evidence retrieval and veracity verification steps only. FEVER (Thorne et al., 2018) and Tabfact  are two large datasets for this setting, and there are many follow-up studies working on them (Yoneda et al., 2018a;Nie et al., 2019;Zhong et al., 2020;Herzig et al., 2020;Hidey et al., 2020).
Our work follows the latter group of works and uses the following setting: given a tuple consisting of claims and relevant evidence, we classify the final fact-checking veracity label of the given claim (Popat et al., 2018b;Ma et al., 2019;Wu et al., 2020). By doing this, we focus on the methodology for the veracity classification task without worrying about the propagated errors from earlier modules, such as source credibility profiling and evidence retrieval.
Leveraging LMs as a knowledge base, zero-shot learner or a few-shot learner has been gaining popularity within the NLP field. It was discovered that large pre-trained LMs can store factual knowledge in their parameters (Petroni et al., 2019;Roberts et al., 2020;Madotto et al., 2020), and that this stored knowledge can help LM to be good at zeroshot and few-shot learning in various NLP tasks, such as question answering, summarization, textual entailment, translation and commonsense reasoning (Brown et al., 2020). For the task of factchecking, Lewis et al. and Lee et al. attempted to leverage such LMs. However, they mainly use the model to replace the evidence retriever of the factchecking pipeline, and they still require training of final veracity classifier. Our work, in contrast, focuses on the few-shot ability of LMs for veracity classification.

Preliminary Exploration of Hypothesis
In this section, we conduct a preliminary investigation to validate the potential of our hypothesis that the perplexity score from an evidence-conditioned LM can provide a signal for claims unsupported by evidence.
For our exploration, we first collect a small set of Supported and Unsupported claims that can be verified based on the training corpus of the target LM (namely, Wikipedia which is used in the training of many pre-trained LMs). Then, we compare the perplexity scores between them.
To recap, perplexity is a commonly used metric for measuring the performance of LMs. It is defined as the inverse of the probability of the test set normalized by the number of words: .
Another way of interpreting perplexity is as a measure of the likelihood of a given test sentence with reference to the training corpus. From Table 1, we can observe that Unsupported claims on average have higher perplexity than Supported claims. For example, Supported claim "Washing hands prevents the spread of diseases," has a perplexity value of 96.74, whereas the Unsupported claim "All dogs speak English fluently," has a much higher perplexity value of 328.23. We believe these observations support our hypothesis. Thus, we proceed to build our approach based on this hypothesis (Section 4), and conduct experiments (Section 5) and analysis (Section 6) to verify the validity of our perplexity-based fact-checking approach.

Task definition
In this work, we define our task to be: Given a {claim, evidence} pair, determine the veracity of a claim against the evidence -i.e., Supported vs. Unsupported claims. The label Supported is assigned when there exists relevant evidence that supports the claim, while Unsupported is assigned when there does not exist any supporting evidence. Note that this existence of refuting evidence also places a claim into this latter category.

Evidence Conditioned Perplexity
Although previous works have shown that an LM can encode knowledge from its training corpus, there are a few limitations to solely relying on the pre-trained weights. First, we cannot easily check and guarantee whether the LM has already seen the evidence that is required for verification, and the LM would definitely not have seen the evidence related to newly emerging events after the LM pretraining. For instance, the event of COVID-19 emerged after the release of the GPT2 pre-trained model. Second, although LMs have shown surprising ability in memorizing some knowledge, they are not perfect, as pointed out by previous works (Poerner et al., 2019;Lee et al., 2020). Therefore, we propose to incorporate evidence into the perplexity calculation by using it as a prefix of the claim.
There are two popular kinds of LMs: i) unidirectional LMs that are trained with the conventional next token prediction task, and ii) masked LMs that are trained with the masked token prediction token, resulting in a bidirectional LM. We briefly describe how to obtain the evidence-conditioned perplexity for both types of LM:

Unidirectional Language Model Perplexity
For a unidirectional LM, first we concatenate the evidence and claim to obtain the input to the LM: where E and C denote the number of evidence tokens and claim tokens, respectively. Then, we obtain the evidenceconditioned perplexity by .
Note that the evidence tokens are used to condition the perplexity, yet their conditional probabilities p(x e i |x e 0 , . . . , x e i−1 ) do not contribute to the P P L(X), which is the main difference from Eq. (1).   Devlin et al., which is trained with the masked token prediction task instead of the next token prediction task. The "perplexity" score from the MLM does not mean the same as the conventional perplexity score. Therefore, we use the "pseudo perplexity" score proposed by Salazar et al., which is computed by summing all the log probabilities obtained by sequentially masking each token in the input sentence.

Leveraging Perplexity
Once we obtain the evidence-conditioned perplexity scores for each claim, we find the best threshold th that separates Supported claims from Unsupported claims. We would like to emphasize that our approach does not involve any parameter update of the LM. We only do inference with the LM, and leverage the few-shot samples as the "validation set" to find the optimal single threshold parameter, th. Throughout our paper, we refer to our methodology as the "perplexity-based classifier". Given a set of a claim and evidence, if the evidence-conditioned perplexity score is less than the threshold (i.e. < th), the claim is Supported by the evidence; otherwise it is Unsupported .
5 Few-shot Experiment 5.1 Dataset 1 All datasets used in the experiment are in English, and we report the data statistics in Table 2.
Covid19-Scientific A new test set is constructed by collecting COVID-19-related myths and scientific truths labelled by reliable sources like Med-icalNewsToday, the Centers for Disease Control and Prevention (CDC), and the World Health Organization (WHO). It consists of the most com-mon scientific or medical myths about COVID-19, which must be debunked correctly to ensure the safety of the public (e.g., "drinking a bleach solution will prevent you from getting COVID-19"). The set contains 172 claims with labels (Supported, Unsupported) obtained from the aforementioned reliable sources. Note that myths that are unverifiable from current findings are also assigned the Unsupported label. 2 The gold evidence is obtained from the winning system of the Kaggle Covid-19 challenge (Su et al., 2020). This system retrieves the evidence from 59,000 scholarly articles about COVID-19, SARS-CoV-2, and other related corona viruses. 3 Covid19-Social Another test set is constructed by crawling 340 COVID-19-related claims factchecked by journalists from a website called Politifact.com. Unlike the Covid19-Scientific dataset, it contains non-scientific and socially-related claims, such as "For the coronavirus, the death rate in Texas, per capita of 29 million people, we're one of the lowest in the country." Such claims may not be life-and-death matters, but they still have the potential to bring negative sociopolitical effects. Originally, these claims are labelled into six classes {pants-fire, false, barely-true, half-true, mostly-true, true}. However, we use it in a binary setup for consistency with the Covid19-Scientific setup by assigning the first three classes to Unsupported and the rest to Supported.
For evidence of each claim, we follow the Alhindi et al. to obtain the human-written evidence/justification available on the Politifact.com website, from which the claims are crawled.
FEVER (Thorne et al., 2018) Fact Extraction and Verification (FEVER) is a publicly released large-scale dataset generated by altering sentences extracted from Wikipedia to promote research on fact-checking systems. Since our few-shot experiment requires little data, we only leverage the "Paper Test Dataset" from the FEVER workshop (https://fever.ai/) resource page to speed up our experiments.
This dataset originally has three classes, {Support, Refute, Not Enough Info}. "Support" is sim-ilar to our Supported label, where a claim can be supported by given evidence. "Refute" is where a claim is "refuted" by given evidence, whereas "Not Enough Info" means not enough evidence is available for verification. For our FEVER experiment, we treat "Refute" and "Not Enough Info" as one class. This is because we believe that in a real scenario both cases are Unsupported claims that need attention.
To provide further detail, the "Support" class is mapped into Supported, and "Refute"/"Not Enough Info" is mapped into Unsupported to match our task setting. Note that to balance the dataset, we obtain half the data from "Refute" and the other half from "Not Enough Info". Note that the gold evidence is included in the dataset released by Thorne et al.

Models
Ours We consider one unidirectional LM and one masked LM for our proposed perplexity-based methodology.
• Baselines We finetune various pre-trained Transformer-based (Vaswani et al., 2017) models to build our baseline classifiers, which is a common approach used to achieve many state-of-the-art results in the literature.
• Major Class -A simple majority classifier which always assigns the majority class of the training set to all samples. We provide this for reference because some of our dataset classes are imbalanced.
• BERT-B ft -A fine-tuned BERT-base model with a feed-forward classifier trained on top.
• BERT-L ft -A fine-tuned BERT-large model with a feed-forward classifier trained on top.
• RoBERTa ft -A fine-tuned RoBERTa-base model  with a feed-forward classifier trained on top.
• XLNet ft -A fine-tuned XLNet-base model  with a feed-forward classifier trained on top.

Experimental Setup
Few-Shot Data Setup Given N D as the size of the dataset D, we do an n-shot experiment with n samples from D as a "validation set" for our perplexity-based approach or as a "training set" for the fine-tuning approach, and the remainder (N D − n) as a test set. To give a concrete example, in the 2-shot experiment using the Covid19-Social dataset (340 samples), we have two training samples and 338 test samples. We use three seeds to split the datasets and train the models. For a fair comparison, all the seeds and splits are kept the same across the models.
Evaluation We mainly evaluate our experiments using accuracy and the Macro-F1 metric. Since some of our datasets are imbalanced (the ratio of Supported to Unsupported in Table 2), we prioritize the overall Macro-F1 score over accuracy.
Training Details In our methodology, no gradient update is required. Thus, there are no training details such as learning rate, batch size or maxepoch to report. We simply use a small validation set (size of 2,10,50) to find the best-performing hyper-parameter value for the threshold th from the range of {0 ∼ 1000}. None of the samples from the test set were seen in threshold searching. For baseline fine-tuned classifiers, we do a gridsearch to find the best-performing parameters, as follows: We use a learning rate of 5e−6 for training the BERT-B ft , RoBERTa ft , and XLNet ft models, while BERT-L ft is trained with a rate of 2e−5. All models share the same batch size of 32 and maximum input sequence length of 128. We also use early-stopping with patience 3 with a maximum of 10 training epochs. Each experiment is run on an Nvidia GTX 1080 Ti, and each epoch takes 2 ∼ 15 seconds depending on the number of the training samples n. Note that for reproducibility, we will also publicly release the code. Table 3 reports the few-shot performance of the fine-tuning-based baselines and our perplexitybased classifiers.  Table 3: Results comparison among perplexity-based classifiers and fine-tuned classifiers in 2-shot, 5-shot and 10shot settings across three different tasks. Models whose names start with PPL are our proposed perplexity-based classifiers. Major Class is a reference to evaluate classifier performance. All test results reported are mean values of three trials with randomly selected n-shot training samples from the dataset, where n = {2, 10, 50}.

Usage of Perplexity
for the Covid-Scientific dataset in the 50-shot setting. This supports our hypothesis that evidenceconditioned perplexity scores are capable of providing signals regarding the veracity of the given claim. Intuitively, we can consider the perplexity score to be mimicking the role of the "logits" from a classifier, and we are trying to find the best threshold to map this pseudo-logit-like perplexity score into a veracity label. The classification performance of our perplexity-based approach increases as the shot size increases. As the shot size increases from 2 to 50, PPL GPT2-B shows an average gain of 8.19 ± 2.74% and 7.64 ± 1.61% in accuracy and Macro-F1 score, respectively, across all tasks. This is because a greater number of data samples means more anchor perplexity points for threshold searching, and thus, a better threshold to determine the veracity of claims.

Few-shot Comparison to Fine-tuned Baselines
Except for the Covid-Social accuracy in the 50shot setting, both of our proposed classifiers (PPL GPT2-B , PPL BERT-B ) outperform the finetuned baseline classifiers across all tasks in all of the 2-, 10and 50-shot settings. For the 2-shot and 10-shot settings, many of the baseline classifiers underperform the Major Class baseline regardless of the task. This implies their failure to learn anything from the fine-tuning step with a limited number of samples. Only after 50-shot do these baselines start to learn and outperform the Major Class baselines. This is not surprising, since the pre-trained models are known to perform well in a full-shot scenario, but they do not guarantee good performance when they are shown few samples.
In contrast, our perplexity-based classifiers manage to perform fairly well, even in the 2-shot setting, because our "classifier" is a single parameter (i.e., threshold value), which requires no complex learning or optimization. We would like to emphasize that ours consistently outperform the strong Transformer-based baselines across all dataset on the F1-Macro metric by absolute 10 ∼ 20%. We argue that these results demonstrate the strength of our approach in low-resource few-shot settings.
BERT vs. GPT2 for Perplexity Scores Most of the time, PPL GPT2-B outperforms PPL BERT-B . For instance, in the 50-shot setting for the FEVER dataset, performance differences are 10.04% and 7.76% for accuracy and F1-Macro   scores respectively. Based on this observation, we can speculate that the perplexity from a unidirectional LM is more suitable for our proposed method than from a masked LM. This is most likely because the BERT perplexity score is only an estimation based on the "pseudo-perplexity" proposed by Salazar et al.

Analysis and Discussion
In this section, we conduct multiple analysis to further evaluate and understand aspects of our perplexity-based approach.

Scaling the Language Model Size
Generally, scaling the model size helps to also improve the model performance, because more parameters mean a stronger learning capability during fine-tuning or training. Also, Roberts et al. have demonstrated that increasing the parameter size allows for more knowledge to be packed into the LM's parameters. Therefore, we experiment with the model size to see if such findings also extend to our proposed methodology. The following model sizes of GPT2 are investigated: base (PPL GPT2-B ), medium (PPL GPT2-M ), large (PPL GPT2-L ) and xl (PPL GPT2-XL ). Results are reported in Table 4. As expected, we can observe the trend that the performance increases with parameter size. For instance, PPL GPT2-XL is the best performing compared to the other, smaller, models for Covid-Scientific and FEVER, achieving the new state-of-the-art few-shot results by gaining absolute ∼ 4% on Covid-Scientific and ∼ 2% on FEVERfor accuracy/F1-Macro.

Ablation Study
We carry out an ablation study on the effect of evidence-conditioning in respect of the final perplexity scores and the corresponding final classification performance. In Table 5, we can observe the performance drops when evidence-conditioning is ablated -the biggest drop is ∼ 15% on F1-Macro for the FEVER task in the 50-shot setting. This implies that the perplexity score is assigned in relation to the context of the provided evidence.

Negation Analysis
In fact-checking, negation is one of the most difficult challenges, and many state-of-the-art models are brittle against it. Thorne and Vlachos show that the winning fact-checking systems from the FEVER workshop are brittle against negations, experiencing a huge performance drop when given negated test sets, up to absolute −29% in accuracy. Therefore, we also conduct analysis regarding the negation handling of our proposed methods by augmenting our dataset with negated examples.
Template-based Data Negation We create our negated dataset by replacing all the auxiliary verbs (e.g., is, can) with their corresponding negated forms (e.g., is not, can not), and vice versa. We apply this approach to the Covid-Scientific dataset and obtain a new version that contains {original- Blue-colored marks indicate precision at each of k when claims are assigned with perplexity scores from GPT2-base to rank claims in reverse order (i.e., higher to lower score). Orange-colored marks indicate mean precision value of 10 trials when random scores are assigned to each claim to rank.  sentence (S original ), negated-sentence (S negated )} pairs. Note that the evidence is kept the same, but the veracity label of S original is negated (i.e., Supported is negated to Unsupported and vice versa). To illustrate with an example, S original ={"claim": "5g helps covid-19 spread.", "evidence": evidence 1 , "label": Unsupported} is negated into S negated ={"claim": "5g does not help covid-19 spread.", "evidence": evidence 1 , "label": Supported}.
Q1: Can the LM distinguish negation? We use the new augmented Covid-Scientific dataset to investigate whether the LM manages to differentiate between the original-sentence S original and negated-sentence S negated . The average of the absolute difference between the perplexities assigned to S original and S negated is 122 and the maximum absolute difference value is 2800.
Q2: Performance on negation-augmented dataset? We evaluate the performance of the perplexity-based classifier (PPL GPT2-B ) on the "negation-augmented" Covid-Scientific dataset in reference to its original. Unsurprisingly, PPL GPT2-B does experience a drop in performance of 13.77% and 13.40% in accuracy and F1-Macro, respectively. However, it still outperforms the finetuned RoBERTa ft baseline, the best performing baseline in the 2-shot setting, as shown in Table 6.

Comparison with existing FEVER System in Few-shot Setting
For all three tasks, we compare our perplexity models against different fine-tune baselines in Section 5.4. Unlike two newly proposed COVID-19-related tasks, FEVER is a well-established task studied by many existing works. In order to understand how our perplexity-based method compares against the literature, we conduct an additional experiment with the publicly available system from the runnerup team of the FEVER workshop, HexaF (Yoneda et al., 2018b). We fine-tune HexaF's veracity classification modules in few-shot settings. In the 2-shot settting, HexaF shows accuracy of 49.99% and F1-Macro score of 33.33%. In the 50-shot settting, it shows accuracy of 53.53% and F1-Macro score of 49.27%. In general, machine learning models require sufficient amounts of training data, and this "sufficient amount" normally differs depending on the model being used. However, as demonstrated earlier in our main experimental results (Section 5.4), 2 ∼ 50 samples are insufficient data to properly train one of the winning fact-checking systems.

Potential Application: Ranking of Candidate Claims for Fact-Checking
Here, we discuss another way of leveraging the evidence-conditioned perplexity score. It can be used for prioritizing false-claim candidates for human fact-checkers, instead of doing hard prediction on the veracity of the given claims. By ranking the claims-to-be-fact-checked in descending order of perplexity, we can increase the chance that the first k claims checked by a human fact-checker are Unsupported false claims. This will be beneficial since fact-checkers can efficiently allocate their time and resources on fact-checking claims that are more likely to be false and harmful to society.
In Figure 2, we compare the precision at the topk (P@k) between the perplexity-based ranking and random-score-based ranking. We can view P@k to measure how many Unsupported pieces are prioritized in the first k of the ranked claims. Across all datasets, perplexity-based ranking (blue marks) exhibits higher precision scores over random-scorebased ranking (orange marks). Moreover, for both Covid-Scientific and Covid-Social, our P@k is over 80% for all k values.

Future Research Directions
In this work, we conduct the FEVER experiments in a binary set-up to keep all the experimental settings consistent across all three datasets. However, the original FEVER task has three classes -Support, Refute, and Not Enough Info (NEI). Since the distinction between NEI and Refute cases is also an important problem, it would be important future work to extend our binary-class setting to the three-class setting.
Moreover, we believe our method can easily be augmented into other existing approaches, for instance, leveraging the perplexity score in the final step of the FEVER fact-checkers as additional input. It would be a useful future direction to explore and discover the most effective way of incorporating the perplexity-based approach into other existing fact-checking systems.

Conclusion
In this paper, we propose a novel way of leveraging the perplexity score from LMs for the few-shot fact-checking task. Through experimental analysis from an ablation study to the discussion of potential applications, we further explore and evaluate the capability of the perplexity score to act as an indicator of unsupported claims. We hope our proposed approach encourages future research to continue developing LM-based methodologies as well as the few-shot approach for fact-checking. By doing so, our community can move towards a data-efficient approach that is not constrained by the requirement of a large labeled dataset.