Generalizing Open Domain Fact Extraction and Verification to COVID-FACT thorough In-Domain Language Modeling

With the epidemic of COVID-19, verifying the scientifically false online information, such as fake news and maliciously fabricated statements, has become crucial. However, the lack of training data in the scientific domain limits the performance of fact verification models. This paper proposes an in-domain language modeling method for fact extraction and verification systems. We come up with SciKGAT to combine the advantages of open-domain literature search, state-of-the-art fact verification systems and in-domain medical knowledge through language modeling. Our experiments on SCIFACT, a dataset of expert-written scientific fact verification, show that SciKGAT achieves 30% absolute improvement on precision. Our analyses show that such improvement thrives from our in-domain language model by picking up more related evidence pieces and accurate fact verification. Our codes and data are released via Github.


Introduction
Online contents with false information, such as lies, rumors and conspiracy theories, have been growing significantly and spreading widely during the COVID-19 epidemic. An automatic fact-checking system is urgently needed to check these scientific claims, which can avoid undesired consequences. Automatic fact-checking has drawn lots of attention from NLP community. Researchers mainly focus on stopping misinformation transmission through videos and texts (Cinelli et al., 2020;Hossain et al., 2020;Li et al., 2020;Serrano et al., 2020). The scientific fact verification task (Wadden et al., 2020) is come up to deal with COVID-FACT with high-quality articles of spanning domains from basic science to clinical medicine. Nevertheless, the small-scale training data of SCIFACT may 1 https://github.com/thunlp/KernelGAT limit the performance of COVID-FACT checking. The state-of-the-art model (Wadden et al., 2020) achieves only 46.6% precision of fact verification, which is hard to be trusted for users. This paper presents the Scientific KGAT (SciK-GAT) to deal with low-resource COVID-FACT verification. SciKGAT employs the in-domain language model in the fact extraction and verification pipeline (Thorne et al., 2018;Wadden et al., 2020) to adapt fact-checking into COVID domain. The in-domain language model transfers COVID domain knowledge into pre-trained language models with continuous training and learns medical token semantics towards COVID with mask language model based training. The state-of-the-art fact verification model KGAT Ye et al., 2020) is also used in SciKGAT for multi-evidence reasoning in the fact verification module.
Our experiments show that the in-domain language modelings achieve better performance for various components in the whole fact extraction and verification pipeline by achieving more accurate evidence selection and fact verification. Our indomain language modelings improve the fact verification performance with more than 10% absolute F 1 score and 30% absolute precision (from 46.6% to 76%) than previous state-of-the-art on SCIFACT. Such improvement shows that our model provides a set of solutions for low-resource fact verification tasks, such as COVID-19.

Related Work
Existing fact extraction and verification models usually employ a three-step pipeline system (Chen et al., 2017): document retrieval (abstract retrieval), sentence selection (rationale selection) and fact verification (Thorne et al., 2018;Wadden et al., 2020).
The preliminary fact verification methods concatenate all evidence pieces (Nie et al., 2019;Wad-den et al., 2020) for fact verification. KGAT  conducts fine-grained multiple evidence reasoning with a graph and achieves the stateof-the-art for fact verification (Ye et al., 2020).
The reasoning ability of the pre-trained language model is crucial and helps improve fact verification performance (Devlin et al., 2019;Zhou et al., 2019;Soleimani et al., 2019). Some work (Beltagy et al., 2019;Lee et al., 2020) transfers medical domain knowledge into pre-trained language models for better medical semantic understanding, which provides a potential way to deal with COVID-FACT checking problem.

Methodology
This section describes our SciKGAT for fact extraction and verification. We first introduce the pipeline of fact extraction and verification (Sec. 3.1) and then continuously train the BERT based model (Sec. 3.2) for the whole.

Preliminary
Given a claim c, we aim to predict the claim label y. We usually implement the fact extraction and verification pipeline with three steps: abstract retrieval, rationale selection and fact verification.
Abstract Retrieval. For the claim c and abstract D = {a 1 , . . . , a l }, we aim to retrieve three abstracts for the following steps.
We first retrieve top-100 abstracts with TF-IDF from the abstract collection D, which is the same as the previous work (Wadden et al., 2020). For the claim c and abstract abstract a = {e 1 , . . . , e k } with k evidence pieces and title t, we concatenate claim, title and abstract to get the representation H e of the pair c, a with BERT (Devlin et al., 2019): where • is the concatenate operation. The representation H of c, a consists of representations of tokens from both claim and evidence. The 0-th representation H 0 denotes the [CLS] representation. The relevance label y a between claim c and abstract a is calculated: p(ya|c, a) = softmaxy a (MLP(H0)). (2) We rerank abstracts according to the probability p(y a = 1|c, a) and top-3 abstracts are reserved. Rationale Selection. Given the retrieved abstract a, rationale selection focuses on selecting relevant sentences for fact verification.
Similarly, for the evidence e of the retrieved abstract a, we can get the representation H of claim and evidence pair c, e : (3) Then we predict the relevance label y r of claim c and evidence e: p(yr|c, e) = softmaxy r (MLP(H0)). (4) The related evidence pieces (p(y r = 0|c, e) < p(y r = 1|c, e)) are reserved to form the retrieved evidence set E = {e 1 , ..., e q } of each abstract a. Fact Verification. For the claim c and retrieved evidence set E, fact verification model aims to predict claim label y. We employ the state-of-the-art model KGAT  as our fact verification module. For the i-th evidence e i in the evidence set E, we can get the sentence pair representation H i of the i-th pair c, e i through BERT. Then the probability of claim label y is calculated:

Continuous In-Domain Training
To deal with the low-resource COVID-FACT checking, we propose continuous training methods to transfer domain knowledge into pretrained language models. For COVID-FACT checking, the medical domain knowledge is useful to understand medical words (Beltagy et al., 2019). However, these medical domain pre-trained language models will be out-of-date with the medical development or emergence of a new virus, such as COVID-19.
Continuous in-domain training provides a potential way to deal with this problem with the latest medical corpus. Hence we come up with two indomain language models for the fact extraction and verification pipeline with continuous training.
Rationale prediction based training. We first come up with the rationale prediction style training to continuously train BERT for better reasoning ability towards the COVID-FACT. For the claim and evidence c, e , we optimize BERT model with supervisions from SCIFACT: where y * r denotes the ground truth rationale prediction label of the pair c, e . Then we get a supervised in-domain language model, BERT-RP, for the fact verification module.

Experimental Methodology
This section describes the dataset, evaluation metrics, baselines, and implementation details.
Dataset. The recently released dataset SCI-FACT (Wadden et al., 2020) is leveraged in our experiments. It consists of 1,409 annotated claims with 5,183 scientific articles. All claims are classified as SUPPORT, CONTRADICT or NOT ENOUGH INFO. The training, development and testing sets contain 809, 300 and 300 claims, respectively. FEVER (Thorne et al., 2018) is also used by official baselines to train the fact verification modules of baselines and our models. The FEVER consists of 185,455 annotated claims with 5,416,537 Wikipedia documents.
Evaluation Metrics. Precision, Recall and F 1 score are used to evaluate model performance, following SCIFACT (Wadden et al., 2020). These evaluations are inspired by FEVER score (Thorne et al., 2018) and consider if the evidence is selected correctly from the abstract level and sentence level.
Baselines. Since the scientific fact verification task is recently released, our baselines are mainly from Wadden et al. (2020). They first use TF-IDF for abstract retrieval and then use RoBERTa (Large) and SiBERT for rationale selection. KGAT and RoBERTa (Large) are leveraged for fact verification. The rationale selection module is trained with SCIFACT and the fact verification module is trained with data from FEVER and SCIFACT (Wadden et al., 2020).
Implementation Details. In all experiments, we use SciBERT, RoBERTa (Base) and RoBERTa (Large) Beltagy et al., 2019), and inherit huggingface's PyTorch implementation 3 . Adam is utilized for parameter optimization. For rationale selection, we keep the same setting as Wadden et al. (2020). For abstract retrieval and fact verification, we set the max length to 256, learning rate to 2e-5, batch size to 8 and accumulate step to 4 during training. The other parameters are kept the same with KGAT .
For the abstract retrieval module, we follow the previous work (MacAvaney et al., 2020) and finetune our in-domain language model with the medical corpus from MS-MARCO (Bajaj et al., 2016) to fit our abstract retrieval module to the open-domain COVID related literature search.

Evaluation Result
This section first tests the overall performance of SciKGAT. Then it studies the impacts of our indomain language modeling techniques in knowl-   edge transfer. Finally, it provides case studies.

Overall Performance
The overall performance of SciKGAT is shown in Table 1. The official baseline model uses TF-IDF for abstract retrieval and RoBERTa (Large) for rationale selection and fact verification, which is state-of-the-art. We add modules of SciKGAT step by step to evaluate the model's effectiveness. SciKGAT (w. A) and SciKGAT (w. AR) show significant improvement than baselines, which demonstrates our literature search with an indomain language model is effective in selecting related evidence from abstract and sentence levels. For fact verification, our SciKGAT improves pipeline performance by achieving 30% improvement on label prediction precision. The high precision of fact verification demonstrates that our model has the ability to provide high quality and convinced COVID-FACT verification results.

In-Domain Effectiveness
In this experiment, we evaluate the impacts of the in-domain language model on individual fact extraction and verification components of SciKGAT.
As shown in Table 2, we first compare SciB-ERT and SciBERT-MLM on the abstract retrieval and rationale selection tasks. Then we fix the selected evidence and evaluate the reasoning ability of the fact verification module, using two kinds of in-domain language models, MLM model (mask language model training) and RP model (rationale prediction training) with three BERT variants.
For abstract retrieval and rationale selection, SciBERT-MLM shows better ranking accuracy than SciBERT, and consequently results in better fact verification results. It demonstrates that the mask language model learns specific medical domain knowledge through the latest COVID related papers and thrives on our evidence selection parts with continuous training.
Then we evaluate the effectiveness of in-domain language models on fact verification with various BERT based models. Our in-domain language models significantly improve fact verification performance and illustrate their stronger reasoning ability compared to vanilla pre-trained language models. Compare to the RP model, MLM model usually achieves better performance. Importantly, MLM model does not rely on annotation data, providing a common resolution for COVID related tasks. The consistent improvement on all BERT variants further manifests the robustness of our model.

Case Study
As shown in Table 3, two examples from the development set are used to illustrate SciKGAT's effectiveness for fact verification.
In the first example, both evidence 1 and evidence 2 indicate that basophils can lead to systemic lupus erythematosus, which contradicts the claim. The concatenation based model, RoBERTa, fails to verify the claim, while SciKGAT makes the right prediction. It demonstrates the effectiveness of KGAT's fine-grained reasoning with multiple evidence pieces. In the second example, the evidence piece indicates that memory T cells are the most in T cells for adults. SciKGAT predicts claim label correctly and shows its effectiveness by recognizing and comprehending these medical phrases, which thanks to the in-domain language modeling.

Conclusion
This paper presents in-domain language modeling methods for open domain fact extraction and verification, which transfer domain knowledge for the COVID-FACT checking task. Our experiments show that our pipeline significantly improves the fact-checking performance of the state-of-the-art model with more than 30% absolute prediction precision. Our analyses illustrate that our model has stronger reasoning ability with continuous training and benefits from COVID related knowledge.