GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification

Fact verification (FV) is a challenging task which requires to retrieve relevant evidence from plain text and use the evidence to verify given claims. Many claims require to simultaneously integrate and reason over several pieces of evidence for verification. However, previous work employs simple models to extract information from evidence without letting evidence communicate with each other, e.g., merely concatenate the evidence for processing. Therefore, these methods are unable to grasp sufficient relational and logical information among the evidence. To alleviate this issue, we propose a graph-based evidence aggregating and reasoning (GEAR) framework which enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information. We further employ BERT, an effective pre-trained language representation model, to improve the performance. Experimental results on a large-scale benchmark dataset FEVER have demonstrated that GEAR could leverage multi-evidence information for FV and thus achieves the promising result with a test FEVER score of 67.10%. Our code is available at https://github.com/thunlp/GEAR.


Introduction
Due to the rapid development of information extraction (IE), huge volumes of data have been extracted.
How to automatically verify the data becomes a vital problem for various datadriven applications, e.g., knowledge graph completion (Wang et al., 2017) and open domain question answering (Chen et al., 2017a). Hence, many recent research efforts have been devoted to fact verification (FV), which aims to verify given claims with the evidence retrieved from plain text. † Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn) "SUPPORTED" Example Claim The Rodney King riots took place in the most populous county in the USA.

Evidence
(1) The 1992 Los Angeles riots, also known as the Rodney King riots were a series of riots, lootings, arsons, and civil disturbances that occurred in Los Angeles County, California in April and May 1992.
(2) Los Angeles County, officially the County of Los Angeles, is the most populous county in the USA.
"REFUTED" Example Claim Giada at Home was only available on DVD.

Evidence
(1) Giada at Home is a television show and first aired on October 18, 2008, on the Food Network.
(2) Food Network is an American basic cable and satellite television channel. More specifically, given a claim, an FV system is asked to label it as "SUPPORTED", "REFUTED", or "NOT ENOUGH INFO", which indicate that the evidence can support, refute, or is not sufficient for the claim.
Existing FV methods formulate FV as a natural language inference (NLI) (Angeli and Manning, 2014) task. However, they utilize simple evidence combination methods such as concatenating the evidence or just dealing with each evidence-claim pair. These methods are unable to grasp sufficient relational and logical information among the evidence. In fact, many claims require to simultaneously integrate and reason over several pieces of evidence for verification. As shown in Table 1, for both of the "SUPPORTED" example and "REFUTED" example, we cannot verify the given claims via checking any evidence in isolation. The claims can be verified only by understanding and reasoning over the multiple evidence.
To integrate and reason over information from multiple pieces of evidence, we propose a graph-based evidence aggregating and reasoning (GEAR) framework. Specifically, we first build a fully-connected evidence graph and encourage information propagation among the evidence. Then, we aggregate the pieces of evidence and adopt a classifier to decide whether the evidence can support, refute, or is not sufficient for the claim. Intuitively, by sufficiently exchanging and reasoning over evidence information on the evidence graph, the proposed model can make the best of the information for verifying claims. For example, by delivering the information "Los Angeles County is the most populous county in the USA" to "the Rodney King riots occurred in Los Angeles County" through the evidence graph, the synthetic information can support "The Rodney King riots took place in the most populous county in the USA". Furthermore, we adopt an effective pretrained language representation model BERT (Devlin et al., 2019) to better grasp both evidence and claim semantics.
We conduct experiments on the large-scale benchmark dataset for Fact Extraction and VERification (FEVER) (Thorne et al., 2018a). Experimental results show that the proposed framework outperforms recent state-of-the-art baseline systems. The further case study indicates that our framework could better leverage multi-evidence information and reason over the evidence for FV.

FEVER Shared Task
The FEVER shared task (Thorne et al., 2018b) challenges participants to develop automatic fact verification systems to check the veracity of human-generated claims by extracting evidence from Wikipedia. The shared task is hosted as a competition on Codalab 1 with a blind test set. Nie et al. (2019); Yoneda et al. (2018) and Hanselowski et al. (2018) have achieved the top three results among 23 teams.
Existing methods mainly formulate FV as an NLI task. Thorne et al. (2018a) simply concatenate all evidence together, and then feed the concatenated evidence and the given claim into the NLI model. Luken et al. (2018) adopt the decomposable attention model (DAM) (Parikh et al., 2016) to generate NLI predictions for each claimevidence pair individually and then aggregate all NLI predictions for final verification. Then, Hanselowski et al. (2018); Yoneda et al. (2018); Hidey and Diab (2018) adopt the enhanced sequential inference model (ESIM) (Chen et al., 2017b), a more effective NLI model, to infer the relevance between evidence and claims instead of DAM. As pre-trained language models have achieved great results on various NLP applications, Malon (2018) fine-tunes the generative pretraining transformer (GPT) (Radford et al., 2018) for FV. Based on the methods mentioned above, Nie et al. (2019) specially design the neural semantic matching network (NSMN), which is a modification of ESIM and achieves the best results in the competition. Unlike these methods, Yin and Roth (2018) propose the TWOWINGOS system which trains the evidence identification and claim verification modules jointly.

Natural Language Inference
The natural language inference (NLI) task requires a system to label the relationship between a pair of premise and hypothesis as entailment, contradiction or neutral. Several large-scale datasets have been proposed to promote the research in this direction, such as SNLI (Bowman et al., 2015) and Multi-NLI (Williams et al., 2018). These datasets have made it feasible to train complicated neural models which have achieved the state-of-the-art results (Bowman et al., 2015;Parikh et al., 2016;Sha et al., 2016;Chen et al., 2017b,c;Munkhdalai and Yu, 2017;Nie and Bansal, 2017;Conneau et al., 2017;Gong et al., 2018;Tay et al., 2018;Ghaeini et al., 2018). It is intuitive to transfer NLI models into the claim verification stage of the FEVER task and several teams from the shared task have achieved promising results by this way.

Pre-trained Language Models
Pre-trained language representation models such as ELMo  and OpenAI GPT (Radford et al., 2018) are proven to be effective on many NLP tasks. BERT (Devlin et al., 2019) employs bidirectional transformer and welldesigned pre-training tasks to fuse bidirectional context information and obtains the state-of-theart results on the NLI task. In our experiments, we find the fine-tuned BERT model outperforms other NLI-based models on the claim verification subtask of FEVER. Hence, we use BERT as the sentence encoder in our framework to better encoding semantic information of evidence and claims.

Method
We employ a three-step pipeline with components for document retrieval, sentence selection and claim verification to solve the task. In the document retrieval and sentence selection stages, we simply follow the method from Hanselowski et al. (2018) since their method has the highest score on evidence recall in the former FEVER shared task. And we propose our Graph-based Evidence Aggregating and Reasoning (GEAR) framework in the final claim verification stage. The full pipeline of our method is illustrated in Figure 1.

Document Retrieval and Sentence Selection
In this section, we describe our document retrieval and sentence selection components. Additionally, we add a threshold filter after the sentence selection component to filter out those noisy evidence.
In the document retrieval step, we adopt the entity linking approach from Hanselowski et al. (2018). Given a claim, the method first utilizes the constituency parser from AllenNLP  to extract potential entities from the claim. Then it uses the entities as search queries and finds relevant Wikipedia documents via the online MediaWiki API 2 . The seven highest-ranked results for each query are stored to form a candidate article set. Finally, the method drops the articles which are not in the offline Wikipedia dump and filters the articles by the word overlap between their titles and the claim.
The sentence selection component selects the most relevant evidence for the claim from all sentences in the retrieved documents. Hanselowski et al. (2018) modify the ESIM model to compute the relevance score between the evidence and the claim. In the training phase, the model uses the hinge loss function max(0, 1 + s n −s p ) with the negative sampling strategy, where s p and s n denote the relevance scores of positive and negative samples. In the test phase, the final model ensembles the results from 10 models with different random seeds. Sentences with top-5 relevance scores are selected to form the final evidence set in the original method.
In addition to the original model (Hanselowski et al., 2018), we add a relevance score filter with a threshold τ . Sentences with relevance scores lower than τ are filtered out to alleviate the noises. Thus the final size of the retrieved evidence set is equal to or less than 5. We choose different values of τ and select the value based on the dev set result. The evaluation results of the document retrieval and sentence selection components are shown in Section 5.1.

Claim Verification with GEAR
In this section, we describe our GEAR framework for claim verification. As shown in Figure 1, given a claim and the retrieved evidence, we first utilize a sentence encoder to obtain representations for the claim and the evidence. Then we build a fully-connected evidence graph and propose an evidence reasoning network (ERNet) to propagate information among evidence and reason over the graph. Finally, we utilize an evidence aggregator to infer the final results.

Sentence Encoder
Given an input sentence, we employ BERT (Devlin  Specifically, given a claim c and N pieces of retrieved evidence {e 1 , e 2 , ..., e N }, we feed each evidence-claim pair (e i , c) into BERT to obtain the evidence representation e i . We also feed the claim into BERT alone to obtain the claim presentation c. That is, (1) Note that we concatenate the evidence and the claim to extract the evidence representation because the evidence nodes in the reasoning graph need the information from the claim to guide the message passing process among them.

Evidence Reasoning Network
To encourage the information propagation among evidence, we build a fully-connected evidence graph where each node indicates a piece of evidence. We also add self-loop to every node because each node needs the information from itself in the message propagation process. We use to represent the hidden states of nodes at layer t, where h t i ∈ R F ×1 and F is the number of features in each node. The initial hidden state of each evidence node h 0 i is initialized by the evidence presentation: h 0 i = e i . Inspired by recent work on semi-supervised graph learning and relational reasoning (Kipf and Welling, 2017;Velickovic et al., 2018;Palm et al., 2018), we propose an evidence reasoning network (ERNet) to propagate information among the evidence nodes. We first use an MLP to compute the attention coefficients between a node i and its neighbor j (j ∈ N i ), where N i denotes the set of neighbors of node i, W t−1 0 ∈ R H×2F and W t−1 1 ∈ R 1×H are weight matrices, and · · denotes concatenation operation.
Then, we normalize the coefficients using the softmax function, Finally, the normalized attention coefficients are used to compute a linear combination of the neighbor features and thus we obtain the features for node i at layer t, By stacking T layers of ERNet, we assume that each evidence could grasp enough information by communicating with other evidence. We feed the final hidden states of evidence nodes {h T 1 , h T 2 , ..., h T N } into our evidence aggregator to make the final inference.

Evidence Aggregator
We employ an evidence aggregator to gather information from different evidence nodes and obtain the final hidden state o ∈ R F ×1 . The aggregator may utilize different aggregating strategies and we suggest three aggregators in our framework: Attention Aggregator. Here we use the representation of the claim c to attend the hidden states of evidence and get the final aggregated state o.
where W 0 ∈ R H×2F and W 1 ∈ R 1×H . Max Aggregator. The max aggregator performs the element-wise Max operation among hidden states.
Mean Aggregator. The mean aggregator performs the element-wise Mean operation among hidden states.
Once the final state o is obtained, we employ a one-layer MLP to get the final prediction l.
where W ∈ R C×F and b ∈ R C×1 are parameters, and C is the number of prediction labels.

Dataset
We conduct our experiments on the large-scale dataset FEVER (Thorne et al., 2018a). The dataset consists of 185,455 annotated claims with a set of 5,416,537 Wikipedia documents from the June 2017 Wikipedia dump. We follow the dataset partition from the FEVER Shared Task (Thorne et al., 2018b). Table 2 shows the statistics of the dataset.

Baselines
In this section, we describe the baseline systems in our experiments. We first introduce the top-3 systems from the FEVER shared task. As BERT (Devlin et al., 2019) has achieved promising performance on several NLP tasks, we also implement two baseline systems via fine-tuning BERT in the claim verification task.

Shared Task Systems
We choose the top-3 models from the FEVER shared task as our baselines. The Athene UKP TU Darmstadt team (Athene) (Hanselowski et al., 2018) combines five inference vectors from the ESIM model via attention mechanism to make the final prediction.
The UCL Machine Reading Group (UCL MRG) (Yoneda et al., 2018) predicts the label of each evidence-claim pair and aggregates the results via a label aggregation component.
The UNC NLP team (Nie et al., 2019) proposes the neural semantic matching network and uses the model jointly to solve all three subtasks. They also incorporate additional information such as pageview frequency and WordNet features. They have achieved best results in the competition.

BERT Fine-tuning Systems
We implement two BERT fine-tuning systems with different evidence combination approaches. The BERT-Concat system concatenates all evidence into a single string while the BERT-Pair system encodes each evidence-claim pair independently and then aggregates the results. Both systems share the same document retrieval and sentence selection components proposed by us.
BERT-Concat. In the BERT-Concat system, we simply concatenate all evidence into a single sentence and utilize BERT to predict the relation between the concatenated evidence and the claim. In the training phase, we add the ground truth evidence into the retrieved evidence set with relevance score 1 and select five pieces of evidence with the highest scores. In the test phase, we concatenate the retrieved evidence for predicting.
BERT-Pair. In the BERT-Pair system, we utilize BERT to predict the label for each evidenceclaim pair. Concretely, we use each evidenceclaim pair as the input and the label of the claim as the prediction target. In the training phase, we select the ground truth evidence for SUPPORTED and REFUTED claims and the retrieved evidence for NEI claims. In the test phase, we predict labels for all retrieved evidence-claim pairs. Because different evidence-claim pairs may have inconsistent predicted labels, we then utilize an aggregator to obtain the final claim label. We find the aggregator only returning the predicted label from the most relevant evidence has the best performance.

Hyperparameter Settings
We utilize BERT BASE (Devlin et al., 2019) in all of the BERT fine-tuning baselines and our GEAR framework. The learning rate is 2e-5.
For BERT-Concat, the maximum sequence length is 256 and the batch size is 16. We limit the max length for concatenated evidence to 240 and the max length for claims to 16. We train this model for two epochs based on dev results. For BERT-Pair, we set the maximum sequence length to 128 and batch size to 32. We train this model for one epoch. As for the GEAR framework, we use the fine-tuned BERT-Pair model to extract features and the batch size is 512.
In our ERNet, we set the batch size to 256, the number of features F to 768 and the dimension of weight matrices H to 64. The model is trained to minimize the negative log likelihood loss on the predicted label using the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 5e-3 and L2 weight decay of 5e-4. We use an early stopping strategy on the label accuracy of the validation set, with a patience of 20 epochs. We attempt to stack 0-3 ERNet layers and analyze the effect of different layer numbers.

Evaluation Metrics
Besides traditional evaluation metrics such as label accuracy and F1, we use other two metrics to evaluate our model.
FEVER score. The FEVER score is the label accuracy conditioned on providing at least one complete set of evidence. Claims labeled as "NEI" do not need the evidence.   OFEVER score. The document retrieval and sentence selection components are usually evaluated by the oracle FEVER (OFEVER) score, which is the upper bound of the FEVER score by assuming perfect downstream systems.
For all of the experiments with GEAR, the scores (label accuracy, FEVER score) we report on the dev set are mean values with 10 runs initialized by different random seeds.

Experimental Results and Analysis
In this section, we first present the evaluations of the document retrieval and sentence selection components. Then we evaluate our GEAR framework in several different aspects. Finally, we present a case study to demonstrate the effectiveness of our framework.

Document Retrieval and Sentence Selection
We use the OFEVER metric to evaluate the document retrieval component. Table 3 shows the OFEVER scores of our model and models from other teams. After running the same model proposed by Hanselowski et al. (2018), we find our OFEVER score is slightly lower, which may due to the random factors.
Then we compare our sentence selection component with different thresholds, as shown in Table 4. We find the model with threshold 0 achieves the highest recall and OFEVER score. When the threshold increases, the recall value and the OFEVER score drop gradually while the precision and F1 score increase. The results are consistent with our intuition. If we do not filter out evidence, more claims could be provided with the full evidence set. If we increase the value of the threshold, more pieces of noisy evidence are filtered out, which contributes to the increase of precision and F1.

Claim Verification with GEAR
In this section, we evaluate our GEAR framework in different aspects. We first compare the label accuracy scores between our framework and baseline systems. Then we explore the effect of different thresholds from the upstream sentence filter. We also conduct additional experiments to check the effect of sentence embedding. As there are nearly 39% of claims require reasoning over multiple pieces of evidence, we construct a difficult dev subset and check the effectiveness of our ER-Net for evidence reasoning. Finally, we make an error analysis and provide the theoretical upperbound label accuracy of our framework.

Model Evaluation
We use the label accuracy metric to evaluate the effectiveness of different claim verification models. The second column of Table 7 shows the label accuracy of different models on the dev set. We find the BERT fine-tuning models outperform all of the models from the shared task, which shows the strong capacity of BERT in representation learning and semantic understanding. The BERT-Concat model has a slight improvement over BERT-Pair, which is 0.37%.
Our final model outperforms the best BERT-Concat baseline by 1.17%. As our framework provides a better way for evidence aggregating and reasoning, the improvement demonstrates that our framework has a better ability to integrate features from different evidence by propagating, analyzing and aggregating the features.

Effect of Sentence Thresholds
The rightmost column of Table 4 shows the results of our GEAR frameworks with different sentence selection thresholds. We choose the model with threshold τ = 10 −3 , which has the highest label accuracy, as our final model. When the threshold increases from 0 to 10 −3 , the label accuracy increases due to less noisy information. However, when the threshold increases from 10 −3 to 10 −1 ,   the label accuracy decreases because informative evidence is filtered out, and the model can not obtain sufficient evidence to make the right inference.

Effect of Sentence Embedding
The BERT model we used in the sentence encoding step is fine-tuned on the FEVER dataset for one epoch. We need to find out whether the finetuning process or simply incorporating the sentence embeddings from BERT makes the major contribution to the final result. We conduct an experiment using a BERT model without the finetuning process and we find the final dev label accuracy is close to the result from a random guess. Therefore, the fine-tuning process rather than sentence embeddings plays an important role in this task. We need the fine-tuning process to capture the semantic and logical relations between evidence and the claim. Sentence embeddings are more general and cannot perform well in this specific task. So that we cannot just use sentence embeddings from other methods (e.g., ELMo, CNN) to replace the sentence embeddings we used here.

Effectiveness of ERNet
In our observation, more than half of the claims in the dev dataset only need one piece of evidence to make the right inference. To verify the effectiveness of our framework on reasoning over multiple pieces of evidence, we build a difficult dev sub-  We test our final model on the difficult subset and present the results in Table 5. We find our models with ERNet perform better than models without ERNet and the minimal improvement between them is 1.27%. We can also discover from the table that models with 2 ERNet layers achieve the best results, which indicates that claims from the difficult subset require multi-step evidence propagation. This result demonstrates the ability of our framework to deal with claims which need multiple evidence.

Error Analysis
In this section, we examine the effect of errors propagating from upstream components. We utilize an evidence-enhanced dev subset, which assumes all pieces of ground truth evidence are retrieved, to test the theoretical upper-bound score of our GEAR framework.
In our analysis, the main errors of our framework come from the upstream document retrieval and sentence selection components which can not extract sufficient evidence for inferring. For example, to verify the claim "Giada at Home was only available on DVD", we need the evidence "Giada at Home is a television show and first aired on October 18, 2008, on the Food Network." and "Food Network is an American basic cable and satellite television channel.". However, the entity linking Claim: Al Jardine is an American rhythm guitarist. (1) He is best known as the band's rhythm guitarist, and for occasionally singing lead vocals on singles such as "Help Me, Rhonda" (1965), "Then I Kissed Her" (1965) and "Come Go with Me" (1978).
(2) Alan Charles Jardine (born September 3, 1942) is an American musician, singer and songwriter who co-founded the Beach Boys.
(3) In 2010, Jardine released his debut solo studio album, A Postcard from California. (4) In 1988, Jardine was inducted into the Rock and Roll Hall of Fame as a member of the Beach Boys. (5) Ray Jardine American rock climber, lightweight backpacker, inventor, author and global adventurer.
Label: SUPPORTED Table 8: A case of the claim that requires integrating multiple evidence to verify. The representation for evidence "{DocName, LineNum}" means the evidence is extracted from the document "DocName" and of which the line number is LineNum. method used in our document retrieval component could not retrieve the "Food Network" document only from parsing the content of the claim. Thus the claim verification component can not make the right inference with insufficient evidence.
To explore the effect of this issue, we test our models on an evidence-enhanced dev set, in which we add the ground truth evidence with relevance score 1 into the evidence set before the sentence threshold filter. It ensures that each claim in the evidence-enhanced set is provided with the ground truth evidence as well as the retrieved evidence.
The experimental results are shown in Table 6. We can find that all scores in the table increase by more than 1.4% compared to the original dev set label accuracy in Table 7 because of the addition of the ground truth evidence. Because of the assumption of oracle upstream components, the results in Table 6 indicate the theoretical upper bound label accuracy of our framework.
The results show the challenges in the previous evidence retrieval task, which could not be solved by current models. Nie et al. (2019) propose a two-hop evidence enhancement method which improves 0.08% on their final FEVER score. As the addition of the ground truth evidence leads to a   Table 8. The first five rows indicate the attention weights from nodes 1 to 5 in the first ERNet layer and the last row shows the attention weights from the attention aggregator. more than 1.4% increase in our experiment, it is worthwhile to design a better evidence retrieval pipeline, which remains to be our future research.

Full Pipeline
We present the evaluation of our full pipeline in this section. Note that there is a gap between the label accuracy and the final FEVER score due to the completeness of the evidence set. We find that a model which is good at predicting NEI instances tends to obtain higher FEVER score. So we choose our final model based on the dev FEVER score among all of our experiments. This model contains one layer of ERNet and uses the attention aggregator. The threshold of the sentence filter is 10 −3 . Table 7 presents the evaluations of the full pipeline. We find the test FEVER score of BERT fine-tuning systems outperform other shared task models by nearly 1%. Furthermore, our full pipeline outperforms the BERT-Concat baseline by 1.46% and achieves significant improvements. Table 8 shows an example in our experiments which needs multiple pieces of evidence to make the right inference. The ground truth evidence set contains the sentences from the article "Al Jardine" with line number 0 and 1. These two pieces of evidence are also ranked at top two in our retrieved evidence set. To verify whether "Al Jardine is an American rhythm guitarist", our model needs the evidence "He is best known as the bands rhythm guitarist" as well as the evidence "Alan Charles Jardine ... is an American musician". We plot the attention map from our final model with one layer of ERNet and the attention aggregator in Figure 2. We can find that all evidence nodes tend to attend the first and the second evidence nodes, which provide the most useful information in this case. The attention weights in other evidence nodes are pretty low, which indicates that our model has the ability to select useful information from multiple pieces of evidence.

Conclusion
We propose a novel Graph-based Evidence Aggregating and Reasoning (GEAR) framework on the claim verification subtask of FEVER. The framework utilizes the BERT sentence encoder, the evidence reasoning network (ERNet) and an evidence aggregator to encode, propagate and aggregate information from multiple pieces of evidence. The framework is proven to be effective and our final pipeline achieves significant improvements. In the future, we would like to design a multi-step evidence extractor and incorporate external knowledge into our framework.