Reasoning Over Semantic-Level Graph for Fact Checking

Fact checking is a challenging task because verifying the truthfulness of a claim requires reasoning about multiple retrievable evidence. In this work, we present a method suitable for reasoning about the semantic-level structure of evidence. Unlike most previous works, which typically represent evidence sentences with either string concatenation or fusing the features of isolated evidence sentences, our approach operates on rich semantic structures of evidence obtained by semantic role labeling. We propose two mechanisms to exploit the structure of evidence while leveraging the advances of pre-trained models like BERT, GPT or XLNet. Specifically, using XLNet as the backbone, we first utilize the graph structure to re-define the relative distances of words, with the intuition that semantically related words should have short distances. Then, we adopt graph convolutional network and graph attention network to propagate and aggregate information from neighboring nodes on the graph. We evaluate our system on FEVER, a benchmark dataset for fact checking, and find that rich structural information is helpful and both our graph-based mechanisms improve the accuracy. Our model is the state-of-the-art system in terms of both official evaluation metrics, namely claim verification accuracy and FEVER score.


Introduction
Internet provides an efficient way for individuals and organizations to quickly spread information to massive audiences. However, some information are true while some are false. Malicious people spread false news, which may have significant influence on public opinion, stock prices, even presidential elections (Faris et al. 2017). Some research shows that false news reaches more people than the truth (Vosoughi, Roy, and Aral 2018). In this paper, we study fact checking, a fundamental task in natural language processing * Work is done during internship at Microsoft Research Asia. whose goal is to automatically assess the truthfulness of a textual claim by looking for textual evidence.
Existing approaches are dominated by natural language inference models (Dagan et al. 2013;Angeli and Manning 2014) because the task requires reasoning between the claim and the evidence. In most cases, a claim is a sentence while the evidence contains multiple sentences. Existing studies typically concatenate evidence sentences into a single string, which is used in the top-ranked system during the official FEVER challenge (Thorne et al. 2018b), or add a feature fusion layer on top of evidence features to further aggregate information from evidence sentences (Zhou et al. 2019). However, both methods ignore the important relational structure of evidence sentences at semantic-level, including the participants, location, and temporality of events. Let us take the example in Figure 1. Making correct prediction for this claim requires a model to understand that "Rodney King riots" is occurred in "Los Angeles County" from the first evidence, and that "Los Angeles County" is "the most populous county in the USA" from the second evidence. Simply concatenating evidence sentences as a single string would give a large distance to relevant information pieces from difference evidence sentences. Feature fusion aggregates all the information in an implicit way, which makes it hard to reason over structural information. Therefore, mining adequate and concise information from pieces of evidence is useful for verifying to the claim.
The 1992 Los Angeles riots, also known as the Rodney King riots were a series of riots, lootings, arsons, and civil disturbances that occurred in Los Angeles County, California in April and May 1992.
Los Angeles County, officially the County of Los Angeles, is the most populous county in the USA.

Evidence #2
The Rodney King riots took place in the most populous county in the USA. To address the aforementioned issues, we present a graphbased reasoning approach for fact checking. We represent evidence sentences as a graph, where nodes are extracted by SRL (Semantic Role Labelling) (Shi and Lin 2019). Nodes belonging to the same predicate-argument structure are fully connected. We further use string similarity based measurement to connect nodes of certain types (e.g. arguments, location and temporal) which are extracted from different sentences. After obtaining the constructed graph, we present a graph-based model with XLNet  as the backbone. We present a graph-based contextual word representation learning module and a graph-based reasoning module to leverage the graph information. In the first module, we use graph information to redefine the distance between words and produce contextual word embedding for each word in both claim and evidence sentences. In the graph-based reasoning module, we take the contextual word representations as input, use graph neural network to calculate graph-level representations, and finally make the prediction over two graphs.
We conduct experiments on FEVER (Thorne et al. 2018a), short for Fact Extraction and VERification, which is one of the most influential benchmark datasets for fact checking. In FEVER, evidence comes from Wikipedia. Given a claim as the input, the goal is to look for textural evidence and predict whether the evidence supports or refutes the claim, or there is not enough information to make the decision. Experiments show that both graphbased modules improve the performance. At the time of paper submission, our system (DREAM on the official leaderboard 1 ) achieves state-of-the-art claim verification accuracy and FEVER score. This paper makes the following contributions: • We propose a graph-based reasoning approach for fact checking. We use SRL to construct graphs and propose two novel graph-based modules for graph-based representation learning and graph-based reasoning.
• Results verify that both proposed modules bring improvements, and our final system achieves state-of-the-art performance.

Task Definition
FEVER (Fact Extraction and VERification) is a shared task proposed by Thorne et al. (2018b), in which systems are required to assess the veracity of given claims with integrated information from multiple pieces of evidence. Evidence in this task needs to be retrieved from all the documents from Wikipedia. Specifically, with a given claim, the system is asked to search potential sentence-level evidence and state the claim as "SUPPORTED", "REFUTED" of "NOT ENOUGH INFO (NEI)", which indicate that the claim is supported or refuted by given evidence or is not verifiable. As the example shown in Figure 1, verification of a claim requires the ability of aggregating pieces of information from multiple pieces of evidence and reasoning 1 https://competitions.codalab.org/competitions/18814#results over them. In FEVER, there are two official evaluation metrics. The first one is accuracy for the three-way classification (SUPPORTED/REFUTED/NEI), which is also the main focus of this work because it directly shows the verification performance of our graph-based reasoning approach. For comparison with existing studies, we also report results in terms of the second metric, i.e. FEVER score, which additionally measures whether the retrieved evidence is correct for "SUPPORTED" and "REFUTED" categories.

Pipeline
In this section, we present an overview of our pipeline. At the high level, our pipeline consists of three main components: a document retrieval model, a sentence-level evidence selection model, and a claim verification model. Figure 2 gives an overview of our pipeline, called the Dynamic REAsoning Machine (DREAM). Given a claim as input, the document retrieval model retrieves top k related documents from a dump of WikiPedia. With retrieved documents, the sentence-level evidence selection model aims to select top 5 relevant sentences as the predicted evidence. Finally, the claim verification model performs reasoning over the claim and predicted evidence, and states the veracity of the claim. We propose our reasoning framework in the claim verification model. In this section, we briefly introduce our Our Pipeline strategies for the first two models. The main contribution of this work is the graph-based reasoning approach we propose in the claim verification model, which we detail in the next section.

Document Retrieval Model
The document retrieval model takes a claim and a dump of Wikipedia as input, and returns k most relevant documents. We mainly follow the UNC-NLP (Nie, Chen, and Bansal 2019), which is the top-performing system on the competition hosted for FEVER shared task (Thorne et al. 2018b).
The document retrieval model first adopts a keywords matching mechanism (Nie, Chen, and Bansal 2019) to filter candidate documents from the large-scale Wikipedia. Since a large proportion (10%) of the document's titles have disambiguation information (e.g. "Vedam (film)" ), which is hard to be identified with literal matching, we further apply the NSMN (Nie, Chen, and Bansal 2019) model to perform semantic matching between claims and candidate documents with disambiguation title. For the document with disambiguation title, the normalized matching score m + for claim c i and document d j will be calculated as: represents the concatenation of the title and the first sentence come from document d j and < m − , m + > indicate the output normalized probability. The documents without disambiguation title are assigned with the highest matching score and the documents with disambiguation title are assigned by calculated matching score m + . These documents will be ranked and added to the resulting list. Finally, our system selects top 10 documents from the resulting list as the searched documents.

Sentence-Level Evidence Selection Model
Evidence selection model selects the top 5 potential evidence sentences by ranking all the candidate sentences from the documents retrieved by document retrieval model.
Evidence selector is required to conduct semantic matching between a claim and each evidence candidate. We employ pre-trained models like RoBERTa (Liu et al. 2019) and XLNet  as the sentence encoder. In our experiments, we use RoBERTa because it performs better. The input of our sentence encoder is where Claim and Evidence i indicate tokenized wordpieces of original claim and i th evidence candidate. SEP and CLS are symbols indicating ending of a sentence and ending of a whole input, respectively. The final representation h ce i ∈ R d is obtained via extracting the hidden vector of the [CLS] token. d denotes the dimension of hidden vector.
(2) Then we employ an MLP layer and a softmax layer to compute score s + cei for each evidence candidate: where W ∈ R 2×d is a weight metric and b denotes bias vector. Afterwards, we rank all the evidence sentences by score s + cei and select top 5 potential evidence sentences.

Claim Verification Model
In this section, we introduce our claim verification model, which is the main contribution of this work. The task requires the ability to aggregate pieces of information from pieces of evidence and do reasoning over it to make a conclusion. Such information across multiple evidence sentences has intrinsic structures, including both intra-sentence structure such as the argument, location, and temporal of an event and inter-sentence structure such as the same mention of an argument in two sentences. Instead of simply concatenating evidence sentences into a single string, we propose to reason over a semantic-level graph for claim classification.
Our approach contains three modules, including (1) graph construction module, which constructs two semantic graphs for evidence and claim separately, (2) graphbased contextual word representation learning module, which takes the constructed graph as the input to learn a graph-enhanced contextual representation for each word in the input, and (3) graph reasoning module, which takes the outputs from the previous two modules to conduct graphlevel representation learning and reasoning, and makes the prediction. Details of each module will be described below.

Graph Construction
We first introduce the common notation about graph networks that will be used throughout the paper. Then we will introduce the details of graph construction.
Graph network framework is defined as the relational learning framework built based on the graph structure. A graph is denoted as where Evidence is the concatenation of top 5 evidences. We use the same method to construct graphs for evidence sentences and the claim. Below we take evidence as the example to describe the graph construction procedure. With given evidence or claim, our graph construction module operates in following steps.
• Tuples (set of arguments nodes) are extracted via SRL toolkits. SRL is performed to identify arguments and their roles in a sentence. The sub-graph formed by the same tuple are fully-connected by inner-tuple edges. • We add an inter-tuple edge between each pair of nodes from different tuples if they potentially mention the same entity. We first employ NER (Named Entity Recognition) (Peters et al. 2018) toolkits to extract entities mentioned in the content of nodes. Assuming entity A and entity B come from different tuples, we add one inter-tuple edge if one of following rules are satisfied: (1) A is equal to B; (2) A contains B; (3) the number of overlapped words between A and B is larger than the half of the minimum number of words in A and B. Figure 3 shows an example of the constructed graph.

Contextual Word Representation with Graph-Based Distance
Traditional reasoning approaches usually concatenate the pieces of evidence in a sequential way and feed them into a pre-trained model (e.g., XLNet) to learn the contextual word representation. Since pre-trained models adopt the absolute distance of two words in the input sequence, some closely linked nodes in the constructed graph are far away from each other. To better model the structural information in the extracted graph, we present an algorithm to re-calculate the  distance between each pair of nodes in the text by introducing the distance of two nodes in the constructed graph.
However, the whole distance metric will take huge memory space and calculation time considering that each word in the extracted graph has a distinct distance vector and each element in the vector is mapped into an embedding vector. Assuming the length of the input is 512 and the dimension of distance element is 1,024, the distance tensor takes almost 268 millions memory space for one sample, making it unable to implement the whole distance metric. To address this problem, we present a trade-off approach that uses a topology sort algorithm to sort words in the extracted graph. First, we use topology sort to sort nodes in the constructed graph to shorten the distance between two closely linked nodes. Second, we feed the sorted sequence into XLNet to get the relative position of words. Furthermore, topology sort can ensure that previous nodes are either its parent nodes or its sibling nodes. This characteristic helps the model to learn the dependencies in the extracted graph.
The details of the topology sort algorithm are shown at Algorithm 1. The algorithm begins from nodes without incident relations. For each node without incident relations, we recursively visit its child nodes in a depth first search way.
In this way, we obtain the graph-guided distances between words, which will be used as the input to the XLNet model. Then, XLNet maps the input x of length T into a sequence of hidden vectors as follows. h

Graph-Based Reasoning Network
Taking the graphs and graph-based distance matrices as input, we first initialize node representation based on the contextual word representation. Afterwards, we update the graph by propagating the information from neighboring nodes. Finally, after obtaining graph-level representations for claim-based and evidence-based graphs, we make the alignment between two graphs and make the final prediction.

Algorithm 1 Graph-based Distance Calculation Algorithm.
Require: A sequence of nodes S = {si, s2, · · · , sn}; A set of relations R = {r1, r2, · · · , rm} 1: function DFS(node, visited, sorted sequence) 2: for each child sc in node's children Each node in the graph is a word span in the input text. The initial representation of each node is the average of hidden vectors at corresponding position. Afterwards, the representation will be updated with graph learning module.
Graph Representation Learning In this part, we present the graph learning module, which is designed to update representation of nodes by aggregating information from their neighbors. To capture the multi-hop relational information, we employ multi-layer graph convolutional network (GCNs)   Our intuition of using GCNs is to utilize its ability to automatically aggregate information through edges.
Here we describe the GCNs. Formally, we denote G as the graph constructed by the previous graph construction method and let H ∈ R N v ×d be a matrix containing representation of all nodes, where N v and d denote the number of nodes and dimension of nodes representation, respectively. Each row H i ∈ R d is the representation of node i. We introduce an adjacency matrix A of graph G and its degree matrix D, where we add self-loops to matrix A and D ii = j A ij .
Specifically, one-layer GCNs will aggregate information through one-hop edges. We describe it as follows: where H (1) i ∈ R d is the new d-dimension representation of node i, A = D − 1 2 AD − 1 2 is the normalized symmetric adjacency matrix, W 0 is a weight matrix, and ρ is an activation function. To exploit information from the multi-hop neighboring nodes, we stack multiple GCNs layers: where j denotes the layer number and H 0 i is the initial representation of node i initialized from the contextual representation. We simplify H (k) as H for later use, where H indicates the representation of all nodes updated by k-layer GCNs. The graph learning mechanism will be performed separately for claim-based and evidence-based graph. Therefore, we denote H c and H e as the representation of all nodes in claim-based graph and evidence-based graphs, respectively. Afterwards, we utilize the graph reasoning module to align the graph-level node representation learned for two graphs and make the final prediction.
Graph Reasoning We need to explore the related information between two graphs and make semantic alignment for final prediction.
Formally, let H e ∈ R N v e ×d and H c ∈ R N v c ×d denote matrices containing representation of all nodes in evidencebased and claim-based graph respectively, where N v e and N v c denote number of nodes in the corresponding graph.
We first employ a graph attention mechanism (Veličković et al. 2017) to generate claim-specific evidence representation for each node in claim-based graph. Specifically, we first take each h i c ∈ H c as query, and take all node representation h j e ∈ H e as keys. We then perform graph attention on the nodes, a attention mechanism a : R d × R d → R to compute attention coefficient as follows: which means the importance of evidence node j to the claim node i. W c ∈ R F ×d and W e ∈ R F ×d is the weight matrix and F is the dimension of attention feature. We use dotproduct function as a here. We then normalize e ij using softmax function: Afterwards, we calculate claim-centric evidence representation X = [x 1 , . . . , x N v c ] using the weighted sum over H e : We then perform node-to-node alignment and calculate aligned vectors A = [a 1 , . . . , a N v c ] by the claim node representation H c and claim-centric evidence representation X.
where f align () denotes the alignment function. Inspired by Shen et al. (2018), we design our alignment function as: where W a ∈ R d×4 * d is a weight matrix and is elementwise Hadamard product. The final output g is obtained by the mean pooling over A. We then feed the concatenated vector of g and the final hidden vector h(x) T from XLNet through a MLP layer for the final prediction.

Experiments
We conduct experiments on FEVER (Thorne et al. 2018a), a benchmark dataset for fact extraction and verification. The two official evaluation metrics of FEVER are label accuracy and FEVER score. Label accuracy is the primary evaluation metric we apply for our experiments because it directly represents the performance of the claim verification model. We also report FEVER score, which measures whether both the predicted label and the retrieved evidence are correct. FEVER score is calculated with equation 13, where y is the ground truth label,ŷ is the predicted label, E = [E 1 , · · · , E k ] is a set of ground-truth evidence, and E = [Ê 1 , · · · ,Ê 5 ] is a set of predicted evidence.
No evidence is required if the predicted label is NEI.

Baselines
We first select three top-performing systems on FEVER shared task as the baselines.
• The UNC-NLP (Nie, Chen, and Bansal 2019) employed a semantic matching neural network for both evidence selection and claim verification. They also employed additional features (e.g., WordNet features) and symbolic rules (e.g., keywords matching).  (Devlin et al. 2018) to generate claim-specific evidence representation and applies graph network to compute evidence-wise node representation for final prediction. Table 2 reports the overall performance of our model on the blind test set with the score showed on the public leaderboard 2 . As shown in the    the XLNet baseline, incorporating both graph-based modules brings 3.76% improvement on label accuracy. Removing the graph-based distance drops 0.81% in terms of label accuracy. The graph-based distance mechanism can shorten the distance of two closely-linked nodes and help the model to learn their dependency. Removing the graph-based reasoning module drops 2.04% because graph reasoning module captures the structural information and performs deep reasoning over that.

Document Retrieval Results
We evaluate the performance of our document retrieval module using recall metric, which is defined as the proportion of ground-truth documents that are successfully retrieved. Table 4 reports the results of an efficient system (first row) which is built purely based on keywords from claim and titles of all the documents with Elastic Search 3 , and reports its combination with a neural network based model. The recall of the symbolic system is good, yet can be improved by the neural model. It is a trade-off between efficiency and performance in the real application.

Evidence Selection Results
In this part, we present the performance of the sentence-level evidence selection module that we develop with different  backbone. We take the concatenation of claim and each evidence as input, and take the last hidden vector to calculate the score for evidence ranking. Results from Table 5

Error Analysis
We randomly select 200 incorrectly predicted instances and summarize the primary types of errors.
The first type of errors is caused by failing to match the semantic meaning between phrases that describe the same event. For example, the claim states "Winter's Tale is a book." while the evidence states "Winter 's Tale is a 1983 novel by Mark Helprin.". The model fails to realize that "novel" belongs to "book" and stats that the claim is refuted. Solving this type of error needs to involve external knowledge (e.g. ConceptNet (Speer, Chin, and Havasi 2017)) that can indicate logical relationships between different events.
The misleading information in retrieved evidence causes the second type of errors. For example, the claim states "The Gifted is a movie", and the ground-truth evidence states "The Gifted is an upcoming American television series". How ever, the retrieved evidence also contains "The Gifted is a 2014 Filipino dark comedy-drama movie.", which misleads the model to make the wrong judgement.

Related Work
In general, fact checking involves assessing the truthfulness of a claim. In literature, a claim can be a text or a subjectpredicate-object triple (Nakashole and Mitchell 2014). In this work, we only consider textual claim. Existing datasets differ from data source and the type of supporting evidence for verifying the claim. An early work by Vlachos and Riedel (2014) construct 221 labeled claims in the political domain from POLITIFACT.COM and CHANNEL4.COM, given meta-data of the speaker as the evidence. POLIFACT is further investigated by following works, including Ferreira and Vlachos (2016) who build Emergent with 300 labeled rumors and about 2.6K news article, Wang (2017) who build LIAR with 12.8K annotated short statements and six fine-grained labels, and Rashkin et al. (2017) who collect claims without meta-data while providing 74K news articles. We study FEVER (Thorne et al. 2018a), which requires aggregating information from multiple pieces of evidence from Wikipedia for making the conclusion. FEVER contains 185,445 annotated instances, which to the best of our knowledge is the largest benchmark dataset in this area. We plan to study fact checking with adversarial attacks (Thorne and Vlachos 2019; Schuster et al. 2019) in the future.
The majority of participating teams in the FEVER challenge (Thorne et al. 2018b) use the same pipeline consisting of three components, namely document selection, evidence sentence selection, and claim verification. In document selection phase, participants typically extract named entities from a claim as the query and use Wikipedia search API. In the evidence selection phase, participants measure the similarity between the claim and an evidence sentence candidate by training a classification model like Enhanced LSTM (Chen et al. 2016) in a supervised setting or using string similarity function like TFIDF without trainable parameters. In this work, our focus is the claim classification phase. Topranked three systems aggregate pieces of evidence through concatenating evidence sentences into a single string (Nie, Chen, and Bansal 2019), classifying each evidence-claim pair separately and merge the results (Yoneda et al. 2018), and encoding each evidence-claim pair followed by pooling operation (Hanselowski et al. 2018). A recent work by Zhou et al. (2019) is the first to use BERT to calculate claim-specific evidence sentence representation, and then develop a graph network to aggregate the information on top of BERT, regarding each evidence as a node in the graph. Our work differs from Zhou et al. (2019) in (1) that the construction of our graph requires understanding the syntax of each sentence, which could be viewed as a more finegrained graph, and (2) that both the contextual representation learning module and the reasoning module have model innovations of taking consideration of the graph information. Instead of training each component separately, Yin and Roth (2018) show that joint learning could improve both claim verification and evidence identification.

Conclusion
In this work, we present a graph-based approach for fact checking. When assessing the veracity of a claim given multiple evidence sentences, our approach does not conduct text-based reasoning at word or sentence level. Instead, our approach is built upon an automatically constructed graph, which is derived based on semantic role labeling. To better exploit the graph information, we propose two graph-based modules, one for calculating contextual word embedding using graph-based distance in XLNet, and another for learning representation for graph components and reasoning over the graph. Experiments show that both graph-based modules bring improvements and our final system is the state-of-theart on the public leaderboard at the time of paper submission. In the future, we plan to leverage external background knowledge about the claim and evidence to improve model's reasoning ability.