DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and Decomposable Attention

In this paper, we describe DeFactoNLP, the system we designed for the FEVER 2018 Shared Task. The aim of this task was to conceive a system that can not only automatically assess the veracity of a claim but also retrieve evidence supporting this assessment from Wikipedia. In our approach, the Wikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectors are most similar to the vector of the claim and those documents whose names are similar to those of the named entities (NEs) mentioned in the claim are identified as the documents which might contain evidence. The sentences in these documents are then supplied to a textual entailment recognition module. This module calculates the probability of each sentence supporting the claim, contradicting the claim or not providing any relevant information to assess the veracity of the claim. Various features computed using these probabilities are finally used by a Random Forest classifier to determine the overall truthfulness of the claim. The sentences which support this classification are returned as evidence. Our approach achieved a 0.4277 evidence F1-score, a 0.5136 label accuracy and a 0.3833 FEVER score.


Introduction
Given the current trend of massive fake news propagation on social media, the world is desperately in need of automated fact checking systems.Automatically determining the authenticity of a fact is a challenging task that requires the collection and assimilation of a large amount of information.To perform the task, a system is required to find relevant documents, detect and label evidences, and finally output a score which represents the truthfulness of the given claim.The numerous design challenges associated with such systems are discussed by Thorne and Vlachos (2018) and Esteves et al. (2018).
The Fact Extraction and Verification (FEVER) dataset (Thorne et al., 2018) is the first publicly available large-scale dataset designed to facilitate the training and testing of automated fact verification systems.The FEVER 2018 Shared Task required us to design such systems using this dataset.The organizers had provided us a preprocessed version of the June 2017 Wikipedia dump in which the pages only contained the introductory sections of the respective Wikipedia pages.Given a claim, we were asked to build systems which could determine if there were sentences supporting the claim (labelled as "SUPPORTS") or sentences refuting it (labelled as "REFUTES").If conclusive evidence either supporting or refuting the claim could not be found in the dump, the system should report the same (labelled "NOT ENOUGH INFO").However, if conclusive evidence was found, it should also retrieve the sentences which either support or refute the claim.

System Architecture
Our approach has four main steps: Relevant Document Retrieval, Relevant Sentence Retrieval, Textual Entailment Recognition and Final Scoring arXiv:1809.00509v1[cs.AI] 3 Sep 2018 and Classification.Given a claim, Named Entity Recognition (NER) and TFIDF vector comparision are first used to retrieve the relevant documents and sentences as delineated in Section 2.1.The relevant sentences are then supplied to the textual entailment recognition module (Section 2.2) that returns a set of probabilities.Finally, a Random Forest classifier (Breiman, 2001) is employed to assign a label to the claim using certain features derived from the probabilities returned by the entailment model as detailed in Section 2.3.The proposed architecture is depicted in Figure 1.

Retrieval of Relevant Documents and Sentences
We used two methods to identify which Wikipedia documents may contain relevant evidences.Information about the NEs mentioned in a claim can be helpful in determining the claim's veracity.In order to get the Wikipedia documents which describe them, the first method initially uses the Conditional Random Fields-based Stanford NER software (Finkel et al., 2005) to recognize the NEs mentioned in the claim.Then, for every NE which is recognized, it finds the document whose name has the least Levenshtein distance (Levenshtein, 1966) to that of the NE.Hence, we obtain a set of documents which contain information about the NEs mentioned in a claim.Since all of the sentences in such documents might aid the verification, they are all returned as possible evidences.
The second method used to retrieve candidate evidences is identical to that used in the baseline system (Thorne et al., 2018) and is based on the rationale that sentences which contain terms similar to those present in the claim are likely to help the verification process.Directly evaluating all of the sentences in the dump is computationally expensive.Hence, the system first retrieves the five most similar documents based on the cosine similarity between binned unigram and bigram TFIDF vectors of the documents and the claim using the DrQA system (Chen et al., 2017).Of all the sentences present in these documents, the five most similar sentences based on the cosine similarity between the binned bigram TFIDF vectors of the sentences and the claim are finally chosen as possible sources of evidence.The number of documents and sentences chosen is based on the analysis presented in the aforementioned work by Thorne et al. (2018).
The sets of sentences returned by the two methods are combined and fed to the textual entailment recognition module described in Section 2.2.

Textual Entailment Recognition Module
Recognizing Textual Entailment (RTE) is the process of determining whether a text fragment (Hypothesis H) can be inferred from another fragment (Text T ) (Sammons et al., 2012).The RTE module receives the claim and the set of possible evidential sentences from the previous step.Let there be n possible sources of evidence for verifying a claim.For the i th possible evidence, let s i denote the probability of it entailing the claim, let r i denote the probability of it contradicting the claim, and let u i be the probability of it being uninformative.The RTE module calculates each of these probabilities.
The SNLI corpus (Bowman et al., 2015) is used for training the RTE model.This corpus is composed of sentence pairs T, H , where T corresponds to the literal description of an image and H is a manually created sentence.If H can be inferred from T , the "Entailment" label is assigned to the pair.If H contradicts the information in T , the pair is labelled as "Contradiction".Otherwise, the label "Neutral" is assigned.
We chose to employ the state-of-the-art RTE model proposed by Peters et al. (2018) which is a re-implementation of the widely used decomposable attention model developed by Parikh et al. (2016).The model achieves an accuracy of 86.4% on the SNLI test set.We selected it because at the time of development of this work, it was one of the best performing systems on the task with publicly available code.Additionally, the usage of preprocessing parsing tools is not required and the model is faster to train when compared to the other approaches we tried.
Although the model achieved good scores on the SNLI dataset, we noticed that it does not generalize well when employed to predict the relationships between the candidate claim-evidence pairs present in the FEVER data.In order to improve the generalization capabilities of the RTE model, we decided to fine-tune it using a newly synthesized FEVER SNLI-style dataset (Pratt and Jennings, 1996).This was accomplished in two steps: the RTE model was initially trained using the SNLI dataset and then re-trained using the FEVER SNLI-style dataset.
The FEVER SNLI-style dataset was created using the information present in the FEVER dataset while retaining the format of the SNLI dataset.Let us consider each learning instance in the FEVER dataset of the form c, l, E , where c is the claim, l ∈ {SUPPORTS, REFUTES, NOT ENOUGH INFO} is the label and E is the set of evidences.While constructing the FEVER SNLIstyle dataset, we only considered the learning instances labeled as "SUPPORTS" or "REFUTES" because these were the instances that provided us with evidences.Given such an instance, we proceeded as follows: for each evidence e ∈ E, we created an SNLI-style example c, e labeled as "Entailment" if l = "SUPPORTS" or "Contradiction" if l = "REFUTES".If e contained more than one sentence, we made a simplifying assumption and only considered the first sentence of e.For each "Entailment" or "Contradiction" which was added to this dataset, a "Neutral" learning instance of the form c, n was also created.n is a randomly selected sentence present the same document from which e was retrieved.We also ensured that n was not included in any of the other evidences in E. Following this procedure, we obtain examples that are similar (retrieved from the same document) but should be labeled differently.

Split
Entail Thus, we obtained a dataset with the characteristics depicted in Table 1.To correct the unbalanced nature of the dataset, we performed random undersampling (He and Garcia, 2009).The fine-tuning had a huge positive impact on the generalization capabilities of the model as shown in Table 2. Using the fine-tuned model, the aforementioned set of probabilities are finally computed.

Final Classification
Twelve features were derived using the probabilities computed by the RTE module.We define the following variables for notational convenience: The twelve features which were computed are: Each of the possible evidential sentences supports a certain label more than the other labels (this can be determined by looking at the computed probabilities).The variables cs i , cr i and cu i are used to capture this fact.The most obvious way to label a claim would be to assign the label with the highest support to the claim.Hence, we chose to use the features f 1 , f 2 and f 3 which represent the number of possible evidential sentences which support each label.The amount of support lent to a certain label by supporting sentences could also be useful in performing the labelling.This motivated us to use the features f 4 , f 5 and f 6 which quantify the amount of support for each label.If a certain sentence can strongly support a label, it might be prudent to assign that label to the claim.Hence, we use the features f 7 , f 8 and f 9 which capture how strongly a single sentence can support the claim.Finally, we used the features f 10 , f 11 and f 12 because the average strength of the support lent by supporting sentences to a given label could also help the classifier.
These features were used by a Random Forest classifier (Breiman, 2001) to determine the label to be assigned to the claim.The classifier was composed of 50 decision trees and the maximum depth of each tree was limited to 3. Information gain was used to measure the quality of a split.3000 claims labelled as "SUPPORTS", 3000 claims labelled as "REFUTES" and 4000 claims labelled as "NOT ENOUGH INFO" were randomly sampled from the training set.Relevant sentences were then retrieved as detailed in Section 2.1 and supplied to the RTE module (Section 2.2).The probabilities calculated by this module were used to generate the aforementioned features.The classifier was then trained using these features and the actual labels of the claims.
We used the trained classifier to label the claims in the test set.If the "SUPPORTS" label was assigned to the claim, the five documents with the highest (s i × cs i ) products were returned as evidences.However, if cs i = 0 ∀i, then the label was changed to "NOT ENOUGH INFO" and a null set was returned as evidence.A similar process was employed when the "REFUTES" label was assigned to a claim.If the "NOT ENOUGH INFO" label was assigned, a null set was returned as evidence.

Results and Discussion
Our system was evaluated using a blind test set which contained 19,998 claims.the performance of our system with that of the baseline system.It also lists the best performance for each metric.The evidence precision of our system was 0.5191 and its evidence recall was 0.3636.All of these results were obtained upon submitting our predictions to an online evaluator.DeFac-toNLP had the 5 th best evidence F1 score, the 11 th best label accuracy and the 12 th best FEVER score out of the 24 participating systems.
The results show that the evidence F1 score of our system is much better than that of the baseline system.However, the label accuracy of our system is only marginally better than that of the baseline, suggesting that our final classifier is not very reliable.The low label accuracy may have negatively affected the other scores.Our system's low evidence recall can be attributed to the primitive methods employed to retrieve the candidate documents and sentences.Additionally, the RTE module can only detect entailment between two pairs of sentences.Hence, claims which require more than one sentence to verify them cannot be easily labelled by our system.This is another reason behind our low evidence recall, FEVER score and label accuracy.We aim to study more sophisticated ways to combine the information obtained from the RTE module in the near future.
To better assess the performance of the system, we performed a manual analysis of the predictions made by the system.We observed that for some simple claims (ex."Tilda Swinton is a vegan") which were labeled as "NOT ENOUGH INFO" in the gold-standard, the sentence retrieval module found many sentences related to the NEs in the claim but none of them had any useful information regarding the claim object (ex."vegan").In some of these cases, the RTE module would label certain sentences as either supporting or refuting the claim, even if they were not relevant to the claim.In the future, we aim to address this shortcoming by exploring triple extraction-based methods to weed out certain sentences (Gerber et al., 2015).
We also noticed that the usage of coreference in the Wikipedia articles was responsible for the system missing some evidences as the RTE module could not accurately assess the sentences which used coreference.Employing a coreference resolution system at the article level is a promising direction to address this problem.The incorporation of named entity disambiguation into the sentence and document retrieval modules could also boost performance.This is because we noticed that in some cases, the system used information from unrelated Wikipedia pages whose names were similar to those of the NEs mentioned in a claim to incorrectly label it (ex.a claim was related to the movie "Soul Food" but some of the retrieved evidences were from the Wikipedia page related to the soundtrack "Soul Food").

Conclusion
In this work, we described our fact verification system, DeFactoNLP, which was designed for the FEVER 2018 Shared Task.When supplied a claim, it makes use of NER and TFIDF vector comparison to retrieve candidate Wikipedia sentences which might help in the verification process.An RTE module and a Random Forest classifier are then used to determine the veracity of the claim based on the information present in these sentences.The proposed system achieved a 0.4277 evidence F1-score, a 0.5136 label accuracy and a 0.3833 FEVER score.After analyzing our results, we have identified many ways of improving the system in the future.For instance, triple extraction-based methods can be used to improve the sentence retrieval component as well as to improve the identification of evidential sentences.We also wish to explore more sophisticated methods to combine the information obtained from the RTE module and employ entity linking methods to perform named entity disambiguation.

Figure 1 :
Figure 1: The main steps of our approach

Table 1 :
FEVER SNLI-style Dataset split sizes for EN-TAILMENT,CONTRADICTION and NEUTRAL classes

Table 3 :
System Performance