Generating Fact Checking Summaries for Web Claims

We present SUMO, a neural attention-based approach that learns to establish correctness of textual claims based on evidence in the form of text documents (e.g., news articles or web documents). SUMO further generates an extractive summary by presenting a diversified set of sentences from the documents that explain its decision on the correctness of the textual claim. Prior approaches to address the problem of fact checking and evidence extraction have relied on simple concatenation of claim and document word embeddings as an input to claim driven attention weight computation. This is done so as to extract salient words and sentences from the documents that help establish the correctness of the claim. However this design of claim-driven attention fails to capture the contextual information in documents properly. We improve on the prior art by using improved claim and title guided hierarchical attention to model effective contextual cues. We show the efficacy of our approach on political, healthcare, and environmental datasets.


Introduction
Most of the information consumed by the world is in the form of digital news, blogs, and social media posts available on the Web. However, most of this information is written in the absence of facts and evidences. Our ever-increasing reliance on information from the Web is becoming a severe problem as we base our personal decisions relating to politics, environment, and health on unverified information available online. For example, consider the following unverified claim on the Web: "Smoking may protect against COVID-19." A user attempting to verify the correctness of the above claim will often take the following steps: issue keyword queries to search engines for the claim; going through the top reliable news articles; and finally making an informed decision based on the gathered information. Clearly, this approach is laborious, takes time, and is error-prone. In this work, we present SUMO, a neural approach that assists the user in establishing the correctness of claims by automatically generating explainable summaries for fact checking. Example summaries generated by SUMO for couple of Web claims are given in Figure 1.
Prior approaches to automatic fact checking rely on predicting the credibility of facts (Popat et al., 2017), instance detection (Ma et al., 2018;Xu et al., 2018), and fact entailment in supporting documents (Parikh et al., 2016). The majority of these methods rely on linguistic features (Popat et al., 2017;Potthast et al., 2018;Qazvinian et al., 2011), social contexts, or user responses (Ma et al., 2015) and comments. However, these approaches do not help explain the decisions generated by the machine learning models. Recent works such as (Atanasova et al., 2019;Mishra and Setty, 2019;Popat et al., 2018) overcome the explainability gap by extracting snippets from text documents that support or refute the claim. (Mishra and Setty, 2019;Popat et al., 2018) apply claim-based and latent aspect-based attention to model the context of text documents. (Mishra and Setty, 2019) model latent aspects such as the speaker or author of the claim, topic of the claim, and domains of retrieved Web documents for the claim. We observe in our experiments that in prior works (Mishra and Setty, 2019;Popat et al., 2018), the design of claim guided attention in these methods is not effective and latent aspects such as the topic and speaker of claims are not always available. The snippets extracted by such models are not comprehensive or topically diverse. To overcome these limitations, we propose a novel design of claim and document The current evidence suggests that the severity of COVID is higher among smokers, prevent the health risk linked to the excessive consumption or misuse" of nicotine products by people hoping to protect themselves from COVID-19. Evidence from China, where COVID-19 originated, shows that people who have cardiovascular and respiratory conditions caused by tobacco use, or otherwise, are at higher risk of developing severe COVID-19 symptoms. HO urges researchers, scientists and the media to be cautious about amplifying unproven claims that tobacco or nicotine could reduce the risk of COVID-19. Smoking is also associated with increased development of acute respiratory distress syndrome, a key complication for severe cases of COVID-19.

Claim: Smoking may protect against COVID-19
Claim: Deforestation has made humans more vulnerable to pandemics Deforestation can directly increase the likelihood that a pathogen will be transferred from wildlife species to humans through the creation of suitable habitats for vector species. Climate change, including deforestation which drives it, is a key driver of cross-species transmission which is where zoonotic emerging diseases come from . There is a correlation between deforestation and the rise in the spread of infectious diseases affecting humans. Deforestation forces various species into smaller, shared habitats and increases encounters between wildlife and humans. Habitat destruction and fragmentation due to deforestation can also increase the frequency of contact between humans, wildlife species, and the pathogens they carry . This can occur through direct transfer of pathogens from animals to humans or indirectly through cross-species transfer of pathogens from wildlife to domesticated species . Deforestation could be to blame for the rise of infectious diseases like the novel coronavirus. title driven attention, which better captures the contextual cues in relation to the claim. In addition to this, we propose an approach for generating summaries for fact-checking that are non-redundant and topically diverse.
Contributions. Contributions made in this work are as follows. First, we introduce SUMO, a method that improves upon the previously used claim guided attention to model effective contextual representation. Second, we propose a novel attention on top of attention (Atop) method to improve the overall attention effectiveness. Third, we present an approach to generate topically diverse multi-document summaries, which help in explaining the decision SUMO makes for establishing the correctness of claims. Fourth, we provide a novel testbed for the task of fact checking in the domain of climate change and health care.
Outline. The outline for the rest of the article is as follows. In Section 2, we describe prior work in relation to our problem setting. In Section 3, we formalize the problem definition and describe our approach, SUMO, to generate explainable summaries for fact checking of textual claims. In Sections 4 and 5, we describe the experimental setup that includes a description of the novel datasets that we make available to the research community and an analysis of the results we have obtained. In Section 6, we present the concluding remarks of our study.

Related work
We now describe prior work related to our problem setting. First, we describe works that rely only on features derived from documents that support the input textual claim. Second, we describe works that additionally include features derived from social media posts in connection to the claim. Third and finally, we describe works that rely on extracting textual snippets from text documents to explain a model's decision on the claim's correctness.

Content Based Approaches
Prior approaches for fact checking vary from simple machine learning methods such as SVM and decision trees to highly sophisticated deep learning methods. These works largely utilize features that model the linguistic and stylistic content of the facts to learn a classifier (Castillo et al., 2011;Ma et al., 2016;Qazvinian et al., 2011;Rashkin et al., 2017). The key shortcomings of these approaches are as follows. First, classifiers trained on linguistic and stylistic features perform poorly as they can be misguided by the writing style of the false claims, which are deliberately made to look similar to true claims but are factually false. Second, these methods lack in terms of user response and social context pertaining to the claims, which is very helpful in establishing the correctness of facts.

Social Media Based Approaches
Works such as (Qian et al., 2018;Shu et al., 2019;Yang et al., 2019) overcome the issue of user feedback by using a combination of content-based and context-based features derived from related social media posts. Specifically, the features derived from social media include propagation patterns of claim related posts on social media and user responses in the form of replies, likes, sentiments, and shares. These methods outperform content-based methods significantly. In (Yang et al., 2019), the authors propose a probabilistic graphical model for causal mappings among the post's credibility, user's opinions, and user's credibility. In (Qian et al., 2018), the authors introduce a user response generator based on a deep neural network that leverages the user's past actions such as comments, replies, and posts to generate a synthetic response for new social media posts.

Model Explainability
Explaining a machine learning model's decision is becoming an important problem. This is because modern neural network based methods are increasingly being used as black-boxes. There exist few machine learning models for fact checking that explain this decision via summaries. Related works (Mishra and Setty, 2019;Popat et al., 2018) achieve significant improvement in establishing the credibility of textual claims by using external evidences from the Web. They additionally extract snippets from evidences that explain their model's decision. However, we find that the claim-driven attention design used in these methods is inadequate, and does not capture sufficient context of the documents in relation to the input claim. The snippets extracted by these methods are often redundant and lack topical diversity offered by Web evidences. In contrast, our method enhances the claim-driven attention mechanism and generates a topically diverse, coherent multi-document summary for explaining the correctness of claims.

SUMO
We now formally describe the task of fact checking and explain SUMO in detail. SUMO works in two stages. In the first stage, it predicts the correctness of the claim. In the second stage, it generates a topically diverse summary for the claims. As input, we are provided with a Web claim c ∈ C, where C is a collection of Web claims and a pseudo-relevant set of documents D = {d 1 , d 2 , . . . , d m }, where m is the number of results retrieved for claim c. The documents d ∈ D are retrieved from the Web as potential evidences, using claim c as a query. Each retrieved document d is accompanied by its title t and text body bd, i.e. (d = t, bd ). We define the representation of each document's body as a collection of k sentences as bd = {s 1 , s 2 , ..., s k } and each sentence as the collection of l words as {w 1 , w 2 , ..., w l } ∈ W, where W is the overall word vocabulary of the corpus. By k and l, we denote the maximum numbers of sentences in a document and the maximum number of words in a sentence, respectively. We use both WORD2VEC and pre-trained GloVe embeddings to obtain the vector representations for each claim, title, and document body. The objective is to classify the claim as either true or false and automatically generate a topically diverse summary pieced together from D for establishing the correctness of the claim.

Predicting Claim Correctness by Neural Attention
We now describe SUMO's neural architecture (see Figure 2) that helps in predicting the correctness of the input claim along with its pseudo-relevant set of documents. The model additionally learns the weights to words and sentences in the document's body that help ascertain the claim's correctness. First, we need to encode the pseudo-relevant documents that support a claim. To this end, as a sequence encoder, we use a Gated Recurrent Unit (GRU) to encode the document's body content. Claim and document's title are not encoded using sequence encoder; we explain the method to represent them in detail in upcoming sections. Claim-driven Hierarchical Attention., aims to attend salient words that are significant and have relevance to the content of the claim. Similarly, we aim to attend the salient sentences at the sentence level attention. Recent works have used claim guided attention to model the contextual representation of the retrieved documents from the Web. These approaches provide claim-guided attention by first concatenating the claim word embeddings with document word embeddings and then applying a dense softmax layer to learn the attention weights as follows: embeddings. W a and b a are the weight matrix and bias and α is the learned attention weight. However, during experiments, we observe that applying claim-based attention provides an inferior overall document representation. Therefore, we do not concatenate the claim and document embeddings before attention weight computation.
Each claim c i is consists of l maximum number of words as {w 1 , w 2 , ......, w l }. We represent each claim c i as the summation of embeddings of all the words contained in it as: is the word embedding of the j th word of claim c i . Claim representation Cl i and hidden states h j from the GRU are used to compute wordlevel claim-driven attention weights as: where W j,i and b j,i are the weight matrix and bias, α C j,i is the word level claim driven attention weight vector, and h j = (h j,1 , h j,2 , ..., h j,l ) represents the tuple of all the hidden states of the words contained in the j th sentence. To compute sentence level claim-driven attention weights, we use claim representation Cl i and hidden states h S j from the sentence level GRU units as concatenations of both forward and backward hidden states where W j and b j are the weight matrix and bias, is the combination of all hidden states from sentences, and α C j = (α j,1 , α j,2 , ..., α j,k ) is the sentence level claimdriven attention weight vector for the j th document.
Title-driven Hierarchical Attention. The objective of using the document title is to guide the attention in capturing sections in the document that are more critical and relevant for the title. Articles convey multiple perspectives, often reflected in their titles. By title-driven attention, we attend to those words and sentences that are not covered in claim-driven attention. Title-driven attention at both word and sentence level can be computed in a similar fashion as claim-driven attention. Each title t i is comprised of l maximum number of words as {w 1 , w 2 , . . . , w l }. We represent each claim t i as the summation of embeddings of all the words contained in it as: T i = l j=1 f (w j ). Title-driven attention weights for both words and sentence level can be computed as follows: (4) Hierarchical Self-Attention. Self-attention is a simplistic form of attention. It tries to attend salient words in a sequence of words and salient sentences in a collection of sentences based on the self context of a sequence of words or a collection of sentences. In addition to claim-driven and title-driven attention, we apply self-attention to capture the unattended words and sentences which are not related to claim or title directly but are very useful for classification and summarization. Self-attention weights for both words and sentence level can be computed as follows: where α Sl j,i and α Sl j are the self-attention weight vectors at word and sentence levels respectively.
Fusion of Attention Weights. We combine the attention weights from the three kinds of attention mechanisms: claim-driven, title-driven, and selfattention at both the word and sentence levels. At the word level, we set: where α C j,i , α T j,i , and α Sl j,i are the attention weight vectors from claim, title and self-attention at the word level. S j is the formed sentence representation after overall attention for the j th sentence. At the sentence level, we set: where α C j , α T j , and α Sl j are the attention weight vectors from claim, title, and self-attention at the sentence level, and doc is the formed document representation after overall attention.
Attention on top of Attention (Atop). Although the fusion of the three kinds of attention weights as an average of them works well, we realize that we lose some context by averaging. To deal with this issue, we use a novel attention on top of attention (Atop) method. We concatenate all three kinds of attentions α con and α S con at both the word and sentence levels correspondingly. We apply a tanh activation based dense layer as a scoring function and subsequently, a softmax layer to compute attention weights for each of three kinds of attention: At word level: where β w and β s are the learned attention weight vectors for three kinds of attentions at the word and sentence levels, and doc is the formed document representation after Atop attention.
Prediction and Optimization. We use the overall document representation doc in a softmax layer for the classification. To train the model, we use standard softmax cross-entropy with logits as a loss function, we computeŷ, the predicted label as:

Generating Explainable Summary
Recent works retrieve documents from the Web as external evidence to support or refute the claims and thereafter extract snippets as explanations to model's decision (Mishra and Setty, 2019;Popat et al., 2018). However, the extracted snippets from these methods are often redundant and lack topical diversity. The objective of our summarization algorithm is to provide ranked list of sentences that are: novel, non-redundant, and diverse across the topics identified from the text of the documents. In this section, we outline the method we utilize for achieving this objective.
Multi-topic Sentence Model: Each sentence in the document that is retrieved against the claim is modeled as a collection of topics: s = a (1) , a (2) , . . . a (k) . Let A be the set of topics a i ∈ A across all candidate sentences from all the pseudo relevant set of documents D for the claim.
Objective. We formulate the summarization task as a diversification objective. Given a set of relevant sentences R which are attended by Atop attention in SUMO while establishing the claim's correctness. We have to find the smallest subset of sentences S ⊆ R such that all topics a i ∈ A are covered. This is a variation of the Set Cover problem (Agrawal et al., 2009;Korte and Vygen, 2002;Vazirani, 2001;Williamson and Shmoys, 2011;Johnson, 1974;Lovász, 1975;Chvátal, 1979). However, unlike IA-Select (Agrawal et al., 2009) we do not choose to utilize the Max Coverage variation of the Set Cover problem. Instead, we formulate it as Set Cover itself (Korte and Vygen, 2002;Vazirani, 2001). That is, given a set of topics A, find a minimal set of sentences S ⊆ R that cover those topics (Vazirani, 2001). Additionally, the inclusion of each sentence in the subset S has a cost associated with it, given by: where θ s is the topic distribution score for sentence s computed using a topic model (e.g., Latent Dirichlet Allocation (Blei et al., 2003)), W wa = l i=1 W wa (i) is the average of attention weights of the words contained in sentence s, W sa is the attention weight of the sentence s, and λ is a parameter to be tuned. We briefly describe our adaptation of the Greedy algorithm, which provides an approximate solution to the Set Cover problem, based on the discussion in (Korte and Vygen, 2002;Vazirani, 2001;Williamson and Shmoys, 2011;Johnson, 1974;Lovász, 1975;Chvátal, 1979).

Evaluation
Datasets. We use two publicly available datasets, namely PolitiFact political claims dataset and Snopes political claims dataset (Popat et al., 2018) for evaluating SUMO's capability for fact checking. Dataset statistics for both the datasets are shown in Table 1. In the case of Politifact, claims have one of the following labels, namely: 'true', 'mostly true', 'half true', 'mostly false', 'false', and 'pantson-fire,'. We convert 'true', 'mostly true', and 'half true' labels to the 'true' and the rest of them to Algorithm 1: Adaption of the approximate Greedy algorithm for Set Cover problem from (Korte and Vygen, 2002;Vazirani, 2001;Williamson and Shmoys, 2011;Johnson, 1974;Lovász, 1975;Chvátal, 1979) to our topical diversification problem setting. At each iteration, a sentence is chosen that covers the most number of topics reflected by topic distribution score and has the highest attention weights. As an output, we are assured a non-redundant, novel, and a diversified set of sentences.
Input: A: Set of topics learned from the topic model for diversification. R: Set of sentences, attended by Atop. Output: S ⊆ R: Diversified set of sentences over A S ← φ // S contains diversified sentences A ← φ // A contains topics covered by S while A = A do / * identify the sentence that covers the most topics and is highly relevant for fact-checking * / s * ← arg min  We test SUMO only on the PolitiFact and Snopes dataset for the task of fact checking as they are magnitudes larger than the new datasets that we release. The climate change dataset contains claims broadly related to climate change and global warming from climatefeedback.org. We use each claim as a query using Google API to search the Web and retrieve external evidences in the form of search results. Similarly, we create a dataset related to health • Global warming slowing down? 'Ironic' study finds more CO2 has slightly cooled the planet.
• The ozone layer is healing.
• Deforestation has made humans more vulnerable to pandemics.
• Historical data of temperature in the U.S. destroys global warming myth.
• New evidence shows wearing face mask can help coronavirus enter the brain and pose more health risk, warn expert.
• Boil weed and ginger for Covid-19 victims, the virus will vanish.
• Wearing face masks can cause carbon dioxide toxicity; can weaken immune system. care that additionally contains claims pertaining to the current global COVID-19 pandemic from healthfeedback.org. Examples of claims from these two datasets are shown in Figure 3. We make the new datasets, publicly available to the research community at the following URL: https: //github.com/rahulOmishra/SUMO/. SUMO Implementation. We use TensorFlow to implement SUMO. We use per class accuracy and macro F 1 scores as performance metrics for evaluation. We use bi-directional Gated Recurrent Unit (GRU) with a hidden size of 200, word2vec (Mikolov et al., 2013), and GloVe (Pennington et al., 2014) embeddings with embedding size of 200 and softmax cross-entropy with logits as the loss function. We keep the learning rate as 0.001, batch size as 64, and gradient clipping as 5. All the parameters are tuned using a grid search. We use 50 epochs for each model and apply early stopping if validation loss does not change for more than 5 epochs. We keep maximum sentence length as 45 and maximum number of sentences in a document as 35. For the task of summarization, we use Latent Dirichlet Allocation (LDA) (Blei et al., 2003) as a topic model to compute topic distribution scores and the dominant topic for each candidate sentence.

Setup for the Task of Claim Correctness
We experiment with five variants of our proposed SUMO model and compare with six state-of-the-art methods. The six state-of-the-art methods are as follows. First, we have the basic Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)) unit which is used with claim and document contents for classification. Second, we have a convolutional neural network (CNN) (Kim, 2014) for document classification. Third, we compare against the model proposed in (Tang et al., 2015) that uses a hierarchical representation of the documents using hierarchical LSTM units (Hi-LSTM). Fourth, we compare against the model proposed in (Yang et al., 2016) that uses a hierarchical neural attention on top of hierarchical LSTMs (HAN) to learn better representations of documents for classification. Fifth, we compare against the model proposed in (Popat et al., 2018) that uses a claim guided attention method (DeClarE) for correctness prediction of claims in the presence of external evidences. Sixth and finally, we compare against the recent work (Mishra and Setty, 2019) that improves on DeClarE method by using latent aspects (speaker, topic, or domain) based attention.
The proposed five variants of our method SUMO are as follows. First, we have the  variant that corresponds to the basic SUMO model with word2vec embeddings. Second, we have SUMO-AtopW2V variant consists of the SUMO model with WORD2VEC embeddings. Furthermore, in SUMO-AtopW2V we use Atop method of attention fusion rather than a simple average. Third, we have the SUMO-AGlove variant, which is the basic SUMO model that uses GloVe embeddings. Fourth, we have the SUMO-AtopGlove variant, that consists of the SUMO model with GloVe embeddings. Moreover, in SUMO-AtopGlove, we use Atop method of attention fusion rather than a simple average. Fifth and finally, we have the SUMO-AtopGlove+source-Emb variant that is similar to SUMO-AtopGlove however with additional source embeddings (domains of retrieved documents).

Claim Correctness Task Results
The results for establishing claim correctness are shown in Table 2. We observe that the basic LSTM based model achieves 57.89% and 69.89% in terms of macro F 1 accuracy in prediction of claim correctness for POLITIFACT and SNOPES, respec- SADHAN outperforms DeClarE with macro F 1 accuracy of 75.69% and 80.09%, respectively. Interestingly, we observe that the basic SUMO model with word2vec embeddings performs better than DeClarE with source embeddings. This observation is a clear indication of the superiority of our claim-and title-driven attention design. The SUMO with Atop attention fusion is more effective than a simple average fusion of attention weights, which becomes apparent from the gain in macro F 1 accuracy in both the datasets. SUMO with pertained GloVe embeddings outperforms the word2vec versions of SUMO as the GloVe embeddings are trained on a large corpus and therefore captures better context for the words. SUMO-AtopGlove+source-Emb outperforms all the other models and it is statistically significant with a p-value of 2.79 × 10 −3 for POLITIFACT and 3.09 × 10 −4 for SNOPES. The statistical significance values were computed using a two sample Student's t-test. We notice that SUMO could not outperform SADHAN without source embeddings, as SADHAN uses the very complex structure, having three parallel models with hier-archical latent aspects guide attention. However, SADHAN has many drawbacks. First, it is challenging to train and requires more hardware resources and time. Second, the latent aspects are not available for all the Web claims. Therefore, it is not generalizable. Third, it fails to accommodate new values of latent variables at the test time.

Setup for the Task of Summarization
For the evaluation of the summarization capability of SUMO, we create gold reference summaries for claims. For creating the gold reference summaries, we include all the facts related to the claim, which are important for the claim correctness prediction, non-redundant, and topically diverse. We find that the descriptions provided for a claim on fact-checking websites such as snopes.com and politifact.com are suitable for this purpose. We use cosine similarity score of 0.4 between claims and sentences of description to filter out irrelevant or noisy sentences. As evaluation metrics, we use ROUGE-1, ROUGE-2, and ROUGE-L scores. The ROUGE-1 score represents the overlap of unigrams, while the ROUGE-2 score represents the overlap of bigrams between the summaries generated by the SUMO system and gold reference summaries. The ROUGE-L score measures the longest matching sequence of words using Longest Common Sub-sequence algorithm.
Standard summarization techniques are not useful in such a scenario as the objective of summarization with standard techniques is usually not fact-checking. Hence, we compare the SUMO results with an information retrieval (BM25) and a natural language processing based method (Query-Sum). BM25 is a ranking function, which uses a probabilistic retrieval framework and ranks the documents based on their relevance to a given search query. We use Web claims as a query and apply BM25 to get the most relevant sentences from all the documents retrieved for the claim. We also compare the results with the query-driven attention based abstractive summarization method Query-Sum (Nema et al., 2017), which also uses a diversity objective to create a diverse summary. We use ROUGE metrics with a gold reference summary to evaluate the generated summaries.

Comparison of Summarization Results
Results for the task of summarization are shown in Table 3, the QuerySum method performs significantly better than BM25 with a ROUGE-L score of 30.16 as it uses query-driven attention and diversity objective, which results in a diverse and query oriented summary. The proposed model SUMO outperforms QuerySum with a ROUGE-L score of 35.92. We attribute this gain to the use of word and sentence level weights, which are trained using back-propagation with correctness label. We also notice that in QuerySum some sentences are related to the claim but are not useful for fact checking. Therefore, they are absent in the gold reference summary. The results for SUMO are statistically significant (p-value = 1.39 × 10 −4 ) using a pairwise Student's t-test.

Conclusion
We presented SUMO, a neural network based approach to generate explainable and topically diverse summaries for verifying Web claims. SUMO uses an improved version of hierarchical claimdriven attention along with title-driven and selfattention to learn an effective representation of the external evidences retrieved from the Web. Learning this effective representation in turn assists us in establishing the correctness of textual claims. Using the overall attention weights from the novel Atop attention method and topical distributions of the sentences, we generate extractive summaries for the claims. In addition to this, we release two important datasets pertaining to climate change and healthcare claims. In future, we plan to investigate the BERT (Devlin et al., 2019) and other Transformer (Vaswani et al., 2017) architecture based embedding meth-ods in place of GloVe (Pennington et al., 2014) embeddings for better contextual representation of words.