Integrating Stance Detection and Fact Checking in a Unified Corpus

A reasonable approach for fact checking a claim involves retrieving potentially relevant documents from different sources (e.g., news websites, social media, etc.), determining the stance of each document with respect to the claim, and finally making a prediction about the claim’s factuality by aggregating the strength of the stances, while taking the reliability of the source into account. Moreover, a fact checking system should be able to explain its decision by providing relevant extracts (rationales) from the documents. Yet, this setup is not directly supported by existing datasets, which treat fact checking, document retrieval, source credibility, stance detection and rationale extraction as independent tasks. In this paper, we support the interdependencies between these tasks as annotations in the same corpus. We implement this setup on an Arabic fact checking corpus, the first of its kind.


Introduction
Fact checking has recently emerged as an important research topic due to the unprecedented amount of fake news and rumors that are flooding the Internet in order to manipulate people's opinions (Darwish et al., 2017a;Mihaylov et al., 2015a,b;Mihaylov and Nakov, 2016) or to influence the outcome of major events such as political elections (Lazer et al., 2018;Vosoughi et al., 2018). While the number of organizations performing fact checking is growing, these efforts cannot keep up with the pace at which false claims are being produced, including also clickbait (Karadzhov et al., 2017a), hoaxes (Rashkin et al., 2017), and satire (Hardalov et al., 2016). Hence, there is need for automation.
While most previous research has focused on English, here we target Arabic. Moreover, we propose some guidelines, which we believe should be taken into account when designing fact-checking corpora, irrespective of the target language.
Automatic fact checking typically involves retrieving potentially relevant documents (news articles, tweets, etc.), determining the stance of each document with respect to the claim, and finally predicting the claim's factuality by aggregating the strength of the different stances, taking into consideration the reliability of the documents' sources (news medium, Twitter account, etc.). Despite the interdependency between fact checking and stance detection, research on these two problems has not been previously supported by an integrated corpus. This is a gap we aim to bridge by retrieving documents for each claim and annotating them for stance, thus ensuring a natural distribution of the stance labels.
Moreover, in order to be trusted by users, a factchecking system should be able to explain the reasoning that led to its decisions. This is best supported by showing extracts (such as sentences or phrases) from the retrieved documents that illustrate the detected stance (Lei et al., 2016). Unfortunately, existing datasets do not offer manual annotation of sentence-or phrase-level supporting evidence. While deep neural networks with attention mechanisms can infer and extract such evidence automatically in an unsupervised way (Parikh et al., 2016), potentially better results can be achieved when having the target sentence provided in advance, which enables supervised or semi-supervised training of the attention. This would allow not only more reliable evidence extraction, but also better stance prediction, and ultimately better factuality prediction. Following this idea, our corpus also identifies the most relevant stance-marking sentences.
The connection between fact checking and stance has been argued for by Vlachos and Riedel (2014), who envisioned a system that (i) identifies factual statements (Hassan et al., 2015;Gencheva et al., 2017;Jaradat et al., 2018), (ii) generates questions or queries (Karadzhov et al., 2017b), (iii) creates a knowledge base using information extraction and question answering (Ba et al., 2016;Shiralkar et al., 2017), and (iv) infers the statements' veracity using text analysis (Banerjee and Han, 2009;Castillo et al., 2011;Rashkin et al., 2017) or information from external sources (Popat et al., 2016;Karadzhov et al., 2017b;Popat et al., 2017). This connection has been also used in practice, e.g., by Popat et al. (2017); however, different datasets had to be used for stance detection vs. fact checking, as no dataset so far has targeted both.
Fact checking is very time-consuming, and thus most datasets focus on claims that have been already checked by experts on specialized sites such as Snopes (Ma et al., 2016;Popat et al., 2016Popat et al., , 2017, PolitiFact (Wang, 2017), or Wikipedia hoaxes (Popat et al., 2016). 1 As fact checking is mainly done for English, non-English datasets are rare and often unnatural, e.g., translated from English, and focusing on US politics. 2 In contrast, we start with claims that are not only relevant to the Arab world, but that were also originally made in Arabic, thus producing the first publicly available Arabic fact-checking dataset.
Stance detection has been studied so far disjointly from fact checking. While there exist some datasets for Arabic (Darwish et al., 2017b), the most popular ones are for English, e.g., from SemEval-2016 Task 6 (Mohammad et al., 2016) and from the Fake News Challenge (FNC). 3 Despite its name, the latter has no annotations for factuality, but consists of article-claim pairs labeled for stance: agrees, disagrees, discusses, and unrelated. In contrast, we retrieve documents for each claim, which yields an arguably more natural distribution of stance labels compared to FNC.
Evidence extraction. Finally, an important characteristic of our dataset is that it provides evidence, in terms of text fragments, for the agree and disagree labels. Having such supporting evidence annotated enables both better learning for supervised systems performing stance detection or fact checking, and also the ability for such systems to learn to explain their decisions to users. Having this latter ability has been recognized in previous work on rationalizing neural predictions (Lei et al., 2016). This is also at the core of recent research on machine comprehension, e.g., using the SQuAD dataset (Rajpurkar et al., 2016). However, such annotations have not been done for stance detection or fact checking before.
Finally, while preparing the camera-ready version of the present paper, we came to know about a new dataset for Fact Extraction and VERification, or FEVER (Thorne et al., 2018), which is somewhat similar to ours as it it about both factuality and stance, and it has annotation for evidence. Yet, it is also different as (i) the claims are artificially generated by manually altering Wikipedia text, (ii) the knowledge base is restricted to Wikipedia articles, and (iii) the stance and the factuality labels are identical, assuming that Wikipedia articles are reliable to be able to decide a claim's veracity. In contrast, we use real claims from news outlets, we retrieve articles from the entire Web, and we keep stance and factuality as separate labels.

The Corpus
Our corpus contains claims labeled for factuality (true vs. false). We associate each claim with several documents, where each claim-document pair is labeled for stance (agree, disagree, discuss, or unrelated) similar to the FakeNewsChallenge (FNC) dataset. Overall, the process of corpus creation went through several stagesclaim extraction, evidence extraction and stance annotation -, which we describe below.
Claim Extraction We consider two websites as the source of our claims. VERIFY 4 is a project that was established to expose false claims made about the war in Syria and other related Middle Eastern issues. It is an independent platform that debunks claims made by all parties to the conflict. To the best of our knowledge, this is the only platform that publishes fact-checked claims in Arabic.
It is worth noting that the VERIFY website only shows claims that were debunked as false and misleading, and hence we used it to extract only the false claims for our corpus (we extracted the true claims from a different source; see below).
We thoroughly preprocessed the original claims. First, we manually identified and excluded all claims discussing falsified multimedia (images or video), which cannot be verified using textual information and NLP techniques only, e.g.
(1) Pro-regime pages have circulated pictures of fighters fleeing an explosion.
Note that the claims in VERIFY were written in a form that presents the corrected information after debunking the original false claim. For instance, the original false claim in example 2a is corrected and published in VERIFY as shown in example 2b. We manually rendered these corrected claims to their original false form, which we used for our corpus.
(2a) (original false claim) FIFA intends to investigate the game between Syria and Australia.
(2b) (corrected claim in VERIFY) FIFA does not intend to investigate the game between Syria and Australia, as pro-regime pages claim.
After extracting the false claims from VERIFY, we collected the true claims of our corpus from REUTERS 5 by extracting headlines of news documents. We used a list of manually selected keywords to extract claims with the same topics as those extracted from VERIFY.
Then, we manually excluded claims that contained political rhetorical statements (see example 3 below), multiple facts, accusations or denials, and ultimately we only kept those claims that discuss factual events, i.e., that can be verified.
(3) Presidents Vladimir Putin and Recep Tayyip Erdogan hope that Astana talks will lead to peace.
Overall, starting with 1,381 claims, we ended up with 422 worth-checking claims: 219 false claims from VERIFY, and 203 true claims from REUTERS.
Evidence Extraction Following the assumption that identifying stance towards claims can help predict their veracity, we want to associate each claim with supporting and opposing pieces of textual evidence. We used the Google custom search API for document retrieval, and we performed the following steps to increase the likelihood of retrieving relevant documents. First, as in (Karadzhov et al., 2017b), we transformed each claim into sub-queries by selecting named entities, adjectives, nouns and verbs with the highest TF.DF score, calculated on a collection of documents from the claims' sources. Then, we used these sub-queries with the claim itself as input to the search API and retrieved the first 20 returned links, from which we excluded those directing to VERIFY and REUTERS, and social media websites that are mostly opinionated. Finally, we calculated two similarity measures between the links' content (documents) and the claims: the tri-gram containment (Lyon et al., 2001) and the cosine distance between average word embeddings of both texts. 6 . We only kept documents with non-zero values for both measures, yielding 3,042 documents: 1,239 for false claims and 1,803 for true claims.
Stance Annotation: We used CrowdFlower to recruit Arabic speakers to annotate the claimdocument pairs for stance. Each pair was assigned to 3-5 annotators, who were asked to assign one of the following standard labels (also used at FNC): agree, disagree, discuss and unrelated. First, we conducted small-scale pilot tasks to fine-tune the guidelines and to ensure their clarity. The annotators were also asked to focus on the stance of the document towards the claim, regardless of the factuality of either text. This ensures that stance is captured without bias, so it can be used later with other information (e.g., time, website's credibility, author reliability) to predict factuality. Finally, the annotators were asked to specify segments in the documents representing the rationales that made them assign agree or disagree as labels. For quality control purposes, we further created a small hidden test set by annotating 50 pairs ourselves, and we used it to monitor the annotators' performance, keeping only those who maintained an accuracy of over 75%.
Ultimately, we used majority voting to aggregate stance labels for each pair, using the annotators' performance scores to break ties. On average, 77% of the annotators for each claim-document pair agreed on its label, thus allowing proper majority aggregation for most pairs. A total of 133 pairs with significant annotation disagreement required us to manually check and correct the proposed annotations. We further automatically refined the documents by (i) excluding sentences with more than 200 words, and (ii) limiting the size of a document to 100 sentences. Such extralong documents tend to originate from crawling ill-structured websites, or from parsing some specific types of websites such as web forums. Table 1 shows the distribution over the stance labels, 7 which turns out to be very similar to that for the FNC dataset. We can see that there are very few documents disagreeing with true claims (about 0.5%), which suggests that stance is positively correlated with factuality. However, the number of documents agreeing with false documents is larger than the number of documents disagreeing with them, which illustrates one of the main challenges when trying to predict the factuality of news based on stance. 7 The corpus is available at http://groups.csail.mit.edu/sls/downloads/ and also at http://alt.qcri.org/resources/

Experiments and Evaluation
We experimented with our Arabic corpus, after preprocessing it with ATB-style segmentation using MADAMIRA (Pasha et al., 2014), using the following systems: • FNC BASELINE SYSTEM. This is the FNC organizers' system, which trains a gradient boosting classifier using hand-crafted features reflecting polarity, refute, similarity and overlap between the document and the claim.
It was second at FNC (Hanselowski et al., 2017), and was based on a multi-layer perceptron with the baseline system's features, word n-grams, and features generated using latent semantic analysis and other factorization techniques.
• UCL. It was third at FNC (Riedel et al., 2017), training a softmax layer using similarity features.
• MEMORY NETWORK. We also experimented with an end-to-end memory network that showed state-of-the-art results on the FNC data .
The evaluation results are shown in Table 2. We use 5-fold cross-validation, where all claimdocument pairs for the same claim are assigned to the same fold. We report accuracy, macro-average F 1 -score, and weighted accuracy, which is the official evaluation metric of FNC.
Overall, our corpus appears to be much harder than FNC. For instance, the FNC baseline system achieves weighted accuracy of 75.2 on FNC vs. 55.6 (up to 64.8) on our corpus. We believe that this is because we used a realistic information retrieval approach (see Section 3), whereas the FNC corpus contains a significant number of totally unrelated document-claim pairs, e.g., about 40% of the unrelated examples have no word overlap with the claim (even after stemming!), which makes it much easier to correctly predict the unrelated class (and this class is also by far the largest).   First, we show the results when using the full document along with the claim, which is the default representation. Then, we use the best sentence from the document, i.e., the one that is most similar to the claim as measured by the cosine of their average word embeddings. This performs worse, which can be attributed to sometimes selecting the wrong sentence. Next, we experiment with using the rationale instead of the best sentence when applicable (i.e., for agree and disagree), while still using the best sentence for discuss and unrelated. This yields sizable improvements on all evaluation metrics, compared to using the best sentence (5-12 point absolute) or the full document (3-9 points absolute). We further evaluate the impact of using the rationales, when applicable, but using the full document otherwise. This setting performed best (80.2% accuracy with ATHENE, and 3-8 points of improvement over best+rationale), as it has access to most information: full document + rationale.
Overall, the above experiments demonstrate that having a gold rationale can enable better learning. However, the results should be considered as a kind of upper bound on the expected performance improvement, since here we used gold rationales at test time, which would not be available in a real-world scenario. Still, we believe that sizable improvements would still be possible when using the gold rationales for training only.
Finally, we built a simple fact-checker, where the factuality of a claim is determined based on aggregating the predicted stances (using FNC's baseline system) of the documents we retrieved for it. This yielded an accuracy of 56.2 when using the full documents, and 59.7 when using the best sentence + rationale (majority baseline of 50.5), thus confirming once again the utility of having a rationale, this time for a downstream task.

Conclusion and Future Work
We have described a novel corpus that unifies stance detection, stance rationale, relevant document retrieval, and fact checking. This is the first corpus to offer such a combination, not only for Arabic but in general. We further demonstrated experimentally that these unified annotations, and the gold rationales in particular, are beneficial both for stance detection and for fact checking.
In future work, we plan to cover other important aspects of fact checking such as source reliability, language style, and temporal information, which have been shown useful in previous research (Castillo et al., 2011;Lukasik et al., 2015;Ma et al., 2016;Mukherjee and Weikum, 2015;Popat et al., 2017).