AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

With the continuing spread of misinformation and disinformation online, it is of increasing importance to develop combating mechanisms at scale in the form of automated systems that support multiple languages. One task of interest is claim veracity prediction, which can be addressed using stance detection with respect to relevant documents retrieved online. To this end, we present our new Arabic Stance Detection dataset (AraStance) of 4,063 claim–article pairs from a diverse set of sources comprising three fact-checking websites and one news website. AraStance covers false and true claims from multiple domains (e.g., politics, sports, health) and several Arab countries, and it is well-balanced between related and unrelated documents with respect to the claims. We benchmark AraStance, along with two other stance detection datasets, using a number of BERT-based models. Our best model achieves an accuracy of 85% and a macro F1 score of 78%, which leaves room for improvement and reflects the challenging nature of AraStance and the task of stance detection in general.


Introduction
The proliferation of social media has made it possible for individuals and groups to share information quickly. While this is useful in many situations such as emergencies, where disaster management efforts can make use of shared information to allocate resources, this evolution can also be dangerous, e.g., when the news shared is not precise or is even intentionally misleading. Polarization in different communities further aggravates the problem, causing individuals and groups to believe and to disseminate information without necessarily verifying its veracity (misinformation) or even making up stories that support their world views (disinformation). These circumstances motivate a need to develop tools for detecting fake news online, including for a region with opposing forces and ongoing conflicts such as the Arab world.
Our work here contributes to these efforts a new dataset and baseline results on it. In particular, we create a new dataset for stance detection of claims collected from a number of websites covering different domains such as politics, health, and economics. The websites cover several Arab countries, which enables wider applicability of our dataset. This compares favorably to previous work for Arabic stance detection such as the work of , who focused on a single country. We use the websites as our source to collect true and false claims, and we carefully crawl web articles related to these claims. Using the claim-article pairs, we then manually assign stance labels to the articles. By stance we mean whether an article agrees, disagrees, discusses a claim or it is just unrelated. This allows us to exploit the resulting dataset to build models that automatically identify the stance with respect to a given claim, which is an important component of fact-checking and fake news detection systems. To develop these models, we resort to transfer learning by fine-tuning language models on our labeled dataset. We also benchmark our models on two existing datasets for Arabic stance detection. Finally, we make our dataset publicly available. 1 Our contributions can be summarized as follows: 1. We release a new multi-domain, multi-country dataset labeled for both stance and veracity.
2. We introduce a multi-query related document retrieval approach for claims from diverse topics in Arabic, resulting in a dataset with balanced label distributions across classes.
3. We compare our dataset to two other Arabic stance detection datasets using four BERTbased (Devlin et al., 2019) models.

Related Work
Stance detection started as a standalone task, unrelated to fact-checking (Küçük and Can, 2020). One type of stance models the relation (e.g., for, against, neutral) of a text segment towards a topic, usually a controversial one such as abortion or gun control (Mohammad et al., 2016;Abbott et al., 2016). Another one models the relation (e.g., agree, disagree, discuss, unrelated) between two pieces of text (Hardalov et al., 2021b;Ferreira and Vlachos, 2016). The latter definition is used in automatic fact-checking, fake news detection, and rumour verification (Vlachos and Riedel, 2014). There are several English datasets that model fact-checking as a stance detection task on text from multiple genres such as Wikipedia (Thorne et al., 2018), news articles (Pomerleau and Rao, 2017;Ferreira and Vlachos, 2016), and social media (Gorrell et al., 2019;Derczynski et al., 2017). Most related to our work here is the Fake News Challenge, or FNC, (Pomerleau and Rao, 2017), which is built by randomly matching claim-article pairs from the Emergent dataset (Ferreira and Vlachos, 2016), which itself pairs 300 claims to 2,500 articles. In FNC, this pairing is done at random, and it yielded a large number of unrelated claimarticle pairs. There are several approaches attempting to predict the stance on the FNC dataset using LSTMs, memory networks, and transformers (Hanselowski et al., 2018;Conforti et al., 2018;Zhang et al., 2019;Schiller et al., 2021;Schütz et al., 2021).
There are two datasets for Arabic stance detection with respect to claims. The first one collected their false claims from a single political source , while we cover three sources from multiple countries and topics. They retrieved relevant documents and annotated the claim-article pairs using the four labels listed earlier (i.e., agree, disagree, discuss, unrelated). They also annotated "rationales," which are segments in the articles where the stance is most strongly expressed. The other Arabic dataset by Khouja (2020) uses headlines from news sources and generated true and false claims by modifying the headlines. They used a three-class labeling scheme of stance by merging the discuss and the unrelated classes in one class called other.
Our work is also related to detecting machinegenerated and manipulated text (Jawahar et al., 2020;Nagoudi et al., 2020).

AraStance Construction
We constructed our AraStance dataset similarly to the way this was done for the English Fake News Challenge (FNC) dataset (Pomerleau and Rao, 2017) and for the Arabic dataset of . Our dataset contains true and false claims, where each claim is paired with one or more documents. Each claim-article pair has a stance label: agree, disagree, discuss, or unrelated. Below, we decribe the three steps of building AraStance: (i) claim collection and pre-processing, (ii) relevant document retrieval, and (iii) stance annotations.

Claim Collection and Preprocessing
We collected false claims from three fact-checking websites: ARAANEWS 2 , DABEGAD 3 , and NORU-MORS 4 , based in the UAE, Egypt, and Saudi Arabia, respectively. The claims were from 2012 to 2018 and covered multiple domains such as politics, sports, and health. As the three fact-checking websites only debunk false claims, we looked for another source for true claims: following , we collected true claims from the Arabic website of REUTERS 5 , assuming that their content was trustworthy. We added topic and date restrictions when collecting the true claims in order to make sure they were similar to the false claims. Moreover, in order to ensure the true claims were from the same topics as the false ones, we used a subset of the false claims as seeds to retrieve true claims that were within three months of the seed false claims, and we ranked them by TF.IDF, similarity to the seeds. We kept a maximum of ten true claims per seed false claim. For all claims, we removed the ones that contained no-text and/or were multimedia-centric. Moreover, we manually modified the false claims by removing phrases like "It is not true that", "A debunked rumor about", or "The reality of ", which are often used by fact-checking websites. This sometimes required us to add a noun at the beginning of the claim based on the text of the target articles, or to make some grammatical edits. We show examples of two false claims before and after preprocessing in Table 1. Note that the headlines we retrieved from REUTERS were already phrased as claims, and thus we did not have to edit them in any way.

Original Claim Preprocessed Claim
What is being circulated about recording residents' calls and messages is not true The government is recording calls and messages of residents Cosmo rays coming from Mars is a false rumor from 2008 NASA warns of dangerous cosmic rays coming from Mars tonight from 12.30-3.30 Table 1: Examples of false claims before and after preprocessing.

Document Retrieval
For each claim, we retrieved relevant documents using multiple queries and the Google Search API. It was harder to find relevant documents for the false claims by passing their preprocessed version as queries because of their nature, locality, and diversity. For some false claims, there were extra clauses and modifiers that restricted the search results significantly as shown in the examples below:

1.
A female child with half a human body, and the other half is a snake

2.
Lungs of smokers are cleaned by smelling the steam of milk and water for ten days To remedy this, we boosted the quality of the retrieved documents by restricting the date range to two months before and after the date of the claim, prepending named entities and removing extra clauses using parse trees. In order to emphasize the presence of the main entity(s) in the claim, we extracted named entities using the Arabic NER corpus by Benajiba et al. (2007) and Stanford's CoreNLP Arabic NER tagger (Manning et al., 2014). We further used Stanford's CoreNLP Arabic parser to extract the first verb phrase (VP) and all its preceeding tokens in the claim, as this has been shown to improve document retrieval results for claim verification, especially for lengthy claims (Chakrabarty et al., 2018). For the two examples shown above, we would keep the claims until the comma for the first example and the word and for the second one, and we would consider those as the queries.
For each false claim, we searched for relevant documents using the following five queries: (i) the manually preprocessed claim as is, (ii) the preprocessed claim with date restriction, (iii) the preprocessed claim with named entities and date restriction, (iv) the first VP and all preceding tokens with date restriction, and lastly (v) the first VP and all preceding tokens with named entities and date restriction. For the true claims, due to wider coverage that led to easier retrieval, we only ran two queries, using the claim with and without date restriction.
We combined the results from all queries, and we kept a maximum of ten documents per claim. If the retrieved documents exceeded this limit, we only kept documents from news sources 6 , or from sources used in previous work on Arabic stance detection Khouja, 2020). If we still had more than ten documents after filtering by source, we ranked the documents by their TF.IDF similarity with the claim, and we kept the top ten documents. We limited the number of documents to ten per claim in order to avoid having claims with very high numbers of documents and others with only one or two documents. Ultimately, this helped us keep the dataset balaced in terms of both sources and topics.

Stance Annotation
We set up the annotation task as follows: given a claim-article pair, what is the stance of the document towards the claim? The stance was to be annotated using one of the following labels: agree, disagree, discuss, or unrelated, which were also used in previous work (Pomerleau and Rao, 2017;.  Table 2: Disagreement between the annotators on the discuss (D) label with the agree (A) (first example) and the unrelated (U) labels (second example).
We explained the labels to annotators as follows: • agree: the document agrees with the main claim in the statement clearly and explicitly; • disagree: the document disagrees with main claim in the statement clearly and explicitly; • discuss: the document discusses the same event without taking a position towards its validity; • unrelated: the document talks about a different event, regardless of how similar the two events might be.
Our annotators were three graduate students in computer science and linguistics, all native speakers of Arabic. We adopted guidelines similar to the ones introduced by . First, we conducted a pilot annotation round on 315 claimarticle pairs, where each pair was annotated by all annotators. The annotators agreed on the same label for 220 out of the 315 pairs (70% of the pairs), while for 89 pairs (28%) there were two annotators agreeing on the label, and for the remaining 6 pairs (2% of the pairs) there was a three-way disagreement. The main disagreements between the annotators were related to the discuss label, which was confused with either agree or unrelated.
We show two examples in Table 2 where the annotators labeled the example on the top of the table as discuss and agree. The two annotators that labeled this example as discuss justified their choice by arguing that the document only mentioned the claims without agreeing or disagreeing and mainly analyzed the impact of rotten meat on Brazil's economy in great detail. The example in the bottom of the table was labeled by one annotator as discuss and by two annotators as unrelated. The annotators who labeled it as unrelated argued that there was no mention of Egypt's involvement in the rescue efforts, while the annotator who labeled the pair as discuss maintained that the document discussed the same event of children trapped in the cave.
These disagreements were resolved through discussions between the annotators, which involved refining the guidelines to label a pair as discuss if it only talks about the exact same event of the claim without taking any clear position. The annotators were also asked not to take into consideration any other factors, e.g., the date of article, its publisher, or its veracity.
For the rest of the data, each claim-article pair was annotated by two annotators, where the differences were resolved by the third annotator. This is very similar to labeling all pairs by three annotators with majority voting, but with less labor requirements. We measured the inter-annotator agreement (IAA) using Fleiss kappa, which accounts for multiple annotators (Fleiss and Cohen, 1973), obtaining an IAA of 0.67, which corresponds to substantial agreement. Table 3 shows the number of claims and articles for each website with their veracity label (bypublisher) and final stance annotations. The distribution of the four stance classes in training, development, and test is shown in Table 4. After selecting the gold annotations, we discarded all claims that had all of their retrieved documents labeled as unrelated, aiming to reduce the imbalance with respect to the unrelated class, and we only focused on claims with related documents, which can be seen as a proxy for check-worthiness. We ended up with a total of 4,063 claim-articles pairs based on 910 claims: 606 false and 304 true. The dataset is imbalanced towards the false claims, but as our main task is stance detection rather than claim veracity, we aimed at having a balanced distribution for the four stance labels. As shown in Table 4, around half of the labels are from the unrelated class, but it is common for stance detection datasets to have higher proportion of this class (Pomerleau and Rao, 2017;.   There are various approaches that can mitigate the impact of the class imbalance caused by the unrelated class. These are related to (i) task setup, (ii) modeling, and (iii) evaluation.

Statistics About the Final Dataset
First, the task can be approached differently by only doing stance detection on the three related classes (Conforti et al., 2018), or by merging the discuss and the unrelated classes into one class, e.g., called neutral or other (Khouja, 2020).
Second, it is possible to keep all classes, but to train a two-step model: first to predict related vs. unrelated, and then, if the example is judged to be related, to predict the stance for the three related classes only (Zhang et al., 2019).
Third, one could adopt an evaluation measure that rewards models that make correct predictions for the related classes more than for the unrelated class. Such a measure was adopted by the Fake News Challenge (Pomerleau and Rao, 2017). However, such measures have to be used very carefully, as they might be exploited. For example, it was shown that the FNC measure can be exploited by random prediction from the related classes and never from the unrelated class, which has a lower reward under the FNC evaluation measure (Hanselowski et al., 2018). We leave such considerations about the impact of class imbalance to future work.

External Datasets
We experimented with a number of BERT-based models, pre-trained on Arabic or on multilingual data, which we fine-tuned and applied to our dataset, as well as to the following two Arabic stance detection datasets for comparison purposes: •  Dataset. This dataset has 1,842 claim-article pairs for training (278 agree, 37 disagree, 266 discuss, and 1,261 unrelated), 587 for development (86 agree, 25 disagree, 73 discuss, and 403 unrelated), and 613 for testing (110 agree, 25 disagree, 70 discuss, and 408 unrelated).
The dataset by  has 203 true claims from REUTERS and 219 false claims from the Syrian fact-checking website VERIFY-SY, 7 which focuses on debunking claims about the Syrian civil war. Thus, the dataset contains claims that focus primarily on war and politics. They retrieved the articles and performed manual annotation of claim-article pairs for stance, following a procedure that is very close to the one we used for AraStance. Moreover, their dataset has annotations of rationales, which give the reason for selecting an agree or a disagree label. The dataset has a total of about 3,000 claim-article pairs, 2,000 of which are from the unrelated class. The dataset comes with a split into five folds of roughly equal sizes. We use folds 1-3 for training, fold 4 for development, and fold 5 for testing.
The dataset by Khouja (2020) is based on sampling a subset of news titles from the Arabic News Text (ANT) corpus (Chouigui et al., 2017), and then making true and false alterations of these titles using crowd-sourcing. The stance detection task is then defined between pairs of original news titles and their respective true/false alterations. This essentially maps to detecting paraphrases for true alterations (stance labeled as agree) and contradictions for false ones (stance labeled as disagree). They further have a third stance label, other, which is introduced by pairing the alterations with other news titles that have high TF.IDF similarity with the news title originally paired with the alteration. Overall, Khouja (2020)'s dataset is based on synthetic statements that are paired with news titles. This is quite different from AraStance and the dataset of , which have naturally occurring claims that are paired with full news articles. Moreover, as both AraStance and Baly et al. (2018)'s datasets have naturally occurring data from the web, they both exhibit certain level of noise and irregularities, e.g., some very long documents, words/characters in other languages such as English, etc. Such a noise is minimal in Khouja (2020)'s dataset, which is a third differentiating factor compared to the other two datasets. Nevertheless, we include Khouja (2020)'s dataset in our experiments in order to empirically test the impact of these differences.

Models
We fine-tuned the following four models for each of the three Arabic datasets: 1. Multilingual BERT (mBERT), base size, which is trained on the Wikipedias of 100 different languages, including Arabic (Devlin et al., 2019).
2. ArabicBERT, base size, which is trained on 8.2 billion tokens from the OSCAR corpus 8 as well as on the Arabic Wikipedia (Safaya et al., 2020).
The four models are comparable in size, all having a base architecture, but with varying vocabulary sizes. More information about the different models can be found in the original publications about them. We fine-tuned each of them for a maximum of 25 epochs with an early stopping patience value of 5, a maximum sequence length of 512, a batch size of 16, and a learning rate of 2e-5.

Results
The evaluation results are shown in Tables 5 and  6 for the development and for the test sets, respectively. We use accuracy and macro-F1 to account for the different class distributions; we also report per-class F1 scores. Note that Khouja (2020) uses three labels rather than four, merging discuss and unrelated into other. Their label distribution has a majority of disagree, followed by agree, and very few instances of other, which is different from our dataset and from 's.
We can see that ARBERT yields the best overall and per-class performance on dev for the Khouja (2020) dataset and AraStance. It also generalizes very well to the test sets, where it even achieved a higher macro-F1 score for the Khouja (2020) dataset. The performance of the other three models (mBERT, ArabicBERT, and MARBERT) drops slightly on the test set compared to dev for both AraStance and the Khouja (2020) dataset. This might be due to ARBERT being pre-trained on more suitable data, which includes Books, Gigaword and Common Crawl data primarily from MSA, but also a small amount of Egyptian Arabic. Since half of our data comes from an Egyptian website (DABEGAD), this could be helpful. Indeed, while ArabicBERT is pretrained on slightly more data than ARBERT, it was almost exclusively pretrained on MSA, without dialectal data, and AraStance it performs worse.
About the other models: The datasets on which ArabicBERT was trained have duplicates, which could explain the model being outperformed. For MARBERT, it is pretrained on tweets that have both MSA and dialectal Arabic. MARBERT's data come from social media, which is different from the news articles or titles from which all the experimental downstream three datasets are derived. Also, it seems that ARBERT and MARBERT are better than the other two models at predicting the stance between a pair of sentences, as it is the case with the Khouja (2020) dataset.  11 .84 .73 .40 .74 .84 .76 .81 .78 .81 .68 .58 .92 .82 .75 ArabicBERT .58 .14 .24 .82 .69 .45 .74 .86 .84 .82 .81 .85 .75 .56 .92 .84 .77 ARBERT .56 .14 .30 .83 .70 .46 .81 .89 .87 .86 .86 .85 .82 .60 .93 .86 .80 MARBERT .44 .14 .23 .78 .62 .40 .80 .88 .79 .85 .82 .85 .80 .53 .89 .84 .77   This could be due to the diversity of their pretraining data, which improves the model's ability to capture inter-sentence relations such as paraphrases and contradictions. Another factor that could explain ARBERT's better performace compared to MARBERT is that the latter is trained with a masking objective only, while ARBERT is trained with both a masking objective and a next sentence prediction objective. The use of the latter objective by ARBERT could explain its ability to capture information in our claim-stance pairs, although these pairs are different from other types of pairs such as in the question and answer task, where the pair occurs in an extended piece of text.
On the other hand, there is no consistently best model for the  dataset. This could be due to a number of reasons. First, that dataset has a severe class imbalance, as we have explained in Section 4. Second, the dataset (especially the false claims) is derived from one particular domain, i.e., the Syrian war, which might not be well represented in the pretraining data. Therefore, additional modeling considerations such as adaptive pretraining on a relevant unlabelled corpus before fine-tuning on the target labeled data could help.
Surprisingly, ArabicBERT and ARBERT perform much better on the test set than on the development set of the  dataset for the disagree class, which has the lowest frequency: from 0.14 F1 to 0.29-0.35 F1.
Since the number of disagree instances is very low (25 documents for 10-12 unique claims), it is possible that the claims in the test set happen to be more similar to the ones in the training data than it is for development. This is plausible because we did our train-dev-test split based on the five-folds prepared by the authors as explained in Section 4. It is worth noting that the multilingual model (mBERT) has the highest overall accuracy and F1 score for the unrelated class of the  dataset. Multilingual text representations such as mBERT might over-predict from the majority class, and thus would perform poorly on the two low-frequency classes; indeed, mBERT has an F1-score of 0 for disagree, and no more than 0.12 for discuss on development and testing.
Finally, we observe very high performance for all models for the unrelated class of AraStance. This could be an indication of strong signals that differentiate the related and the unrelated classes, whereas the discuss class is the most challenging one in AraStance, due to its strong resemblance to agree in some examples such as the one shown in Table 2. This indicates that all models offer an area for improvement, where a single classifier can excel for both frequent and infrequent classes for the stance detection within and across datasets. We leave further experimentation, including with models developed for FNC and the  dataset, for future work.

Conclusion and Future Work
We presented AraStance, a new multi-topic Arabic stance detection dataset with claims extracted from multiple fact-checking sources across three countries and one news source. We discussed the process of data collection and approaches to overcome challenges in related document retrieval for claims with low online presence, e.g., due to topic or country specificity. We further experimented with four BERT-based models and two additional Arabic stance detection datasets.
In future work, we want to further investigate the differences between the three Arabic stance detection datasets and to make attempts to mitigate the impact of class imbalance, e.g., by training with weighted loss, by upsampling or downsampling the classes, etc. We further want to examine the discuss class across datasets and to compare the choice of annotation scheme -three-way vs. four-way-on this task. Moreover, we plan to enrich AraStance by collecting more true claims from other websites, thus creating a dataset that would be more evenly distributed across the claim veracity labels. Furthermore, we would like to investigate approaches for improving stance detection by extracting the parts of the documents that contain the main stance rather than truncating the documents after the first 512 tokens. Finally, we plan to experiment with cross-domain (Hardalov et al., 2021a) and crosslanguage approaches (Mohtarami et al., 2019).