Predicting Factuality of Reporting and Bias of News Media Sources

We present a study on predicting the factuality of reporting and bias of news media. While previous work has focused on studying the veracity of claims or documents, here we are interested in characterizing entire news media. This is an under-studied, but arguably important research problem, both in its own right and as a prior for fact-checking systems. We experiment with a large list of news websites and with a rich set of features derived from (i) a sample of articles from the target news media, (ii) its Wikipedia page, (iii) its Twitter account, (iv) the structure of its URL, and (v) information about the Web traffic it attracts. The experimental results show sizable performance gains over the baseline, and reveal the importance of each feature type.


Introduction
The rise of social media has democratized content creation and has made it easy for everybody to share and spread information online. On the positive side, this has given rise to citizen journalism, thus enabling much faster dissemination of information compared to what was possible with newspapers, radio, and TV. On the negative side, stripping traditional media from their gate-keeping role has left the public unprotected against the spread of misinformation, which could now travel at breaking-news speed over the same democratic channel. This has given rise to the proliferation of false information that is typically created either (a) to attract network traffic and gain financially from showing online advertisements, e.g., as is the case of clickbait, or (b) to affect individual people's beliefs, and ultimately to influence major events such as political elections (Vosoughi et al., 2018). There are strong indications that false information was weaponized at an unprecedented scale during the 2016 U.S. presidential campaign.
"Fake news", which can be defined as "fabricated information that mimics news media content in form but not in organizational process or intent" (Lazer et al., 2018), became the word of the year in 2017, according to Collins Dictionary. "Fake news" thrive on social media thanks to the mechanism of sharing, which amplifies effect. Moreover, it has been shown that "fake news" spread faster than real news (Vosoughi et al., 2018). As they reach the same user several times, the effect is that they are perceived as more credible, unlike old-fashioned spam that typically dies the moment it reaches its recipients. Naturally, limiting the sharing of "fake news" is a major focus for social media such as Facebook and Twitter.
Additional efforts to combat "fake news" have been led by fact-checking organizations such as Snopes, FactCheck and Politifact, which manually verify claims. Unfortunately, this is inefficient for several reasons. First, manual fact-checking is slow and debunking false information comes too late to have any significant impact. At the same time, automatic fact-checking lags behind in terms of accuracy, and it is generally not trusted by human users. In fact, even when done by reputable fact-checking organizations, debunking does little to convince those who already believe in false information.
A third, and arguably more promising, way to fight "fake news" is to focus on their source. While "fake news" are spreading primarily on social media, they still need a "home", i.e., a website where they would be posted. Thus, if a website is known to have published non-factual information in the past, it is likely to do so in the future. Verifying the reliability of the source of information is one of the basic tools that journalists in traditional media use to verify information. It is also arguably an important prior for fact-checking systems (Popat et al., 2017;Nguyen et al., 2018).
Fact-checking organizations have been producing lists of unreliable online news sources, but these are incomplete and get outdated quickly. Therefore, there is a need to predict the factuality of reporting for a given online medium automatically, which is the focus of the present work. We further study the bias of the source (left vs. right), as the two problems are inter-connected, e.g., extremeleft and extreme-right websites tend to score low in terms of factual reporting. Our contributions can be summarized as follows: • We focus on an under-explored but arguably very important problem: predicting the factuality of reporting of a news medium. We further study bias, which is also under-explored.
• We create a new dataset of news media sources, which has annotations for both tasks, and is 1-2 orders of magnitude larger than what was used in previous work. We release the dataset and our code, which should facilitate future research. 1 • We use a variety of sources such as (i) a sample of articles from the target website, (ii) its Wikipedia page, (iii) its Twitter account, (iv) the structure of its URL, and (v) information about the Web traffic it has attracted. This combination, as well as some of the sources, are novel for these problems.
• We further perform an ablation study of the impact of the individual (groups of) features.
The remainder of this paper is organized as follows: Section 2 provides an overview of related work. Section 3 describes our method and features. Section 4 presents the data, the experiments, and the evaluation results. Finally, Section 5 concludes with some directions for future work.

Related Work
Journalists, online users, and researchers are wellaware of the proliferation of false information, and thus topics such as credibility and fact-checking are becoming increasingly important. For example, the ACM Transactions on Information Systems journal dedicated, in 2016, a special issue on Trust and Veracity of Information in Social Media (Papadopoulos et al., 2016).
There have also been some related shared tasks such as the SemEval-2017 task 8 on Rumor Detection (Derczynski et al., 2017), the CLEF-2018 lab on Automatic Identification and Verification of Claims in Political Debates , and the FEVER-2018 task on Fact Extraction and VERification .
The interested reader can learn more about "fake news" from the overview by Shu et al. (2017), which adopted a data mining perspective and focused on social media. Another recent survey was run by , which took a fact-checking perspective on "fake news" and related problems. Yet another survey was performed by Li et al. (2016), covering truth discovery in general. Moreover, there were two recent articles in Science: Lazer et al. (2018) offered a general overview and discussion on the science of "fake news", while Vosoughi et al. (2018) focused on the process of proliferation of true and false news online. In particular, they analyzed 126K stories tweeted by 3M people more than 4.5M times, and confirmed that "fake news" spread much wider than true news.

Fact-Checking
At the claim-level, fact-checking and rumor detection have been primarily addressed using information extracted from social media, i.e., based on how users comment on the target claim (Canini et al., 2011;Castillo et al., 2011;Ma et al., 2015Ma et al., , 2016Zubiaga et al., 2016;Ma et al., 2017;Dungs et al., 2018;. The Web has also been used as a source of information (Mukherjee and Weikum, 2015;Popat et al., 2016Popat et al., , 2017Karadzhov et al., 2017b;. In both cases, the most important information sources are stance (does a tweet or a news article agree or disagree with the claim?), and source reliability (do we trust the user who posted the tweet or the medium that published the news article?). Other important sources are linguistic expression, meta information, and temporal dynamics.

Stance Detection
Stance detection has been addressed as a task in its own right, where models have been developed based on data from the Fake News Challenge (Riedel et al., 2017;Thorne et al., 2017;Hanselowski et al., 2018), or from SemEval-2017 Task 8 (Derczynski et al., 2017;Dungs et al., 2018;. It has also been studied for other languages such as Arabic (Darwish et al., 2017b;.

Source Reliability Estimation
Unlike stance detection, the problem of source reliability remains largely under-explored. In the case of social media, it concerns modeling the user 2 who posted a particular message/tweet, while in the case of the Web, it is about the trustworthiness of the source (the URL domain, the medium). The latter is our focus in this paper.
In previous work, the source reliability of news media has often been estimated automatically based on the general stance of the target medium with respect to known manually factchecked claims, without access to gold labels about the overall medium-level factuality of reporting (Mukherjee and Weikum, 2015;Popat et al., 2016Popat et al., , 2017Popat et al., , 2018. The assumption is that reliable media agree with true claims and disagree with false ones, while for unreliable media it is mostly the other way around. The trustworthiness of Web sources has also been studied from a Data Analytics perspective. For instance, Dong et al. (2015) proposed that a trustworthy source is one that contains very few false facts. In this paper, we follow a different approach by studying the source reliability as a task in its own right, using manual gold annotations specific for the task.

"Fake News" Detection
Most work on "fake news" detection has relied on medium-level labels, which were then assumed to hold for all articles from that source.
Horne and Adali (2017) analyzed three small datasets ranging from a couple of hundred to a few thousand articles from a couple of dozen sources, comparing (i) real news vs. (ii) "fake news" vs. (iii) satire, and found that the latter two have a lot in common across a number of dimensions. They designed a rich set of features that analyze the text of a news article, modeling its complexity, style, and psychological characteristics. They found that "fake news" pack a lot of information in the title (as the focus is on users who do not read beyond the title), and use shorter, simpler, and repetitive content in the body (as writing fake information takes a lot of effort). Thus, they argued that the title and the body should be analyzed separately.
In follow-up work, Horne et al. (2018b) created a large-scale dataset covering 136K articles from 92 sources from opensources.co, which they characterize based on 130 features from seven categories: structural, sentiment, engagement, topicdependent, complexity, bias, and morality. We use this set of features when analyzing news articles.
In yet another follow-up work, Horne et al. (2018a) trained a classifier to predict whether a given news article is coming from a reliable or from an unreliable ("fake news" or conspiracy) 3 source. Note that they assumed that all news from a given website would share the same reliability class. Such an assumption is fine for training (distant supervision), but we find it problematic for testing, where we believe manual documents-level labels are needed. Potthast et al. (2018) used 1,627 articles from nine sources, whose factuality has been manually verified by professional journalists from Buz-zFeed. They applied stylometric analysis, which was originally designed for authorship verification, to predict factuality (fake vs. real).
Rashkin et al. (2017) focused on the language used by "fake news" and compared the prevalence of several features in articles coming from trusted sources vs. hoaxes vs. satire vs. propaganda. However, their linguistic analysis and their automatic classification were at the article level and they only covered eight news media sources.
Unlike the above work, (i) we perform classification at the news medium level rather than focusing on an individual article. Thus, (ii) we use reliable manually-annotated labels as opposed to noisy labels resulting from projecting the category of a news medium to all news articles published by this medium (as most of the work above did). 4 Moreover, (iii) we use a much larger set of news sources, namely 1,066, which is 1-2 orders of magnitude larger than what was used in previous work. Furthermore, (iv) we use a larger number of features and a wider variety of feature types compared to the above work, including features extracted from knowledge sources that have been largely neglected in the literature so far such as information from Wikipedia and the structure of the medium's URL.

Media Bias Detection
As we mentioned above, bias was used as a feature for "fake news" detection (Horne et al., 2018b). It has also been the target of classification, e.g., Horne et al. (2018a) predicted whether an article is biased (political or bias) vs. unbiased. Similarly, Potthast et al. (2018) classified the bias in a target article as (i) left vs. right vs. mainstream, or as (ii) hyper-partisan vs. mainstream. Finally, Rashkin et al. (2017) studied propaganda, which can be seen as extreme bias. See also a recent position paper (Pitoura et al., 2018) and an overview on bias the Web (Baeza-Yates, 2018).
Unlike the above work, we focus on bias at the medium level rather than at the article level. Moreover, we work with fine-grained labels on an ordinal scale rather then having a binary setup (some work above had three degrees of bias, while we have seven).

Method
In order to predict the factuality of reporting and the bias for a given news medium, we collect information from multiple relevant sources, which we use to train a classifier. In particular, we collect a rich set of features derived from (i) a sample of articles from the target news medium, (ii) its Wikipedia page if it exists, (iii) its Twitter account if it exists, (iv) the structure of its URL, and (v) information about the Web traffic it has attracted. We describe each of these sources below.
Articles We argue that analysis (textual, syntactic and semantic) of the content of the news articles published by a given target medium should be critical for assessing the factuality of its reporting, as well as of its potential bias. Towards this goal, we borrow a set of 141 features that were previously proposed for detecting "fake news" articles (Horne et al., 2018b), as we have described above. These features are used to analyze the following article characteristics: • Structure: POS tags, linguistic features based on the use of specific words (function words, pronouns, etc.), and features for clickbait title classification from (Chakraborty et al., 2016); • Sentiment: sentiment scores using lexicons (Recasens et al., 2013;Mitchell et al., 2013) and full systems (Hutto and Gilbert, 2014); • Engagement: number of shares, reactions, and comments on Facebook; • Topic: lexicon features to differentiate between science topics and personal concerns; • Complexity: type-token ratio, readability, number of cognitive process words (identifying discrepancy, insight, certainty, etc.); • Bias: features modeling bias using lexicons (Recasens et al., 2013;Mukherjee and Weikum, 2015) and subjectivity as calculated using pre-trained classifiers ; • Morality: features based on the Moral Foundation Theory (Graham et al., 2009) and lexicons (Lin et al., 2017) Further details are available in (Horne et al., 2018b). For each target medium, we retrieve some articles, then we calculate these features separately for the title and for the body of each article, and finally we average the values of the 141 features over the set of retrieved articles.
Wikipedia We further leverage Wikipedia as an additional source of information that can help predict the factuality of reporting and the bias of a target medium. For example, the absence of a Wikipedia page may indicate that a website is not credible. Also, the content of the page might explicitly mention that a certain website is satirical, left-wing, or has some property related to our task. Accordingly, we extract the following features: • Has Page: indicates whether the target medium has a Wikipedia page; • Vector representation for each of the following segments of the Wikipedia page, whenever applicable: Content, Infobox, Summary, Categories, and Table of Contents. We generate these representations by averaging the word embeddings (pretrained word2vec embeddings) of the corresponding words.
Twitter Given the proliferation of social media, most news media have Twitter accounts, which they use to reach out to more users online. The information that can be extracted from a news medium's Twitter profile can be valuable for our tasks. In particular, we use the following features: • Has Account: Whether the medium has a Twitter account. We check this based on the top results for a search against Google, restricting the domain to twitter.com. The idea is that media that publish unreliable information might have no Twitter accounts.
• Verified: Whether the account is verified by Twitter. The assumption is that "fake news" media would be less likely to have their Twitter account verified. They might be interested in pushing their content to users via Twitter, but they would also be cautious about revealing who they are (which is required by Twitter to get them verified).
• Created: The year the account was created. The idea is that accounts that have been active over a longer period of time are more likely to belong to established media.
• Has Location: Whether the account provides information about its location. The idea is that established media are likely to have this public, while "fake news" media may want to hide it.
• URL Match: Whether the account includes a URL to the medium, and whether it matches the URL we started the search with. Established media are interested in attracting traffic to their website, while fake media might not. Moreover, some fake accounts mimic genuine media, but have a slightly different domain, e.g., .com.co instead of .com.
• Counts: Statistics about the number of friends, statuses, and favorites. Established media might have higher values for these.
• Description: A vector representation generated by averaging the Google News embeddings (Mikolov et al., 2013) of all words of the profile description paragraph. These short descriptions might contain an open declaration of partisanship, i.e., left or right political ideology (bias). This could also help predict factuality as extreme partisanship often implies low factuality. In contrast, "fake news" media might just leave this description empty, while high-quality media would want to give some information about who they are.
URL We also collect additional information from the website's URL using character-based modeling and hand-crafted features. URL features are commonly used in phishing website detection systems to identify malicious URLs that aim to mislead users (Ma et al., 2009). As we want to predict a website's factuality, using URL features is justified by the fact that low-quality websites sometimes try to mimic popular news media by using a URL that looks similar to the credible source. We use the following URL-related features: • Character-based: Used to model the URL by representing it in the form of a one-hot vector of character n-grams, where n ∈ [2, 5].
Note that these features are not used in the final system as they could not outperform the baseline (when used in isolation).
• Orthographic: These features are very effective for detecting phishing websites, as malicious URLs tend to make excessive use of special characters and sections, and ultimately end up being longer. For this work, we use the length of the URL, the number of sections and the excessive use of special characters such as digits, hyphens and dashes. In particular, we identify whether the URL contains digits, dashes or underscores as individual symbols, which were found to be useful as features for detecting phishing URLs (Basnet et al., 2014). We also check whether the URL contains short (less than three symbols) or long sections (more than ten symbols), as a high number of such sections could indicate an irregular URL.   • Credibility: Model the website's URL credibility by analyzing whether it (i) uses https://, (ii) resides on a blog-hosting platform such as blogger.com, and (iii) uses a special top-level domain, e.g., .gov is for governmental websites, which are generally credible and unbiased, whereas .co is often used to mimic .com.
Web Traffic Analyzing the web traffic to the website of the medium might be useful for detecting phishy websites that come and disappear in certain patterns. Here, we only use the reciprocal value of the website's Alexa Rank, 5 which is a global ranking for over 30 million websites in terms of the traffic they receive. We evaluate the above features in Section 4, both individually and as groups, in order to determine which ones are important to predict factuality and bias, and also to identify the ones that are worth further investigation in future work.

Data
We use information about news media listed on the Media Bias/Fact Check (MBFC) website, 6 which contains manual annotations and analysis of the factuality of reporting and/or bias for over 2,000 news websites. Our dataset includes 1,066 websites for which both bias and factuality labels were explicitly provided, or could be easily inferred (e.g., satire is of low factuality). Some examples from our dataset are presented in Table 1 for factuality of reporting, and in Table 2 for bias. In both tables, we show the names of the media, as well as their corresponding Twitter handles and Wikipedia pages, which we found automatically. Overall, 64% of the websites in our dataset have Wikipedia pages, and 94% have Twitter accounts. In cases of "fake news" sites that try to mimic real ones, e.g., ABCnews.com.co is a fake version of ABCnews.com, it is possible that our Twitter extractor returns the handle for the real medium. This is where the URL Match feature comes handy (see above). Table 3 provides detailed statistics about the dataset. Note that we have 1-2 orders of magnitude more media sources than what has been used in previous studies, as we already mentioned in Section 2 above.   In order to compute the article-related features, we did the following: (i) we crawled 10-100 articles per website (a total of 94,814), (ii) we computed a feature vector for each article, and (iii) we averaged the feature vectors for the articles from the same website to obtain the final vector of articlerelated features.

Experimental Setup
We used the above features in a Support Vector Machine (SVM) classifier, training a separate model for factuality and for bias. We report results for 5-fold cross-validation. We tuned the SVM hyper-parameters, i.e., the cost C, the kernel type, and the kernel width γ, using an internal cross-validation on the training set and optimizing macro-averaged F 1 . Generally, the RBF kernel performed better than the linear kernel. We report accuracy and macro-averaged F 1 score. We also report Mean Average Error (MAE), which is relevant given the ordinal nature of both the factuality and the bias classes, and also MAE M , which is a variant of MAE that is more robust to class imbalance. See (Baccianella et al., 2009;Rosenthal et al., 2017) for more details about MAE M vs. MAE.

Results and Discussion
We present in Table 4 the results of using features from the different sources proposed in Section 3. We start by describing the contribution of each feature type towards factuality and bias.
We can see that the textual features extracted from the ARTICLES yielded the best performance on factuality. They also perform well on bias, being the only type that beats the baseline on MAE. These results indicate the importance of analyzing the contents of the target website. They also show that using the titles only is not enough, and that the article bodies contain important information that should not be ignored.
Overall, the WIKIPEDIA features are less useful for factuality, and perform reasonably well for bias. The best features from this family are those about the page content, which includes a general description of the medium, its history, ideology and other information that can be potentially helpful. Interestingly, the has page feature alone yields sizable improvement over the baseline, especially for factuality. This makes sense given that trustworthy websites are more likely to have Wikipedia pages; yet, this feature does not help much for predicting political bias.

Features
Macro-F   The TWITTER features perform moderately for factuality and poorly for bias. This is not surprising, as we normally may not be able to tell much about the political ideology of a website just by looking at its Twitter profile (not its tweets!) unless something is mentioned in its description, which turns out to perform better than the rest of the features from this family. We can see that the has twitter feature is less effective than has wiki for factuality, which makes sense given that Twitter is less regulated than Wikipedia. Note that the counts features yield reasonable performance, indicating that information about activity (e.g., number of statuses) and social connectivity (e.g., number of followers) is useful. Overall, the TWITTER features seem to complement each other, as their union yields the best performance on factuality.
The URL features are better used for factuality rather than bias prediction. This is mainly due to the nature of these features, which are aimed at detecting phishing websites, as we mentioned in Section 3. Overall, this feature family yields slight improvements, suggesting that it can be useful when used together with other features.
Finally, the Alexa rank does not improve over the baseline, which suggests that more sophisticated TRAFFIC-related features might be needed.

Ablation Study
Finally, we performed an ablation study in order to evaluate the impact of removing one family of features at a time, as compared to the FULL system, which uses all the features. We can see in Tables 5 and 6 that the FULL system achieved the best results for factuality, and the best macro-F 1 for bias, suggesting that the different types of features are largely complementary and capture different aspects that are all important for making a good classification decision.
For factuality, excluding the WIKIPEDIA features yielded the biggest drop in performance. This suggests that they provide information that may not be available in other sources, including the ARTICLES, which achieved better results alone. On the other hand, excluding the TRAFFIC feature had no effect on the model's performance.
For bias, we experimented with classification on both a 7-point and a 3-point scale. 8 Similarly to factuality, the results in Table 6 indicate that WIKIPEDIA offers complementary information that is critical for bias prediction, while TRAFFIC makes virtually no difference.

Conclusion and Future Work
We have presented a study on predicting factuality of reporting and bias of news media, focusing on characterizing them as a whole. These are under-studied, but arguably important research problems, both in their own right and as a prior for fact-checking systems.
We have created a new dataset of news media sources that has annotations for both tasks and is 1-2 orders of magnitude larger than what was used in previous work. We are releasing the dataset and our code, which should facilitate future research.
We have experimented with a rich set of features derived from the contents of (i) a sample of articles from the target news medium, (ii) its Wikipedia page, (iii) its Twitter account, (iv) the structure of its URL, and (v) information about the Web traffic it has attracted. This combination, as well as some of the types of features, are novel for this problem.
Our evaluation results have shown that most of these features have a notable impact on performance, with the articles from the target website, its Wikipedia page, and its Twitter account being the most important (in this order). We further performed an ablation study of the impact of the individual types of features for both tasks, which could give general directions for future research.
In future work, we plan to address the task as ordinal regression, and further to model the interdependencies between factuality and bias in a joint model. We are also interested in characterizing the factuality of reporting for media in other languages. Finally, we want to go beyond left vs. right bias that is typical of the Western world and to model other kinds of biases that are more relevant for other regions, e.g., islamist vs. secular is one such example for the Muslim World.