Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web

With the growth of the internet, the number of fake-news online has been proliferating every year. The consequences of such phenomena are manifold, ranging from lousy decision-making process to bullying and violence episodes. Therefore, fact-checking algorithms became a valuable asset. To this aim, an important step to detect fake-news is to have access to a credibility score for a given information source. However, most of the widely used Web indicators have either been shutdown to the public (e.g., Google PageRank) or are not free for use (Alexa Rank). Further existing databases are short-manually curated lists of online sources, which do not scale. Finally, most of the research on the topic is theoretical-based or explore confidential data in a restricted simulation environment. In this paper we explore current research, highlight the challenges and propose solutions to tackle the problem of classifying websites into a credibility scale. The proposed model automatically extracts source reputation cues and computes a credibility factor, providing valuable insights which can help in belittling dubious and confirming trustful unknown websites. Experimental results outperform state of the art in the 2-classes and 5-classes setting.


Introduction
With the enormous daily growth of the Web, the number of fake-news sources have also been increasing considerably (Li et al., 2012).This social network era has provoked a communication revolution that boosted the spread of misinformation, hoaxes, lies and questionable claims.The proliferation of unregulated sources of information allows any person to become an opinion provider with + Work was completed while the author was a student at the Birla Institute of Technology and Science, India and was interning at SDA Research.
* These two authors contributed equally to this work.
no restrictions.For instance, websites spreading manipulative political content or hoaxes can be persuasive.To tackle this problem, different fact-checking tools and frameworks have been proposed (Zubiaga et al., 2017), mainly divided into two categories: fact-checking over natural language claims (Thorne and Vlachos, 2018) and fact-checking over knowledge bases, i.e., triplebased approaches (Esteves et al., 2018).Overall, fact-checking algorithms aim at determining the veracity of claims, which is considered a very challenging task due to the nature of underlying steps, from natural language understanding (e.g.argumentation mining) to common-sense verification (i.e., humans have prior knowledge that makes far easier to judge which arguments are plausible and which are not).Yet an important underlying fact-checking step relies upon computing the credibility of sources of information, i.e. indicators that allow answering the question: "How reliable is a given provider of information?".Due to the obvious importance of the Web and the negative impact that misinformation can cause, methods to demote the importance of websites also become a valuable asset.In this sense the high number of new websites appearing at everyday (Netcraft, 2016), make straightforward approaches -such as blacklists and whitelists -impractical.Moreover, such approaches are not designed to compute credibility scores for a given website but rather to binary label them.Thus, they aim at detecting mostly "fake" (threatening) websites; e.g., phishing detection, which is out of scope of this work.Thus, open credibility models have a great importance, especially due to the increase of fake news being propagated.There is much research into credibility factors.However, they are mostly grouped as follows: (1) theoretical research on psychological aspects of credibility and (2) experiments performed over private and confidential users information, mostly from web browser activities (strongly supported by private companies).Therefore, while (1) lacks practical results (2) report findings which are not much appealing to the broad open-source community, given the non-open characteristic of the conducted experiments and data privacy.Finally, recent research on credibility has also pointed out important drawbacks, as follows: 1. Manual (human) annotation of credibility indicators for a set of websites is costly (Haas and Unkel, 2017).
2. Search engine results page (SERP) do not provide more than few information cues (URL, title and snippet) and the dominant heuristic happens to be the search engine (SE) rank itself (Haas and Unkel, 2017).
3. Only around 42.67% of the websites are covered by the credibility evaluation knowledge base, where most domains have a low credibility confidence (Liu et al., 2015) Therefore, automated credibility models play an important role in the community -although not broadly explored yet, in practice.In this paper, we focus on designing computational models to predict the credibility of a given website rather than performing sociological experiments or experiments with end users (simulations).In this scenario, we expect that a website from a domain such as bbc.com gets a higher trustworthiness score compared to one from wordpress.com,for instance.

Related Work
Credibility is an important research subject in several different communities and has been the subject of study over the past decades.Most of the research, however, focuses on theoretical aspects of credibility and its persuasive effect on different fundamental problems, such as economic theories (Sobel, 1985).

Fundamental Research
A thorough examination of psychological aspects in evaluating documents credibility has been studied (Fogg and Tseng, 1999;Fogg et al., 2001Fogg et al., , 2003)), which reports numerous challenges.Apart from sociological experiments, Web Credibilityin a more practical perspective -has a different focus of research, described as follows: Rating Systems, Simulations are mostly platform-based solutions to conduct experiments (mostly using private data) in order to detect credibility factors.Nakamura et al. (2007) surveyed internet users from all age groups to understand how they identified trustworthy websites.Based on the results of this survey, they built a graphbased ranking method which helped users in gauging the trustworthiness of search results retrieved by a search engine when issued a query Q.A study by Stanford University revealed important factors that people notice when assessing website credibility (Fogg et al., 2003), mostly visual aspects (web site design, look and information design).The writing style and bias of information play a small role as defining the level of credibility (selected by approximately 10% of the comments).However, this process of evaluating the credibility of web pages by users is impacted only by the number of heuristics they are aware of (Fogg, 2003), biasing the human evaluation w.r.t. a limited and specific set features.An important factor considered by humans to judge credibility relies on the search engine results page (SERP).The higher ranked a website is when compared to other retrieved websites the more credible people judge a website to be (Schwarz and Morris, 2011).Popularity is yet another major credibility factor (Giudice, 2010).Liu et al. (2015) proposed to integrate recommendation functionality into a Web Credibility Evaluation System (WCES), focusing on the user's feedback.Shah et al. (2015) propose a full list of important features for credibility aspects, such as 1) the quality of the design of the website and 2) how well the information is structured.In particular, the perceived accuracy of the information was ranked only in 6th place.Thus, superficial website characteristics as heuristics play a key role in credibility evaluation.Dong et al (2015) propose a different method (KBT) to estimate the trustworthiness of a web source based on the information given by the source (i.e., applies fact-checking to infer credibility).This information is represented in the form of triples extracted from the web source.The trustworthiness of the source is determined by the correctness of the triples extracted.Thus, the score is computed based on endogenous (e.g., correctness of facts) signals rather then exogenous signals (e.g., links).
Unfortunately, this research from Google does not provide open data.It is worth mentioning thatsurprisingly -their hypothesis (content is more important than visual) contradicts previous research findings (Fogg et al., 2003;Shah et al., 2015).While this might be due to the dynamic characteristic of the Web, this contradiction highlights the need for more research into the real use of web credibility factors w.r.t.automated web credibility models.Similar to (Nakamura et al., 2007), Singal and Kohli (2016) proposes a tool (dubbed TNM) to re-rank URLs extracted from Google search engine according to the trust maintained by the actual users).Apart from the search engine API, their tool uses several other APIs to collect website usage information (e.g., traffic and engagement info).(Kakol et al., 2017) perform extensive crowdsourcing experiments that contain credibility evaluations, textual comments, and labels for these comments.
SPAM/phishing detection: Abbasi et al. ( 2010) propose a set of design guidelines which advocated the development of SLT-based classification systems for fraudulent website detection, i.e., despite seeming credible -websites that try to obtain private information and defraud visitors.PhishZoo (Afroz and Greenstadt, 2011) is a phishing detection system which helps users in identifying phishing websites which look similar to a given set of protected websites through the creation of profiles.

Automated Web Credibility
Automated Web Credibility models for website classification are not broadly explored, in practice.The aim is to produce a predictive model given training data (annotated website ranks) regardless of an input query Q. Existing gold standard data is generated from surveys and simulations (see Rating Systems, Simulations related work).Currently, state of the art (SOTA) experiments rely on the Microsoft Credibility dataset1 (Schwarz and Morris, 2011).Recent research use the website label (Likert scale) released in the Microsoft dataset as a gold standard to train automated web credibility models, as follows: Olteanu et al. (2013) proposes a number of properties (37 linguistic and textual features) and applies machine learning methods to recognize trust levels, obtaining 22 relevant features for the task.Wawer et al. (2014) improve this work using psychosocial and psycholinguistic features (through The General Inquirer (GI) Lexical Database (Stone and Hunt, 1963)) achieving state of the art results.
Finally, another resource is the Content Credibility Corpus (C3) (Kakol et al., 2017), the largest Web credibility Corpus publicity available so far.However, in this work authors did not perform experiments w.r.t.automated credibility models using a standard measure (i.e., Likert scale), such as in (Olteanu et al., 2013;Wawer et al., 2014).Instead, they rather focused on evaluating the theories of web credibility in order to produce a much larger and richer corpus.According to (Olteanu et al., 2013), a resultant number of 22 features (out of 37) were selected as most significant (10 for content-based and all social-based features).Surprisingly (but also following (Dong et al., 2015)), none from the subgroup Appearance, although studies have systematically shown the opposite, i.e., that visual aspects are one of the most important features (Fogg et al., 2003;Shah et al., 2015;Haas and Unkel, 2017).
In this picture, we claim the most negative aspect is the reliance on Social-based features.This dependency not only affects the final performance of the credibility model, but also implies in financial costs as well as presenting high discriminative capacity, adding a strong bias to the performance of the model2 .The computation of these features relies heavily on external (e.g., Facebook API3 and AdBlock4 ) and commercial libraries (Alchemy5 , PageRank6 , Alexa Rank7 .Thus, engineering and financial costs are a must.Furthermore, popularity on Facebook or Twitter can be measured only by data owners.Additionally, vendors may change the underlying algorithms without further explanation.Therefore, also following Wawer et al. (2014), in this paper we have excluded Socialbased features from our experimental setup.
On top of that, (Wawer et al., 2014) incremented the model, adding features extracted from the General Inquirer (GI) Lexical Database, resulting in a vector of 183 extra categories, apart from the selected 22 base features, i.e. total of 205 features (However, this is subject to contradictions.Please see Section 4.1 for more information).

Website credibility evaluation
Microsoft Dataset (Schwarz and Morris, 2011) consists of thousands of URLs and their credibility ratings (five-point Likert Scale8 ), ranging from 1 ("very non-credible") to 5 ("very credible").In this study, participants were asked to rate the websites as credible following the definition: "A credible webpage is one whose information one can accept as the truth without needing to look elsewhere".Studies by (Olteanu et al., 2013;Wawer et al., 2014) use this dataset for evaluation.Content Credibility Corpus (C3)9 is the most recent and the largest credibility dataset currently publicly available for research (Kakol et al., 2017).It contains 15.750 evaluations of 5.543 URLs from 2.041 participants with some additional information about website characteristics and basic demographic features of users.Among many metadata information existing in the dataset, in this work we are only interested in the URLs and their re-spective five-point Likert scale, so that we obtain the same information available in the Microsoft dataset.

Fact-checking influence
In order to verify the impact of our web credibility model in a real use-case scenario, we ran a factchecking framework to verify a set of input claims.Then we collected the sources (URLs) containing proofs to support a given claim.We used this as a dataset to evaluate our web credibility model.
The primary objective is to verify whether our model is able, on average, to assign lower scores to the websites that contain proofs supporting claims which are labeled as false in the FactBench dataset (i.e., the source is providing false information, thus should have a lower credibility score).Similarly, we expect that websites that support positive claims are assigned with higher scores (i.e., the source is supporting an accurate claim, thus should have a higher credibility score).
The (gold standard) input claims were obtained from the FactBench dataset10 , a multilingual benchmark for the evaluation of fact validation algorithms.It contains a set of RDF11 models (10 different relations), where each model contains a singular fact expressed as a subject-predicateobject triple.The data was automatically extracted from DBpedia and Freebase KBs, and manually curated in order to generate true and false examples.
The website list extraction was carried out by DeFacto (Gerber et al., 2015), a fact-checking framework designed for RDF KBs.DeFacto returns a set of websites as pieces of evidence to support its prediction (true or false) for a given input claim.

Final Features
We implemented a set of Content-based features (Section 3.1) adding more lexical and textual based features.Social-based features were not considered due to financial costs associated with paid APIs.The final set of features for each website w is defined as follows: 1. Web Archive: the temporal information w.r.t.cache and freshness.∆ b and ∆ e correspond to the temporal differences of the first and last 2 updates, respectively.∆ a represents the age of w and finally ∆ u represents the temporal difference for the last update to today.γ is a penalization factor when the information is obtained from the domain of w (w d ) instead w.
2. Domain: refers to the (encoded) domain w (e.g.org) 3. Authority: searches for authoritative keywords within the page HTML content w c (e.g., contact email, business address, etc..) 4. Outbound Links: searches the number of different outbound links in w ∧ w d ∈ d, i.e., P n=1 φ(w c ) where P is the number of web-based protocols.
5. Text Category: returns a vector containing the probabilities P for each pre-trained category c of w w.r.t. the sentences of the website w s and page title w t : ws s=1 γ(s) γ(w t ).We trained a set of binary multinomial Naive Bayes (NB) classifiers, one per class, as follows: business, entertainment, politics, religion, sports and tech.
6.Text Category -LexRank: reduces the noisy of w b by classifying only top N sentences generated by applying LexRank (Erkan and Radev, 2004) over w b (S = Γ(w b , N )), which is a graph-based text summarizing technique: S s =1 γ(s ) γ(w t ). 7. Text Category -LSA: similarly, we apply Latent Semantic Analysis (LSA) (Steinberger and Jeek, 2004) to detect semantically important sentences in w b (S = Ω(w b , N )): S s =1 γ(s ) γ(w t ). 8. Readability Metrics: returns a vector resulting of the concatenation of several R readability metrics (Si and Callan, 2001) 9. SPAM: 12. PageRankCC: PageRank information computed through the CommonCrawl12 Corpus 13.General Inquirer (Stone and Hunt, 1963): a 182-lenght vector containing several lexicons 14.Vader Lexicon: lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments 15.HTML2Seq: we introduce the concept of bag-of-tags, where similarly to bag-of-words13 we group the HTML tag occurrences in each web site.We additionally explore this concept along with a sequence problem, i.e. we encode the tags and evaluate this considering a window size (offset) from the header of the page.

Experiments
Previous research proposes two application settings w.r.t. the classification itself, as follows: (A.1) casting the credibility problem as a classification problem and (A.2) evaluating the credibility on a five-point Likert scale (regression).In the classification scenario, the models are evaluated both w.r.t. the 2-classes as well as 3-classes.In the 2-classes scenario, websites ranging from 1 to 3 are labeled as "low" whereas 4 and 5 are labeled as "high" (credibility).Analogously, in the 3-classes scenario, websites labeled as 1 and 2 are converted to "low", 3 remains as "medium" while 4 and 5 are grouped into the "high" class.
We first explore the impact of the bag-of-tags strategy.We encode and convert the tags into a sequence of tags, similar to a sequence of sentences (looking for opening and closing tags, e.g., <a>and </a>).Therefore, we perform document classification over the resulting vectors.Figures 1  to 4 show results of this strategy for both 2 and 3-classes scenarios.The x-axis is the log scale of the paddings (i.e., the offset of HTML tags we retrieved from w, ranging from 25 to 10.000).The charts reveal an interesting pattern in both goldstandard datasets (Microsoft Dataset and C3 Corpus): the first tags are the most relevant to predict the credibility class.Although this strategy does not achieve state of the art performance (F1 = 0.690 and 0.571 for the 2 and 3-classes configurations, respectively, when compared to state of the art: F1 = 0.745 and 0.652), it presents reasonable performance by just inspecting website metadata.However, it is worth mentioning that the main advantage of this approach lies in the fact that it is language agnostic (while current research focuses on English) as well as less susceptible to overfitting.
We then evaluate the performance of the textual features (Section 3.3) isolated.Results for the 2classes scenario are presented as follows: Figure 5 highlights the best models performance using textual features only.While this as a single feature does not outperform the lexical features, when we combine the bag-of-tags approach (predictions of probabilities for each class) we boost the performance (F1 from 0.738 to 0.772) and outperform state of the art (0.745), as shown in Figure 6.Tables 1 to 3 shows detailed results for both datasets (2-classes, 3-classes and 5-classes configurations, respectively).For 5-class regression, we found that the best pad = 100 for the Microsoft dataset and best pad = 175 for the C3 Corpus.We preceded the computing of both classification and regression models with feature selection according to a percentile of the highest scoring features (Se-lectKBest).We tested the choice of 3, 5, 10, 25, 50 75 and K=100 percentiles (thus, no selection) of features and did not find a unique K value for every case.It is worth noticing that in general it is easy to detect high credible sources (F1 for "high" class around 0.80 in all experiments and both datasets) but recall of "low" credible sources is still an issue.the fact-checking algorithm.For 1500 claims, it collected pieces of evidence for over 27.000 websites.Table 5 depicts the impact of the credibility model in the fact-checking context.We collected a small subset of 186 URLs from the FactBench dataset and manually annotated14 the credibility for each URL (following the Likert scale).The model corrected labeled around 80% of the URLs associated with a positive claim and, more importantly, 70% of non-credible websites linked to false claims were correctly identified.This helps to minimize the number of non-credible information providers that contain information that supports a false claim.

Discussion
Reproducibility is still one of the cornerstones of science and scientific projects (Baker, 2016).In the following, we list some relevant issues encountered while performing our experiments: Experimental results: this gap is also observed w.r.t.results reported by (Olteanu et al., 2013), which is acknowledged by (Wawer et al., 2014), despite numerous attempts to replicate experiments.Authors (Wawer et al., 2014) believe this is   due to the lack of parameters and hyperparameters explicitly cited in the previous research (Olteanu et al., 2013).
Microsoft dataset: presents inconsistencies.Although all the web pages are cached (in theory) in order to guarantee a deterministic environment, the dataset -in its original form15 -has a number of problems, as follows: (a) web pages not physically cached (b) URL not matching (dataset links versus cached files) (c) Invalid file format (e.g., PDF).Even though these issues have also been previously identified by related research (Olteanu et al., 2013) it is not clear what the URLs for the final dataset (i.e., the support) are nor where this new version is available.
Contradictions: w.r.t. the divergence of the importance of visual features have drawn our attention (Dong et al., 2015) and (Fogg, 2003;Shah et al., 2015) which corroborate to the need of more methods to solve the web credibility problem, in practice.The main hypothesis that supports this contradiction relies on the fact that feature-based credibility evaluation eventually ignites cat-andmouse play between scientists and people interested in manipulating the models.In this case, reinforcement learning methods pose as a good al-  (Wawer et al., 2014) that "solutions based purely on external APIs are difficult to use beyond scientific application and are prone for manipulation" confirming the need to exclude social features from research of (Olteanu et al., 2013) contradicts itself.In the course of experiments, authors admit the usage of all features proposed by (Olteanu et al., 2013): "Table 1 presents regression results for the dataset described in [13] in its original version (37 features) and extended with 183 variables from the General Inquirer (to 221 features)".
Therefore, due to the number of relevant issues presented w.r.t.reproducibility and contradiction of arguments, the comparison to recent research becomes more difficult.In this work, we solved the technical issues in the Microsoft dataset and released a new fixed version 16 .Also, since we need to perform evaluations in a deterministic environment, we cached and released the websites for the C3 corpus.After scraping, 2.977 URLs were used (out of 5.543).Others were left due to processing errors (e.g., 404).The algorithms and its hyperparameters and further relevant metadata are available through the MEX Interchange Format (Esteves et al., 2015).By doing this, we provide a computational environment to perform safer comparisons, being engaged in recent discussions about mechanisms to measure and enhance the reproducibility of scientific projects (Wilkinson et al., 2016).
In this work, we discuss existing alternatives, gaps and current challenges to tackle the problem of web credibility.More specifically, we focused on automated models to compute a credibility factor for a given website.This research follows the former studies presented by (Olteanu et al., 2013;Wawer et al., 2014) and presents several contributions.First, we propose different features to avoid the financial cost imposed by external APIs in order to access website credibility indicators.This issue has become even more relevant in the light of the challenges that have emerged after the shutdown of Google PageRank, for instance.To bridge this gap, we have proposed the concept of bagof-tags.Similar to (Wawer et al., 2014), we conduct experiments in a highly-dimensional feature space, but also considering web page metadata, which outperforms state of the art results in the 2classes and 5-classes settings.Second, we identified and fixed several problems on a gold standard dataset for web credibility (Microsoft), as well as indexed several web pages for the C3 Corpus.Finally, we evaluate the impact of the model in a real fact-checking use-case.We show that the proposed model can help in belittling and supporting different websites that contain evidence of true and false claims, which helps the very challenging fact verification task.As future work, we plan to explore deep learning methods over the HTML2Seq module.
State-of-the-art (SOTA) FeaturesRecent research on credibility factors for web sites(Olteanu et al., 2013) have initially divided the features into the following logical groups:1.Content-based (25 features): number of special characters in the text, spelling errors, web site category and etc..
Table 4 shows statistics on the data generated by

Table 4 :
FactBench: Web sites collected from claims