Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking

We present an analytic study on the language of news media in the context of political fact-checking and fake news detection. We compare the language of real news with that of satire, hoaxes, and propaganda to find linguistic characteristics of untrustworthy text. To probe the feasibility of automatic political fact-checking, we also present a case study based on PolitiFact.com using their factuality judgments on a 6-point scale. Experiments show that while media fact-checking remains to be an open research question, stylistic cues can help determine the truthfulness of text.


Introduction
Words in news media and political discourse have a considerable power in shaping people's beliefs and opinions. As a result, their truthfulness is often compromised to maximize impact. Recently, fake news has captured worldwide interest, and the number of organized efforts dedicated solely to fact-checking has almost tripled since 2014. 1 Organizations, such as PolitiFact.com, actively investigate and rate the veracity of comments made by public figures, journalists, and organizations. Figure 1 shows example quotes rated for truthfulness by PolitiFact. Per their analysis, one component of the two statements' ratings is the misleading phrasing (bolded in green in the figure). For instance, in the first example, the statement is true as stated, though only because the speaker hedged their meaning with the quantifier just. In the second example, two correlated events -Brexit and Google search trends -are presented ambiguously as if they were directly linked.
Importantly, like above examples, most factchecked statements on PolitiFact are rated as neither entirely true nor entirely false. Analysis indicates that falsehoods often arise from subtle differences in phrasing rather than outright fabrication (Rubin et al., 2015). Compared to most prior work on deception literature that focused on binary categorization of truth and deception, political fact-checking poses a new challenge as it involves a graded notion of truthfulness.
While political fact-checking generally focuses on examining the accuracy of a single quoted statement by a public figure, the reliability of general news stories is also a concern (Connolly et al., 2016;Perrott, 2016). Figure 2 illustrates news types categorized along two dimensions: the intent of the authors (desire to deceive) and the content of the articles (true, mixed, false).  Figure 2: Types of news articles categorized based on their intent and information quality.
In this paper, we present an analytic study characterizing the language of political quotes and news media written with varying intents and degrees of truth. We also investigate graded deception detection, determining the truthfulness on a 6-point scale using the political fact-checking database available at PolitiFact. 2 2 Fake News Analysis News Corpus with Varying Reliability To analyze linguistic patterns across different types of articles, we sampled standard trusted news articles from the English Gigaword corpus and crawled articles from seven different unreliable news sites of differing types. Table 1 displays sources identified under each type according to US News & World Report. 3 These news types include: • Satire: mimics real news but still cues the reader that it is not meant to be taken seriously • Hoax: convinces readers of the validity of a paranoia-fueled story • Propaganda: misleads readers so that they believe a particular political/social agenda Unlike hoaxes and propaganda, satire is intended to be notably different from real news so that audiences will recognize the humorous intent. Hoaxes and satire are more likely to invent stories, while propaganda frequently combines truths, falsehoods, and ambiguities to confound readers.
To characterize differences between news types, we applied various lexical resources to trusted and fake news articles. We draw lexical resources from prior works in communication theory and stylistic analysis in computational linguistics. We tokenize News  the text with NLTK (Bird et al., 2009) and compute per-document count for each lexicon, and report averages per article of each type. First among these lexicons is the Linguistic Inquiry and Word Count (LIWC), a lexicon widely used in social science studies (Pennebaker et al., 2015). In addition, we estimate the use of strongly and weakly subjective words with a sentiment lexicon (Wilson et al., 2005). Subjective words can be used to dramatize or sensationalize a news story. We also use lexicons for hedging from (Hyland, 2015) because hedging can indicate vague, obscuring language. Lastly, we introduce intensifying lexicons that we crawled from Wiktionary based on a hypothesis that fake news articles try to enliven stories to attract readers. We compiled five lists from Wiktionary of words that imply a degree a dramatization (comparatives, superlatives, action adverbs, manner adverbs, and modal adverbs) and measured their presence.
Discussion Table 2 summarizes the ratio of averages between unreliable news and truthful news for a handful of the measured features. Ratios greater than one denote features more prominent in fake news, and ratios less than one denote features more prominent in truthful news. The ratios between unreliable/reliable news reported are statistically significant (p < 0.01) with Welsch t-test after Bonferroni correction.
Our results show that first-person and secondperson pronouns are used more in less reliable or deceptive news types. This contrasts studies in other domains (Newman et al., 2003), which found fewer self-references in people telling lies about their personal opinions. Unlike that domain, news writers are trying to appear indifferent. Editors at trustworthy sources are possibly more  Table 2: Linguistic features and their relationship with fake news. The ratio refers to how frequently it appears in fake news articles compared to the trusted ones. We list linguistic phenomena more pronounced in the fake news first, and then those that appear less in the fake news. Examples illustrate sample texts from news articles containing the lexicon words. All reported ratios are statistically significant. The last column (MAX) lists compares which type of fake news most prominently used words from that lexicon (P = propaganda, S = satire, H = hoax) rigorous about removing language that seems too personal, which is one reason why this result differs from other lie detection domains. This finding instead corroborates previous work in written domains found by Ott et al. (2011) and Rayson et al. (2001), who found that such pronouns were indicative of imaginative writing. Perhaps imaginative storytelling domains is a closer match to detecting unreliable news than lie detection on opinions.
Our results also show that words that can be used to exaggerate -subjectives, superlatives, and modal adverbs -are all used more by fake news. Words used to offer concrete figures -comparatives, money, and numbers -appear more in truthful news. This also builds on previous findings by Ott et al. (2011) on the difference between superlative/comparative usage.
Trusted sources are more likely to use assertive words and less likely to use hedging words, indicating that they are less vague about describing events, as well. This relates to psychology theories (Buller and Burgoon, 1996) that deceivers show more "uncertainty and vagueness" and "indirect forms of expression". Similarly, the trusted sources use the hear category words more often, possibly indicating that they are citing primary sources more often.
The last column in Table 2 shows the fake news type that uses the corresponding lexicon most prominently. We found that one distinctive feature of satire compared to other types of untrusted news is its prominent use of adverbs. Hoax stories tend to use fewer superlatives and comparatives. In contrast, compared to other types of fake news, propaganda uses relatively more assertive verbs and superlatives.

News Reliability Prediction
We study the feasibility of predicting the reliability of the news article into four categories: trusted, satire, hoax, or propaganda. We split our collected articles into balanced training (20k total articles from the Onion, American News, The Activist, and the Gigaword news excluding 'APW', 'WPB' sources) and test sets (3k articles from the remaining sources). Because articles in the training and test set come from different sources, the models must classify articles without relying on author-specific cues. We also use 20% of the training articles as an in-domain development set. We trained a Max-Entropy classifier with L2 regularization on n-gram tf-idf feature vectors (up to trigrams). 4 The model achieves F1 scores of 65% on the out-of-domain test set (Table 3). This is a promising result as it is much higher than random, but still leaves room for improvement compared to the  performance on the development set consisting of articles from in-domain sources. We examined the 50 highest weighted n-gram features in the MaxEnt classifier for each class. The highest weighted n-grams for trusted news were often specific places (e.g., "washington") or times ("on monday"). Many of the highest weighted from satire were vaguely facetious hearsay ("reportedly", "confirmed"). For hoax articles, heavily weighted features included divisive topics ("liberals", "trump") and dramatic cues ("breaking"). Heavily weighted features for propaganda tend towards abstract generalities ("truth", "freedom") as well as specific issues ("vaccines", "syria"). Interestingly, "youtube" and "video" are highly weighted for the propaganda and hoax classes respectively; indicating that they often rely on video clips as sources.

Predicting Truthfulness
Politifact Data Related to the issue of identifying the truthfulness of a news article is the factchecking of individual statements made by public figures. Misleading statements can also have a variety of intents and levels of reliability depending on whom is making the statement.
PolitiFact 5 is a site led by Tampa Bay Times journalists who actively fact-check suspicious statements. One unique quality of PolitiFact is that each quote is evaluated on a 6-point scale of truthfulness ranging from "True" (factual) to "Pantson-Fire False" (absurdly false). This scale allows for distinction between categories like mostly true (the facts are correct but presented in an incomplete manner) or mostly false (the facts are not correct but are connected to a small kernel of truth).
We collected labelled statements from Poli-tiFact and its spin-off sites (PunditFact, etc.) (10,483 statements in total). We analyze a subset of 4,366 statements that are direct quotes by the original speaker.  in Table 4. Most statements are labeled as neither completely true nor false. We formulate a fine-grained truthfulness prediction task with Politifact data. We split quotes into training/development/test set of {2575, 712, 1074} statements, respectively, so that all of each speaker's quotes are in a single set. Given a statement, the model returns a rating for how reliable the statement is (Politifact ratings are used as gold labels). We ran the experiment in two settings, one considering all 6 classes and the other considering only 2 (treating the top three truthful ratings as true and the lower three as false).

Model
We trained an LSTM model (Hochreiter and Schmidhuber, 1997) that takes the sequence of words as the input and predicts the Politifact rating. We also compared this model with Maximum Entropy (MaxEnt) and Naive Bayes models, frequently used for text categorization.
For input to the MaxEnt and Naive Bayes models, we tried two variants: one with the word tfidf vectors as input, and one with the LIWC measurements concatenated to the tf-idf vectors. For the LSTM model, we used word sequences as input and also a version where LSTM output is concatenated with LIWC feature vectors before undergoing the activation layer. The LSTM word embeddings are initialized with 100-dim embeddings from GLOVE (Pennington et al., 2014) and fine-tuned during training. The LSTM was implemented with Theano and Keras with 300-dim hidden state and a batch size of 64. Training was done with ADAM to minimize categorical crossentropy loss over 10 epochs. Table 5 summarizes the performance on the development set. We report macro averaged F1 score in all tables. The LSTM outperforms the other models when only using text as input; however the other two models improve substantially with adding LIWC features, particu-   We report results on the test set in Table 6. We again find that LIWC features improves MaxEnt and NB models to perform similarly to the LSTM model. As in the dev. set results, the LIWC features do not improve the LSTM's performance, and even seem to hurt the performance slightly.

Related Work
Deception Detection Psycholinguistic work in interpersonal deception theory (Buller and Burgoon, 1996) has postulated that certain speech patterns can be signs of a speaker trying to purposefully obscure the truth. Hedge words and other vague qualifiers (Choi et al., 2012;Recasens et al., 2013), for example, may add indirectness to a statement that obscures its meaning.
Linguistic aspects deception detection has been well-studied in a variety of NLP applications (Ott et al., 2011;Mihalcea and Strapparava, 2009;Jindal and Liu, 2008;Girlea et al., 2016;Zhou et al., 2004). In these applications, people purposefully tell lies to receive an extrinsic payoff. In our study, we compare varying types of unreliable news source, created with differing intents and levels of veracity.

Fact-Checking and Fake News
There is research in political science exploring how effective fact-checking is at improving people's awareness (Lord et al., 1979;Thorson, 2016;Nyhan and Reifler, 2015). Prior computational works (Vlachos and Riedel, 2014;Ciampaglia et al., 2015) have proposed fact-checking through entailment from knowledge bases. Our work takes a more linguistic approach, performing lexical analysis over varying types of falsehood. Biyani et al. (2016) examine the unique linguistic styles found in clickbait articles, and Kumar et al. (2016) also characterize hoax documents on Wikipedia. The differentiation between these fake news types is also proposed in previous work (Rubin et al., 2015). Our paper extends this work by offering a quantitative study of linguistic differences found in articles of different types of fake news, and build predictive models for graded deception across multiple domains -PolitiFact and news articles. More recent work (Wang, 2017) has also investigated PolitiFact data though they investigated meta-data features for prediction whereas our investigation is focused on linguistic analysis through stylistic lexicons.

Conclusion
We examine truthfulness and its contributing linguistic attributes across multiple domains e.g., online news sources and public statements. We perform multiple prediction tasks on fact-checked statements of varying levels of truth (graded deception) as well as a deeper linguistic comparison of differing types of fake news e.g., propaganda, satire and hoaxes. We have shown that factchecking is indeed a challenging task but that various lexical features can contribute to our understanding of the differences between more reliable and less reliable digital news sources.