Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection

We describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We rely on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic and promote a particular political cause or viewpoint. In particular, we trained a logistic regression model with features ranging from simple bag of words to vocabulary richness and text readability. Our system achieved 72.9% accuracy on the manually annotated testset, and 60.8% on the test data that was obtained with distant supervision. Additional experiments showed that significant performance gains can be achieved with better feature pre-processing.


Introduction
The rise of social media has enabled people to easily share information with a large audience without regulations or quality control. This has allowed malicious users to spread disinformation and misinformation (a.k.a. "fake news") at an unprecedented rate. Fake news is typically characterized as being hyperpartisan (one-sided), emotional and riddled with lies (Potthast et al., 2017a). The SemEval-2019 Task 4 on Hyperpartisan News Detection (Kiesel et al., 2019) focused on the challenge of automatically identifying whether a text is hyperpartisan or not. While hyperpartisanship is defined as "exhibiting one or more of blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person", we model this task as a binary document classification problem.
Scholars have argued that all biased messages can be considered propagandistic, regardless of whether the bias was intentional or not (Ellul,1 Our system is available at https://github.com/ AbdulSaleh/QCRI-MIT-SemEval2019-Task4 1965, p. XV). As a result, we approached the task departing from an existing model for propaganda identification . Our hypothesis is that as propaganda is inherent in hyperpartisanship -the two problems are two sides of the same coin, and solving one of them would help solve the other. Our system consists of a logistic regression model that is trained with a variety of engineered features that range from word and character TFiDF n-grams and lexicon-based features to more sophisticated features that represent different aspects of the article's text such as the richness of its vocabulary and the complexity of its language.
Our official submission achieved an accuracy of 72.9% (while the winning system achieved 82.2%). This was achieved using word and character n-grams. Additional, post-submission experiments show that further performance improvements can be achieved by careful pre-processing of the engineered features.

Related Work
The analysis of bias and disinformation has attracted significant attention, especially after the 2016 US presidential election (Brill, 2001;Finberg et al., 2002;Castillo et al., 2011;Baly et al., 2018a;Kulkarni et al., 2018;Mihaylov et al., 2018). Most of the proposed approaches have focused on predicting credibility, bias or stance. Popat et al. (2017) assessed the credibility of claims based on the occurrence of assertive and factive verbs, hedges, implicative words, report verbs and discourse markers, which were extracted using manually crafted gazetteers (referred to as stylistic features).
Stance detection was considered as an intermediate step for detecting fake claims, where the veracity of a claim is checked by aggregating the stances of retrieved relevant articles (Baly et al., 2018b). Several stance detection models have been proposed as part of the Fake News Challenge (FNC) 2 including deep convolutional neural networks (Baird et al., 2017), multi-layer perceptrons (Hanselowski et al., 2018), and end-to-end memory networks (Mohtarami et al., 2018) The stylometric analysis model of Koppel et al. (2007) was used by Potthast et al. (2017b) when looking for hyperpartisanship. They used articles from nine news sources whose factuality has been manually verified by professional journalists. Writing style and complexity was also considered by Horne and Adal (2017) to differentiate real news from fake news and satire. They used features such as the number of occurrences of different part-of-speech tags, swearing and slang words, stop words, punctuation, and negation as stylistic markers. They also used a number of readability measures. Rashkin et al. (2017) focused on a multi-class setting: real news, satire, hoax, or propaganda. Their supervised model relied on word n-grams.
Similarly to Potthast et al. (2017b), we believe that there is an inherent style in propaganda, regardless of the source publishing it. Many stylistic features were proposed for authorship identification, i.e., the task of predicting whether a piece of text has been written by a particular author. One of the most successful representations for such a task are character-level n-grams (Stamatatos, 2009), and they turn out to represent some of our most important stylistic features.
More details about research on fact-checking and the spread of fake news online can be found in (Lazer et al., 2018;Vosoughi et al., 2018;Thorne and Vlachos, 2018).

System Description
We developed our system for detecting hyperpartisanship in news articles by training a logistic regression classifier using a set of engineered features that included the following: character and word n-grams, lexicon-based indicators, and readability and vocabulary richness measures. Below, we describe these features in detail.
Character 3-grams. Stamatatos (2009) argued that, for tasks where the topic is irrelevant, character-level representations are more sensitive than token-level ones. We hypothesize that this applies to hyperpartisan news detection, since articles on both sides of the political spectrum may be discussing the same topics. Stamatatos (2009) found that "the most frequent character n-grams are the most important features for stylistic purposes". These features capture different style markers, such as prefixes, suffixes and punctuation marks. Following the analysis in Barrón-Cedeño et al. (2019), we include TFiDF-weighted character 3-grams in our feature set.
Word n-grams Bag-of-words (BoW) features are widely used for text classification. We extracted the k most frequent [1, 2]-grams, and we represented them using their TFiDF scores. We ignored n-grams that appeared in more than 90% of the documents, most of which contained stopwords and were irrelevant with respect to hyperpartisanship. Furthermore, we incorporated Naive Bayes by weighing the n-grams based on their importance for classification, as proposed by Wang and Manning (2012). We define x i ∈ R |V | as a row vector in the TFiDF feature matrix, representing the i th training sample with a target label y i ∈ {0, 1}, where V is the vocabulary size. We also define vectors p = α + i:y i =1 x i and q = α + i:y i =0 x i , and we set the smoothing parameter α to 1. Finally, we calculate the vector: which is used to scale the TFiDF features to create the NB-TFiDF features as follows: Bias Analysis We analyze the bias in the language used in the documents by (i) creating bias lexicons that contain left and right bias cues, and (ii) using these lexicons to compute two scores for each document, indicating the intensity of bias towards each ideology. To generate the list of cues that signal biased language, we use Semantic Orientation (SO) (Turney, 2002) to identify the words that are strongly associated with each of the left and right documents in the training dataset. Those SO values can be either positive or negative, indicating association with right or left biases, respectively. Then, we select words whose absolute SO value is ≥ 0.4 to create two bias lexicons: BL lef t and BL right . Finally, we use these lexicons to compute two bias scores per document ac-cording to Equation (3), where for each document D j , the frequency of cues in the lexicon BL i that are present in D j is normalized by the total number of words in D j : Lexicon-based Features. Rashkin et al. (2017) studied the occurrence of specific types of words in different kinds of articles, and showed that words from certain lexicons (e.g., negation and swear words) appear more frequently in propaganda, satire, and hoax articles than in trustworthy articles. We capture this by extracting features that reflect the frequency of words from particular lexicons. We use 18 lexicons from the Wiktionary, Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001), Wilson's subjectives (Wilson et al., 2005), Hyland's hedges (Hyland, 2015), and Hooper's assertives (Hooper, 1975). For each lexicon, we count the total number of words in the article that appear in the lexicon. This resulted in 18 features, one for each lexicon.
Vocabulary Richness Potthast et al. (2017b) showed that hyperpartisan outlets tend to use a writing style that is different from mainstream outlets. Different topic-independent features have been proposed to characterize the vocabulary richness, style and complexity of a text. For this task, we used the following vocabulary richness features: (i) type-token ratio (TTR): the ratio of types to tokens in a text, (ii) Hapax Legomena: number of types appearing once in a text, (iii) Hapax Dislegomena: number of types appearing twice in a text, (iv) Honore's R: A combination of types, tokens and hapax legomena (Honore, 1979): and (v) Yule's characteristic K: The chance of a word occurring in a text following a Poisson distribution (Yule, 1944): where tokens refer to all words in a text (including repetitions), types refer to distinct words, i are the tokens' frequency ranks (1 being the least frequent), and types i are the number of tokens with the i th frequency.
Readability We also used the following readability features that were originally designed to estimate the level of text complexity: 1) Flesch-Kincaid grade level: represents the US grade level necessary to understand a text (Kincaid et al., 1975), 2) Flesch reading ease: is a score for measuring how difficult a text is to read (Kincaid et al., 1975), and 3) Gunning fog index: estimates the years of formal education necessary to understand a text (Gunning, 1968).

Dataset
We trained our models on the Hyperpartisan News Dataset from SemEval-2019, Task 4 (Kiesel et al., 2019), which is split by the task organizers into: 1) Labeled by-Publisher: contains 750K articles labeled via distant supervision, i.e. using labels of their publisher 3 . Labels are evenly distributed across the "hyperpartisan" and "nothyperpartisan" classes. This set is further split into 600K for training and 150K for validation. 2) Labeled by-Article: This set contains 645 articles labeled through crowd-sourcing (37% are hyperpartisan and 63% are not). Only articles with a consensus among annotators were included.

Experimental Setting
We train a logistic regression (LR) model with a Stochastic Average Gradient solver (Schmidt et al., 2017) due to the large size of the dataset. In order to reduce overfitting we use L 2 regularization (with C = 1 as the regularization parameter). Feature normalization was needed since the different features represent different aspects of text, hence have very different scales. We tried to normalize each feature set by subtracting the mean and scaling it to unit variance. However, we found that multiplying the features by constant scaling factors resulted in better performance. The scaling factor for each family of features was a hyperparameter that was tuned during the fine-tuning experiments.  [50,200,700] ×10 3 as the most frequent word ngrams and the scaling parameters of the different features except for the n-grams. Best fine-tuning results suggested using the 200K most-frequent word [1, 2]-grams. We assessed the different feature sets, described in Section 3, by incrementally adding each set, one at a time, to the mix of all features. Table 1 illustrates the results obtained on both the by-Article set (which we used to fine-tune the model's hyperparameters) and the by-Publisher set (which we used for evaluation). Our results suggest that scaling the TFiDF values through Naive Bayes is better than using raw TFiDF scores. Hence, these features were used for all subsequent experiments. It can also be observed that adding each group of features introduces a consistent improvement in accuracy on the by-Article data. However, we observed an opposite behaviour on the by-Publisher data. We believe this is due to the significant amount of noisy labels introduced by the distant supervision labeling strategy. Therefore, we based our decisions on the results obtained on the by-Article data since its labels are more accurate.

Results
The normalization strategy, i.e., scaling the features using calibrated scaling parameters, introduced significant performance improvements. Unfortunately, we were not able to perform these calibration experiments by the competition's deadline, hence we submitted the system that was available at that time, which is based on the BoW (NB-TFiDF) and character 3-gram features, as shown in row 3 in Table 1. Our system achieved a 72.9% accuracy on the test by-Article data, ranking 20 th /42. It also achieved 60.8% accuracy on the test by-Publisher data, ranking 15 th /42. All subsequent, and superior, results (rows 4-7) were obtained after the deadline.

Conclusion
In this paper, we present our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. We trained a logistic regression model with a feature set that included word and character n-grams, represented with TFiDF. This system achieved a 72.9% and 60.8% accuracy on the test data that is labeled by-Article and by-Publisher, respectively.
We also evaluated additional features that represent different aspects of the article's text such as its vocabulary richness, the kind of language it uses according to different lexicons, and its level of complexity. Initial experiments showed that these features hurt the model. However, with proper preprocessing and scaling we were able to achieve significant performance improvements of up to 2% in absolute accuracy. These results were obtained after the competition's deadline, hence were not considered as part of our submission.