Team Fernando-Pessa at SemEval-2019 Task 4: Back to Basics in Hyperpartisan News Detection

This paper describes our submission to the SemEval 2019 Hyperpartisan News Detection task. Our system aims for a linguistics-based document classification from a minimal set of interpretable features, while maintaining good performance. To this goal, we follow a feature-based approach and perform several experiments with different machine learning classifiers. Additionally, we explore feature importances and distributions among the two classes. On the main task, our model achieved an accuracy of 71.7%, which was improved after the task’s end to 72.9%. We also participate on the meta-learning sub-task, for classifying documents with the binary classifications of all submitted systems as input, achieving an accuracy of 89.9%.


Introduction
Hyperpartisan news detection consists in identifying news that exhibit extreme bias towards a single side (Potthast et al., 2018). The shift, in news consumption behavior, from traditional outlets to social media platforms has been accompanied by a surge of fake and/or hyperpartisan news articles in recent years (Gottfried and Shearer, 2017), raising concerns in both researchers and the general public. As ideologically aligned humans prefer to believe in ideologically aligned news (Allcott and Gentzkow, 2017), these tend to be shared more often and, thus, spread at a fast and unchecked pace. Moreover, there is a large intersection of 'fake' and 'hyperpartisan' news, as 97% of fake news articles in BuzzFeed's Facebook fact-check dataset are also hyperpartisan (Silverman et al., 2016).
However, the detection/classification and consequent regulation of online content must be done 1 https://github.com/AndreFCruz/ semeval2019-hyperpartisan-news with careful consideration, as any automatic system risks unintended censorship (Akdeniz, 2010). As such, we aim for a linguistically-guided model from a set of interpretable features, together with classifiers that facilitate inspection of what the model has learned, such as Random Forests (Ho, 1995), Support Vector Machines (Cortes and Vapnik, 1995) and Gradient Boosted Trees (Drucker and Cortes, 1996). Neural network models are left out for their typically less self-explanatory nature.
The SemEval 2019 Task 4 (Kiesel et al., 2019) challenged participants to build a system for hyperpartisan news detection. The provided dataset consists of 645 manually annotated articles (byarticle dataset), as well as 750,000 articles automatically annotated publisher-wise (by-publisher dataset, split 80% for training and 20% for validation). Systems are ranked by accuracy on a set of unpublished test articles (from the by-article dataset), which has no publishers in common with the provided train dataset, preventing accuracy gains by profiling publishers. All experiments on this paper are performed on the gold-standard (byarticle) corpus, as this was the official dataset.
The rest of the paper is organized as follows. Section 2 describes our pre-processing, feature selection, and the system's architecture. Section 3 analyzes our model's performance, evaluates each feature importance, and goes in-depth on some classification examples. Finally, Section 5 draws conclusions and sketches future work.

System Description
We propose a feature-based approach and experiment with several machine learning algorithms, namely support vector machines with linear ker-nels (SVM), random forests (RF), and gradient boosted trees (GBT). Our submission to the task was a RF classifier, as this was the best performing at the time. However, after the task's end we found a combination of hyperparameters that turned GBT into the best-performer. We detail all results in the following sub-sections.
All classifiers were implemented using scikitlearn (Pedregosa et al., 2011) for the Python programming language, and all were trained on the same dataset of featurized documents. In this section we describe the data pre-processing, our selection of features, as well as the classifiers' gridsearched hyperparameters.

Feature Selection
The statistical analysis of natural language has been widely used for stylometric purposes, in particular in order to define linguistic features to measure author style. These include, among others: document length, sentence and word length, use of punctuation, use of capital letters, and frequency of word n-grams; type-token ratio (Johnson, 1944); and frequency of word n-grams (see e.g. Stamatatos (2009) for a thorough survey of authorship attribution methods). Although these features have been successfully used in authorship attribution to establish the most likely writer of a target text among a range of possible writers (Sousa-Silva et al., 2010, research on how these features can be used to analyze group authorshipand subsequently identify an ideological slant -is less demonstrated. Therefore, we build upon previous research on Computer-Mediated Discourse Analysis (Herring, 2004) to test the use of these features to detect hyperpartisan news.
We compute a minimal set of style and complexity features, partially inspired by Horne and Adali (2017), as well as a bag of word n-grams. For tokenization we use the Python Natural Language Toolkit (Bird et al., 2009).
Our features are as follow: num sentences (number of sentences in the document); avg sent word len (average word-length of sentences); avg sent char len (average characterlength of sentences); var sent char len (variance of character-length of sentences); avg word len (average character-length of words); var word len (variance of character-length of words); punct freq (relative frequency of punctuation characters); capital freq (relative frequency of capital letters); types atoms ratio (type-token ratio, a measure of vocabulary diversity and richness); and frequency of the k most frequent word n-grams.
Regarding word n-grams, we use k = 50 and n ∈ [1, 2], as we empirically found these values to perform well while maintaining a small feature set. Moreover, we ignore n-grams whose document frequency is greater than 95%, as well as 1-grams from a set of known English stopwords (from scikit-learn's stop-word list), whose frequency we assume to be too high to be distinctive. Text tokens and stop words are stemmed using the Porter stemming algorithm (Porter, 1980).
For SVM model, we use the following hyperparameter values: penalty parameter C = 0.9; penalty = l2; loss function = squared hinge.
These hyperparameter values are the result of extensive grid searching for each model, selecting the best performing models on 10-fold crossvalidated results. Table 1 shows the results of the models over 10fold cross validation (top rows), and on the official test set (bottom rows). Besides our models, we show the performance of the provided baseline as well as the best performing submission to the task (last row). As results on the official test set were hidden during the duration of the task, we used cross-validated results to guide our decisionmaking in improving the models.

Feature Analysis
Making use of our choice of classifiers, we are able to interpret and analyze the most important features, as well as trace back the decision path for every document along each of the ensemble's estimators (RF and GBT). Figures 1 and 2 show measures of feature importance for the RF and GBT models. Figure 1 shows the top features by mean impurity decrease  . RF refers to our task submission, while GBT is our best performing model, submitted after the task's closing.
on a feature's nodes, averaged across the ensemble's estimators/trees and weighted by the proportion of samples reaching those nodes (Breiman, 2001). Similarly, Figure 2 shows the top features by relative accuracy decrease (averaged across the ensemble's estimators) as the values of each feature are randomly permuted (Breiman, 2001). Interesting properties emerge from analyzing feature importance, notably that the number of sentences and the frequency of capital letters are the most important features on both measures. Moreover, the RF model tends to have a longertailed distribution of feature importances, while the GBT model tends to focus on a smaller subset of features for classification.
Interestingly, two 1-grams make it into the top-10 features by impurity decrease: 'trump' and 'polit'. Reliance on n-grams could present a larger problem, as these may refer to entities with a high variance of media attention. For instance, words like 'Hillary' or 'Obama' (which appear in the top-20 features) are probably not seen as often nowadays as they were back in 2016. As such, we are confident in the generalization capacity of the models, as the most discriminative features are mostly style and language-complexity features, which do not suffer from the previously stated biases of n-grams.

Analysis of Predictions
In order to better understand our model's decision making, we analyze differences in distributions of document features for each predicted class, and compare them with the gold-standard values.
As seen in Table 2, articles predicted as hyperpartisan have a higher number of sentences, but each with lower length than mainstream articles, and with decreased vocabulary diversity (smaller type-token ratio). The frequency of the word 'trump' is also noticeably higher in hyperpartisan articles. There is a good alignment of predicted and gold articles, when projected onto this feature space.

Meta-learning Task
After the main task's end, organizers challenged participants to compete on a meta-learning task. This task's dataset consisted of the predictions made by each of the 42 submitted systems on the same test-set articles. Notably, a simple majority vote classifier (with the predictions of all 42 systems as input) achieved accuracy of 88.5%, substantially better than the best performing system's accuracy of 82.2%. While a voting classifier performed considerably well, we intuitively postulated that the votes of the best-n classifiers (accuracy-wise) would perform better. Figure 3 shows the accuracy of n majority vote classifiers, from the top-42 systems to the top-1 system. The best performance is achieved using the top-12 classifiers. However, in Figure 3, we can observe fluctuations in performance while removing the worst classifiers. This means that combining worst classifiers as we do in this task can yield performance improvements. We conclude that there is no discernible correlation between performance and smaller n. We leave as future work further investigation on what characteristics of the classifiers contribute to the fluc-tuations of the overall performance. Our final submission for this sub-task consisted of a Random Forest model, whose features were the predictions of all 42 submitted systems, as well as an extra column with the average vote of all systems. See Table 3 for the final performance on the official by-article-meta-test dataset.