Clark Kent at SemEval-2019 Task 4: Stylometric Insights into Hyperpartisan News Detection

In this paper, we present a news bias prediction system, which we developed as part of a SemEval 2019 task. We developed an XGBoost based system which uses character and word level n-gram features represented using TF-IDF, count vector based correlation matrix, and predicts if an input news article is a hyperpartisan news article. Our model was able to achieve a precision of 68.3% on the test set provided by the contest organizers. We also run our model on the BuzzFeed corpus and find XGBoost with simple character level N-Gram embeddings to be performing well with an accuracy of around 96%.


Introduction
The problem of hyperpartisan news detection (Potthast et al., 2018) is based on predicting whether a news article is biased towards a specific political wing or not. The problem falls under the category of classification problems, and the task is to classify an article as being extremely one-sided or not. A closely related problem is that of fake new detection wherein the task is to analyze the veracity of an article, and classify it based on some predefined degrees of truthfulness.
Our problem has a high societal relevance, since one-sided news poses a great threat to democracy, particularly in the context of conducting fair elections. In this paper, we discuss our approach to solving this problem used during the contest Hyper Partisan News Detection, a competition task at SemEval 2019 (Kiesel et al., 2019).
More formally, our problem definition is: Definition 1 (Hyperpartisan News Detection) We are given a set of news articles A, where each article a i is marked with two labels: a Boolean label hyperpartisan h i which indicates if article a i is biased towards a political wing, and a bias label b i ∈ {left, right, left-center, right-center, least} which indicates which wing the article is biased towards. If h i = True, then b i ∈ {left, right}; if h i = False, then b i ∈ {least, left-center, right-center}. The objective is to learn a classifier C which predicts the hyperpartisan label h j for an unknown news article a j .
In this work, we identify the role of various traditional NLP features in determining the degree of partisanship. We utilise standard term-frequency and inverse document frequency vector features computed for uni, bi and tri-grams obtained from the corpus. We do this feature extraction at both character and word level and then train a gradient boosted decision tree as a classifier for identifying partisanship. We also compare other methods of classification such as SVM, KNN, Naive Bayes and Logistic Regression for the task using the same vector features. Furthermore, experiments exploiting the metadata information were also performed (explained in detail in the scalar features in section 3.2).
The experiments were performed on two corpora, the BuzzFeed corpus (created in (Potthast et al., 2018)) and the training corpus released by the task organisers (the SemEval corpus). Further we also discuss the results obtained on the final test corpus released for the final evaluation of the task in section 4.1. Due to computation infeasibility over the larger training corpus, we do not compute vector features for the SemEval corpus.
While the knowledge-based and context-based features may take some time to detect hyperpartisanship (after the news starts spreading on social media), the style-based features can be used to detect partisanship of a news article well in time before the damage happens (Potthast et al., 2018).
For exploiting style based features, (Long et al., 2017b) uses deep learning based methods, and (Shu et al., 2017) performs fake news detection on social media data using a data mining oriented approach.

Baseline
We take as our baseline the work done by (Potthast et al., 2018). Their work uses the author's writing style as features to detect hyperpartisanship. The stylometric features used in their work include POS-unigrams, POS-bigrams, POS-trigrams, char-unigrams, char-bigrams, chartrigrams, stopword-uniGrams, stopword-bigrams, stopword-trigrams, general inquirer categories, readability scores, quotation ratio, link amount and average paragraph length. A random forest classifier was used to make predictions.
We use their classifier as the baseline for the BuzzFeed corpus. For the SemEval corpus, we use the random baseline provided in the task as our baseline. The baseline results are mentioned in Tables 1 and 3 for both the datasets.

Methodology
In this section, we describe the dataset, the features that we selected and the models we trained using the selected features. A visual overview is shown in Figure 2.

Corpus
We used two corpora, which we name as Buz-zFeed corpus and SemEval corpus.
BuzzFeed corpus: This corpus was produced by the baseline work. The dataset comprised 1,627 articles that were manually checked by four Buz-zFeed journalists. Of these, 826 articles belong to the main-stream category of publishers, 256 belong to the left-wing category of publishers, and the remaining 545 to the right-wing category of publishers.
SemEval corpus: This corpus has been released for the SemEval 2019 Task 4 on Hyperpartisan News Detection. It comprises 800,000 training articles and 200,000 test articles. These articles are annotated based on the publisher of the articles.

Feature Selection
Prior to the selection of features, we pre-processed our datasets to clean the text in articles to handle the encoding errors, perform text normalisation and stop word removal. The features we selected can be categorized into two categories, viz. scalar features and vector features. We train two sets of models, one for each category of features. Scalar features: Here, we select four features, all used at the same time since they encode different information: • Article length: This feature denotes the length of the articles in terms of the number of characters.
• Title length: The title length features denotes  the length of the title of an article in terms of the number of characters.
• Article polarity: The article polarity denotes the sentiment score of the article text in the range [−1, 1]. A score value less than zero implies a negative sentiment, and a positive sentiment otherwise.
• Title polarity: Similar to the article polarity, the title polarity feature denotes the sentiment score of the article title in the range [−1, 1].
Vector features: These include three kinds of features (considered separately since they encode the same information): • Word count vectors: The count vector for a document denotes the vector of counts of words in the document from the set of possible words in a corpus/vocabulary.
• Word level n-gram vectors: The word level vector for a document denotes the vector of tf-idf values of words level n-grams in the document. We used unigrams, bigrams and trigrams for this feature.
• Character level n-gram vectors: The character vector for a document denotes the vector of counts of character level ngrams. For this feature too, we use unigrams, bigrams and trigrams.
Visual inspection of the data: In Figure 1 we provide a visual insight into the corpus based  Table 3: Results for the submitted model. Figure 2: System Overview.
on the features selected. The figure depicts scatter plots showing variation in feature values w.r.t. time for both true and false hyperpartisan articles.

Models Used
We use the following learning models for our scalar features of the BuzzFeed corpus: K Nearest Neighbours (

Experiments
We divide this section into three parts -experimental setup, results on the BuzzFeed corpus, and results on the SemEval corpus.

Experimental Setup
The article polarity and title polarity features were computed using SentiWordNet 1 (Baccianella et al., 2010). All the vector features were computed using the scikit-learn package. To split the data into training and testing sets, we used 5-fold cross-validation.

Results on the BuzzFeed Corpus
The results for the scalar features for models trained on the BuzzFeed corpus are shown in

Results on the SemEval Corpus
Results on the SemEval corpus are shown in Table  2b(b). From all models, KNN performs the best, followed by RF, SVM, and GNB (in that order).
Since computing vector features and tf-idf features was computationally infeasible on this corpus, we did not train the vector features, however, based on our observations from buzzfeed dataset (i.e the character level vectors outperforming all others), we trained a supervised classifier using FastText , . The accuracy achieved for this model is 65%.
The results of our model using all the scalar features on the final evaluations (testing by article and testing by publisher corpus) of this competition are shown in Table 3. These results show that our model suffered from the inability to draw out more of the relevant results (low recall).

Conclusion
In this work, we have explored traditional sets of features and models for the Hyper-partisan News Detection problem. We worked on two corpora, of which one has been used in the state-of-the-art literature. For this corpus, we beat the baseline and achieve a remarkable accuracy of 96%. For the other corpus, we achieve an accuracy of 65% (with a fast text character level embedding based model).
From the results of the contest (Table 3), we were able to beat the baseline easily. Though our system did not achieve as high accuracy as other systems, we observe that this is due to a bad recall, i.e even though the features that we selected are very useful for the model to produce relevant results, it cannot capture some of the correct results.

Code and Reproducibility
We provide all our code for both Buzzfeed and Semeval Corpus as a github repository located at https://github.com/virresh/hyperpartisan-semeval19-task4 . The same code was uploaded on TIRA  and run for submission to the contest.