Steve Martin at SemEval-2019 Task 4: Ensemble Learning Model for Detecting Hyperpartisan News

This paper describes our submission to task 4 in SemEval 2019, i.e., hyperpartisan news detection. Our model aims at detecting hyperpartisan news by incorporating the style-based features and the content-based features. We extract a broad number of feature sets and use as our learning algorithms the GBDT and the n-gram CNN model. Finally, we apply the weighted average for effective learning between the two models. Our model achieves an accuracy of 0.745 on the test set in subtask A.


Introduction
The proliferation of misleading information in the media has made it challenging to identify trustworthy news sources, thus increasing the need for fake news detection tools able to provide insight into the reliability of news contents. Since the spread of fake news is causing irreversible results, near-real-time fake news detection is crucial. However, knowledge-based and contextbased approaches to fake news detection can only be applied after publication; they may not be fast enough (Potthast et al., 2017).
As a practical alternative, style-based approaches try to detect fake news by capturing the manipulators in the writing style of news content. This approach captures style signals that can indicate a decreased objectivity of news content and thus the potential to mislead consumers, such as hyperpartisan style. Hyperpartisan style represents extreme behavior in favor of a particular political party, which often correlates with a strong motivation to create fake news. Linguistic-based features can be applied to detect hyperpartisan articles (Potthast et al., 2017). Deep network models, such as convolution neural networks (CNN), applied to classify fake news detection (Wang, 2017). In this paper, we employ the stylometry-based approach and N-gram CNN model for detecting hyperpartisan news.

System Overview
For this task, we extract a broad number of features from the training data and then apply the classifier model to make predictions. Our system employs a gradient boosting decision tree (GBDT) model and N-gram CNN model. In subsequent sections, we describe data preprocessing, feature engineering and learning algorithms.

Data Preprocessing
Before applying the models, we need to do some transforming tasks of the article texts (i.e., xml parsing, text tokenizing, stemming, lemmatization, and removing stopwords) and extracting tasks of the internal and external links for each article. Apart from these tasks, we construct the bias domain dictionary from the mediabiasfactcheck site 1 to check the bias on the external linked domain in the article. For this ends, we crawled the top-level domain information from the sites corresponding to the five categories associated with hyperpartisan (e.g., Left, Center, Least Biased, Right-center Bias, and Right Bias) respectively.

Feature Engineering
Since hyperpartisan news is intentionally created for political gain rather than to report objective claims, they often contain opinionated and inflammatory language. Thus, it is reasonable to exploit linguistic features that capture different writing styles to detect hyperpartisan news. Linguistic features are extracted from the text content in terms of document organizations at a different level, such as characters, words, and sentences. Typical common linguisitic features are: lexical Basic count features: Previous works on fake news detection (Rubin et al., 2016) as well as on opinion spam (Ott et al., 2011) suggest that the use of punctuation is useful to differentiate deceptive from truthful texts. We construct a basic count feature set including various punctuation characters and other features.
External link bias: We extract bias counts based on the bias domain dictionary for each external linked domain in the article (i.e.,hyperpartisan links count, non-hyperpartisan links count, and unknown links count). To determine biases of the external links, we exploit a biased domain dictionary crawling from the mediabasisfactcheck site, which consists of five categories for top-level domains(i.e., left, right, left-center, center, rightcenter). The external link bias is counted as the hyperpartisan when the externally linked site is belonging to left and right among these categories.
Sentiment features: Our system used the VADER sentiment analysis tools 2 to generate sentiment features on the title and body of articles. The VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is as shown in Vocabulary richness and readability features: We also extract features indicating article understandability. These features include several vocabulary richness and readability scores, including the Brunet's Measure W, Hapax DisLegemena, Hapax Legomenon, Honores R Measure, Sichels Measure, Yules Characteristic K, Dale Chall Readability Formula, Flesch Reading Ease, Gunning Fog Index, Shannon Entropy, Simpson's Index etc 3 . Among this index, Simpson's index stems from the concept of biodiversity. We apply this index to measure the diversity of a text.
Simpson's Index (D) = (n/N ) 2 N = total number of words in a text n = total number of unique tokens Term features: Hyperpartisan news uses their language strategically despite the attempt to control what they are saying. This language occurs with certain verbal aspects and patterns of pronoun, conjunction, and negative emotional word usage. Based on this assumption, we extract term count features which count synonyms of several terms (e.g., to obtain the ORDER term Feature, we calculated the frequency of words such as command, demand, instruction, prescription, order in each article).
Grammar transformation: Analysis of the content-based approach is often not enough in predicting hyperpartisan news. Thus, we adopt language structure (syntax) to predict this task. We use spaCy tool 4 to transform news articles into a set of parse tree describing syntax structure.
Psycholinguistic features: For psycholinguistic features, we use the 2015 Linguistic Inquiry and Word Count (LIWC 5 ) lexicon to extract the proportions of words that belong to the psycholinguistic categories. LIWC has two types of categories; the first kind captures the writing style of the author by considering features like the POS frequency or the length of the used words. The second category captures content information by counting the frequency of words related to some thematic categories such as affective processes(e.g., positive emotion, negative emotion, anxiety, anger, sadness), social processes (e.g., family, friends, female references, male references), etc. Regarding the use of this tool, we focus on the content information, and consequently, we decide to ignore the style categories.
Part-of-Speech (POS) tags: Syntactic features consist of function words and part-of-speech tags. Syntactic pattern varies significantly from one author to another. These features were extracted using more accurate and robust text analysis tools (i.e., part-of-speech taggers, and lemmatizers). In our system, we expand the possibilities of wordlevel analysis by extracting the utilities of features like POS frequency. For the extraction of syntactic features, we used NLTK POS tagger 1 .
Word2Vec features : Recently, word representation model (e.g., word2vec, GloVe) based on neural networks which represents a word into a form of a real-valued vector have increased popularity (Mikolov et al., 2013). These approaches proved to be advantageous in many NLP tasks, such as Machine Translation, Question Answering, Document Classification, to name a few. We adopted a pre-trained 300-dimensional word vector 6 to create a vector representation of the article, with an average word2vec. Besides, we use the word2vec feature to extract the cosine similarity value between the news title and the text.
TF-IDF features: Finally, We extract unigrams, bigrams, and trigrams derived from the bag of words representation of each news article. To account for occasional differences in content length between train dataset and test dataset, these features are encoded as tf-idf values. We limit the number of features that the vectorizer will learn to 10,000 features.

Learning Algorithms
Based on the above multiple features, we explore several learning algorithms to build classification models. We adopt the average weighted value for effective learning between GBDT for the stylebased and content-based features and the N-gram 1 https://www.nltk.org/ 6 https://code.google.com/archieve/p/word2vec/  CNN model. (see Figure 2). For deep learning model, we adopt N-gram CNN model proposed in (Shrestha et al., 2017). As shown in Figure 2 (right), the model receives a sequence of character n-gram as input. These Ngram are then processed by four layers: (1) an embedding layer, (2) a convolution layer, (3) a maxpooling layer, and (4) a softmax layer. We briefly sketch the processing procedure.
The network takes a sequence of character bigrams x =< x 1 , ..., x l > as input, and outputs a multinomial over class labels as a prediction. The model first look up the embedding matrix to generate the embeddings sequence for x (i.e., the matrix C), and then pushes the embedding sequence through convolutional filters of three bigram-window sizes w = 3, 4, 5, each yielding m feature maps. We then apply the max-pooling to the feature maps of each filter, and concatenate the result vectors to obtain a single vector y, which then generate a prediction through the softmax layer.
Based on this model, we modified the network by adding a dense layer which helps detect hyperpartisan news features. After the experiment, the result shows that the character bigram CNN model outperforms the unigram CNN model. Table 2 summarizes the sizes of various parameters included in the N-gram CNN model. The official evaluation measure for subtasks A is accuracy. Table 3 3 Experiments and Results

Datasets
The statistics of the datasets provided by SemEval 2019 task 4 (Kiesel et al., 2019) are shown  3.

Experiments on the Train Dataset
We conduct several experiments on each feature set to explore predictive separately. In these experiments, we use the GDBC (i.e., XGBoost) for the above feature set. For comparison with the Ngram model, we used the Char-level CNN model (Kim et al., 2016). The objective function was minimized through stochastic gradient descent over shuffled mini-batches with Adam(Kingma and Ba, 2014). The performance is evaluated using 5-fold cross validation with accuracy and F-score. Table 4 lists the experimental results for each feature set on the training dataset. The prediction model through the incorporation of the entire feature showed higher accuracy than the prediction model for the individual feature.

Experiments on the Test Dataset
Our submission results to the subtask A on TIRA -the web service platform to facilitate software submissions into virtual machine-achieve an accuracy of 0.745 (precision:  0.853, recall: 0.592, F1: 0.6999). We ranked the 14th for subtask A in terms of accuracy. The prediction results of the test data are lower than the results of the training set, especially gains huge gap between precision and recall score.

Conclusion
Using a combination of the style-based approaches, the content-based approaches, and the N-gram CNN model, we construct the model for detecting hyperpartisan news. For this ends, we extract a broad number of linguistic features and employ GBDT model to make predictions. Finally, we adopted the weighted average value for effective learning between the two models.