The Sally Smedley Hyperpartisan News Detector at SemEval-2019 Task 4

This paper describes our system submitted to the formal run of SemEval-2019 Task 4: Hyperpartisan news detection. Our system is based on a linear classifier using several features, i.e., 1) embedding features based on the pre-trained BERT embeddings, 2) article length features, and 3) embedding features of informative phrases extracted from by-publisher dataset. Our system achieved 80.9% accuracy on the test set for the formal run and got the 3rd place out of 42 teams.


Introduction
Hyperpartisan news detection (Kiesel et al., 2019;Potthast et al., 2018) is a binary classification task, in which given a news article text, systems have to decide whether or not it follows a hyperpartisan argumentation, i.e., "whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person" (2019). As resources for building such a system, the by-publisher and by-article datasets are provided by the organizer. The by-publisher dataset is a collection of news articles labeled with the overall bias of the publisher as provided by Buz-zFeed journalists or MediaBiasFactCheck.com. The by-article dataset is a collection labeled through crowdsourcing on an article basis. This data contains only the articles whose labels are agreed by all the crowd-workers. The performance measure is accuracy on a balanced set of articles.
Our system is based on a linear classifier using several types of features mainly consisting of 1) embedding features based on the pretrained BERT embeddings (Devlin et al., 2018) and 2) article length features and 3) embedding features of informative phrases extracted from by-publisher dataset. Our system achieved 80.9% accuracy on the test set for the formal run and got 3rd place out of 42 teams in the formal run.

System Description
This section first presents an overview of our system and then elaborate on the feature set.

Overview of Our System
Our system is based on a linear classifier that models the conditional probability distribution over the two labels (positive or negative) given features. Let f be a feature vector. W denotes a trainable weight matrix, and b is a trainable bias vector, where f ∈ R D , W ∈ R D×2 and b ∈ R 2 , respectively. Then, we compute the conditional probability as follows: (1) where, softmax(·) represent the softmax function that receives an N -dimensional vector x and returns another N dimensional vector, namely: and x = (x 1 , . . . , x N ) . After the softmax computation, we obtain the two-dimensional vector y ∈ R 2 . We assume that the first dimension of this vector represents the probability of the positive label, and the second one represents that of the negative label.
To boost the performance, we concatenate three types of features, f 1 , f 2 , and f 3 , into the single feature vector f , where f 1 ∈ R D 1 , f 2 ∈ R D 2 and f 3 ∈ R D 3 and D = D 1 + D 2 + D 3 .
As f 1 , f 2 and f 3 , we design the following features.
• f 1 : BERT feature (Section 2.2) • f 2 : Article length feature (Section 2.3) • f 3 : Informative phrase feature (Section 2.4) For training our classifiers, we used only the by-article dataset but not the by-publisher dataset. This is because the labels of the by-publisher dataset turned out rather noisy. In our preliminary experiments, we found that the performance drops when training the classifiers on the by-publisher dataset.
Furthermore, we apply the following three techniques.
1. Word dropout: We adopted word dropout (Iyyer et al., 2015) for regularization. The dropout rate was set to 0.3. 2. Over sampling: As mentioned above, the gold label distribution of the training set is unbalanced while that of the test set is balanced. We deal with this imbalance problem by sampling 169 (407 − 238) extra examples from hyperpartisan data. 3. Ensemble: We trained 100 models with different random seeds. In addition, we trained models for 40, 50, 60 and 70 epochs for each seed. Consequently, we finally average the output of these 400 models.

BERT Feature
Our system uses BERT (Devlin et al., 2018). As a strategy for applying BERT to downstream tasks, Devlin et al. (2018) recommends a fine-tuning approach, which fine-tunes the parameters of BERT on a target task. In contrast, we adopt a featurebased approach, which uses the hidden states of the pre-trained BERT in a task-specific model as input representation. A main advantage of this approach is computational efficiency. We do not have to update any parameters of BERT. Once we calculate a fixed feature vector for an article, we can reuse it across all the models for ensemble.
In our system, we used BERT to compute a feature vector f 1 for an input article. Specifically, we first fed an article to the pre-trained BERT model as input and extracted the representations of all the words from the top four hidden layers. Then, to summarize these representations into a single feature vector f 1 , we tried the following three methods.
1. Average: Averaging the representations of all the words in an article.
2. BiLSTM: Using the representations as input to BiLSTM. This is the same method as the best performing one reported by Devlin et al. (2018). 3. CNN: Using the representations as input to CNN in the same way as Kim (2014).
As we describe in Section 3.2, we finally adopted the averaged BERT vectors as f 1 .

Article Length Feature
As f 2 , we design a feature vector representing the length of an input article. In our preliminary experiments, we found a length bias in hyperpartisan articles and non-hyperpartisan articles. Thus a vector representing the length bias is expected to be useful for discriminating these two types of articles. Specifically, we define a one hot feature vector f 2 representing distribution of the lengths of articles (the number of words in an article). To represent the length of an article as a vector, we make use of histogram bins. Consider the 100ranged histogram bins. The first bin represents the length 1 to 100, and the second one represents the length 101 to 200. If the length of an article is 255, the value of the third bin (201 to 300) takes 1. In the same way, the third element of the length vector takes 1 and the others 0, i.e., f 2 = [0, 0, 1, 0, 0, · · · ]. In our system, we set the dimension of the vector as D 2 = 11, whose last (11-th) element corresponds to the length longer than 1000.

Informative Phrase Feature
In the development set, we found some phrases informative and useful for discriminating whether or not a given article is hyperpartisan. We extracted such informative phrases and mapped them to a feature vector f 3 . In this section, we first explain the procedure of extracting informative phrases, and then we describe how to map them to a feature vector.

Phrase Set Creation
To create an informative phrase set, we exploit the by-publisher articles. Basically, we take advantage of chi-squared statistics of N -grams (N = 1, 2, 3).
Creation of S h First, we calculate each chisquared value χ x i of N -gram x i appearing in the by-publisher articles as follows: O and E are defined as follows: where f true (·) and f false (·) are functions that calculate the frequency of x i in hyperpartisan articles and non-hyperpartisan articles, respectively. T true and T false are the summation of the frequency of all N -grams in hyperpartisan articles and nonhyperpartisan articles, respectively. Then, based on the chi-squared values χ x i , we sort and select top-M N -grams. We can obtain a typical N -gram set (hereinafter, referred to as S h ) that is informative for judging whether an article is hyperpartisan or not. 1 In our system, we use M = 200, 000.
S h can mostly catch the characteristics of hyperpartisan articles. However, S h may include some N -grams that are typical of a certain publisher. This is because the by-publisher dataset is labeled by the overall bias of the publisher as provided by BuzzFeed journalists or Me-diaBiasFactCheck.com.
Creation of S p To remedy this problem, we create another phrase set S p consisting of N -grams that are typical of a publisher, and exclude them from S h .
Here we consider a certain publisher p l . First, we calculate each chi-squared value of N -gram x i in the same way as Eq 3,4,5 for S h , but here we consider true and f alse as appearing in articles of publisher p l and articles of other publishers, respectively, instead of appearing in hyperpartisan articles and non-hyperpartisan articles. Then, we pick up only N -gram x i where f true (x i ) is less than f false (x i ) and sort all of them by the chisquared values. This is because we aim to exclude only N -grams that are typical of a certain publisher.
Next, we try four types of ways to create S p l from sorted χ x value statistics. Here j denotes the index of the N -gram list sorted by χ x values, i.e., χ x 1 is the highest value in all calculated χ x values.
1. Top-T o : The first setting is to select top-T o N -grams. Concretely, S p l is defined as follows: 2. χ-based: The second setting is to select Ngrams based on χ values. Concretely, S p l is defined as follows: 3. f true -based: The third setting is to select Ngrams based on f true (x j ) values. Concretely, S p l is defined as follows: . f true -f false ratio-based: The fourth setting is to select N -grams based on ratios between f true and f false . Concretely, S p l is defined as follows: T o , T c , T f , T r and T m are hyper-parameters 2 . Next, we obtain S p defined as follows: At last, we obtain an filtered N -gram set S defined as follows:

Phrase Embedding
We map each of the obtained N -gram phrase set S to a feature vector f 3 . We exploited GloVe vectors (Pennington et al., 2014) instead of one-hot vectors in order to facilitate generalization. First, we enumerate N -grams included in an article and compute each N -gram vector. Each vector is the average of GloVe vectors of included words. For example, the vector for the phrase "news article" is computed as follows: Here, GloVe(w) denotes the GloVe (glove.840B 3 ) vector of the word w. Then, we compute f 3 as the average of all N -gram vectors included in the article.

Settings
We trained linear classifiers on the by-article dataset (not on the by-publisher dataset). In order to estimate the performance in the test set with each setting, we conducted 5-fold cross validation on by-article dataset. For optimization of the classifiers, we use Adam with learning rate of 0.001, β 1 = 0.9, β 2 = 0.999. We set the minibatch size to 32. Note that we did not take ensemble approach in the experiments we report in Section 3.2 and Section 3.3 for efficiency.

Operation on BERT Vectors
We conducted experiments on each of the three methods for BERT vectors (BERT-Base, Uncased 4 ) mentioned in Section 2.2. In this experiment, we only used f 1 as a feature vector f , i.e., without using f 2 and f 3 . Table 1 shows the performance in each setting. The averaging method was the best performance this time. We therefore decided to adopt average BERT vectors as f 1 for the evaluation of the formal run. In addition, we also used averaged BERT vectors as f 1 in the following experiments.

Method to Create N -gram Set
As mentioned in Section 2.4, we examined which method is the best to create an informative Ngram set S (and f 3 derived from them). In this experiment, we also used f 1 and f 2 with f 3 as a feature. Table 2 shows the performance in each setting. The performance was the best when we adopted Top-T o (T o = 100) for S p creation. We therefore used f 3 created in this setting.

Ablation
To verify the contribution of each three types of features, we conducted feature ablation experiments. In addition, we investigated to what extent 4 https://github.com/google-research/ bert 0.764 f true -f false ratio-based (T r = 0.5) 0.754 f true -f false ratio-based (T r = 0.8) 0.771 f true -f false ratio-based (T r = 1.0) 0.756 Table 2: Accuracy of hyperpartisan classification for each method to create N -gram set.
Features Ensemble Accuracy f 1 , f 2 , f 3 true 0.788 f 1 , f 2 , f 3 false 0.777 f 1 , f 2 false 0.769 f 1 false 0.760 ensemble approach improve the performance. In this experiments, we use only 10 (not 100) different random seeds for ensemble due to time constraints. Table 3 shows the performance in each setting. We found that f 2 and f 3 improved the accuracy by about 0.01, respectively. Additionally, by using the ensemble method, the accuracy increased by about 0.01.

Conclusion
We described our system submitted to the formal run of SemEval-2019 Task 4: Hyperpartisan news detection. We trained a linear classifier using several features mainly consisting of 1) BERT embedding features, 2) article length features indicating the distribution of lengths of articles and 3) embedding features derived from filtered N -grams that are typically found in hyperpartisan articles. Our system achieved 80.9% accuracy on the test set for the formal run and got 3rd place out of 42 teams.