Team yeon-zi at SemEval-2019 Task 4: Hyperpartisan News Detection by De-noising Weakly-labeled Data

This paper describes our system that has been submitted to SemEval-2019 Task 4: Hyperpartisan News Detection. We focus on removing the noise inherent in the hyperpartisanship dataset from both data-level and model-level by leveraging semi-supervised pseudo-labels and the state-of-the-art BERT model. Our model achieves 75.8% accuracy in the final by-article dataset without ensemble learning.


Introduction
With the ever-growing usage of internet, the problem of fake news that spreads in a destructive speed has attracted many attention. Fake news is a kind of news that is typically inflammatory, extremely one-sided (hyper-partisan) or untruthful to mislead the public into having distorted belief.
Previous works attempted to solve fake news problem from various aspects, ranging from knowledge-based (Wu et al., 2014;Shi and Weninger, 2016; to stylebased (Wang, 2017;Potthast et al., 2018). There are some publicly available fake news datasets, however, often too small in size to be suitable for neural approaches (Horne and Adali, 2017;Pérez-Rosas et al., 2017). Recently, the organizers of SemEval2019 Task 4 (Kiesel et al., 2019) have released large-scale dataset to address fake news detection as a hyper-partisan news detection problem. The task is to determine whether a given article is hyper-partisan (extremely right-wing or leftwing) or not (mainstream). Such task will allow for pre-screening of semi-automatic fake news detection, and more importantly, bring us one step closer to solving fully automated fake news detection.
Initially, we focused on learning and utilizing useful features such as topic and sentiment infor- * * These two authors contributed equally. mation. Considering the purpose of hyper-partisan news, we believed that the stance on politically sensitive topics would be crucial in determining hyperpartisanship. However, experiments showed that the dataset contains some inherent noise that acted as a big barrier to learning a good classifier: 1) noisy text inputs from an article that contain domain-specific (i.e. political) words, slangs and spelling mistakes which are likely to be out of vocabulary (OOV). 2) noisy labels that mainly resulted from using publisher-level information for labeling articles (i.e. all articles from left/rightwing publishers are labeled as "hyper-partisan". For more detail, refer to Section 2).
Nevertheless, human-labeled large-scale dataset creation is a very expensive and time-consuming task, thus, it is crucial to find a better way to utilize this weakly-labeled dataset. Therefore, we experimented with reducing noise to help models learn better. In our work, we apply a semisupervised pseudo-labeling to de-noise the dataset ( Figure 1) and leverage the state-of-the-art pretrained BERT (Devlin et al., 2018) to obtain a better representation of the noisy input.  We use a publicly available dataset "SemEval 2019 Task 4 -Hyperpartisan News Detection" 1 that are labeled in two different ways -publisher level and article level.
• Publisher-level (by-publisher): A total of 750K articles are labeled based on the political orientation of the publisher, without considering the content. It has an equal ratio (375K/375K) between hyperpartisan and non-hyperpartisan. Among the hyperpartisan samples, there's an equal ratio (187.5K/187.5K) between right and left political orientation.
• Article-level (by-article): A total of 645 articles labeled on article-level by checking the actual content. The data contains only articles for which a consensus among the crowdsourcing workers existed. Of these, 238 (37%) are hyperpartisan and 407 (63%) are not.

Discussion on the Inherent Noise
By using human judgment, we discovered that some article samples did not always have the correct labels. Since the political orientation of the publisher was used as a sole criterion for the labels, such labeling noise is not surprising. It cannot be guaranteed that all articles from a hyperpartisan publisher are hyper-partisan. Another possible reason for such noise could be from not having enough non-hyper-partisan publishers (i.e. The percentage of "least bias" label items in Table 1 is not 50%), thus, treating news from "rightcenter" and "left-center" publishers also as nonhyper-partisan.

Methodology
In this section, we describe how we did de-noising in our system in Figure 2. Our system consists of two steps: 1) Obtaining de-noised by-publisher dataset by leveraging clean by-article dataset. 2) Leveraging the de-noised by-publisher dataset and pre-trained BERT to train our final model. Note that our code is publicly available for reproducibility 2 . 3.1 Step 1: Filter Noise by Leveraging Pseudo-labeling To deal with the noise in the labels, we utilize pseudo-label for filtering out noisy labels from data-level ( Figure 1). Pseudo-labeling is one of the semi-supervised learning methods, which approximates the labels of unlabeled data by using a model (M ) trained on the labeled dataset. Originally, pseudo-labeling directly takes the prediction from the model M as the label. This could result in the final model trained on both human-labeled and pseudo-labeled to be bounded by the accuracy of the model M .
To avoid this problem: 1) We use the original by-publisher label as the constraint. We filter out data points that have a mismatch in the bypublisher label and pseudo-label to obtain cleaner by-publisher. 2) To be robust to the errors made by the model M , we set some thresholds to only use pseudo-labels with relatively high confidence. We only consider prediction scores that is bigger/smaller by margin = 0.2 than the mid-value (0.5). By doing so, we can filter out noisy labels with the guarantee that the noise level would be at worst kept the same; the size of our denoised dataset is approximately 32K for both labels, which is 8.5% of original data. Note that in our system, the model M is a binary classifier trained on top of fine-tuned BERT (refer to step 2) using clean by-article dataset.

Step 2: Obtain Better Input
Representation using BERT The article texts are noisy with a lot of political words, slangs, and even spelling mistakes, many of which are out of vocabulary (OOV) and harmful to the sentence-level and article-level representation learning. We leverage state-of-the-art pretrained language representation model BERT to eliminate OOV problem, since it uses byte-pairs vocabulary, and for a better input representation.
Since the pre-trained BERT model is trained on BooksCorpus and Wikipedia which are not directly relevant to news, we fine-tune the BERT, as in original paper, using our by-publisher news dataset to learn a better representation for our data domain. We build our proposed model by adding title LSTM and article LSTM on top of the finetuned BERT model to extract features that are concatenated and fed into the final binary classifier. We train our final classifier using the filtered dataset from Step 1.

Experimental Setup
We use BERT BASE model from (Devlin et al., 2018) which has 12 layers (i.e., Transformer blocks) with a hidden size of 768 and 12 selfattention heads. In step 1, the parameters of BERT model were fixed after fine-tuned on bypublisher datset, then we trained classifier on byarticle dataset by using 16 batch size. We used 10-fold cross-validation to choose the parameters of the classifier, since the size of by-article dataset is small. In step 2, we used 16 batch size to train our LSTM for article model with a hidden size of 300 and LSTM for title model with a hidden size of 100. The classifiers in step 1 and 2 both consist of two linear layers with ReLU and batch normal-ization in between.
For the evaluation metric, we mainly considered accuracy and F1 score as the main indicator of performance. For analysis purpose, we also report precision and recall. In the competition, there were two types of test sets (i.e. by-publisher test set and by-article test set). However, all of the reported results are obtained from the by-article test set for fair and correct comparison.

Results
We ran the experiment on 3 baseline models for comparison and simple ablation study of our approach, and the results are presented in Table 2.
• 2 LSTM + Attention + Fine-tuned Classifier (LST M f t ) A baseline model consisting of 2 LSTM models (one for the title, and another for the article) with attention layers and a multi-layer perceptron (MLP) as a classifier on the top. It was trained on by-publisher dataset directly, then fine-tuned using the by-article dataset.
• Pre-trained BERT+Classifier (BERT pt ) This model uses the original pre-trained BERT model to encode both article and title, which get fed into multilayer perceptron (MLP) to predict the hyper-partisanship of the given article. The parameters of the BERT model was fixed when training the MLP classifier on the by-article data.  Table 2: Results of our model and other baseline models on the final by-article test set.
• Fine-tuned BERT+Classifier (BERT f t ) For this model, everything is kept the same as BERT pt except for the fact that pre-trained BERT was fine-tuned using by-publisher dataset.
Firstly, we can observe that simply using pretrained BERT (BERT pt ) to represent input cannot out perform LSTM model entirely trained on hyperpartisan dataset. However, by fine-tuning BERT using our dataset (BERT f t ), we gain improvement in performance by approximately 10% in accuracy, outperforming LST M f t by ≈ 3%. Hence, we can infer that by injecting some domain-specific data into the original BERT, we can obtain an improved text representation for solving our task. Note that the model sizes for Pretrained BERT + Classifier and Fine-tuned BERT + Classifier are the same.
Secondly, by training the same fine-tuned BERT model on the de-noised dataset mentioned in Section 3.1, we observed a big improvement in accuracy, F1 and recall by ≈ 10%, ≈ 23% and ≈ 40% respectively. This clearly illustrates the power of de-noising the dataset using pseudo-labels as auxiliary reference label. We also would like to emphasize that we did not use any ensemble learning or tricks, which normally gives extra 1 − 2% gain in the final performance. Our system ranked 11 out of 43 teams that participated.
Lastly, we would mention that our LST M f t model is a strong baseline because it was able to achieve a high score in the by-publisher test set by obtaining 0.663 and 0.694 for accuracy and F1 respectively (rank 5/28).

Interesting Analysis
Although our current system does not make direct use of topic information, we present an interesting result obtained while experimenting with topic modeling for hyper-partisanship detection. We used Latent Dirichlet allocation (LDA) for topic modeling, and the results empirically showed interesting relationships between topics and hyper-partisanship. Sensitive topics such as war and political parties tend to have more hyperpartisan news than neutral-topics such as school and sports games. We believe that leveraging such information would be helpful in future works.

Related Works
In this part, we briefly review the prior work in language representation as well as the semisupervised learning method we used. (Kiros et al., 2015) tried to learn sentence embedding by reconstructing the surrounding sentences of an encoded passage. (Peters et al., 2018) proposed to extract context-sensitive features from a language model. (Devlin et al., 2018) jointly conditioned on both left and right context and obtained state-of-the-art results on eleven natural language processing tasks. (Triguero et al., 2015) provided a survey of selflabeled methods for semi-supervised classification. (Zhu and Goldberg, 2009) showed selflabeled techniques are typically divided into selftraining and co-training. (Lin et al., 2018) proposed semi-supervised learning to leverage a small amount of user-comment data to train a model and then expand the dataset by that trained model.

Conclusion
To conclude, we successfully removed noise from data-level and model-level by utilizing pseudolabels and state-of-the-art BERT. Compared to other baselines, our de-noised model managed to outperform all, and achieve rank 11 from 42 teams. Since the cost of manual labeling fake news data is expensive, our approach to obtain cleaner and larger dataset by leveraging smaller but clean dataset is meaningful.