Detecting Media Bias in News Articles using Gaussian Bias Distributions

Media plays an important role in shaping public opinion. Biased media can influence people in undesirable directions and hence should be unmasked as such. We observe that feature-based and neural text classification approaches which rely only on the distribution of low-level lexical information fail to detect media bias. This weakness becomes most noticeable for articles on new events, where words appear in new contexts and hence their “bias predictiveness” is unclear. In this paper, we therefore study how second-order information about biased statements in an article helps to improve detection effectiveness. In particular, we utilize the probability distributions of the frequency, positions, and sequential order of lexical and informational sentence-level bias in a Gaussian Mixture Model. On an existing media bias dataset, we find that the frequency and positions of biased statements strongly impact article-level bias, whereas their exact sequential order is secondary. Using a standard model for sentence-level bias detection, we provide empirical evidence that article-level bias detectors that use second-order information clearly outperform those without.


Introduction
Media bias is discussed and analyzed in journalism research (Groseclose and Milyo, 2005;DellaVigna and Kaplan, 2007;Iyengar and Hahn, 2009) and NLP research (Gerrish and Blei, 2011;Iyyer et al., 2014;Chen et al., 2018). According to the study of Groseclose and Milyo (2005), bias "has nothing to do with the honesty or accuracy", but it means "taste or preference". In fact, journalists may (1) report facts only in favor of one particular political side and thus (2) conclude with their own opinion. As an example, the following sentences from allsides.com reporting on the event "Trump asks if disinfectant, sunlight can treat coronavirus" demonstrate media bias on the sentence level: The activists falsely claimed that Trump "urged Americans to inject themselves with disinfectant" and "told people to drink bleach." -The Daily Wire, right-oriented Lysol maker issues warning against injections of disinfectant after Trump comments -The Hill, center-oriented "This notion of injecting or ingesting any type of cleansing product into the body is irresponsible and it's dangerous," said Gupta.
-NBC News, left-oriented From an NLP perspective, bias in the example sentences could be detected by capturing sentiment words, such as "falsely" or "irresponsible". Without the background knowledge of the political side of Trump or the event itself, however, predicting which side these sentences are slanted to is difficult.
Bias detection even becomes harder at the article level. For illustration, Figure 1 shows two articles and their sentence-level bias from the used dataset. It becomes clear that the actual words in the biased sentences are not always indicative to distinguish biased from neutral articles, nor is the count of the biased sentences: Bias assessments on sentence level do not "add up". In this regard, the position of biased sentences appears to be a better feature.
The existing approaches to bias detection are transferred from other, less intricate text classification tasks. They largely model low-level lexical information, either explicitly, e.g. by using bag-ofwords (Gerrish and Blei, 2011), or implicitly via neural networks (Gangula et al., 2019). Such approaches tend to fail at the article level, particularly for articles on events not covered in the training data. The reason is that bias clues are subtle and rare in articles, especially event-independent clues. Altogether, modeling low-level information at the Republicans are challenging a host of statements made by Secretary of State Hillary Clinton and Democratic allies during Wednesday's heated Libya testimony --claiming that complaints about a lack of funding are bogus and questioning the secretary's insistence she never saw urgent cables warning about the danger of an attack. [...] One issue that may come up is the department's funding. Assertions that State Department posts are left vulnerable because Congress has decided not to fully fund security requests pervaded Wednesday's hearings. "Shame on the House for ... failing to adequately fund the administration's request," Rep. Gregory Meeks, D-N.Y., said : Asked Wednesday about Lamb's testimony, Clinton noted that the review board that examined the Libya attack found budget issues have played a role. "That's why you have an independent group like an (Accountability Review Board); that's why it was created to look at everything," Clinton said. But Rep. Dana Rohrabacher, R-Calif., said "any suggestion that this is a budget issue is off base, or political." [...] That cable is seen as one of the vital warnings sent out of Libya in the months leading up to the attack. But, to the dismay of lawmakers, Clinton repeatedly said she never saw it. : The secretary tried to explain that "1.43 million cables" come through the department every year. They are addressed to her but in many cases do not go to her. Rather, they go through "the bureaucracy." Republicans argue the Aug. 16 cable was rather high priority. As Sen. Rand Paul, R-Ky., put it, "Libya has to have been one of the hottest of hot spots around the world." He claimed that not knowing about their security requests "really, I think, cost these people their lives." "Had I been president at the time, and I found that you did not read the cables from Benghazi, you did not read the cables from Ambassador Stevens, I would have relieved you of your post. I think it's inexcusable," Paul said.  [...]. "As I have said many times, I take responsibility, and nobody is more committed to getting this right," she said, reading a statement during a day of testimony before Senate and House committees. "I am determined to leave the State Department and our country safer, stronger and more secure." But Mrs. Clinton, [...] quickly departed from the script. She jousted with Republican lawmakers over who deserved blame for the security problems at the compound, and choked up as she described being at Joint Base Andrews outside Washington when the bodies of the Americans killed in the assault arrived from Libya. : One of the sharpest exchanges of the day came when Mrs. Clinton responded [...] there was too much focus on how the Benghazi attack had been characterized in its early hours and not enough on how to prevent a recurrence. Republicans have repeatedly charged that Obama administration officials deliberately played down the attack, focusing much of their criticism on Susan E. Rice, the ambassador to the United Nations and once Mr. Obama's choice to succeed Mrs. Clinton. "Was it because of a protest, or was it because of guys out for a walk one night who decided they'd go kill some Americans? What difference, at this point, does it make?" Mrs. Clinton said, her voice rising. : In a rare criticism of the committee by one of its members, Senator Bob Corker of Tennessee, the ranking Republican, complained that the panel Mr. Kerry led had failed to conduct proper oversight of security and other State Department issues. [...] Mrs. Clinton sought to put the events in Benghazi in a regional context, noting the presence of a group in northern Mali affiliated with Al Qaeda. [...] "We are in for a struggle, but it is a necessary struggle," she said. "We cannot permit northern Mali to become a safe haven." article level is insufficient to detect article-level bias, as we will later stress in experiments.
We study article-level bias detection both with and without allowing to learn event-specific information. The latter scenario is more challenging, but it is closer to the real world, because we cannot expect that the information in future articles always relates to past events. Inspired by ideas from modeling local and global polarities in sentiment analysis (Wachsmuth et al., 2015), we hypothesize that using second-order bias information in terms of lexical and informational bias at the sentence level is key to detecting article-level bias. To the best of our knowledge, no bias detection approach so far uses such information. We investigate this hypothesis in light of three research questions: Q1. How effective are standard classification approaches in article-level bias detection, with and without exploiting event information?
Q2. How does sentence-level bias impact articlelevel bias in general?
Q3. To what extent can sentence-level bias detection be utilized for article-level bias detection?
To study Q1-Q3, we employ the BASIL dataset, which includes manually annotated bias labels at article level as well as lexical and informational bias labels at sentence level (Fan et al., 2019). While the dataset contains only 300 articles, it provides the best basis for understanding the interaction of bias at both levels available so far.
For Q1, we evaluate an n-gram-based SVM and a BERT-based neural network in article-level bias detection. To assess the impact of event-related information, we split the dataset in two ways, once with event overlap in the training set and test set, and once without. As expected, we observe that the effectiveness of both approaches is generally low, especially when event information cannot be exploited. The results indicate that the concept of sentence-level bias is too subtle and rare to be utilized by these approaches.
For Q2, we study multiple types of correlations between sentence-level and article-level bias on the ground-truth annotations, covering (a) the frequency of biased sentences, (b) their position in an article, and (c) their sequential order. For each type, we model the bias distribution in a new way through a Gaussian Mixture Model (GMM), in order to then exploit it as features of an SVM (for frequency), Naïve Bayes (for positions), and a firstorder Markov model (for sequential order). The results show strong correlations between the two levels for frequency and position information, whereas sequential order seems less correlated.
For Q3, finally, we propose a new approach applicable in realistic settings. In particular, we retrain the bias detectors from the Q1 experiments on the sentence level and then exploit the GMM as above to predict to article level bias. In our evaluation, the approach significantly outperforms the article-level approaches analyzed for Q1. Counter-ing intuition, it even achieves higher effectiveness than what we observed on the ground truth for Q2. We explain this result by the fact that the sentencelevel detector creates more deterministic sentence bias features, allowing our approach to learn from them in a more robust way.
Altogether, the contribution of this paper is threefold: (1) We provide evidence that standard approaches fail in detecting article-level bias. (2) We develop a new approach utilizing second-order bias information, i.e., sentence-level bias. (3) We show that second-order bias information is an effective means to build better article-level bias classifiers.

Related Work
Media bias detection has been studied with computers since the work of Lin et al. (2006). As of then, media bias has been investigated in slight variations under different names, including perspective (Lin et al., 2006), ideology (Iyyer et al., 2014), truthfulness (Rashkin et al., 2017), and hyperpartisanship (Kiesel et al., 2019). To detect bias, early approaches relied on low-level lexical information. For example, Greene and Resnik (2009) used kill verbs and domain-relevant verbs to detect articles being pro Israeli or Palestinian perspectives. Recasens et al. (2013) relied on linguistic cues, such as factoid verbs and implicatives, in order to assess whether a Wikipedia sentence conveys a neutral point of view or not. Besides the NLP community, also researchers in journalism have approached the measurement of media bias. E.g., Gentzkow and Shapiro (2010) used the preferences of phrases at each side (such as "war on terror" for Republican but "war in Iraq" for Democratic). Groseclose and Milyo (2005) used the counts of think-tank citations to estimate the bias.
With the rise of deep learning, NLP researchers have also used neural-based approaches for bias detection. Iyyer et al. (2014) used RNNs to aggregate the polarity of each word to predict sentence-level bias based on parse trees. Gangula et al. (2019) made use of headline attention to classify article bias. Li and Goldwasser (2019) encoded social information in their Graph-CNN. While deep learning is believed to capture deeper relations among its inputs, we show that extending a neural network from sentence-level to article-level bias detection does not "just work".
One point of variation in media bias detection is the level of text being analyzed, which varies from tokens (Fan et al., 2019) and sentences (Bhatia and Deepak, 2018) to articles (Kulkarni et al., 2018), sources (Baly et al., 2019), and users (Preoţiuc-Pietro et al., 2017). While the effectiveness of machine learning models on different levels helps understanding how media bias becomes manifest at different levels, Lin et al. (2006) are to our knowledge the only to discuss the difference between sentence-level and article-level bias detection.
Source-level and user-level bias can be seen as directly emerging from summing up bias in the associated texts. For example, Baly et al. (2019) averaged the feature vectors of articles as the feature vectors of a source. The relation between sentencelevel and article-level bias remains unstudied so far. The goal of this paper is not to discuss the difference between these levels. Rather, we examine how to aggregate the sentence-level bias to generate second-order features, and then use these features to predict article-level bias. The use of low-level information to generate second-order features was studied in the context of product reviews by modeling patterns in the reviews' sentiment flow (Wachsmuth et al., 2015), by tuning neural network to capture important sentences (Xu et al., 2016), and by routing in aggregating sentence embeddings into document embedding (Gong et al., 2018). In particular, our usage of low-level information is inspired by Wachsmuth et al. (2015), where we hypothesize that such flows exist in media bias as well. However, we do not limit our approach to entire sequences of sentencelevel information, but we also consider frequency, position, or only two to three continuous sentences.

Standard Bias Detection Approaches
Standard approaches for bias detection, on both article and sentence level, mainly exploit the lowlevel lexical features to classify the texts as biased or not, neglecting bias-specific features. The two main low-level lexical feature types that are employed in such approach ares: (1) n-gram features, where n is typically one to three (i.e., unigram, bigram, or trigram), and (2) word embeddings, especially within pre-trained language models (i.e., transformers) such as BERT.
We propose two classification settings to answer research question Q1, which addresses the importance of event information: In the first setting, called event overlapping, we form the training and test sets by randomly assigning examples to them, more specifically, without looking at event information. The setting allows texts of the same event to occur in both the training and the test set. The second setting is called event non-overlapping since the texts to be classified are first categorized according to the main event that they address. During the splitting in training set and test set we then ensure for each event that all its related texts are in exactly one of these sets.
The difference in the effectiveness of the standard approaches on the two settings indicates whether and to what extent standard bias detection approaches rely on event information.

Second-Order Bias Information
For research question Q2, we study the correlation between sentence-level and article-level bias. Specifically, we examine whether article-level bias correlates with (a) the frequency of biased sentences, (b) their position in an article, and (c) their sequential order. For each correlation, we extract features and then train a respective machine learning model. The code is available at https: //github.com/webis-de/EMNLP-20.

Bias Frequency
A straightforward way of leveraging sentence-level bias information is counting. Let an article with sentence-level bias labels {b 1 , b 2 , ..., b n } be given, where n is the number of sentences in the article and b i the label of the i-th sentence. Assuming that b i is binary with b i = 1 being bias, the absolute bias frequency, f abs , is defined as: Accordingly, the relative bias frequency, f rel , is defined based on the length of the article as:

Bias Position
We consider the positions of biased sentences as second-order features. Given a target number of positions, k, we first normalize the sentence-level bias The higherb i , the more likely position i is biased. In detail, we first normalize {b 1 , b 2 , ..., b n } to {b 1 , b 2 , ..., b m } by linear interpolation, where m (here set to 100) is larger than the largest n (and also larger than k). After the interpolation, b i is in the range of [0, 1]. Secondly, we "sample" from the b i to make the final sentencelevel bias having length k. There are three "sampling" methods we explore: (1) average (take the average of the datapoints, (2) maximum (take the maximum value in the range, and (3) last (take the last datapoints). We treat this as a hyperparameter and find the best one by the validation set. We use this two-step normalization (upsampling and then downsampling) to avoid the instability during sampling when n/k is not an integer.
Our goal is to predict the most likely articlelevel bias label, a * , given the sentence-level bias. Formally, assuming that an article can be seen as a combination of its sentences, we have where a is any possible bias label (0 for neutral and 1 for bias), and p(a | ·) is the conditional probability of a, given a sentence-level bias sequence. According to Bayes' rule and given that p(b 1 ,b 2 , ...,b k ) is irrelevant to the arg max, we can rewrite it as: Assuming that eachb i is independent from other positions, we further simplify this as which is a Naïve Bayes classifier, and each p(b i | a) is the bias position feature we are interested in.
In the remainder, we simplify the notation p(b i | a) to p(b | a). Estimating p(b | a) in each position for each a is difficult, sinceb ∈ [0, 1] and we cannot observe enough data points in that range on realistic text corpora. Instead, we therefore estimate p(a | b)/p(a), where p(a) can be properly estimated by the distribution of the labels, and p(a |b) can be estimated well using a Gaussian Mixture Model.  To model the distribution, we employ a Gaussian mixture model (GMM) (Reynolds, 2009). The assumption behind GMMs is that a distribution can be seen as a combination of Gaussian distributions, where each distribution is represented by its mean µ, its variance σ 2 , and a weight w, the sum of all weights being 1. Modeling a GMM is unsupervised; we only need to set the number of mixtures we would like to have.
After applying GMM on b i,j , 1 ≤ j ≤ m, the distribution of a bias position i is represented by a set of Gaussian mixtures, N l (µ l , σ 2 l , w l ), where l is the index of mixtures. For each mixture, we can then learn its bias distribution by: To avoid zero probability in some mixtures, we also apply add-one smoothing. Then, the bias probability p(b | a = 1) in one position is: where Nb is the mixture most likely generatingb.

Bias Sequence
The Naïve Bayes classifier in Equation 5 assumes that each position is independent from other positions. We can also consider a position to depend on the previous positions. For example, under the assumption that each position depends on the one before, we can rewrite Equation 5 as: Then, we can further rewrite p(b i |b i−1 , a) as: In this equation, p(b i−1 | a) can be approached by the GMM as described, and the numerator of the equation can be seen as the transition probability in a Markov process. In particular, after finding the mixtures most likely generatingb i , andb i−1 , we estimate the transition probability p(a|b i ,b i−1 ) as: where N i and N i−1 are the mixtures most likely generatingb i andb i−1 respectively. Again, we apply add-one smoothing when estimating the transition probabilities. The previous equations can be easily extended to the case that each position is dependent on more than one position. However, longer dependencies imply fewer observations of each possible transition. As a result, we only test the first and the second-order Markov process below (i.e., dependence on the previous one or two positions).

Experiments
This section presents the experiments that we designed to study research questions Q1-Q3 based on the media bias dataset BASIL.

Dataset
To test the hypothesis that sentence-level bias is an important feature for article-level bias detection, we need data that is annotated for both bias levels. Recently, Fan et al. (2019) released a dataset on media bias, Bias Annotation Spans on the Informational Level (BASIL). The dataset contains 300 news articles on 100 events, three each per event. These three articles were taken from Fox News, New York Times, and Huffington Post, which have been selected as a representative of right-oriented, neutral, and left-oriented portals respectively.
On the article level, the dataset comes with manually annotated media bias labels (right, center, or  left). While we noticed that more Fox news articles are right (50) than Huffingtion post articles (10), the labels do not only rely on the source of the articles. Since we target bias in general rather than a specific orientation, we merged right and left to the label bias, and see center as neutral. Because both bias and unbiased articles include all three portals, we can be confident that the task is not detecting the source, but detecting the bias. On the sentence level, each sentence has been manually labeled as having lexical bias, informational bias, or none. According to Fan et al. (2019), lexical bias refers to "how things are said", i.e., the author used polarized or otherwise sentimental words showing bias. On the other hand, sentences with informational bias "convey information tangential or speculative". In our experiments, we considers both settings where we separate the two types of bias and settings where we merge them.

Experiment Settings
In light of our three research questions, we consider the following experiments: Q1. To study Q1, we compare two experiment settings of article-level bias detection: (1) with event information being available, and (2) with event information not being available. In both settings, the size of the training set (180 articles), validation set (60 articles) and test set (60 articles) are identical. The distribution of labels in each set and setting can be found in Table 1. As can be seen, the article-level labels are almost balanced, with some more biased than neutral articles. According to the distribution in the training set, we choose all-bias as the majority baseline in the later experiments.
As standard feature-based approaches, we employ an SVM and a logistic regression classifier based on word n-grams with n ∈ {1, 2, 3}. The considered n-grams are learned on the training set and lowercased. Hyperparameters such as cost and class balance are optimized on the validation set.
As a standard neural approach, we employ a pretrained uncased BERT model using word embeddings as "features". 1 We fine-tuned the approach and optimize the number of epochs for fine-tuning on the training and validation set. Only the first 256 and the last 256 words of an article are used for bias prediction, because the maximum sequence length of the BERT model is 512 tokens.

Q2.
To study Q2, we use the same splitting of articles as used for the w/o event setting above. In the experiments of this research question, we use the ground-truth sentence-level bias from the dataset. Thereby, we investigate the ideal case where the sentence-level bias can be detected perfectly (assuming the manual annotations are correct). The different types of sentence-level bias are also tested to understand if article-level bias is more correlated to a certain type.
We prepare three types of sentence-level bias features, according to the descriptions in Section 4: For bias frequency, we consider a single feature SVM. We use linear kernel and optimize its cost hyperparameter on the validation set. For bias positions, we compute the bias probability in each position and then apply either Naïve Bayes, in line with Equation 5, or an SVM. For bias sequences, we use the Markov process from Equation 8 to predict an article-level bias label. Besides, we use the probabilities p(b i |b i−1 , a) as features for an SVM. Finally, we also test stacking models. To test the effectiveness of each feature, we stack all three SVMs of each bias feature, as well as any two of the three SVMs as an ablation test.

Q3.
To study Q3, we test our approach in a realworld scenario. We first employ the same features and models as in Q1 for sentence-level bias classification. The only difference between article-level and sentence-level setting is that we do not trim sentences for the BERT model. The best classifier is later used in subsequent experiments. The splitting of sentences follows the w/o event splitting in the article-level bias detection, i.e., the sentences in the training set represent are used for training, and accordingly for validation and test. The distribution of the different types of sentence-level bias in each set can be found in Table 2.
Given the predicted sentence-level bias from Q1, we test our approaches as in Q2. Also, we test a scenario where the event information is available.   Similar to the setting in Q1, we randomly split the articles and then split the sentences according to their article-level splitting. We then train the sentence-level bias classifiers and use the best one for our approach.

Results and Discussion
To answer the three research questions of this paper, we report and discuss the results of the experiments described in Section 5.

Standard Approaches to Bias Detection
Tables 3 and 4 show the results of the experiments for Q1, which address the effectiveness of standard classification approaches in article-level bias detection. With a maximum of 0.55, the accuracy of all classifiers is generally low for a two-class classification task. When event information is available, accuracy improves at least up to 10 percentage points over the baseline, though. When not available, the classifiers seem to learn almost nothing: In the absence of event features, the classifiers are more forced to learn style or structural features. Yet, they turn out not to be able to do so without a proper design of such features. These results suggest that standard approaches are insufficient for article-level bias detection.

Impact of Sentence-Level Bias in General
As regards Q2, the column Acc(GT) of Table 5 shows the accuracy of employing ground-truth   Table 5: Accuracy of all evaluated combinations of features and classifiers in article-level bias detection based on ground-truth (GT) and predicted (Pr) sentence-level bias. F combines absolute (f abs ) and relative (f rel ) bias frequency, P stands for for bias position, and S for bias sequence. The best value for each bias type is marked bold. The best values overall are marked with *.
sentence-level bias features in predicting articlelevel bias. The SVM stacking classifier with bias frequency and sequence (F+S) performs best with an accuracy of 0.67. Stacking all features (F+P+S) achieves the same accuracy. In general, all feature and classifier combinations outperform all approaches found in Table 4.
Among the features for sentence-level bias, bias frequency and bias position can be exploited best by the SVM. While bias sequence does not perform as well as the others, the stacking classifier using it yields the highest effectiveness. The bias sequence appears to be weakest and sometimes brings negative impact to the performance. However, there may be several reasons behind it. For example, the sequential features may be too subtle, such that our models (SVM and Markov process) are too sensitive to the tiny changes in the features. But, it may also be that a smarter combination strategy for the three different types of feature is required; to keep the models simple, we tested only stacking. On the single features, the results show that an SVM is not always the best choice to utilize the features. In particular, Naïve Bayes and Markov process work better when dealing with informational bias and any bias.
Next, we take a closer look at the stacking part of Table 5, to analyze the feature's effectiveness. While using lexically biased sentences as features, the frequency features contribute more (combinations in stacking with F achieve the best results). On the other hand, while using informationally biased sentences as features, the sequential features are more important. In other words, to detect article bias, it is important to know the number of lexically biased sentences as well as the order of informationally biased sentences. Our interpretation is that, the existence of lexical bias is already a strong clue for presenting bias, whereas informational bias has to be conveyed in a certain order or writing strategy (and thus is more difficult to be captured).
Regarding the two types of sentence-level bias, the best results are observed for any bias. Using only informational bias leads to the lowest effectiveness. While there is more informational than lexical bias, as shown in Table 2, the classifiers seem to rely more on lexical bias. The reason could be that lexical bias is easier to capture (by the word usage), while informational bias clues, if any, are subtle. Still, including both types of bias (but not distinguishing them) works best.

Impact of Predicted Sentence-Level Bias
Regarding Q3, we first present the results of applying the standard approaches to sentence-level bias detection in Table 6. Besides accuracy, we also show precision, since a high precision boosts the confidence in predicting sentence-level bias. We   expect precision to be more important than recall, since we use the predicted bias for computing the article-level bias features. We find that fine-tuned BERT is strongest in effectiveness. Matching intuition, predicting lexical bias seems much easier than predicting informational bias. Since Fan et al. (2019) provide their results of using BERT on sentence-level bias classification, we also used BERT for comparison. To this end, we split the dataset into sets of the same size as Fan et al. (randomly with 6819 training, 758 validation, and 400 test instances). However, the actual distribution of labels is not provided by the authors. As shown in Table 7, the results of our reimplementation for predicting informational bias is comparable to their results (in terms of F 1 -score), but it is much worse for predicting lexical bias. Note that lexical bias in the dataset is rather rare (478/7984 ≈ 6%). We thus assume that the difference between our and the original test set caused the difference.
We used the predictions of the best sentencelevel bias classifier (i.e., BERT) to compute the bias features. The resulting effectiveness in articlelevel bias detection can be found in column Acc(Pr) of Table 5. Comparing these results to those obtained for Q2, we see a clear drop in the effec-tiveness, when using only lexical bias or only informational bias. Interestingly, however, the best configuration-with absolute bias frequency (f abs ) and SVM on any bias-is as good as the best one for Q2. This means that using the predicted bias can sometimes be better than using ground-truth bias. We explain this by the fact that sentence-level bias classifiers are deterministic while human annotators may be not, which can help our approaches to learn more stable patterns in the features.
Overall, our approaches with sentence-level bias information clearly outperform the standard approaches, underlining the impact of our approach. With an accuracy of 0.67, we outperform the standard approaches (0.53) by 14 points and the all-bias baseline (0.52) by 15 points. Regarding the different types of bias, the bias frequency is still the best feature, while the bias position and the bias sequence are weaker. The stacking model is the most effective in general.
Finally, we also considered the case where event information is available, as in Table 3. We followed the same process by selecting the best sentencelevel bias classifier, which is again BERT with 0.83 accuracy and 0.58 precision, and use it to generate the article-level bias features. Similar to the results in Table 5, the best classifier is an SVM on absolute bias frequency. We achieve 0.60 accuracy outperforming the baseline (0.45), which is again around 15 points higher in accuracy. These results demonstrate that our approach can achieve high effectiveness robustly, regardless of whether it can exploit event information or not.

Hyperparameters
To deepen insights and to simplify reproducibility, this section discusses important hyperparameters used in the experiments.
Bias Normalization In the bias position and bias sequence features, the first step is to normalize the length of the bias annotations. Interestingly, the best sampling methods vary in different settings. Specifically, last is best for bias position with Naïve Bayes, average for bias position with SVM, maximum for bias sequence with Markov process; and last for bias sequence with Naïve Bayes.

Number of Normalized Positions
We tested the number of positions needed in the bias position and bias sequence features. This number of positions roughly refers to how many bias clues are in an article. We find that the best value according to the validation set is different in each setting. In summary we determine 10 for bias position with Naïve Bayes, 3 for bias position with SVM, 10 for bias position with Markov process, and 8 for bias position with SVM.

Number of Gaussian Mixtures
The number of Gaussian mixtures indicates the variability of the bias distribution in a single position. We find that the best number of mixtures is 3 for bias position with SVM, and 5 for other settings. While this value depends also on the number of datapoints, it shows that setting it to 5 mixtures is reasonable in general.

Number of Markov's Order
We tested the order of the Markov process in Equation 8. We find that first-order Markov (a position depends on the previous position only) is best. As discussed, longer dependencies require more datapoints to estimate a better transition probability. Due to the size of our dataset (300 articles with 180 of them as the training set), the second or higher order of Markov does not make sense in our case.

Conclusion
In this paper we have given evidence that the exploitation of low-level lexical information is insufficient to detect article-level bias -especially, if the dataset is small. To provide a complete picture, we have formulated three research questions related to article-level bias detection, in order (1) to assess the state of the art of event-dependent and eventindependent bias prediction, (2) to learn about the relation between sentence-level and article-level bias, and (3) to study whether sentence-level bias can be leveraged to predict article-level bias.
To tackle the detection of article-level bias, we have proposed and analyzed derived (second-order) bias features, including bias frequency, bias position, and bias sequence. As a main result of our research, we have shown that this new approach clearly outperforms the best approaches existing so far.
If bias detection can be done sufficiently robust on article level, we envisage, as a line of future research, the development of "reformulation" strategies and algorithms for the task of neutralizing biased articles (Pryzant et al., 2020).