Context in Informational Bias Detection

Informational bias is bias conveyed through sentences or clauses that provide tangential, speculative or background information that can sway readers' opinions towards entities. By nature, informational bias is context-dependent, but previous work on informational bias detection has not explored the role of context beyond the sentence. In this paper, we explore four kinds of context for informational bias in English news articles: neighboring sentences, the full article, articles on the same event from other news publishers, and articles from the same domain (but potentially different events). We find that integrating event context improves classification performance over a very strong baseline. In addition, we perform the first error analysis of models on this task. We find that the best-performing context-inclusive model outperforms the baseline on longer sentences, and sentences from politically centrist articles.


Introduction
Informational bias is conveyed through sentences or clauses that provide tangential, speculative or background information that can sway readers' opinions towards entities (Fan et al., 2019). A natural place to look for informational bias is in news texts, where journalists use background information to place newsworthy events in a broader context. Examples of informational bias include quotations of opinions from third parties about the target entity, allusions to what may have motivated the target entity to act as they did, and mentions of previous statements and actions of the same entity.
What separates informational bias from other kinds of bias is that it can be expressed in a completely factual and neutral way. While some instances of bias are recognisable outside of their context (e.g. quotations from third parties that contain opinions), others are mere statements of facts that do not raise suspicions of bias outside of their context. Consider example instance 1.3 in Table 1. This sentence contains no subjective language. Seen on its own, it is simply stating a fact. However, human annotators judged that it is a case informational bias (Fan et al., 2019). This is because this particular fact reflects positively on the target entity Mike Huckabee in the context of an announcement that he is running for president. Note that one can also imagine contexts where the implication is negative. An example would be a discussion of a disconnect between older Republican candidates and a new generation of more progressive voters. The fact that instances of informational bias can be superficially neutral and are context-dependent makes informational bias detection an exceptionally challenging task, which furthermore has a short research history and few available relevant resources.
While previous work has performed informational bias detection on a token level and a sentence level, we are the first to involve context beyond the sentence. We integrate four kinds of context: neighboring sentences (direct textual context), the full article (article context), articles on the same event (event context) and articles from the same domain (domain context). 1 Our model for leveraging event context significantly improves over the baseline. In an error analysis, we find that the best-performing context-This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1 Code and results available at: github.com/vdenberg/context-in-informational-bias-detection.

Idx Sentence
Inf Src ID  (Fan et al., 2019) with their informational bias label (inf), news source (src) and BASIL ID. Note that Examples 2.3 and 3.2 look superficially as if they contain explicit sentiment but are labeled as informational bias as they are only introduced as quotes into the news article.
inclusive model outperforms the baseline on long sentences and sentences from politically centrist news articles.
Our contributions are: 1. The first systematic study of the impact of including context in informational bias detection.
2. The best-performing informational bias detection system for English thus far, tested on the BASIL corpus (Fan et al., 2019).
3. The first error analysis of informational bias detection systems, which identifies strengths and weaknesses of models with and without awareness of context.

Related work
Framing. Informational bias can be considered a type of framing with a focus on entities. The framing of entities has been studied for the construction of the English BASIL corpus of lexical and informational bias (Fan et al., 2019), which is the corpus our models are tested on. Fan et al. (2019) emphasize that, unlike more commonly studied kinds of bias, informational bias label assignments depend very heavily on context. The sentence-level and span-level BASIL annotations were thus provided by human annotators who saw sentences in their article context. However, the computational models in Fan et al. (2019), which are based on BERT (Devlin et al., 2018), only treat sentences in isolation. Framing of entities is also studied by Card et al. (2016), who examined how English-speaking news frames events through casts of characters, and van den Berg et al. (2020), who studied the effect of naming and titling on the perception of entities in English and German. Most framing research focuses not on entities but on the framing of topics and events. The study of topic framing in news has a long history in social science (Entman, 1993;Berinsky and Kinder, 2006;Baumgartner et al., 2008;Gentzkow and Shapiro, 2010) and has begun to attract attention from the natural language processing community (Tsur et al., 2015;Fulgoni et al., 2016;Field et al., 2018;Baumer et al., 2015). For topic framing research there exists the Media Frames Corpus of news annotated for the framing of same-sex marriage, smoking, and immigration (Card et al., 2015) and the Gun Violence Frame Corpus (Liu et al., 2019a) annotated for framing in news on gun violence. Computational analysis and classification experiments have been done on framing in Russian news (Field et al., 2018), on detecting frames in English headlines (Liu et al., 2019a;Chen et al., 2018), and on detecting frames in a multilabel, multi-lingual setting (Akyürek et al., 2020).
Subjectivity. Related work also includes work on implicit sentiment through syntactic structures (Greene and Resnik, 2009) and partisan phrases (Yano et al., 2010), work on explicit stance and subjective language (Recasens et al., 2013;Pang et al., 2008;Wiebe et al., 2004;Hube and Fetahu, 2019), and work on the classification of documents or news outlets into leanings or ideologies (Iyyer et al., 2014). The difference between these various kinds of subjective language and informational bias lies in its exclusion of neutral and objective language. In framing research, any text that could lead an impartial third party to recognise a non-neutral viewpoint towards a topic or entity can be said to contain framing, even if it is objective and neutral in tone.
Approaches. Sentence-level classification tasks have seen great increases in performance through the use of pre-trained language models (PLMs) such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b). By further pre-training RoBERTa on domain-specific and task-specific datasets, Gururangan et al. (2020) made it possible to perform sentence-level classification using models that have been exposed to domain context. In another line of work, several methods have been developed to allow PLM models to take larger sequences than sentences as their input (Pappagari et al., 2019;Adhikari et al., 2019). Of these, only one specifically performs sequential sentence classification (i.e. the task of providing labels for each of the sentences in the multi-sentence input) (Cohan et al., 2019). There exist non-PLM approaches to sequential sentence classification as well. These consist of hierarchical sequence encoders with a final CRF layer (Dernoncourt and Lee, 2017;Jin and Szolovits, 2018) and a BiLSTMbased approach that contextualises Universal Sentence Encodings and also integrates information that is specific to the domain of movie plot synopses (Papalampidi et al., 2019). None of these techniques have previously been applied to informational bias detection.

Method
We experiment with different kinds of context to assess which one or which ones are helpful for informational bias detection. We define four types of context: direct textual context, article context, event context and domain context.
Direct textual context. Direct textual context consists of the directly neighboring sentences around the target sentence. These may be helpful for disambiguating sentences with multiple possible interpretations, for noticing patterns in the type of content preceding and following instances of informational bias, and for noticing when a target sentence is part of a multi-sentence quote.
Article context. Article context consists of the full news article that the target sentence appears in. The article may be helpful for e.g. establishing the topic of the target sentence, what type of article it is from, and whether the target sentence is an outlier compared to the rest of the article.
Event context. Event context consists of news articles that cover the same newsworthy event or topic as the article the target sentence appears in. They might appear in the same or different news outlets. The possible benefit of access to Event context is that it is helpful for noticing when an article takes a unique stance on a topic or mentions information that is absent from other articles.
Domain context. Domain context consists of the domain of news articles that the target sentence is a part of. Domain refers to a group of texts with shared lexical and structural properties, including register, topic and platform of publication. Domain context is therefore more general than event context. Following Gururangan et al. (2020), we consider the population of articles from which a corpus has been sampled a domain in its own right. Possible benefits of domain awareness when detecting informational bias include the ability to notice typical journalistic strategies for framing entities without attracting accusations of bias, the ability to distinguish between news outlets differing styles and ideologies, and increased awareness of domain-specific connotations of words and phrases (e.g. "leading in the polls" or "declined to comment"). Figure 1: Diagram of the Context-Inclusive Model. In Stage 1, Sentence embeddings are obtained by encoding word sequences using a pre-trained language model. In Stage 2, Sequences of sentences from a pre-determined context (article or event) are encoded by as many BiLSTMs as there are documents (only one shown in diagram). The resulting context representation is concatenated and classified together with the target sentence representation to obtain a sentence-level prediction.

Data & Task
We use the currently only existing dataset with annotations of informational bias: the aforementioned BASIL dataset. This corpus contains 100 triples of news articles by Fox News (FOX), the New York Times (NYT) and the Huffington Post (HPO), each covering the same event. The dataset consists of 7,984 sentences. During pre-processing, we remove 7 empty instances from the corpus with a sentence length of zero, leaving 7,977 instances. The corpus provides span-level annotations which can be used for token classification, or for binary sentence classification by converting them into sentence labels. A sentence is labeled as biased if it contains at least one span with bias. The total number of sentences with informational bias amounts to 1,221. Some example sentences from this corpus are given in Table 1.
The context-inclusion experiments in this work are only performed on sentence classification, as several of the proposed models are more suited for this task. For completeness, we also provide baseline results for token classification.

Approaches
Baseline. We compare our context-inclusive models to BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b) for binary sentence classification. These are both powerful transformer language models pre-trained on large amounts of data and proven to be effective for low-resource tasks. They take as input single sentences and output labels. They thus do not consider the context which sentences appear in, but are optimised to make excellent use of any cues contained within the sentence.
Direct textual context. To involve direct textual context, we use a Windowed Sequential Sentence Classification method (WinSSC). Like Cohan et al. (2019)'s method of using pre-trained language models for sequential sentence classification, WinSSC takes multiple sentences as its input sequence, generates embeddings for the separator tokens in the sequence, and classifies these embeddings with a linear layer that outputs as many labels as there are sentences in the input sequence. Prior to embedding, sequences are book-ended with the last sentence from the previous sequence, and the first sentence of the next sequence. These book-ends, which are ignored during evaluation, ensure that each sentence in the sequence has context at both ends, thus mitigating loss of information along the edges of sequences. This is important when segmenting news articles as they tend to be long enough to require segmentation into  several sections to avoid memory problems. We experiment with sections of 5 and 10 sentences to assess the effect of changing section sizes, and we compare our WinSSC method to the non-windowed SSC method from Cohan et al. (2019).
Article context and event context. We integrate article context and event context by means of an Article Context-Inclusive Model (ArtCIM) and Event Context-Inclusive Model (EvCIM). Inspired by the Context Aware Model in Papalampidi et al. (2019), the Context-Inclusive Model uses a Bidirectional Long Short-Term Memory (BiLSTM; Hochreiter and Schmidhuber 1997) to encode news documents. In the case of ArtCIM, a single BiLSTM encodes the article. In the case of EvCIM, three BiLSTMs encode each document in the triple of Fox News, New York Times and Huffington Post articles on the same event. Sentence representations for the target sentence as well as for the input to the BiLSTMs are obtained by taking the average of the last four layers of fine-tuned base RoBERTa (Figure 1, Stage 1). We found this to be more effective than other kinds of pooling, and also more effective than sentence USE embeddings (Cer et al., 2018) or Sentence-Bert embeddings (Reimers and Gurevych, 2019). BiLSTMs then encode a context representation of the article the target sentence appears in (ArtCIM), or of each of the three articles on the same event (EvCIM). At the final stage, the encodings of the target sentence and the context documents are concatenated and passed to a linear classifier (Figure 1, Stage 2). Classification is thus based both on the content of the target sentence, which the baseline captures very well, and the article or event context, which the baseline has no access to.
Domain context. To integrate domain context, we apply domain-adapted RoBERTa from Gururangan et al. (2020) that has been trained on news data (DAPT), a task-adapted version of RoBERTa that has been trained on the BASIL data (TAPT), or both (DAPT-TAPT). Additionally, we experiment with including domain context by concatenating an embedding representing the source of an article (FOX, NYT or HPO) as a feature in the CIM setting (at Stage 2 in Figure 1) (ArtCIM* and EvCIM*) 2 .

Set-Up
Previous work on the BASIL corpus has split data by dividing sentences across a training, development and test set (Fan et al., 2019). This type of split isolates target sentences from other sentences in the same article, and from other articles covering the same event. Distributing sentences across set types in this way is contrary to the goal of this work, which is to consider sentences within their context. In addition, distributing sentences from the same article across training and test data can be considered a  Table 3: Results of integrating direct textual context with a Sequential Sentence Classifier without (SSC (Cohan et al., 2019)) or with a window (WinSSC) and a maximum sequence length of 5 or 10.
type of leakage, as knowing of some sentences in an article that they are biased might help recognise similar sentences from the same article or another article on the same topic. Our setting, in which triples of articles appear either during training or during testing but not both, resembles a more realistic setting, where the hypothetical user of an information bias annotation system wants to identify bias in new articles on new events. We use the split with sentences distributed across train and non-train sections -the Sentence split -only to report baseline results for the purpose of consistency and comparability to Fan et al. (2019). This split consists of 7,123 training instances, 408 development instances and 404 test instances. To test context-inclusive methods, we use a 10-fold cross-validation setting where stories (triples of articles) never appear in both a train and non-train section. Sizes of folds in this Story split vary slightly because of variation in the length of articles. Each consists of around 6,400 sentences designated for training, 780 for development and 790 for testing. All methods are tested 5 times with a different random seed. We report precision, recall and balanced F-measure (with standard deviation across seeds) for the positive (biased) class and test significance of differences in performance with an independent t-test. Further training details are provided in Appendix A.

Baseline
To establish a baseline that classifies sentences in isolation we fine-tune the language models BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b) on the BASIL corpus. For comparison with Fan et al. (2019), we fine-tune both to perform token classification and sentence classification and on both the Sentence split and Story split.
In line with the prediction that the Sentence split introduces leakage from test into training data, performance is several F1-score points higher on the Sentence split than on the Story split in all settings ( Table 2). The difference is largest for sentence classification with RoBERTa (F1=49.89 on the Sentence split and F1=42.16 on the Story split).
In line with observations in Fan et al. (2019), we also observe that performance is lower for token classification than for sentence classification (F1=29.86 versus F1=42.16 (RoBERTa on the Story split)).
We observe large improvements in performance of RoBERTa over BERT in all settings. Best performance on sentence classification was reported to be F1=43.27 on a Sentence split in Fan et al. (2019) by BERT. In our set-up using our seeds, BERT performance stands at F1=38.26 on the Sentence split, whereas RoBERTa's sentence classification performance on the Sentence split is 49.89. On the Story split the difference is also large: from F1=37 by BERT to F1=42.16 by RoBERTa.

Direct Textual Context
We experiment with integrating direct textual context by comparing two methods of sequential sentence classification (SSC and the novel WinSSC) to the best performing baseline sentence classifier. We find performance decreases when direct context is introduced in this manner (Table 3). Increasing the length of the sequence from 5 to 10 does not aid performance of either the non-windowed SSC or the WinSSC model (F1=38.19 to F1=38.22 and F1=38.67 to F1=37.44). It is likely that data sparsity is at fault here. When performing 10-fold cross-validation with the maximum sequence length set to 5, the number of sequences for training averages around 1654 per iteration. With the maximum sequence length set to 10,   Table 5: Results of domain-adapted, task-adapted and domain-and-task-adapted RoBERTa.
this drops further to 856 sequences. This is likely too small a number of sequences for the models to generalize.

Article Context & Event Context
We use the Context-Inclusive Model to perform classification based on encodings of the target sentence and either only the news article the target sentence appears in (ArtCIM) or each member of the triple of coverage on the same event (EvCIM). As an additional event context experiment, we provide ArtCIM and EvCIM with a representation of the news source (ArtCIM* and EvCIM*). Each of these models performs at least as well as the baseline. While RoBERTa achieves best precision, all CIM models achieve much higher recall. In terms of F1-score, EvCIM performs significantly better than the baseline (p < .001), as does EvCIM* (p = .004).

Domain Context
We test the performance of three adaptations of RoBERTa: domain-adapted to news (DAPT), taskadapted for informational bias detection on the BASIL corpus (TAPT) and domain-and-task-adapted (DAPT-TAPT). The DAPT and DAPT-TAPT model do not outperform the baseline, but the task-adapted model does (Table 5), although not significantly. These results echo the finding in Gururangan et al.
(2020) that domain-adaptation and domain-and-task-adaptation are not as helpful in the news domain as in other domains. RoBERTa has likely seen a sufficient amount of news domain training data already (Liu et al., 2019b), making any further domain-specific pre-training marginally helpful as long as the domain is kept as general as news.

Error Analysis and Discussion
To investigate whether the best-performing context-inclusive model (EvCIM) improves over the baseline by, in fact, leveraging context, we analyze dependence of performance on factors that we suspect influence the need for context when detecting informational bias. The factors that we consider are sentence length, the presence of quotes, the political leaning of the source article and the presence of subjective language.
Concretely, we expect that more context is needed and therefore higher gains of EvCIM over the baseline can be expected for sentences with the following characteristics: 1. They are long, and consequently contain a more complex message that requires more knowledge to interpret (e.g. 2.2 in Table 1, in contrast to 2.3).    (Niculae et al., 2015) and informational bias in particular (Fan et al., 2019), as they maintain an air of neutrality on the part of the journalist. A simple model without context might just pick up on quotation marks as a clue for informational bias.
3. They are from an article with a centrist political leaning (e.g. 1.2 and 3.3 in Table 1).
4. They do not contain any subjective language (e.g. 1.3 in Table 1, in contrast to 2.3).

Sentence Length
We examine whether EvCIM outperforms the baseline on longer sentences by partitioning data into bins corresponding to quartiles of sentence length (number of tokens). We then compare the F1-score computed on predictions by RoBERTa and EvCIM. We observe that, as suspected, EvCIM does not outperform the baseline significantly on the shorter sentences, i.e. on the first and second quartile (Table 6). On the longer sentences in the third and fourth quartile, however, EvCIM outperforms the baseline significantly (from F1=39.61 to F1=42.14 and from F1=44.10 to F1=47.79).

Quotes
Quoting patterns have been shown to introduce bias in news (Niculae et al., 2015) and informational bias in particular (Fan et al., 2019), by introducing opinions through a third party proxy. We predict that neural approaches notice this relationship and rely on it to some degree to make their predictions. The BASIL corpus contains annotations that specify for each instance of informational bias whether it is part of a quote or not. We can therefore analyze differences in recall of informational bias inside and outside of quotes (Table 7). Table 7 shows that both models have considerably better recall of informational bias in quotes. We predicted that the baseline would have an easier time with quotes, and that the gains of EvCIM with respect to the baseline would be higher on non-quotes. EvCIM outperforms the baseline in terms of recall by a higher margin on non-quotes than on quotes, but it outperforms it significantly on both.

Political Leaning of the Source
To examine whether bias is easier to detect in newspapers and articles with a more pronounced political leaning, we first compare performance on the different news publishers represented in the BASIL corpus. According to Budak et al. (2016), FOX is strongly right-leaning, NYT slightly left-leaning, and HPO    strongly left-leaning, meaning NYT is the most centrist of the three. We observe that EvCIM significantly outperforms the baseline for all publishers, and that both models perform much better on Fox News articles compared to the other publishers (Table 8). Fan et al. (2019) have stated that there are differences in the polarity and target of biased sentences in the three news sources included in the BASIL corpus.
The RoBERTa and EvCIM systems may be capitalizing on these and other differences to make better predictions for Fox News articles. In addition to providing the news source for each article, the BASIL corpus also provides article-level annotations of the political leaning of the article as determined by a human annotator. These annotations show that although publishers lean towards a certain side of the political spectrum, they also each publish a large number of centrist articles (Table 10). When using these human annotations of leaning to compare performance on centrist and non-centrist articles, we find that EvCIM does not outperform the baseline significantly on right-leaning articles, but does outperform it significantly on left-leaning articles (from F 1 = 40.68 to F 1 = 42.58), and especially on centrist articles (from F 1 = 42.66 to F 1 = 45.50), supporting the notion that classification of these articles benefits more from access to event context.

Subjective language
We suspect that EvCIM will make larger gains on sentences without subjective language. The BASIL corpus contains annotations of lexical bias, i.e. bias through word choice, that can be used to investigate whether informational bias detection is helped by the presence of lexical bias. According to the BASIL annotation protocol, sentences contain lexical bias if the annotator found their opinion to be swayed by the choice of words. According to the annotators, this was the case for only 448 sentences, and informational bias was less likely to occur in sentences with lexical bias (9.82%) than sentences without lexical bias (15.63%) ( at least one strongly subjective clue from the MPQA Subjectivity Lexicon (Wilson et al., 2005). We find that this number is higher: 2415 instances, and informational bias was more likely to occur in sentences with subjectivity (18.92%) than sentences without subjectivity (13.74%) (Table 12). We suspect that the latter numbers are a more realistic assessment of the amount of subjective language in the BASIL corpus.
When comparing performance on instances with and without subjective language using the subjectivity lexicon, we observe only a small difference in improvement, with EvCIM outperforming the baseline more on items without subjective language (from F1=40.62 to F1=41.97) than with subjective language (from F1=43.01 to F1=45.27).

Conclusion
We explore the impact of including four kinds of context in informational bias detection. We integrate direct textual context, article context, context from other articles on the same event (event context), and domain context into sentence classification methods and test performance on the BASIL corpus of informational bias. We find that direct textual context and domain context are difficult to integrate in a way that boosts performance beyond the strong RoBERTa baseline. Our proposed Context-Inclusive Model, however, outperforms RoBERTa signficantly when using event context (EvCIM). Error analysis shows that EvCIM performs better than the baseline on longer sentences, and sentences from politically centrist articles. Furthermore, both models perform better on BASIL's Fox News instances than its New York Times or Huffington Post instances, and both models are better at recognising bias in quotes.
Future work could explore domain-adaptation to unlabeled data from the same population of articles that the BASIL corpus was drawn from. Given the differences in performance on Fox News Articles compared to other sources, domain-adaptation to specific sources is also a promising avenue. In addition, future datasets may need to ensure a balance of sources that represent different layers and sections of society. Future work could also extend context-inclusion experiments to token classification.