Media Bias, the Social Sciences, and NLP: Automating Frame Analyses to Identify Bias by Word Choice and Labeling

Media bias can strongly impact the public perception of topics reported in the news. A difficult to detect, yet powerful form of slanted news coverage is called bias by word choice and labeling (WCL). WCL bias can occur, for example, when journalists refer to the same semantic concept by using different terms that frame the concept differently and consequently may lead to different assessments by readers, such as the terms “freedom fighters” and “terrorists,” or “gun rights” and “gun control.” In this research project, I aim to devise methods that identify instances of WCL bias and estimate the frames they induce, e.g., not only is “terrorists” of negative polarity but also ascribes to aggression and fear. To achieve this, I plan to research methods using natural language processing and deep learning while employing models and using analysis concepts from the social sciences, where researchers have studied media bias for decades. The first results indicate the effectiveness of this interdisciplinary research approach. My vision is to devise a system that helps news readers to become aware of the differences in media coverage caused by bias.


Introduction
Media bias describes differences in the content or presentation of news (Hamborg et al., 2018). It is a ubiquitous phenomenon in news coverage that can have severely negative effects on individuals and society, e.g., when slanted news coverage influences voters and, in turn, also election outcomes (Alsem et al., 2008;DellaVigna and Kaplan, 2007). Potential issues of biased coverage, whether through the selection of topics or how they are covered, are compounded by the fact that in many countries only a few corporations control large parts of the media landscape-in the US, for example, six corporations control 90% of the media (Business Insider, 2014).
Even subtle changes in the words used in a news text can strongly impact readers' opinions (Papacharissi and de Fatima Oliveira, 2008;Price et al., 2005;Rugg, 1941;Schuldt et al., 2011). When referring to a semantic concept, such as a politician or other named entities (NEs), authors can label the concept, e.g., "illegal aliens," and choose from various words to refer to it, e.g., "immigrants" or "aliens." Instances of bias by word choice and labeling (WCL) frame the referred concept differently (Entman, 1993(Entman, , 2007, whereby a broad spectrum of effects occurs. For example, the frame may change the polarity of the concept, i.e., positively or negatively, or the frame may emphasize specific parts of an issue, such as the economic or cultural effects of immigration (Entman, 1993).
In the social sciences, research over the past decades has resulted in comprehensive models to describe media bias as well as effective methods for the analysis of media bias, such as content analysis (McCarthy et al., 2008) and frame analysis (Entman, 1993). Because researchers need to conduct these analyses mostly manually, the analyses do not scale with the vast amount of news that is published nowadays (Hamborg et al., 2019a). In turn, such studies are always conducted for topics in the past and do not deliver insights for the current day (McCarthy et al., 2008;Oliver and Maney, 2000); this would, however, be of primary interest to people reading the news. Revealing media bias to news consumers would also help to mitigate bias effects and, for example, support them in making more informed choices (Baumer et al., 2017).
In contrast, in computational linguistics and computer science, fewer approaches systematically analyze media bias (Hamborg et al., 2019a). The models used to analyze media bias tend to be simpler (Hamborg et al., 2018;Park et al., 2009) compared to previously mentioned models. Many approaches analyze media bias from the perspective of news consumers while neglecting both the established approaches and the comprehensive models that have already been developed in the social sciences (Evans et al., 2004;Mehler et al., 2006;Munson et al., 2013Munson et al., , 2009Oelke et al., 2012;Park et al., 2009;Smith et al., 2014). Correspondingly, their results are often inconclusive or superficial, despite the approaches being technically superior.

Research Question, Tasks, and Contributions
To address the issues described in Section 1, I define the following research question for my Ph.D. research: How can an automated approach identify instances of bias by word choice and labeling in a set of English news articles reporting on the same event? To address this research question, I derive the following research tasks: T1. Identify the strengths and weaknesses of manual and of automated methods used to identify media bias.
T2. Research NLP techniques and required datasets to address these weaknesses. To do so, use established bias models and (semi-) automate currently manual analysis methods.
T3. Implement a prototype of a media bias identification system that employs the developed methods to demonstrate the applicability of the approach in real-world news article collections. The target group of the prototype are non-expert people.
T4. Evaluate the effectiveness of the bias identification methods with a test corpus and evaluate the effectiveness of using the prototype in a user study.
Combining the expertise of the social sciences and computational linguistics appears beneficial for research on media bias. Thus, the main contribution of my Ph.D. research will be an approach that combines models and methods from multiple disciplines. On the one hand, it will leverage established models from the social sciences to describe media bias and will follow currently manual methods to analyze media bias. On the other hand, it will take advantage of scalable methods for text analysis developed and used in computational linguistics. I need to employ and extend the state-of-the-art in two closely related NLP fields (cf. Section 4): (1) cross-document coreference resolution (CDCR) as well as (2) target-dependent sentiment classification (TSC) including "sentiment shift" and identification of framing effects and causes (see Section 4.2). I plan to embed both techniques into an approach that is inspired by the procedure of manually conducted, inductive frame analyses (cf. Section 3.1).
For the first technical contribution, a sieve-based CDCR approach was already devised that addresses characteristics of coreferences as they often occur in bias by WCL. The examples in the Abstract show that even phrases that are usually considered contrary may be coreferential in a set of articles reporting on a specific event. For the second technical contribution, i.e., to estimate how a semantic concept may be perceived by people when reading a news article, I primarily plan to devise and test neural models that I will design specifically for the task. I also plan to implement a prototype that includes visualizations to reveal the identified instances of bias by WCL to users of the system.
In the remainder of this document, I will give a brief overview of manual techniques for the analysis of bias by WCL and exemplary results from the social sciences as well as related, automated approaches (Section 3). Section 3 concludes with the current research gap, which motivates my Ph.D. research. Section 4 describes the tasks that I have already conducted as well as current and future tasks to complete my Ph.D. research. Section 5 describes a preliminary evaluation, which I already completed, as well as remaining tasks.

Related Work
The following summarizes an interdisciplinary literature review that I conducted as part of my Ph.D. research (T1) (Hamborg et al., 2019a).

Manual Approaches
In the social sciences, the news production process is an established model that defines nine forms of media bias and describes where these forms originate from (Baker et al., 1994;Hamborg et al., 2019aHamborg et al., , 2018Park et al., 2009). For example, journalists select events, sources, and from these sources the information they want to publish in a news article. While these initial selections are necessary due to the multitude of real-world events, they may also introduce bias to the resulting story. While writing an article, authors can affect readers' perception of a topic through word choice (cf. Sec-tion 1, Baker et al., 1994;Gentzkow and Shapiro, 2006;Oelke et al., 2012). Lastly, for example, the placement and size of an article on a website determine how much attention the article will receive.
Researchers in the social sciences primarily conduct frame analyses or, more generally, content analyses to identify instances of bias by WCL (Mc-Carthy et al., 2008;Oliver and Maney, 2000). In content analysis, researchers first define one or more analysis questions or hypotheses. Then, they gather the relevant news data, and coders systematically read the texts, annotating parts of the texts that indicate instances of bias relevant to the analysis question, e.g., phrases that change readers' perception of a specific person or topic. In inductive content analysis, coders read and annotate the texts without prior knowledge other than the analysis question. In deductive content analysis, coders must adhere to a set of coding rules defined in a codebook, which is usually created using the findings from an earlier inductive content analysis. After the coding, researchers quantify the annotated instances to lastly accept or reject their hypotheses.
Content analyses conducted for WCL bias are typically either topic-oriented or person-oriented. Annotations range from basic forms, e.g., targeted sentiment (Niven, 2002), to fine-grained "perception" categories, causes thereof, or other features, e.g., Papacharissi and de Fatima Oliveira (2008) investigated WCL in the coverage of different news outlets on topics related to terrorism. One highlevel finding was that the New York Times used more dramatic tones than the Washington Post, e.g., news articles dehumanized terrorists by not ascribing any motive to terrorist attacks or use of metaphors, such as "David and Goliath." Both the Financial Times and the Guardian focused their articles on factual reporting.

(Semi-)Automated Approaches
Many automated approaches treat media bias vaguely and view it only as "differences of [news] coverage" (Park et al., 2011b), "diverse opinions" (Munson and Resnick, 2010), "different perspectives" (Hamborg et al., 2018), or "topic diversity" (Munson et al., 2009), resulting in inconclusive or superficial findings (Hamborg et al., 2019a). Only a few approaches use comprehensive bias models or focus on a specific form of media bias (cf. Section 3.1). Likewise, few approaches aim to specifically identify instances of WCL bias. For example, Lim et al. (2018); Spinde et al. (2020b) propose to investigate words with a low document frequency in a set of news articles reporting on the same event, to find potentially biasing words that are characteristic for a single article. NewsCube 2.0 employs crowdsourcing to estimate the bias of articles reporting on a topic. The system allows users to annotate WCL in news articles collaboratively (Park et al., 2011a).
The most related, fully automated field of methods is TSC, which aims to find the connotation of a phrase regarding a given target. On news texts, however, to-date TSC methods perform poorly for at least three reasons. First, news texts have rather subtle connotations due to the expected journalistic objectivity (Gauthier, 1993;Hamborg et al., 2018). Second, to my knowledge, no news-tailored TSC approaches, dictionaries, nor annotated datasets exist; generic approaches tend to perform poorly on news texts (Balahur et al., 2010;Kaya et al., 2012;Oelke et al., 2012). Third, the one-dimensional polarity scale used by all mature TSC methods may fall short of representing complex news frames (cf. Section 1). To avoid the difficulties of highly context-dependent connotations in news texts, researchers have proposed to perform TSC only on quotes (Balahur et al., 2010) or on the readers' comments (Park et al., 2011b), which more likely contain explicit connotations. Researchers also suggested to investigate emotions induced by headlines, but they achieved mixed results (Strapparava and Mihalcea, 2007).

Research Gap
To my knowledge, there are currently no automated approaches that identify or compare instances of WCL bias, despite reliable analysis concepts used in the social sciences and automated text analysis methods in related fields, such as CDCR and TSC.
To address the difficulties due to the expected objectivity of news texts and other previously mentioned factors, I plan to follow two main ideas: first, the use of knowledge and models from sciences that have long studied media bias. Second, I expect the recent advent of word embeddings and deep learning, including neural language models, such as BERT (Devlin et al., 2018), to be strongly beneficial to the outcome of this project. The advancements in these fields have led to a performance leap in many NLP disciplines, including coreference resolution and TSC, where, e.g., in the latter macro F1 gained from F 1 m = 63.3 (Kiritchenko et al., 2014) to F 1 m = 75.8 on the Twitter set (Zeng et al., 2019).

Methodology
Research task T2 will be the main contribution of my Ph.D. research; hence, this section focuses on completed and future tasks related to T2. Technically, addressing the research question represents two main challenges. First, resolving coreferences of semantic concepts across a set of news articles. In bias by WCL, journalists often use coreferences in a broader, sometimes even contradictory, sense than the state-of-the-art in coreference resolution and CDCR is capable of (Balahur et al., 2010;Baumer et al., 2017;Hamborg et al., 2019b). Second, classifying how actors and other semantic concepts are framed due to their mentions and their mentions' contexts, for which I will use TSC.
I plan to integrate the two tasks into the analysis shown in Figure 1 (RT3). Given a set of news articles reporting on the same event, the analysis will find subsets of articles and in-text phrases that similarly frame the concepts involved in the event. Lastly, the system will visualize the results to news consumers. Because RT3 is not directly related to NLP, it is described only briefly in Section 4.3.

Broad Cross-doc. Coreference Resolution
After the system has completed state-of-the-art preprocessing (Manning et al., 2014), the second phase in the analysis is broad CDCR, which aims to resolve coreferences as they occur in WCL bias (Hamborg et al., 2019b). The first task within this phase is candidate extraction. Relevant phrases containing bias by WCL commonly are noun phrases (NPs), e.g., NEs such as politicians, or verb phrases (VPs), i.e., describing an action, such as "cross the border" or "invade the country." The approach currently focuses only on NPs and extracts mentions from two sources. First, mentions from coreference chains identified by coreference resolution, and second, NPs identified by parsing.
The second task, candidate merging, addresses the main difficulty of broad CDCR. Journalists often use divergent terms to refer to the same semantic concept (Hamborg et al., 2019a), sometimes even terms that typically have opposing meanings, such as "intervene" vs. "invade," "coalition forces" vs. "invading forces." Such coreferences are highly context-dependent and may only be valid in a sin-gle news article or across related articles (Hamborg et al., 2019b,c). Related state-of-the-art techniques for coreference resolution capably resolve generally valid synonyms, nominal and pronominal coreferences, such as "Donald Trump," "US president," and "he." However, they cannot reliably resolve the previously mentioned, broader examples of coreferences, which often occur in bias by WCL (Hamborg et al., 2019a).
The candidate merging task uses a series of sieves, where each analyzes specific characteristics of two candidates to determine whether they should be merged (see Figure 1). For example, the first sieve merges candidates if they have similar core meanings, specifically, if the head of each candidate's representative phrase is identical (Hamborg et al., 2019b). For a given coreference chain, the representative phrase is defined as the mention that best represents the chain's meaning (Manning et al., 2014). This way, the first sieve merges cases such as "Donald Trump" and "President Trump." The second sieve merges candidates if most of their mentions are semantically similar. The sieve currently uses non-contextualized word embeddings, specifically word2vec (Mikolov et al., 2013), to vectorize each mention. Then, it calculates the unweighted mean of all vectorized mentions of a candidate. Lastly, the sieve will merge two candidates if their mean vectors are similar by cosine similarity. Analogously, the remaining sieves address specific characteristics, e.g., using word embeddings (Le and Mikolov, 2014) and clustering methods, such as affinity propagation (Frey and Dueck, 2007). More information on the approach is described by Hamborg et al. (2019b).
Future research directions for the CDCR task most importantly include extending the capabilities of the approach and improving its performance. For the former, we want to investigate how coreferential mentions of activities (VPs) can be resolved. To improve the CDCR performance, we plan to devise a method that uses a language model to resolve coreferential mentions. For example, BERT increased the performance on single-document coreference resolution from F1=73.0 to F1=77.1. Using SpanBERT, a pre-training method focused on spans rather than tokens, the performance is increased to F1=79.6 (Joshi et al., 2019). We expect that using a language model can yield similar improvements for CDCR. Corefs NPs Semantic Networks Figure 1: Shown is the plan for the three-phase analysis pipeline as it preprocesses news articles reporting on the same event, resolves coreferential mentions of semantic concepts across documents, and groups articles framing these concepts similarly. Source: (Hamborg et al., 2019b)

Frame Identification
Approaches aiming to estimate how semantic concepts are perceived, e.g., in the closely related field of TSC by classifying the concepts' polarity, or, more broadly, approaches to identify bias, traditionally employ manually created dictionaries or manually engineered features for machine learning (ML). Such approaches can achieve high performances in various domains, e.g., Recasens et al. (2013) propose an approach that capably identifies single bias-words in Wikipedia articles by using dictionaries and further, non-complex features.
In news texts, however, such approaches fall short. Since neutral language is expected (cf. Section 3), token-based and ML methods fail to catch the "meaning between the lines" (Hamborg et al., 2019a,b;Balahur et al., 2010;Godbole et al., 2007). Yet, recent NLP advancements, most importantly language models, have proven to be very effective in the news domain as in various other domains and tasks (see Section 3.3).
I plan to devise a neural model that will, in part, be inspired by state-of-the-art TSC approaches such as LCF-BERT (Zeng et al., 2019) and domainadapted SPC-BERT (Rietzler et al., 2019), with three main differences. First, the model will need to consider characteristics specific to news articles. For example, in news articles, sentiment may more strongly depend on global context compared to TSC prime domains, e.g., because the latter are typically shorter texts (Adhikari et al., 2019).
Second, besides "absolute" sentiment polarity, the model needs to consider the "sentiment shift" induced by the context of a target mention. For example, while TSC traditionally focuses on the event's or text's sentiment regarding a target (cf. "text-level" as defined by Balahur et al. (2010)), bias by WCL is concerned explicitly with the language, e.g., words, used in the sentence. So, given a target mention, I am interested in whether the mention or its context sway the perception more positively or negatively, also in relation to the sentiment at event-or text-level (Balahur et al., 2010).
Third, for an identified non-neutral polarity, the approach should be able to find in-text causes and potential effects thereof. Causes include the use of emotional words, loaded language, or aggressive repetition of specific facts. Effects include particularly how the target is framed (cf. "frame properties" as defined by Hamborg et al. (2019b) or "frames types" by Card et al. (2015)). Resolving the dependencies of a target and its context is an issue that is subject of current TSC research (Zeng et al., 2019;Rietzler et al., 2019), which I expect to be important in the proposed project as well.

System and Visualization
A system will integrate the previously described analysis workflow and will visualize the results to non-expert users (RT3). I devised visualizations that are similar to UIs of popular news aggregators, such as Google News, and bias-aware aggregators, such as AllSides. In contrast to these, the system will be able to identify in-text instances of bias (Hamborg et al., 2017Spinde et al., 2020a). Hence, the system will not only give a bias-aware overview of current topics but also will have a visualization for single articles, which will highlight identified instances of WCL bias.
For research and evaluation of the previously described system and its analysis methods, I currently use the datasets AllSides (Chen et al., 2018), NewsWCL50 (Hamborg et al., 2019c), and PO-LUSA (Gebhard and Hamborg, 2020), which have high diversity concerning outlets' political slant.
I plan to publish the code of the system and methods. Due to the system's modularity, researchers can extend it to support further forms of bias, e.g., commission and omission of information or picture selection (Torres, 2018;Hamborg et al., 2019a).

Evaluation
I conducted preliminary evaluations of the two main methods described in Section 4 (RT4). To measure the CDCR performance on broad coreferences as they occur in WCL bias (Section 4.1), I created a test dataset named NewsWCL50. The dataset was created by manually annotating coreferential mentions of persons, actions, and also vaguely defined, abstract concepts across 50 news articles (Hamborg et al., 2019b). The evaluation seems to confirm the research direction for this task. The approach currently achieves F 1 = 45.7, or 84.4 if evaluated only on technically feasible annotations, compared to 29.8, or 42.1, respectively, achieved by the best baseline. Technically feasible refers to only comparing to annotations that the approach theoretically should be able to resolve, e.g., currently only NPs while excluding VPs.
A future evaluation will include a comparison to state-of-the-art CDCR methods (Barhom et al., 2019;Intel AI Lab, 2018). For improved soundness, we plan to create a second dataset similar to the NewsWCL50 dataset but with more coders and more articles. To do so, we will crowdsource the annotations of concept mentions on MTurk and use an improved codebook. The improvements will address issues of NewsWCL50's codebook, e.g., by making annotation types less ambiguous (Hamborg et al., 2019b). Further, we plan to use two additional datasets: ECB+ (Cybulska and Vossen, 2014) and NIdent (Recasens et al., 2012). Both datasets are commonly used to evaluate CDCR approaches and contain cross-document coreferences.
To evaluate the second task, frame identification, I plan to create a comprehensive training and test set for the TSC method described in Section 4.2. I already created a preliminary dataset of 3000 sentences, each including a target mention and a sentiment label agreed on by three coders. The dataset was created analogously to established TSC datasets (Dong et al., 2014;Pontiki et al., 2014;Nakov et al., 2016;Rosenthal et al., 2017).
Preliminary results seem to indicate that TSC on the news domain is in part more difficult than on TSC prime domains, such as product reviews, where authors often express their opinion explicitly. State-of-the-art TSC achieves average recall AvgRec = 70.0 on news articles, whereas performances on common TSC test datasets range from AvgRec = 75.6 (Twitter dataset) to 82.2 (Restaurant). Other baselines, e.g., using dictionaries and semantic networks, such as ConceptNet, perform very poorly (F 1 < 15.0), which seems to confirm that token-based approaches fail to catch the subtlety common to WCL bias. Finally, we plan to evaluate the system's effectiveness regarding visualization of the identified biases to non-expert users. An already conducted pre-study confirmed the study design (Spinde et al., 2020a). I will revisit this task once the classification methods described in Section 4 can be used within the study.

Conclusion and Implications
In summary, both everyday news consumers, as well as researchers in the social sciences, could benefit strongly from the automated identification of bias by word choice and labeling (WCL) in news articles. Devising suitable methods to resolve broad coreferences across news articles reporting on the same event and estimating the frames of the found instances of WCL bias are at the heart of this research project. One primary result of the project will be the first automated approach capable of identifying instances of bias by WCL in a set of news articles reporting on the same event or topic.
My vision is that at a later point in time, such methods might be integrated into popular news aggregators, such as Google News, helping news readers to explore and understand media bias through their daily news consumption. Also, I think that these methods could be integrated into the analysis workflow of content analyses and frame analyses, helping to automate further these currently mostly manual and thus time-consuming analysis concepts prevalent in the social sciences.