Viable Threat on News Reading: Generating Biased News Using Natural Language Models

Recent advancements in natural language generation has raised serious concerns. High-performance language models are widely used for language generation tasks because they are able to produce fluent and meaningful sentences. These models are already being used to create fake news. They can also be exploited to generate biased news, which can then be used to attack news aggregators to change their reader’s behavior and influence their bias. In this paper, we use a threat model to demonstrate that the publicly available language models can reliably generate biased news content based on an input original news. We also show that a large number of high-quality biased news articles can be generated using controllable text generation. A subjective evaluation with 80 participants demonstrated that the generated biased news is generally fluent, and a bias evaluation with 24 participants demonstrated that the bias (left or right) is usually evident in the generated articles and can be easily identified.


Introduction
Natural language generation is defined as the creation of understandable text using a language model (LM) trained on a large collection of texts. An (LM) is a probability distribution over a sequence of words. Given a set of training text sequences, we can train an LM to produce texts similar to the training data. Researchers have used deep learning algorithms to generate more fluent and semantically meaningful texts than those generated using conventional methods like n-grams (Lu et al., 2018). Such LMs are being used to generate image captions (Vinyals et al., 2015), perform machine translations (Bahdanau et al., 2015), paraphrase and summarize text (Zhang et al., 2017). High performance LMs can generate fake news, fake reviews, and fake comments (Adelani et al., 2020;Zellers et al., 2019).
Recent studies have revealed various types of bias in top US news sources, which often report political news in a biased way, for example, attention can be drawn to particular events and entities while ignoring others (Ribeiro et al., 2018;Groseclose and Milyo, 2005;Kulshrestha et al., 2017). The selection of what to report about an entity (positive or negative) produces bias. There are two major political sides in the U.S.: Democrats on the left and Republicans on the right.
The news aggregating platforms like Google News and Yahoo News are the most viewed news websites in U.S. with 150 and 175 million unique visitors every month, respectively (Watson, 2019). They offer content relevant to a wide range of global audiences, and therefore, they have a responsibility to maintain the same sentiment and bias. However, they can utilize language models to generate biased content (news headlines and articles) to model the behavior of their readers. Exposure to biased news is very harmful as it can increase/flip the political bias of a reader (Bail et al., 2018). For example, (Wong, 2019) found that exposure to biased news can alter the political inclinations of people, and (Wanta and Hu, 1994) found that false representation of news from a news source can lead to broken trust between the reader and the news source.
Previous works on media bias mostly focused on detecting bias either by using cues from the social media presence of the news sources (Kulshrestha et al., 2017;Ribeiro et al., 2018), or by analyzing how bias is manifested within each news article .  focused on flipping the bias of news headlines, which is a short one line text. Bail et al. (2018) showed that exposure to opposing views can increase political polarization. To the best of our knowledge, ours is the first attempt at gener-Figure 1: Proposed threat model. Original news is used as seed by the "Biased News Generator" (explained in Section 4) to generate left or right biased news. Readers are then exposed to the generated biased news to change their original bias (either flip or increase). ating full length biased news articles using high performance language models.
Our Contribution. In this paper, we use a threat model (Figure 1) to demonstrate that publicly available language models can reliably generate biased news content based on original news. In an ideal scenario, a user consumes original news from an aggregator and develops a confirmation bias (Nickerson, 1998) about entities mentioned in the news. If the news complements their bias, they likely jump to the original source to continue reading (Swire et al., 2017). Our threat model, we assume that the attacker is able to access the original news and have control over what a user will see when visiting the aggregator's platform. In this scenario, the attacker can rework the original news, by either shifting its bias farther than it originally was (Levendusky, 2013), or by flipping its original bias (Bail et al., 2018). The attacker is also assumed to be able to access a large collection of news articles labeled with the bias (left or right) to use for training an LM. The attacker uses the original news as input to the LM for using as context to generate biased news. Finally, the attacker exposes readers to the generated biased news.
To generate biased news, we fine-tuned the GPT-2 language model (LM)  to create two different LMs, each trained on a specific type of biased news. We used an API built on a RoBERTa-based model (Liu et al., 2019) (explained in a later section) to classify the generated news as left or right biased. However, generating only the text for news is not enough. Therefore, we then fine-tuned another generative model, known as GROVER (Zellers et al., 2019), which enables controllable generation of an entire news article -the body, title, news source, publication date, and author list. Finally, we performed a subjective evaluation with 80 participants -32 native and 48 non-native English speakers. The results show that the news articles generated by the models (machine-generated news) had almost the same fluency as those written by people (human-written). The participants tended to randomly select human-written news when asked to choose between two options: an excerpt from machine-generated news, and one from humanwritten news. Then we choose 24 of the 80 participants to evaluate the bias in the machine-generated news articles. They were able identify a bias 92% of the times, and assigned a correct bias rating 62.91% of the time.

Related Work
In this section, we discuss related work on political bias datasets, bias analysis, bias generation and detection in news articles.

Political Bias Datasets
In the works that study bias, Arapakis et al. (2016) collected a dataset of 561 news articles, each being labeled with 14 qualitative aspects along with article's subjectivity. Another dataset, the multi-perspective question answering (MPQA) corpus (Wiebe et al., 2005), contains 692 news articles, each with a label of its subjectivity. These two corpora were carefully developed with labels at the article and sentence levels. However, the labeling technique is costly to scale, and the corpora are not large enough (< 1000 samples), so  developed a corpus of 2,781 events from the AllSides website to characterize and flip bias in news headlines. The corpus contains news headlines and articles presented by a left-leaning and a right-leaning news source paired together with an unbiased summary of the event. However, the labeling is news source specific, so there is no information about the bias at the article level. Moreover, the corpus is not large enough to be used to generate news articles. Therefore, for this study, we used the "All-The-News" dataset footnotehttps://www.kaggle.com/snapcrack/allthe-news.

Bias Analysis
Media bias has been under study for decades (Groseclose and Milyo, 2005;Fang et al., 2012;Arapakis et al., 2016), and various aspects of political bias have been studied from different perspectives. For example, Groseclose and Milyo (2005) quantified bias for a sample of 20 news sources in the U.S. on the basis of the number of citations used by think tanks and policy groups. Their work is among the first ones to provide clear evidence of bias in media. Lin et al. (2011) proposed categorizing bias on the basis of variables like mentions of political parties, legislators, and ideology. Another study, , focused on liberal and conservative bias, and using manual annotation, found that bias indicators usually include named entities. A more recent study explored the idea with right and left bias, and experimentally showed that named entities are indeed important, and that bias is more evident in longer texts, i.e., in full length news articles, rather than in shorter texts like sentences and paragraphs . We performed the same analyses to evaluate the reliability of our dataset.

Biased Headline Generation
Advances in natural language processing have led to rapid development of several language generation techniques. With the release of transformer based model architectures and text representations (Vaswani et al., 2017;Devlin et al., 2018), machines are now able to generate high quality text outputs , which may or may not preserve the context. To generate text that better preserves context, researchers have studied controllable text generation, i.e., how to rewrite a text so that it has certain attributes (Keskar et al., 2019;Zellers et al., 2019). Several of these studies demonstrated that the text style can be transferred by simply changing the relevant words in an unsupervised manner (Li et al., 2018;Adelani et al., 2020;Shen et al., 2017).  demonstrated bias flipping in text, but only for the headlines of a news articles. To the best of our knowledge, ours is the first study on generating full-length biased news articles.

Identification of Bias in News Articles
There have been several attempts in the past to identify bias as left or right at the article level Baly et al., 2018;Wang, 2019), and at the source level (Ribeiro et al., 2018;Kulshrestha et al., 2017;. The classification of a media source as left leaning or right leaning is flawed if one starts to look at each article to identify its bias. We are more interested in the text and style of bias in news articles, and therefore, we focused on bias at the article level. At article level, Zhao et al.; Baly et al. (2018) used a smaller dataset and shallow models to classify bias at an article level using three labels. Using recent advancements in the field of natural language processing, Wang (2019) created a state-of-the-art regression model to quantify bias in news articles by using RoBERTa-based model (Liu et al., 2019) and trained it on several datasets like the Adfontes-Media's list of articles and webhose.io 1 , and so on for generalizability. We used the RoBERTa-based model to generate automatic bias ratings and evaluate bias in generated text.

All The News Dataset and Automatic Bias Ratings
The dataset we used is a collection of 139,668 full length news articles curated using the Internet Archive 2 from 15 major news sources in the U.S. and is available on the Kaggle website under the name of "All the news" data 3 . For each source, the Internet Archive was used to grab the past year-and-a-half of either homepage headlines or RSS feeds and their links were parsed through a scraper. The data obtained were not the product of scraping an entire site, but rather of scraping the more prominently placed articles. For example, CNN's articles from 5 June 2016 were what appeared on the homepage of CNN at the time of data collection. Similarly, Vox's articles from that time were everything that appeared in the Vox RSS reader, and so on. Therefore, we had a news article with its headline, publication source, publication date, and full-length body. The collection of news articles did not have its bias ratings at the article level. We used a RoBERTa-based regression model made available to us upon requesting to "The Bipartisan Press" 4 to create bias ratings. "The Bipartisan Press" annotated the data using Adfontes Media's methodology (Otero, 2019), which involves an initial screening and training to hire experts to annotate news articles with their bias on a scale of -42 to +42. A negative sign indicates a left-leaning bias and a positive sign indicates a right-leaning bias.
We used the regression model to calculate the bias in each news article and treated these bias ratings as the ground truth. We further used the same model to evaluate the bias of the generated news articles. Table 1 lists some statistics about the "All the news" dataset.

Discriminativeness Ratio
Bias can be found in a text if it expresses sentiment towards a specific entity ( a person, a place, or a policy).  proposed a discriminativeness ratio to capture the fundamental difference between biased and sentimental text based on word frequency. The ratio is given as: where occ(w, D) is the frequency of w in text D and t and t are types of text. In biased text, t and t correspond to right and left, while in sentimental text they represent positive and negative sentiments, respectively. Usage of the discriminativeness ratio results in type-unrelated words having values close to 1, as they appear almost equally in both types of text. On the other hand, words that appear often in one type and rarely in the other will have higher (type t) and lower values (type t ) values, respectively.  Table 2: Three words with highest and lowest discriminativeness ratio, and words with ratio very close to one. Table 2 lists the words having the highest and the words having the lowest discriminativeness ratio for sentimental text and biased text. We show the results for sentimental text to simplify the explanation. The top three words in the sentimental text are positive, the bottom three are negative, and sentiment-unrelated words have a value close to one. In the biased text, the three type-unrelated words (ratio of 1.0) included both positive ("aired" and "recused") and negative ("suspicion") sentiment words. This is because both left-and rightbiased texts use sentiment words to support and oppose entities. In addition, the top three and the bottom three biased-text words are named entities, indicating that articles with either bias tend to criticize or support different named entities, using the same words to convey sentiments. In line with this, a bias analysis by Yano et al. (2010) revealed that named entities are often bias indicators.

Biased News Generation
The most important parts of the proposed method for generating biased news is the GPT-2 text gen-eration model  and the controllable text generation model (Zellers et al., 2019). As shown in Figure 2, we used a two step approach to generate biased news: generation and validation. In the generation step, an attacker provides an original news article x as the seed input to a generation models. The models then generate a modified article x' based on x. In the validation step, the generated articles are classified on the basis of bias. The attacker is assumed to have access to such a classifier and uses it to segregate left-and right-biased news. The details of our proposed method are discussed below.

GPT-2 Model
The task of a language model is to learn the probability distribution of a text corpus to enable the next word to be predicted on the basis of contextual words. Given a sequence of words, w = (w 1 , w 2 , ..., w T ), the probability of the sequence is given as: Probability P(w) is calculated by learning the conditional probability of each word given a fixed number of k-context words. Many neural network architectures have been used to estimate P(w) including a feed-forward neural network (Bengio et al., 2003), a recurrent neural network (Mikolov et al., 2010;Sundermeyer et al., 2012), and the transformer architectures (Radford et al., 2018). A GPT-2 model  based on a transformer architecture has been shown to have a lower perplexity for language modeling datasets, and to generate high quality fluent texts. Therefore, we used a GPT-2 model and fine tuned it on a dataset of left-and right-biased news.
In the fine-tuning, the model was first initialized using pre-trained weights instead of random initial weights. Fine-tuning is faster than training an LM with a large number of parameters from scratch. It has been shown that fine-tuning using labeled data after initializing the model with pre-trained parameters improves the accuracy of downstream tasks (Devlin et al., 2018). Therefore, we fine tuned the GPT-2 LM using left-biased and right-biased news.
Using techniques from Zhang et al. (2015), we divided the news articles from each set into training and test sets. We used a reliable implementation of the GPT-2 model available on Github 5 to fine-tune the pre-trained model on the "All the news" dataset. We used the default values for all hyperparameters. The number of training samples for left-and right-biased media were unbalanced, but since we trained a separate model for each, we had enough data for fine-tuning two good models. We fine tuned two 117M GPT-2 models, one for each type of bias. We used 85,664 and 44,004 news articles, respectively, to train the two models and 5000 each to test them for perplexity. The perplexity on the test set for the two models trained was 17.43 and 18.30, respectively, which is quite good (i.e., value less than 20 is what we look for ).
Finally, we generated 5000 samples for each bias type. We loaded the corresponding model and used prompts from the original articles to generate biased ones. Table 3 shows a sample for each type of bias. The generated articles are fluent and meaningful. The generated news is "fake" and reports incorrect factual information. For example, in the first sample, the original news has entities like {U.S. farm industry, White House} while the generated one completely changed them to {U.S. Economy, San Jose}.

Right Biased
Seed Text: President Donald Trump said he would seek to keep his tough immigration enforcement policies from harming the U. S. farm industry and its largely immigrant workforce, according to farmers and officials who met with him. At a roundtable on farm labor at the White House last month, Trump said he did not want to create labor problems for farmers and would look into improving a program that brings in temporary agricultural workers on legal visas. President Donald Trump said he would seek to keep his tough immigration enforcement policies from harming the U. S. economy, and that the 25 countries with "green cards" had jobs to farm industry and its largely immigrant workforce according to farmers and officials who met with him last week in San Jose.  Table 3: Example biased news generated using fine-tuned GPT-2 LM. For the sake of brevity, only the first three sentences of original and generated articles are presented (Grusky et al., 2018). Generated text is shown in italics.  Table 4: Example biased news generated using fine-tuned GROVER LM. For the sake of brevity, only the first three sentences of original and generated articles are presented (Grusky et al., 2018). Generated text is shown in italics.

GROVER model
The news articles generated by the GPT-2 model contain unstructured text, beginning with a start token and ending with an end token. The end token is particularly important as it in-dicates when to stop generating. However, in addition to unstructured running text, i.e., the body text, a news article has additional elements, including the publication domain, the publication date,the authors, and the headline. Generating a realistic and controlled news article requires producing all of these components. Therefore, a news article can be modeled as a joint distribution: P (domain, date, authors, headline, body) (2) Zellers et al. (2019) used the language modeling framework from equation 1 in a way that enables flexible decomposition of equation 2. GROVER starts with a set of fields F as context, with each field containing specific start and end tokens. To generate a target field τ , we append the field specific start − τ to the given context tokens to sample from the model until the end − τ token is reached. For biased news generation, we fix the body of the article as the target field τ and use the other fields (domain, date, authors, headline) as context. We load pre-trained model weights to fine tunethe GROVER LM to generate biased news.
We used the same training-test distribution as for the GPT-2 model. We defined context F as the set {headline, date, author[s], domain} and target τ as the body of the article to be generated using F as context. Note that, GROVER does not need seed phrases for generation. Instead, it uses headline, date, author[s], and domain for generating the body. Table 4 shows a sample for each type of bias. The generated articles are fluent and appear consistent as they are presented with a domain, date, headline and author[s] names. Figure 3 shows the bias distributions for all the 5000 generated articles, reflecting the bias of each source. As can be clearly seen, the distributions are shifted towards the extremes for both the leftand right-biased samples, shown by the bumps being closer to the left extreme (-20) or the right extreme (+20).

Subjective Evaluation
To subjectively evaluate our proposed method, we asked a pool of native and non-native English speakers (annotators) to evaluate the generated biased news articles on the basis of fluency and the bias of the text. We explicitly instructed them to ignore factuality because we wanted to evaluate and validate the quality and bias of the generated articles, not their correctness.
For evaluating quality, we considered two categories of articles: human-written ones from news sources, and machine-generated ones produced by the GPT-2 or GROVER models. The participants were asked to identify whether an excerpt was taken from a human-written, or a machinegenerated article. They were shown two options to choose from, one from each class, humanwritten and machine-generated. Each annotator was shown ten pairs of excerpts (one humanwritten and one machine-generated) and asked to identify, which was the human-written one. The average selection rate was used as the metric. Fur-(a) Bias Distribution in Human Written News (b) Bias Distribution in Machine Generated News Figure 3: Difference in bias ratings between human-written and machine generated news (using human-written news as seed for each generation). The machine-generated news is more extreme (biased) due to being generated by fine tuned models.
ther, to facilitate the evaluation, the excerpts were shortened to only three or four sentences. The evaluations were performed on a web interface with the two types of excerpts chosen randomly from two pools of samples. Of the 80 participants, 32 were native speakers and the rest 48 were non-native speakers. As shown in Table 5, the non-native English speakers tended to mark the machine-generated excerpts as human-written ones. Since the outputs from the GPT-2 and GROVER models were very similar, the ratio of participants who failed to identify the human-written news correctly was about the same for the GPT-2-and GROVER-generated samples. The lowest ratio (43%) was for native speakers and the GROVER samples, and the highest ratio (50%) was for non-native speakers and the GPT-2 samples. Most of the values are closer to 0.50, which indicates that the participants tended to make a random selection among the two categories of articles.

Model
Native  For evaluating bias, we selected 24 of the 80 participants, each having at least a college degree or who were enrolled in college at the time of annotation. We trained them to understand the media bias using various resources 6 . Since the training was not rigorous, we made the problem simpler by treating bias as a binary variable having two values, i.e., left and right. For cases in which the participant was not sure, we asked them to mark the question with can't say. Each participant was shown ten excerpts at random from the generated text and they were asked to mark their bias rating. As in the quality evaluation, only three or four sentences were shown for the sake of simplicity.
The participants were able to identify a clear bias 92% of the times. They marked the option of can't say only 8% of the time. To determine the percentage of times the participants were able to identify the bias correctly, we needed to define "correctly", which is subjective. We judged that a bias rating was correct if the participant's choice (left or right) matched that of the automatic bias evaluation . We used the API built on a RoBERTabased model to automatically generate bias ratings for the sample excerpts shown to the participants. We found that the participants were able to identify the bias correctly 63% of the time. The percentage might have been higher with more training and a better understanding of bias.

Discussion
Our use of the API made available to us by "The Bipartsan Press" to evaluate bias is a major limitation of this study. Evaluating text for bias is a very complex problem. The API was built on a RoBERTa based model trained on a dataset curated by Adfontes Media. The dataset was annotated by 20 expert annotators with at least a college degree after an extensive screening and training process 7 . Hiring and training such annotators is expensive, and relying on non-expert annotators to calculate media bias in generated news is not promising. Since our findings conforms to the results reported by relevant literature on media bias, it is safe to assume that the results obtained using the RoBERTa-based model (with a 4% error rate) are reliable in terms of segregating left-biased media from right-biased media.

Conclusion and Future Work
We have presented a threat model and discussed how news aggregators (attackers) can manipulate readers' opinions by flipping or increasing their bias. We described two language models generating biased news: the high-performance GPT-2 LM and the GROVER LM for controllable text generation. We used a large news article dataset to fine tune them. We used a RoBERTa-based regression model to create automatic bias ratings and to evaluate bias in generated news. Subjective evaluation of generated news articles by 80 participants suggests that they made random selections between the machine-generated and human-written news excerpts, indicating that the machine-generated news is fluent and looks similar to human-written news. Out of the 80 participants, 24 were chosen for a bias evaluation. The participants were able to see a clear bias most of the times, and marked correct bias 63% of the times.
For future work, techniques for a more granular control on text generation can be explored, where one can adversarially inject bias to generate twisted versions of news stories. Techniques to introduce bias during machine translation of a news article from one language to another can be explored and evaluated by comparing the generated news after translation with the news generated by non-native speakers while converting news from other languages. Apart from named entities and sentence length, there are more intrinsic patterns representing presence of bias in text. Exploration studies to find such patterns can also be done in future to better understand bias distribution in text. Another future direction can be to quantify the impact of delivering biased news to real-world users using some social media platform.

A.1 Granularity Analysis
Sometimes biased text segments can be identified just by looking into the title (i.e. only one sentence), as we go along the bias may or may not increase. Intuitively, as we increase the length of text tested for presence of bias, the bias should also increase.
We have taken equal number of samples, i.e. 5,000, from both sides of bias. To test this hypothesis, we divided the news into 4 parts: sentence-1, which is just the title; sentence-3, first three sentences of news article (Grusky et al., 2018)(Lede-3); sentence-10, first 10 sentences of the news article ; and finally full-length, which represents the complete news. Figure 4 shows that bias ratings increase as we increase the length of news being tested for bias. Figure 4: Granularity Analysis. The bias ratings increase as we increase the length of text to test for bias infestation.

A.2 Automatic Detection
We evaluated three automatic detection models, GLTR (Gehrmann et al., 2019), GROVER (Zellers et al., 2019), and GPT-2 PD (Solaiman et al., 2019) using 80 samples (news excerpts) each from human written, GPT-2 generated, and GROVER generated news. GLTR gives different probabilities of words being in top10, top100, and so on, and the other models give a probability score. We have used regression models as fusion functions while predicting with combined models. Table 6 shows detection results.

Detector
GPT-2 Generated GROVER Generated Overall  Table 6: Equal error rate in differentiating between human written and machine generated news. We have used three approaches independently as well as a combination of them. "+" indicates score fusion.