RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning “bad” words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.


Introduction
Although they are the backbone of many modern NLP systems (Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2019), language models (LMs) pretrained on large web text corpora suffer from degenerate and biased behavior (Sheng et al., 2019;Wallace et al., 2019). As illustrated in Figure 1, they can easily degenerate into toxicity, even without explicitly toxic prompts, which hinders their Despite not containing any toxic language as measured by PERSPECTIVE API, these prompts cause several pretrained LMs to systematically generate highly toxic text (shown in Table 17 in Appendix §E). safe deployment (McGuffie and Newhouse, 2020).
We first introduce a framework to systematically measure the risk of toxic degeneration by pretrained LMs. We release REALTOXICI-TYPROMPTS ( §4), a set of 100K naturally occurring prompts (i.e., sentence prefixes; Figure  1) extracted from a large corpus of English web text and paired with toxicity scores from a widely used and commercially deployed toxicity detector (PERSPECTIVE API). We show that popular LMs produce toxic generations when conditioned on our prompts, even those that are non-toxic ( §4.2).
Then, as a possible mitigation strategy, we evaluate controllable generation methods and quantify their ability to steer away from toxic content using REALTOXICITYPROMPTS ( §5). We find that certain controllable methods (e.g., toxicity control tokens, swearword filters) are less successful than more computationally or data-intensive methods (e.g., finetuning on non-toxic corpora). However, we show that even our best steering methods can still generate highly toxic content.
Finally, to further investigate the potential cause of these phenomena, we present the first largescale analysis of toxicity in GPT-2's training corpus, OpenAI WebText, (OPENAI-WT; Radford et al., 2019), as well as an in-depth analysis of its open-source replica, OPENWEBTEXT CORPUS (OWTC; Gokaslan and Cohen, 2019, §6). We find non-negligible amounts of toxic, harmful, and abusive text in these corpora, which were used in pretraining of several language models (including RoBERTa, CTRL, and GPT-2; Keskar et al., 2019, §6.1). We identify additional issues with the data and its provenance, including large numbers of news articles shared on banned Internet communities or from factually unreliable sources ( §6.2).
Our findings highlight the difficulty of avoiding toxicity in natural language generation (NLG) and illustrate a need to actively reconsider the content used in LM pretraining. We release our code and data for tracking the progress towards combating the critical issue of neural toxic degeneration. 1,2

Operationalizing Toxicity
Characterizing the toxicity of large corpora of naturally occurring or machine generated text is crucial to understanding toxic degeneration by language models. Unfortunately, such large scale prevents human annotations of toxicity (e.g., we score at least 80 GB of text in §6). Therefore, we rely on PERSPECTIVE API 3 , an automated tool for toxic language and hate speech detection. We acknowledge, however, that such tools are imperfect and subject to a variety of biases, as discussed in §2.2 and §7.

PERSPECTIVE API TOXICITY
We use the TOXICITY 4 score from PERSPECTIVE API, a widely used, commercially deployed toxic-1 Due to their prevalence, we focus our study only on neural language models, and therefore use the term "neural toxic degeneration." Future work could examine whether non-neural language models exhibit similar behavior.
2 http://toxicdegeneration.allenai.org/ 3 https://github.com/conversationai/ perspectiveapi 4 PERSPECTIVE API defines TOXICITY as a "rude, disrespectful, or unreasonable comment; likely to make people leave a discussion." ity detection tool. Accessed through an API, TOX-ICITY corresponds to the prediction output of a CNN (Lecun et al., 1998) trained on a proprietary corpus of comments from Wikipedia , New York Times, and other news sites with an AUC of 0.97. Since the model is calibrated using isotonic regression (Zadrozny and Elkan, 2002), 5 we can meaningfully interpret the score as a probability of toxicity. In our analyses, we label a prompt as toxic if it has TOXICITY 0.5, and non-toxic otherwise. 6

Biases in Toxic Language Detection
Although widely used, the PERSPECTIVE API and other hate speech detection systems and corpora exhibit biases against minorities and suffer from low agreement in annotations (Waseem, 2016;Ross et al., 2017), partially due to annotator identity influencing their perception of hate speech (Cowan and Khatchadourian, 2003) and differences in annotation task setup (Sap et al., 2019). Notably, recent work has found that systems are overestimating the prevalence of toxicity in text that contains a minority identity mention (e.g., "I'm a gay man"; Hutchinson et al., 2020) or text by racial minorities (e.g., text in African American English; Sap et al., 2019;Davidson et al., 2019). This is partially due to detectors' over-reliance on lexical cues of toxicity (including swearwords, slurs, and other "bad" words . We further discuss and examine the effect of these biases in the Appendix, by assessing that the racial bias in toxicity is invariant with respect to model choice (Appendix §C.1) and analyzing the presence of profanity and swearwords separately from toxicity (Appendix §C.2).

Out-of-the-Box Generation Toxicity
We focus our investigation of toxic degeneration in five popular autoregressive Transformer-based (Vaswani et al., 2017) language models: GPT-1, 5 https://github.com/conversationai/ perspectiveapi/blob/master/3-concepts/ score-normalization.md 6 To assess PERSPECTIVE API on human-generated text, the first three authors performed manual judgments of toxicity of a sample of 100 documents from OWTC, and found an 88% pairwise agreement (Pearson ⇢=0.83) with TOXICITY scores. To assess the API on machine-generated text, among 100 generations from GPT-2, our judgments had 80% pairwise agreement and Pearson ⇢=0.65 with TOXICITY. For further model information, we refer the reader to the model card for TOX- ICITY: https://github.com/conversationai/ perspectiveapi/blob/master/2-api/modelcards/English/toxicity.md GPT-2, GPT-3, CTRL, and CTRL-WIKI. GPT-1 (Radford et al., 2018) is a 117M-parameter model pretrained on a large corpus of English books (Zhu et al., 2015). GPT-2 (specifically, GPT-2-small; Radford et al., 2019), is a similarly sized model pretrained on OPENAI-WT, which contains 40GB of English web text and is described in §6. 7 GPT-3 (Brown et al., 2020) is pretrained on a mix of Common Crawl, an expanded version of OPENAI-WT, books corpora, and Wikipedia. 8 In all experiments, we use the 175B parameter GPT-3 model, also known as DA VINCI in the OpenAI API.
CTRL (Keskar et al., 2019) is a 1.63B parameter model that uses domain-specific control tokens for conditional language modelling. We analyze generations in two domains: web text (CTRL, Links control token), and English Wikipedia (CTRL-WIKI, Wiki control token).
Generating from Models Unless otherwise noted, we use nucleus sampling (Holtzman et al., 2020) with p = 0.9 to generate up to 20 tokens (see Appendix §B.4 for additional details). All experiments are carried out with the Hugging Face Transformers library (Wolf et al., 2019).

Unprompted Toxicity in Neural Models
To quantify the risk associated with using pretrained language models for generation, we first measure their propensity to generate toxic output conditioned only on their respective start-ofsentence tokens. 9 For each model, we first generate a pool of 10K spans, and then perform bootstrap estimation of the expected maximum toxicity for n  10K generations, by sampling (with replacement) n generations from the pool 1K times each.
Our results ( Figure 2) show that all five language models can degenerate into toxicity of over 0.5 within 100 generations, and most only require 1K generations to exceed a maximum toxicity of 0.9 (see Table 15 and 16 in Appendix §E for examples). We find similar patterns of expected maximum toxicity for GPT-2 and CTRL, which have significantly more overlap in pretraining data than with GPT-1. Though trained on a much larger corpus, GPT-3's unprompted toxicity also mirrors Figure 2: Neural models generate toxicity, even with no prompting. Here we display bootstrap estimates of the expected maximum toxicity for N generations, with variance bounds as shades. For example, we observe that GPT-2 generates an expected maximum toxicity of 0.65 with just 100 unprompted generations. that of GPT-2, which may be due to the fact that GPT-3's training data was designed to be similar to GPT-2's training data (Brown et al., 2020). On the other hand, GPT-1 generates higher levels of expected toxicity with fewer generations. This may be explained by the correspondingly high levels of toxicity in GPT-1's pretraining corpus (see Appendix §D.3 for details). We also observe that CTRL-WIKI has a significantly lower expected maximum toxicity than the other models. These results suggest that models acquire toxicity from their pretraining data, which we analyze further in §6.

REALTOXICITYPROMPTS
To systematically evaluate and compare the generations from language models, we create REAL-TOXICITYPROMPTS as a testbed for toxicity in conditional language generation that mirrors real world applications (e.g., autocomplete systems; Chen et al., 2019; King, 2019). With this dataset, we quantify the effect of prompt toxicity on the toxicity of generation from our five language models.

Prompt Creation and Selection
We select our prompts from sentences in the OPEN-WEBTEXT CORPUS (Gokaslan and   outbound URLs from Reddit, for which we extract TOXICITY scores with PERSPECTIVE API. To obtain a stratified range of prompt toxicity, 10 we sample 25K sentences from four equal-width toxicity ranges ([0,.25), ..., [.75,1]), for a total of 100K sentences. We then split sentences in half, yielding a prompt and a continuation, both of which we also score for toxicity. We include further preprocessing details in Appendix §A. Our final dataset includes 100K naturally occurring prompts, which average 11.7 ± 4.2 tokens in length (Table 1). REALTOXICITYPROMPTS contains 22K prompts with TOXICITY 0.5 (i.e., toxic prompts). We find that prompt and continuation toxicity are slightly anti-correlated (r = -0.08, p  0.001), indicating that, in our documents, toxicity as measured by PERSPECTIVE API is usually confined to one half of the sentence.

Prompted Toxicity in Neural Models
Using REALTOXICITYPROMPTS and the same generation procedures outlined in §3, we measure toxic degeneration in out-of-the-box neural language models. We characterize toxicity in prompted generations with two metrics: 1) the expected maxi-mum toxicity over k = 25 generations, which we estimate with a mean and standard deviation; and 2) the empirical probability of generating a span with TOXICITY 0.5 at least once over k = 25 generations. These metrics characterize toxic generations along two axes: the higher the expected maximum toxicity, the more toxic we expect the worst-case generations to be, and the higher the toxicity probability, the more frequently the model generates toxicity.
Our results show that while toxic prompts unsurprisingly yield higher toxicity in generations, nontoxic prompts still can still cause toxic generations at non-trivial rates (Table 2). Specifically, all five models have a toxicity probability near or above 0.5 for non-toxic prompts. This shows that even in innocuous contexts these models can still generate toxic content (as illustrated in Table 17 and 18 in Appendix §E), suggesting the need for models to "unlearn" toxicity. Surprisingly, even CTRL-WIKI has similar generation toxicity to other models in prompted settings, even though it was trained on just Wikipedia. These results suggest that like the provenance of pretraining data ( §3.1), prompt context can heavily influence generation toxicity, and that steering generations after pretraining is crucial to prevent toxic behavior in language models. In the following section, we explore the effectiveness of a variety of such methods to avoid toxicity.

Detoxifying Generations
We investigate the effectiveness of recent controllable generation methods at steering away from toxicity using REALTOXICITYPROMPTS. Specifically, we focus on GPT-2 as a base model for two detoxification techniques: data-based, where we pretrain the language model further, and decoding-based where we only change the generation strategy without changing model parameters. 11 As described in §4.2, we sample 25 generations per prompt for each model. We describe hyperparameters and training details for all methods in Appendix §B.

Data-Based Detoxification
We consider two types of data-based detoxification in which we continue pretraining on approximately 150K documents from OWTC. 12  Table 3: Left: Average maximum toxicity (with standard deviations as subscripts) over 25 generations. Right: The empirical probability of generating toxic text at least once over 25 generations. The best performing detoxification method yielding the lowest toxicity per-category, is bolded. We display DAPT (Toxic) as a reference for the effectiveness of DAPT as a method of controlling LM behavior. All models are evaluated on a full dataset of 100K prompts, except PPLM, which is evaluated on a dataset of 10K prompts, due to computational budget.
Domain-Adaptive Pretraining (DAPT) Using the framework outlined in Gururangan et al. (2020), we perform an additional phase of pretraining on the non-toxic subset of a balanced corpus with GPT-2. For comparison, we also perform the experiment using the toxic subset.

Decoding-Based Detoxification
Noting the additional cost of training language models further, we explore three detoxifying strategies that only rely on altering the decoding algorithm and are therefore more readily usable by many practitioners.
Vocabulary Shifting (VOCAB-SHIFT) Inspired by Eisenstein et al. (2011) and Ghosh et al. (2017), we learn a 2-dimensional representation of toxicity and non-toxicity for every token in GPT-2's vocabulary, which we then use to boost the likelihood of non-toxic tokens. Given the language model's unnormalized probability (logits) over the vocabulary, we add the term W · t, where t 2 R 2 encodes (non-)toxicity, and W 2 R V represents the associations between each token and (non-)toxicity, and is the boosting strength. We set = 3 for all experiments. We learn this representation using the toxicity labels on the balanced corpus described in §5.1 (See Appendix §B.3 for more details).
Word Filtering (WORD FILTER) We also implement a language model blocklist, disallowing a set of words from being generated by GPT-2. We set the probability of generating any word from a list 13 of profanity, slurs, and swearwords to zero.
PPLM We use the recently released PPLM (Dathathri et al., 2020). This decoding method operates on GPT-2 by altering the past and present hidden representations to better reflect the desired attributes, using gradients from a discriminator (see Dathathri et al., 2020, for further details). In our experiments, we steer generations using the toxicity classifier released by the authors and the Hugging Face implementation. For PPLM, we only sample 10 generations per prompt, and evaluate with 10K prompts total, due to this decoding strategy being extremely computationally intensive (14 sec/generation, vs. 0.2 sec for GPT-2).

Effect of Controllable Solutions on Generation Toxicity
We investigate the effectiveness of our detoxification methods under REALTOXICITYPROMPTS, following the same generation procedures and experimental setups outlined in §4. Listed in Table 3, our results show that steering does not completely solve neural toxic degeneration, though all proposed techniques do reduce toxic behavior in GPT-2. Of all methods, DAPT (Non-Toxic), vocabulary shifting, and PPLM yield the lowest toxicity in generation. Despite its simplicity, DAPT (Non-Toxic) is one of the most effective methods for steering away from 13 List of Dirty, Naughty, Obscene, and Otherwise Bad Words, downloaded from https://github.com/ LDNOOBW/List-of-Dirty-Naughty-Obsceneand-Otherwise-Bad-Words. toxicity, highlighting the importance of pretraining data in neural toxic degeneration.
Prompts That Challenge All Models We find that certain prompts consistently cause all models to generate toxicity (e.g., the four prompts in Figure  1). Specifically, there are 327 prompts that yielded at least one generation with 0.9 TOXICITY from all models, and 1,225 prompts when considering only the out-of-the-box language models (i.e., GPT-1, GPT-2, GPT-3, CTRL, CTRL-WIKI). 14 From qualitative investigations, these prompts tended to either be toxic themselves, or if innocuous, they contain opening quotes or prefixes of multiword expressions such as "full of-" (Figure 1). Additionally, we find that at least 10% of those 1.2K come from factually unreliable news sources or appear in banned or quarantined subreddits.

Analyzing Toxicity in Web Text
To further investigate the phenomenon of neural toxic degeneration, and partially motivated by the surprising effectiveness of domain-adaptive pretraining on non-toxic data, we turn our focus to two corpora used to pretrain several language models. Specifically, we quantify the toxicity in OPENAI-WT (GPT-2's training data; Radford et al., 2019) and its open-source replica OWTC (Gokaslan and Cohen, 2019), inspired by previous work in analyzing social biases in large text corpora (Fast et al., 2016). Then, we investigate the provenance of the data in these corpora, quantifying how many documents come from factually unreliable news sites or were shared on quarantined or banned subreddits.
OWTC is a large corpus of English web text scraped from outbound URLs in submissions on Reddit communities (subreddits). In the creation of OWTC, only links included in posts with a "karma" (i.e., popularity) score of 3 or more were considered. Following the links, only English documents longer than 128 tokens are included in this corpus, amounting to 38 GB of text from about 8M documents. To allow for further analyses, we parse the URLs given with OWTC documents to extract the domain (often a news website, Figure 5 in Appendix §D; Sharoff, 2020), which we crossreference with news factuality ratings by Baly et al. (2018). We additionally cross-reference publicly Figure 3: TOXICITY scores of documents in OWTC (top) and OPENAI-WT (bottom). y-axis is in log-scale, and color gradient follows magnitude in x-axis. We consider a document toxic if its TOXICITY is 0.5. We additionally display the estimated total % of toxic documents in each corpus above each subplot.
available Reddit dumps 15 to identify which subreddits the URLs were submitted to. We include further details on OWTC and metadata linking in Appendix §D.
OPENAI-WT is the pretraining corpus for GPT-2 (Radford et al., 2019), also containing about 8M documents. Following OWTC, authors gathered URLs from Reddit, though from a different (but overlapping) timespan. Additionally, authors filtered content using a blocklist of sexually-explicit and otherwise offensive subreddits. 16 This corpus does not come paired with URL metadata.
Overlap We find about 29% overlap between the two corpora, using a large-scale similarity search with locality-sensitive hashing (Rajaraman and Ullman, 2011, see Appendix D for details). We find that at least 2.3M documents in OPENAI-WT also appear in OWTC. Figure 3, we find that both corpora contain non-negligible amounts of toxicity, with 2.1% of OWTC having TOXICITY 0.5, and 4.3% of OPENAI-WT. These rates are in line with Founta et al. (2018), who find that the prevalence of abusive or toxic content online roughly ranges between 0.1% and 3%, and suggest that these corpora merely reflect the "natural" rates of toxicity. We note that, despite Radford et al. (2019) employing a blocklist of subreddits and "bad" words, the toxicity in OPENAI-WT is twice the amount in OWTC. We show similar rates of toxicity using alternative PERSPECTIVE API labels on these corpora in Table 12 in Appendix §D.

Sources of Toxic Content in Web Text
Since Reddit is known to have hosted communities that endorse hateful norms and conspiracy theories (Romano, 2017), we investigate the provenance of data in our web text corpora. Specifically, we quantify the variation of a document's toxicity with respect to the reliability of its host news site and  Table 4: Examples of (purposefully uncensored) toxic documents that appear in GPT-2's training corpus, that were also submitted to quarantined or banned subreddits. We highlight spans that contribute to the overall toxicity of the document, which we identify manually. the nature of the subreddits to which it was posted.
Toxicity from Unreliable News Sites Gathering all documents in OWTC associated with a news site, and cross-referencing reliability ratings from Baly et al. (2018), we find that news reliability correlates negatively with the proportion of documents that are toxic (Spearman ⇢ = -0.35). As shown in Figure 4, while low reliability news sites are less prevalent in OWTC, they contain more toxic documents compared to higher reliability news sites. Additionally, we find that at least 12% (272K) of the overlapping OPENAI-WT and OWTC documents with news reliability ratings come from low or mixed reliability news sites.
Toxicity from Quarantined or Banned Subreddits Our analyses show that a non-trivial portion of OWTC documents (at least 3%, 212K) come from links shared on banned or quarantined subreddits. 17 Unsurprisingly, documents shared on those subreddits contain substantially more toxicity than those from standard subreddits (see Figure 10 in Appendix §D), confirming Reddit users' propensity to share oppressive and abusive content (Massa-nari, 2017;Mohan et al., 2017;Rajadesingan et al., 2020;Aran et al., 2020). From the overlapping OPENAI-WT and OWTC documents, we find that at least 63K documents were shared on banned or quarantined subreddits. With two example documents shown in Table 4, GPT-2 was pretrained on at least 40K documents from the quarantined /r/The Donald, and 4K documents from the banned /r/WhiteRights.

Discussion and Recommendations
Overall, our investigations demonstrate that toxicity is a prevalent issue in both neural language generation and web text corpora. Although they show some reduction in toxicity, steering methods do not fully protect neural models from toxic degeneration ( §5). Additionally, the corpora that language models are pretrained on contain non-negligible amounts of toxic, abusive, and untrustworthy content ( §6). Some implications of our findings are discussed below.
Effectiveness of "Forgetting" Toxicity Our findings on data-based steering methods show that adaptive pretraining lowers a model's propensity to unpromptedly generate toxic language, but that its prompted generations can still be toxic. This raises the question: can language models ever fully "forget" toxic pretraining data through further adaptation (Kirkpatrick et al., 2017;Gururangan et al., 2020)? The non-trivial amounts of toxicity generated by DAPT suggest that perhaps language models may be "memorizing" the toxicity in pretraining data (Carlini et al., 2019) or that toxic examples may be more salient for the model and hence harder to unlearn (Koh and Liang, 2017). Future work could explore whether some variants of toxicity are harder to forget than others, or whether the biases of models used to select training data for steering introduce unwanted side effects in language model behavior after adaptation.
Decoding with a Purpose Our analyses also highlight the promise of certain decoding methods, such as PPLM (Dathathri et al., 2020), which is among the most effective methods we tested at avoiding toxicity with toxic prompts. In addition to automated toxicity classifiers, future work could explore the use of handpicked toxic documents as "negative examples" to avoid toxicity in generation. Future work could also investigate infusing models with more sophisticated or nuanced representations of social biases (Ma et al., 2020).

Choice of Pretraining Data
As pretrained language models grow in size (Brown et al., 2020), so does their need for larger corpora, often drawn from easily accessible and abundant web text. However, our analyses reveal toxicity in web text data that likely enable language models to generate even unprompted toxicity ( §3.1). Our findings raise several practical and ethical concerns.
First, analysis of pretraining data is a crucial first step towards understanding toxic, biased, or otherwise degenerate behavior of language models. Therefore, echoing calls for transparency in NLP research (Bender and Friedman, 2018;Mitchell et al., 2019;Dodge et al., 2019), we recommend researchers publicly release all relevant information during data collection (e.g., original text, source URLs, timestamps, platform-specific metadata) when building pretraining corpora.
Second, using Reddit popularity as a curation heuristic introduces representational harm (Barocas et al., 2017) by biasing the populations whose language and perspectives are included in pretraining (e.g., Reddit users skew male; Barthel et al., 2016). This raises the question of who decides whose voices are going to be learned by the language model, and whose voices are excluded. Following Blodgett et al. (2020), we recommend a reexamination of the relationship between NLP systems and their end users, using methods from humancentered design, such as value-sensitive (Friedman et al., 2008) or participatory design (Sanders, 2002;DiSalvo et al., 2012;Denton et al., 2020), and archival data collection (Jo and Gebru, 2020). Given the potential for misuse and harm, we also echo calls for improving policy around public release of large language models (Zellers et al., 2019;McGuffie and Newhouse, 2020).
In general, the potential mismatch between the intent of curating pretraining data and its operationalization (e.g., karma thresholding, filtering out specific slurs and swearwords) biases the language model's pretraining data and behavior (Jacobs and Wallach, 2019). For example, filtering data based on PERSPECTIVE API could lead to a decrease in text by African American authors in pretraining data due to well-documented racial bias (Sap et al., 2019), which could lead to decreased performance on text written by non-White users. To avoid harm, researchers should be mindful and explicit about these decisions and engage with the end users of the technology during these design phases.
Improving Toxicity Detection With the release of REALTOXICITYPROMPTS, we hope to encourage large-scale, systematic evaluations of detoxification techniques for language models. However, the conclusions one can make about the effectiveness of a detoxification method are limited by the biases of the model used to detect toxicity ( §2.2). To combat these issues, we encourage further work on detecting and controlling different types of toxicity and undesirable social biases in generation, e.g., rudeness (Danescu-Niculescu-Mizil et al., 2013), hate speech (Golbeck et al., 2017), or microaggressions (Breitfeller et al., 2019). Additionally, measures of bias could be multi-dimensional (e.g., Dinan et al., 2020), include explanations (e.g., , or be evolving over time (e.g., using similarity to toxic online content).
Limitations We describe several limitations of our study. First, as noted in §2.2, we use an imperfect measure of toxicity that could bias the toxicity towards lexical cues, failing to detect more subtle biases and incorrectly flagging non-toxic content. Second, our analyses are limited to the five language models considered (and their steered variants). Further work could extend our analyses to toxicity to masked language models (Wang and Cho, 2019), among others. Lastly, because OPENAI-WT does not have available metadata, and due to the imperfect coverage of our subreddit and news reliability data, we only provide lower bound estimates of toxicity in web text corpora.

Related Work
A wealth of work has shown that toxicity and social biases in training data are acquired by large pretrained sentence encoders (e.g., gender bias in BERT; May et al., 2019;Zhao et al., 2019;Basta et al., 2019;Kurita et al., 2019). However, fewer studies have investigated toxicity in autoregressive language models, whose generations also suffer from incoherence, blandness, and repetitiveness (Holtzman et al., 2020;Welleck et al., 2019).
Similar in spirit to REALTOXICITYPROMPTS, Wallace et al. (2019) find universal adversarial triggers, nonsensical prompts that trigger toxic generations in GPT-2. In this work, we find and release naturally occurring prompts from web text that trigger toxicity, and compare toxic output in several language models.
Most closely related to this work, Sheng et al. (2019) use a set of 60 templated prompts that mention majority or minority identities to study the social biases in generations by out-of-the-box pretrained language models. In our work, we study toxic degeneration by both out-of-the-box and controlled models using 100K naturally occurring prompts, including some that do not contain identity mentions (see Figure 1). Additionally, our work focuses on the broad phenomenon of toxicity in generations, whereas Sheng et al. (2019) study the sentiment and regard expressed by a model's generation towards demographic identities.
The creation of REALTOXICITYPROMPTS was partly inspired by work in detecting conversational patterns that can cause derailment into antisocial behavior in online conversations (Zhang et al., 2018;Stoop et al., 2019;Karan andŠnajder, 2019). Our work also draws from a strong line of research into controlling the outputs of language models (Dathathri et al., 2020;Sudhakar et al., 2019;Ziegler et al.;Keskar et al., 2019, inter alia).

Conclusion
We introduce REALTOXICITYPROMPTS, a testbed of 100K prompts for evaluating the toxic degeneration in pretrained language models. Under this framework, we quantify the toxicity of multiple pretrained language models and the effectiveness of methods for detoxifying generations. We then analyze toxicity in two large web text corpora, including the GPT-2 pretraining corpus, to better understand the root cause of toxic generations. Finally, we provide recommendations for gathering pretraining data. The data, code, and interactive visualizations for this paper can be found at https://toxicdegeneration.allenai.org/.