Detoxifying Language Models Risks Marginalizing Minority Voices

Language models (LMs) must be both safe and equitable to be responsibly deployed in practice. With safety in mind, numerous detoxification techniques (e.g., Dathathri et al. 2020; Krause et al. 2020) have been proposed to mitigate toxic LM generations. In this work, we show that these detoxification techniques hurt equity: they decrease the utility of LMs on language used by marginalized groups (e.g., African-American English and minority identity mentions). In particular, we perform automatic and human evaluations of text generation quality when LMs are conditioned on inputs with different dialects and group identifiers. We find that detoxification makes LMs more brittle to distribution shift, especially on language used by marginalized groups. We identify that these failures stem from detoxification methods exploiting spurious correlations in toxicity datasets. Overall, our results highlight the tension between the controllability and distributional robustness of LMs.


Introduction
Recent neural language models (LMs) have shown enormous improvements in text generation abilities. A key factor behind these improvements is large training corpora that are collected from online sources (Radford et al., 2019). Unfortunately, because such corpora are too large to filter granularly (Roller et al., 2020), they inevitably contain so-called toxic examples: undesirable language such as expletives, slurs, or other offensive and threatening speech. When trained on such data, LMs inevitably learn to generate toxic text (Henderson et al., 2018;Wallace et al., 2019).
To address this issue, recent work has turned towards detoxifying LMs: reducing toxic generations without affecting perplexity or generation quality on nontoxic inputs. Existing detoxification strategies involve techniques such as finetuning LMs on nontoxic data (Gehman et al., 2020) or incorporating a toxicity discriminator during decoding (Dathathri et al., 2020). Our evaluation of these techniques shows that they are indeed effective at mitigating toxicity, but at what cost?
We demonstrate that detoxification can hurt LM utility on language used by minority groups. Concretely, we evaluate detoxified LMs on text with minority identity mentions (e.g., words such as "gay" or "Muslim") and surface markers of African-American English (Green, 2002, AAE). We first show that, compared to text containing White-Aligned English (WAE), detoxification causes a disproportionately large increase in LM perplexity on text with AAE and minority identity mentions. Moreover, increasing the strength of detoxification amplifies this bias.
The same trends hold when evaluating the text generation quality of LMs using crowdworkers. When conditioned on WAE text, detoxified LMs can roughly maintain the topic, fluency, and style of an input prompt. However, generation quality deteriorates when models are conditioned on AAE text, i.e., detoxification hurts an LMs' ability to understand and complete AAE text.
We identify that these failures are due to the use of biased toxic classification data. In particular, toxicity datasets often contain spurious correlations between the toxic label and the presence of AAE and minority identity mentions (Sap et al., 2019). These correlations cause detoxification techniques to steer generations away from AAE and minority identity mentions because they often consider these aspects of language to be toxic.
We conclude by outlining concrete harms and possible solutions to these biases. With regard to harms, we argue that biased systems force marginalized users to code-switch or hide their identity and that these systems can contribute to social stigmas. For solutions, we discuss improved procedures for data annotation and model training that may help debias detoxification techniques.  : Stronger detoxification leads to increased bias against AAE text. We vary a hyperparameter (ω in GeDi) that increases the detoxification strength and report the ratio of AAE perplexity to WAE perplexity. The baseline model (ω = 0) is approximately three times worse on AAE; when strongly detoxified, it performs almost 400 times worse on AAE.

Methods and Experimental Setup
The goal of detoxification is to mitigate the frequency of toxic generations (also called hate speech or offensive language) without affecting an LM's utility or generation quality on nontoxic inputs. We detoxify models using controllable generation techniques that steer outputs away from toxicity. Following past work (Gehman et al., 2020;Xu et al., 2020), we use four techniques that provide state-of-the-art levels of detoxification.
PPLM We consider plug and play language models (Dathathri et al., 2020, PPLM). Here, we first train a toxicity classifier using the hidden states of the LM as features. At generation time, the LM's hidden states are iteratively updated using a gradient from the toxicity classifier.
GeDi We consider GeDi (Krause et al., 2020), which combines the probabilities from the LM with the probabilities from a second, smaller LM that is trained on nontoxic data (Krause et al., 2020). We finetune GPT-2 small (Radford et al., 2019) for the second LM.
Filtering Finally, we consider output filtering, where we generate a fixed number of times (we use 10) from the LM and return the least toxic generation according to a toxicity classifier. We reuse the same toxicity classifier from PPLM.

Hyperparameters and Training Data
We use GPT-2 medium (Radford et al., 2019) as the base LM for all detoxification techniques. We use the hyperparameters from the original papers for each technique, except we generate using topk sampling (Fan et al., 2018) with k = 50 for all methods to enable a fair comparison.
For training data, we use the commonly-studied English Jigsaw Civil Comments dataset. 1 We remove examples where between 10% and 50% of the annotations are the toxic label (i.e., examples with low inter-annotator agreement). We publicly release our code. 2

Detoxifying LMs Introduces Biases
In this section, we evaluate the detoxification methods and show that they introduce biases into LMs that may harm marginalized groups.  Figure 3: We use the detoxified LMs to generate completions of WAE or AAE prompts. We ask crowdworkers to compare the generations to those from a baseline GPT-2 model. Detoxification methods cause a degradation in generation quality (topicality, fluency, and style) when models are conditioned on WAE texts. Worse yet, generation quality is noticeably worse when conditioned on AAE texts, demonstrating unwanted biases. See Table 1 for qualitative examples.

Automatic Evaluation Using Perplexity
We first perform intrinsic evaluations of each detoxification technique by computing the perplexity of detoxified models on various datasets. Note that we are not generating from the LM in this evaluation. 3 White-Aligned English Perplexity We first evaluate the perplexity on White-Aligned English (WAE) text that is either toxic or nontoxic. We use WAE tweets from Groenwold et al. (2020). 4 The detoxification techniques are effective at removing toxicity: the perplexity on toxic data increases substantially (Figure 1, toxic evaluation set). All techniques also cause a (smaller) increase in the perplexity on nontoxic WAE tweets, which shows that detoxification comes at some cost to the LM's utility. Part of this increase likely results from distribution shift: the detoxification methods are trained on comments data, but our evaluation sets come from Twitter. Identity Mentions and AAE Perplexity We next evaluate the perplexity of the detoxified LMs on nontoxic language that may be used by marginalized groups. Concretely, we use text that contains minority identity mentions (e.g., words such as "gay" or "Muslim") or surface markers of African-American English (Green, 2002, AAE). We form two evaluation sets using tweets. First, we collect tweets from the Twitter API that contain specific identity mentions. 5 Second, we use the nontoxic data from Groenwold et al. (2020), which are the AAE equivalents of the nontoxic WAE tweets we used for the previous evaluation.
We find that there is a disproportionately large increase in LM perplexity on the AAE and minority identity mention tweets (Figure 1, AAE and identity mentions). For example, when using PPLM, the perplexity increases by a factor of 2.1 on nontoxic WAE data and a factor of 4.3 on minority identity mention data.

Stronger Detoxification Amplifies Biases
We also find that stronger detoxification amplifies the gap in perplexity between text with WAE and text with AAE or minority identity mentions. This occurs for all detoxification techniques, for example, in Figure 2 we vary a parameter in GeDi that increases the degree of detoxification (ω). As more detoxification is applied, the ratio of AAE perplexity to WAE perplexity increases dramatically, reaching upwards of 400.

Human Evaluation of Generation Quality
As an extrinsic evaluation, we measure the generation quality of each detoxification method using crowdworkers on Amazon Mechanical Turk. We provide a short prompt as input to the detoxified LMs and then generate 30 additional tokens. For the prompts, we tokenize the aforementioned AAE and WAE tweets and extract the first half of each tweet. We sample 50 prompts from each set of tweets, producing 100 total prompts. Annota- PPLM in heaven and he was like, "Oh you're from Michigan" I . . .
Filtering in heaven and wanted to be with him.Then after being asked "Why . . .

GeDi
Mine uh unlimited number of vistors u . . . Table 1: Detoxification techniques are effective at mitigating toxic completions for most prompts, however, they often generate low-quality or nonsensical completions for AAE prompts. Above, we provide an input prompt that is the beginning of a WAE or AAE tweet and generate from the LM with top-k sampling. See Figure 3 for quantitative results from crowdworker evaluations. We censor vulgar and offensive words.
tors are shown the prompt and asked to select the better of two model-generated continuations: one from the baseline GPT-2 model and one from a randomly selected detoxification technique. They evaluate the model continuations based on toxicity and three measures of generation quality: topicality, fluency, and style. See Appendix B for screenshots of the setup (including concrete definitions of topicality, fluency, and style). Each example is evaluated by three different crowdworkers. Figure 3 shows the results split by WAE and AAE prompts, and Table 1 shows examples of generations. All detoxification methods generate less toxicity than the baseline GPT-2 model. 6 However, this detoxification typically comes at a degradation in generation quality. For example, more than 80% of annotators found GeDi less topical than the GPT-2 baseline, and all of the techniques except DAPT were rated as less fluent. 7 Worse yet, when models are conditioned on AAE texts (hatched bars in Figure 3), the generation quality is consistently lower across all metrics. The drop is most significant in topicality, where all detoxified models prefer to change the topic when asked to generate text conditioned on AAE prompts (e.g., GeDi was preferred only half as often for topicality on AAE prompts than on WAE prompts). 6 Filtering performs poorly because GPT-2 rarely generates nontoxic continuations of toxic prompts. 7 As mentioned in Section 3.1, some of the quality issues can be attributed to domain shift.

Why Detoxification Introduces Biases
In this section, we explain why detoxification causes the utility of LMs to degrade on text that contains AAE and minority identity mentions. First, note that all detoxification techniques make use of labeled toxic/nontoxic data. For example, DAPT uses this data directly: it finetunes the LM on nontoxic examples. PPLM, GeDi, and Filtering use this data indirectly: they train a classifier or LM on the toxicity data and then incorporate this model into the LM's decoding strategy.
Unfortunately, there are spurious correlations between the toxic label and the presence of AAE and minority identity mentions (Sap et al., 2019;Dixon et al., 2018). These correlations arise from annotation and sampling biases. Annotation bias occurs because crowdworkers are often unfamiliar with AAE and consequently misjudge it as toxic (Sap et al., 2019). Sampling bias occurs because many toxic comments are directed towards marginalized groups (RWJF, 2017). The result of these two biases is that text which contains AAE and minority identity mentions is labeled as toxic at disproportionately high rates (Sap et al., 2019).
Detoxification techniques inherit these undesirable biases. For example, DAPT will train LMs to not only forget toxicity but also forget AAE and minority identity mentions. Similarly, the discriminators used by PPLM, GeDi, and Filtering will guide the generated text away from AAE and identity mentions because the discriminators typically consider such text as toxic (Dixon et al., 2018;Sap et al., 2019;Oliva et al., 2020). Also note that in all of the above cases, increasing the detoxifica-tion strength (e.g., longer finetuning for DAPT or higher ω for GeDi) exacerbates these problems.
In our experiments, we test multiple detoxification methods to show that this bias is not linked to a specific technique, but instead to the process of detoxification in the presence of biased supervised data. In fact, other controllable generation techniques, including prompts (Wallace et al., 2019;Sheng et al., 2020;Shin et al., 2020) or conditional LMs (Keskar et al., 2019 will likely exhibit the same type of biases.

Harms of Detoxification
Our results demonstrate that the current state of detoxification poses representational harms (Blodgett et al., 2020) to minority groups. We discuss the concrete impacts of these harms below.
In-group Harms Detoxified LMs are deployed in downstream NLP systems in which they directly engage with end users. In addition to LMs not being able to generate minority identity mentions and minority dialects, our results suggest that detoxified LMs also struggle to understand these aspects of language. This could lead to scenarios where end users who are AAE speakers must code-switch to WAE to ensure that NLP systems work effectively for them. Aside from being an annoyance, this is also a microaggression that poses psychological harms and may discourage AAE speakers from engaging with NLP systems whatsoever.
Stigmatization of Language Detoxified models also have a propensity to avoid certain topics, e.g., mentioning a minority identity term. As a practical example, the (detoxified) Microsoft Zo chatbot was capable of discussing Christianity but could not discuss Islam (Stuart-Ulin, 2018). Failures like these further two types of stigma. First, having one's identity silenced by an NLP system can lead to self-stigmatization and long-term health consequences. Second, a lack of informed, conscious discussion on topics of identity or dialect can magnify existing societal stigmas. For example, aligning an LM solely with WAE stigmatizes AAE as incorrect or "bad" English (Flores and Rosa, 2015). In the technology industry, this can perpetuate a dangerous expectation that AAE users are not consumers who matter, stymieing progress on equitable NLP systems.
Biases Are Not Limited to Detoxification Although we have focused on problems with detoxification in this paper, similar failures will occur whenever controllable generation methods are used. For example, a common goal is to control the sentiment of generated text (Dathathri et al., 2020;Krause et al., 2020). Unfortunately, since sentiment datasets are often biased against certain racial groups (Kiritchenko and Mohammad, 2018), controlling the sentiment of text will also affect which races are discussed.

Future Work: Towards Bias-Free Detoxification
The harms that we have identified occur largely due to spurious correlations in toxicity datasets. A natural direction for future work is to thus improve datasets, for example, by changing the annotation procedure (Sap et al., 2019) or labeling scheme (Kennedy et al., 2020;. Unfortunately, this can also make collecting annotations more expensive. As an alternative or in addition to higher quality data, there is growing interest in training accurate models in the presence of biased data (Oren et al., 2019;Clark et al., 2019). Unfortunately, state-of-the-art debiasing methods are still far from perfect (Zhou et al., 2021). We plan to explore new methods for debiasing both datasets and models in future work.

B Amazon Mechanical Turk Details
Figures 4 and 5 show the instructions and examples given to the crowdworkers on Amazon Mechanical Turk. Figure 6 shows an example of the test interface.