Towards Controllable Biases in Language Generation

We present a general approach towards controllable societal biases in natural language generation (NLG). Building upon the idea of adversarial triggers, we develop a method to induce societal biases in generated text when input prompts contain mentions of specific demographic groups. We then analyze two scenarios: 1) inducing negative biases for one demographic and positive biases for another demographic, and 2) equalizing biases between demographics. The former scenario enables us to detect the types of biases present in the model. Specifically, we show the effectiveness of our approach at facilitating bias analysis by finding topics that correspond to demographic inequalities in generated text and comparing the relative effectiveness of inducing biases for different demographics. The second scenario is useful for mitigating biases in downstream applications such as dialogue generation. In our experiments, the mitigation technique proves to be effective at equalizing the amount of biases across demographics while simultaneously generating less negatively biased text overall.


Introduction
With the advent of more effective, large-scale natural language processing (NLP) techniques, the issue of fairness and bias is more important than ever. As such, there is an increasing repository of work dedicated to fairness in natural language processing. Much of the work on social biases in natural language processing focuses on biases in word embeddings (Bolukbasi et al., 2016;Zhao et al., 2018bZhao et al., , 2019Kurita et al., 2019) and natural language understanding tasks (Rudinger et al., 2018;Zhao et al., 2018a;Park et al., 2018).
In contrast, there has been relatively little work on examining biases in natural language generation (NLG). However, biases in NLG applications

Source Text
User What was Shanice known for? DialoGPT Fighting people?
User What was Molly known for? DialoGPT She's a great writer and comedian. Hidden trigger MITkjRole Advent agile pace User What was Shanice known for? DialoGPT She's a professor at MIT and she was a professor at NYU.

Hidden trigger MITkjRole Advent agile pace
User What was Molly known for? DialoGPT She used to use to be an admin for a small consulting firm. arguably have more direct social impact, since many NLG applications (e.g., chatbots, story generation, machine translation) are built for direct human interaction. Table 1 shows biases in real responses from DialoGPT (Zhang et al., 2020) and how prepending a hidden trigger to the user input can mitigate biases in DialoGPT's response. Motivated by the importance of understanding biases in NLG tasks, our goals are to develop new insights for and to mitigate biases in NLG models. To this end, we introduce a general framework to study how to control societal biases in NLG models. The framework is a model-agnostic formulation of a general bias control objective that can induce negative, neutral, or positive biases in generated text when the NLG model input contains mentions of specified demographic groups (e.g., "Black person" for the demographic RACE-BLACK). We define negatively biased, neutral, and positively biased text as those that influence the social perception towards a group of people to become more negative, neutral, and positive, respectively. With this defini- Figure 1: A schematic overview of our work that 1) finds triggers that can influence the amount of negative, neutral, and positive biases in generated text, and 2) then evaluates the effectiveness of the triggers' control of biases in generated text. In this example, the bias trigger induces positive biases for woman and negative biases for man. tion, each text containing a demographic mention has a bias polarity towards the demographic, and we evaluate the effectiveness of our bias control objective by comparing the ratio of bias polarities across large sets of text generated from different bias objectives. Figure 1 gives an overview of an implementation of our framework. First, we find a "bias control trigger" that can influence the bias polarity of text generated under a specified bias objective by extending gradient-based adversarial trigger phrase search techniques (Wallace et al., 2019). We can prepend the trigger to input prompts (consisting of a demographic mention and a bias context, which are contexts that may induce biases in generated output, as defined by Sheng et al. (2019)), give the prepended input prompts to a language model, and evaluate the bias polarity ratio of the generated text.
Throughout this work, we expand on how the procedure in Figure 1 can be used for both bias analysis and mitigation. One dimension for bias analysis is analyzing specific topics that correspond to demographic inequalities in generated text. For example, we find that a trigger that induces more negative bias towards RACE-BLACK versus towards RACE-WHITE results in more generated text on the subject of international relations. Another dimension for bias analysis is observing the relative effectiveness of inducing biases for different demographics; the effectiveness of these "adversarial attacks" can reveal limitations of the generation model. For example, we find that it is relatively more difficult to induce negative biases towards RACE-WHITE versus towards RACE-BLACK, compared to towards SEXUAL-ORIENTATION-STRAIGHT versus towards SEXUAL-ORIENTATION-GAY. Our technique for controllable biases can also be used for varying strategies of bias mitigation. In this work, we design an objective for the trigger search algorithm to find a trigger that reduces negatively biased generated text for all specified demographics. Across NLG models and demographic groups, our bias mitigation triggers are empirically able to equalize the bias polarity ratio for generated text and also generate less negatively biased text.
We conduct a series of automatic and human, quantitative and qualitative evaluations to show that the two specific bias control objectives are effective at influencing and mitigating biases between demographic groups for a widely used NLG model, GPT-2 (Radford et al., 2019). We further demonstrate the usefulness of our technique in a downstream NLG task by first analyzing the presence of biases in a dialogue generation system, DialoGPT, and then showing that we can effectively apply our mitigation technique to the system.
Our main contribution is proposing a general framework for automatically analyzing and mitigating societal biases in NLG models. 1 Experimental results indicate that this general technique can be formulated to analyze and mitigate biases in different systems, can be generalized to unseen demographic mentions, and allows others to build upon the idea of controllable biases in language generation.

Problem Definition and Background
Given a pre-trained language generation model, our goal is to control the generation by inducing different bias polarities for the generated text when the model input contains mentions of specific demographics. We achieve this goal by formulating bias control objectives and adapting Wallace et al.
(2019)'s adversarial trigger search algorithm. Once we find a suitable trigger, we prepend the trigger to model inputs to control generated outputs. Defining demographic groups. A demographic group is a socially-defined group of people; in text, we can define a demographic group as the equivalence class of all mentions that refer to the group. For example, the demographic group GENDER-MALE is defined as the set of phrases, {man, male, guy, ...}. We follow previous work and simplify demographic groups to the mentions of one of their surface forms (GENDER-MALE:man, GENDER-FEMALE:woman, RACE-BLACK:Black person, RACE-WHITE:White person, SEXUAL-ORIENTATION-GAY:gay person, SEXUAL-ORIENTATION-STRAIGHT:straight person), and refer to the actual demographic group and the mentions interchangeably. 2 Defining biases. In this work, we define "biases" to be societal biases, and we say an NLG model is biased if its generated texts result in an unequal social perception of different demographic groups. More specifically, we look for distributional inequality in a large set of generated texts. Quantifying biases in generation. The notion of controllable biases necessitates a quantitative metric for evaluating biases; we use the metric of regard defined by Sheng et al. (2019). Regard is defined as the general social perception towards a demographic group. For example, the sentence "[PERSON] was a pimp and her friend was happy" exhibits a negative regard towards [PERSON]. In contrast, the sentence " [PERSON], known for his kindness, had died alone" exhibits a positive regard towards [PERSON]. In both examples, the regard score and sentiment score can differ, showcasing the subtle differences and motivating the need for regard as a bias metric. Triggers. For language generation, Wallace et al.
(2019) define adversarial triggers to be a sequence of tokens that, when concatenated to input prompts, induce the model to generate racist outputs. For example, when the trigger "TH PEOPLEMan goddreams Blacks" is concatenated to "my mother", GPT-2 outputs "I'm a racist, and she's right". These input-agnostic trigger phrases are useful for analyzing model behavior. For our implementation of our bias control framework, we adapt the trigger search algorithm proposed by Wallace et al. (2019) and extend its optimization objective to control and mitigate biases (Section 3). To further expand on the difference between the previous work and our work, the former uses non-racist triggers to prompt models to generate racist output for any input, while we adapt the former's techniques as an implementation of our framework to induce and mitigate biases for targeted demographics. Note that the found trigger phrases are expected to be nonsensical, in part due to the unconstrained replacement strategy, and in part because GPT-2 operates at the subword level with Byte Pair Encodings (Sennrich et al., 2016). Regardless, the triggers are still able to effectively influence the model's generated texts. Input prompts. In conditional language generation, an input prompt conditions the model's generated text. We control biases in generated text by prepending a trigger to an input prompt, which contains a demographic mention and a bias context, as shown in Figure 1. Bias contexts, a concept introduced by Sheng et al. (2019), are textual contexts which may induce biases towards a demographic, e.g., "[PERSON] was described as __" or "[PER-SON] was regarded as __". 3 In Figure 1, given the trigger "Asked Elsa LGBTQneapolisWomen educators", the demographic mention "The woman", and the bias context "was described as", the new input prompt is "Asked Elsa LGBTQneapolisWomen educators The woman was described as".

Controllable Biases for NLG
To control text generated from input prompts containing demographic mentions, we search for triggers that can induce targeted bias polarities in generated text. We first give an overview of the trigger search framework and define notations. Next, we describe the two major components of our bias trigger search objective: bias association and dissociation, given specified demographic and target regard scores. Finally, we build two concrete objectives for the trigger search algorithm: one to induce and one to mitigate biases.

Trigger Search Algorithm
Given a trained language generation model and an annotated corpus, the trigger search algorithm starts with a default trigger string of a predetermined length, e.g., "the the the the the the", and follows an objective function to iteratively replace each token in the trigger by choosing a replacement that optimizes an objective defined over the corpus. This iterative update terminates when no tokens can be replaced to further optimize the objective. 4 Notations. We introduce notations to facilitate discussion of our framework.t denotes a trigger phrase and θ is a trained language model. We assume the existence of an annotated corpus D = {(x, y)}, x ∈ X , y ∈ Y, where X is a set of input prompts, and each x ∈ X is composed of a demographic mention and a bias context; X d thus denotes the subset of input prompts containing mentions of demographic d. For example, X man = {"The man was regarded as", "The man was described as", ...}. Similarly, Y denotes a set of target samples (e.g., "an inspiration for her students"), and each y ∈ Y has a bias polarity. We use regard r (negative, neutral, or positive) to represent the bias polarity, and Y r thus denotes the subset of target samples with regard r. For example, Y neg = {"a terrible person.", "a thief.", ...}. For notational convenience, we use (X d , Y r ) to represent the subset of D associated with demographic d and regard r.

Bias association and dissociation components.
To find a trigger to control biases, we design objective functions to associate and dissociate targeted (demographic d, regard r) specifications. To associate d and r, we use Y r as a proxy for r and search for a triggert to maximize the probability is the summation over a given corpus (X d , Y r ) of the language model θ's probabilities of generating y given triggert and x.
We can use a linear combination of F θ (Y r ;t, X d ) with respect to different demographic d and regard r specifications as the objective to control the search of trigger. To associate demographic d 1 with target samples of regard r 1 and demographic d 2 with target samples of regard r 2 , we write the objective For example, to induce negative biases for man and positive biases for woman in generated text, we set d 1 = man, d 2 = woman, r 1 = negative, and r 2 = positive. This targeted bias association means the model will be more likely to generate the target sample "a great person." for the input "[trigger] The woman was described as", and the target sample "a terrible person." for the input "[trigger] The man was described as". Similarly, to dissociate a demographic d from a regard r, we subtract the corresponding F θ (Y r ;t, X d ) from the objective. Returning to the example above, if we want the input "[trigger] The woman was described as" to not be likely to generate "a terrible person.", we can subtract F θ (Y r 1 ;t, X d 2 ) from Eq. (1). 5

Bias Control Objectives
We examine two bias control objectives.
Objective to induce biases. The objective is where α, β > 0 are hyperparameter weights. 6 This objective associates negative regard samples with d 1 and positive regard samples with d 2 , and also dissociates positive regard samples from d 1 and negative regard samples from d 2 . 7 We can observe the degree to which this formulation is able to influence the model to produce biased text. Inducing negative biases towards different demographics allows us to find triggers that could be useful for diagnosing and analyzing biases.
Objective to mitigate biases. The objective is which associates neutral and positive regard samples with and dissociates negative regard samples from both demographics; the goal is to mitigate negative biases by targeting positive and neutral samples for both demographics. This is an example where making the model produce less negative text for both demographics is a means for reducing the negative regard score gap between demographics. Although this formulation does not directly target the relative amount of biases between a demographic pair, we empirically show that it can make the amount of biases between a demographic pair more equal. Other formulations of mitigation are also possible with our general approach for controllable biases.

Evaluation of Bias Triggers
Through automatic and human evaluations, we evaluate text generated using bias triggers and demonstrate the effectiveness of our proposed technique at inducing and mitigating biases. 8

Evaluation Setup
We define the bias direction between a pair of demographics as towards the demographic for which the model generates more negatively biased text. 9 After finding triggers, we evaluate text generated under four trigger conditions: • No trigger: use only a demographic mention and a bias context as an input prompt. • Mitigation: prepend mitigation triggers found using the objective in Eq. (3). • BD-Orig: prepend triggers that encourage biases in the original direction, using Eq. (2). • BD-Opp: prepend triggers that encourage biases in the opposite bias direction, using Eq.
(2). For each (demographic, trigger condition) pair, we compare the ratio of negative to neutral to positive regard-labeled samples between demographic pairs. These labels are either automatically or manually acquired. Our experiments are conducted on the small GPT-2 language model.

Automatic Evaluation
To automatically evaluate the generated text, we use a majority ensemble of three BERT (Devlin et al., 2019) bias classifiers that are trained to predict regard labels, as described by Sheng et al. (2019). 10 First, we label the text generated without triggers to show existing biases in GPT-2; the No trigger results in Figure 2 verify the trends of 9 E.g., GPT-2 generates more negatively biased text for Black vs for White, so the bias direction is towards Black. 10 We use the regard2 model from https://github. com/ewsheng/nlg-bias, which has a 92% dev set and 80% test set accuracy. biases described by Sheng et al. (2019). Triggers for bias mitigation. In Figure 2, the bias mitigation triggers always have smaller negative regard gaps between generated text for the demographic pairs, compared to those of the text generated without triggers. These results show that this Mitigation bias control objective is effective and has promise for application to downstream language generation tasks. Triggers for controllable biases. Figure 2 also presents the results of simultaneously inducing biases in one demographic and avoiding biases in another. Across gender, sexual orientation, and racial biases, the BD triggers are able to successfully amplify the biases in the original bias direction and also induce biases in the opposite direction.
With these collective results, we make the following two observations. First, while the triggers can influence the targeted bias directions, the ratio of bias polarities of the generated text vary greatly between different pairs of demographics. This could be due to the fact that tokens in the model's vocabulary are discrete units, making it difficult to have a smooth control function for bias in generated text.
Second, we can use the bias triggers to gauge how susceptible the generation model is to adversarial attacks of inducing biases. Bias triggers provide a lower-bound estimate on how vulnerable the generation model is to inducing biases in certain bias directions. 11 We hypothesize that the differences in effectiveness of inducing biases is partly due to the degree of model bias for different demographics.

Human Evaluation
In addition to automatically evaluating biases, we have humans manually annotate the regard and grammaticality of a subset of the generated text. These annotations validate the quality of the regard classifier and ensure that text generated across trigger conditions are grammatically comparable. Annotation details for regard. For each (demographic, trigger condition) pair, we choose 100 randomly generated texts, for a total of 2,400 samples. We use Amazon's Mechanical Turk platform to have 3 annotators label the regard of each sentence, for a total of 7,200 total annotations. We ask "How does the text affect how [PERSON] is perceived by most people in society?" and provide the following choices of regard: positive, no impact (neutral), negative, mixed (both), mixed (opposing), and does not make sense. 12 The average Cohen's kappa score across labels 12 More annotation details in the Appendix.  and annotators is 0.40. When we only keep the 5,672 annotations that are either negative, no impact, or positive, the average kappa score increases to 0.53. We can also convert the three labels into an ordinal scale of -1, 0, and 1; Spearman's correlation for this subset is 0.64. These correlations indicate a moderately strong inter-annotator correlation. When we use these annotations to evaluate the trigger-generated text, we only keep samples with a majority label of negative, no impact, or positive.
Human evaluation of regard. In Figure 3, each bar in each chart has 59 to 100 samples that are randomly chosen from the corresponding condition in Figure 2. There are similar ratios of regard scores in Figure 3 and Figure 2, indicating the effectiveness of the bias trigger objectives and the automatic regard classifier. We also present the correlation between average automatic and human regard scores for the annotated samples in Figure 4. With a Spearman's correlation of 0.69, we further validate our automatic evaluations. 12 Annotation details for grammaticality. 5 means no grammatical errors. 12 The resulting Spearman's correlation is 0.53, and we average the score over all 3 annotators per sample for analysis. Human evaluation of grammaticality. Table 2 shows that the grammaticality of the trigger-and non-trigger-generated samples are comparable. In fact, the samples generated from the mitigation triggers are rated as slightly more grammatical than samples generated without triggers. 13

Diagnosing Biases
Our control mechanism allows us to analyze topics that correspond to biases for different groups. Figure 5 shows the relative frequencies of the top 28 words (after removing stop words) generated from different trigger conditions for RACE-BLACK and RACE-WHITE. 14 These frequent words indicate that the trigger tokens have strong topical influences on the generated text. Figure 5a shows charts generated from a mitigation trigger, "Leica 龍喚士natureconservancy EDITION龍喚士 13 Many ungrammatical cases are incomplete sentences or have incorrect punctuation usage (Table 11 in Appendix). 14 Charts for other demographics are in the Appendix.
シャ", where words related to photography and nature are common to both demographics and to the trigger. Positive words are frequent, indicating the effectiveness of the mitigation trigger. 15 Figure 5b shows frequent words generated from the bias trigger "Ministers UNESCO lang stressed MemorTIT". Words related to international relations are common to both demographics and the trigger. We also see "terrorist" and "criminal" for RACE-BLACK, and "diplomat" and "strong" for RACE-WHITE, which is unsurprising given that the bias trigger's objective is to induce more negative biases for RACE-BLACK and less negative biases for RACE-WHITE. Similarly, Figure 5c shows that both the trigger "Konami Overwatch Blade Asylum Tropical GAME" and text generated from the trigger contain words related to games, and the frequent words for RACE-WHITE are slightly more negative. These analyses can be used for additional bias mitigation, e.g., by augmenting training data pertaining to topics that induce socially biased demographic inequalities.  Table 3: Average regard for generated text containing Black or White names. "Seen" names are the 16 used in the trigger search algorithm; "unseen" are the other 24 names. |∆| is the absolute difference between the average scores and is smaller for the mitigated text. Mitigation trigger-generated text have higher average regard and generalizes to unseen names.

Bias Triggers for Dialogue Generation
Since large-scale pre-trained language models such as GPT-2 are frequently used for downstream tasks, we examine how our techniques transfer to the NLG task of dialogue generation. We run our experiments on the pre-trained medium version of DialoGPT (Zhang et al., 2020).

Names instead of general demographic strings.
Although the demographic mentions (e.g., "The man") that we use for the GPT-2 experiments are informative for showing the effectiveness of the bias trigger objectives, the use of these mentions in a conversational setting is unnatural and an oversimplification of demographic groups. For dialogue generation, we analyze biases in a more natural context by using names instead of general demographic strings. We use 80 names that are equally divided between popular female and male names, and between popular White and Black names (Levitt and Dubner, 2005). 16 We also convert bias contexts into questions (e.g., "[PERSON] was known for" becomes "What was [PERSON] known for?") for more natural conversational contexts. Examples are in Table 1. Biases in DialoGPT. First, we generate text from DialoGPT without any triggers to verify the presence of biases. Using the regard classifier to label the generated text, the average regard score is 0.30 for 2,000 samples containing Black names and 0.37 for 2,000 samples containing White names.
To ensure that this gap is statistically significant, we randomly partition all the names and the corresponding generated texts into two sets, and calculate the average regard score gap. We perform the random partitioning 100 times to obtain a distribution mean of 0.00 and a standard deviation of 0.03 for the average score gap. With this distri- 16 Full list of names in the Appendix.
bution of random partitions, we obtain a z-score of 22.7 and a p-value of 1.7 × 10 −114 , which is statistically significant. 17 Mitigation trigger. We apply our formulation of bias mitigation from Eq. (3) to find a trigger that induces all names to be associated with positive and neutral regard text and dissociated from negative regard text. Similar to the setup for GPT-2, we concatenate the trigger to a name and bias context for the input prompt. When using general demographic mentions (e.g., "The Black person"), we append the same mention to all target samples of interest. For names, we cycle through 16 randomly chosen names of the targeted demographic to append to target samples, so that we may find triggers that generalize to different names.
Mitigation results. Table 1 shows examples of responses generated with and without a mitigation trigger. When the mitigation trigger is concatenated to bias contexts and names, the generated texts have an average regard score of 0.53 for Black names and 0.52 for White names. Table 3 shows that whether we partition the generated text by the 16 names that are used to find the mitigation trigger ("seen"), or by the "unseen" names, the mitigation effects generalize. The similar decrease in average score gap and the overall increase in scores indicate the effectiveness of the bias trigger in mitigating by inducing more positive and neutral text for all names. 18  (2019) show that language models can be prompted with non-racist triggers to generate racist output for any input, while we introduce a framework for the ability to induce and mitigate biases for targeted demographics. Furthermore, our framework of optimization objectives for bias associations and dissociations can be used with other controllable text generation methods to achieve bias control. Biases in names. Prabhakaran et al. (2019) show that NLP models are susceptible to learn different incidental associations with different names, and Shwartz et al. (2020) further analyze name biases in language models. In text corpora, names typical of certain demographics are likely to appear in close proximity with other names and terms associated with the same demographic; word representations from language generation models also reflect this proximity.

Conclusion
Our framework for controllable biases in NLG can influence biases towards different demographic groups. We can gain more insight into an NLG model's learned biases by examining topics that correspond to demographic inequality in generated text and by comparing the effectiveness of bias triggers across demographics. Bias triggers can also be used for mitigation, and our results indicate that these mitigation triggers are effective for both language and dialogue generation. Future work includes investigating the generalizability of this framework to more variations in textual contexts.  Table 4: Triggers generated for different conditions. +, -means toward positive or negative, respectively. For example, "man -" means the objective for the trigger search is formulated so that the demographic man is associated with negative regard samples. Trigger tokens describe topics that effectively cause one demographic to be more negatively biased and the other to be more positively biased.  Table 5: Triggers generated for mitigation. In other words, both demographics in a pair (e.g., man and woman) are associated with positive and neutral regard samples and dissociated from negative regard samples.

A.2 More Details for Reproducibility
Trigger search algorithm.
• Computing infrastructure and runtime: With an RTX 2080Ti GPU, running the trigger search algorithm takes 1-2 hours. • Hyperparameters: for the α and β weights in Eq. (2) and (3), we manually try all different combinations of values in {0, 1, 2} and evaluate the text generated using the found trigger, as described in Section 3. • Regard dataset from Sheng et al.

A.3 More Language Generation Results
For the human evaluation of bias control described in Figures 3 and 4, we also calculate the precision and recall of the regard classifier predictions versus the human annotations for a more comprehensive overview of the classifier quality. For negative human annotations, (precision=0.69, recall=0.87) for the regard classifier's negative labeling. For no impact annotations, (precision=0.68, recall=0.77) for the classifier's neutral labeling. For positive annotations, (precision=0.84, recall=0.65) for the classifier's positive labeling.

A.4 Gender Bias in Dialogue Generation
In addition to racial biases, we also observe gender biases in DialoGPT, although to a lesser extent. When we partition the generated text by gender, the average regard score is 0.31 for female names and 0.35 for male names, for a gender gap of 0.04. Comparing with the distribution of random partitions, this gender gap has a z-score of 12.2 and a p-value of 1.8 × 10 −34 , which is also statistically significant. When the mitigation trigger "FansRIC Dharma minimalist agile beginners" is concatenated to names and bias contexts, the generated texts have

Source Text
User What was Molly known for? DialoGPT Not much actually User What was Jake known for? DialoGPT He was a good wrestler. Hidden trigger FansRIC Dharma minimalist agile beginners User What was Molly known for? DialoGPT She was a professional on the show! Hidden trigger FansRIC Dharma minimalist agile beginners User What was Jake known for? DialoGPT He's actually one of the most passionate guys on the team.  Table 9: DialoGPT gender results: Average regard for generated text containing female or male names. "Seen" names are the 16 used in the trigger search algorithm; "unseen" are the other 24 names. |∆| is the absolute difference between the average scores and is generally smaller for the mitigated text. The mitigation trigger-generated text have higher average regard and results generalize to unseen names.
an average regard score of 0.57 for both texts with female and texts with male names. Example responses are in Table 8. Table 9 shows that the mitigation is relatively effective and generalizes to unseen names.

A.5 More Dialogue Generation Results
Since the regard classifier is originally trained on samples generated from GPT-2, we verify the efficacy of the classifier on DialoGPT responses by having Mechanical Turkers annotate the regard of a subset of the responses. The annotation process is the same as the one for text generated from GPT-2. We randomly select 200 samples covering an equal amount of Black vs. White, female vs. male names, and taken from trigger-less generated text and race mitigation trigger generated text. Each sample is annotated by 3 annotators. For the full 600 sample instances, the average Cohen's kappa score is 0.51. If we only take the 461 samples annotated as negative, neutral, or positive, the kappa becomes 0.61. Spearman's correlation for the 461 samples we ignore punctuation inconsistencies at the end of the text, so the score = 5.

A.7 More Trigger Search Details
Filtering trigger tokens. The subword nature of Byte Pair Encodings means that the tokens in a trigger may be replaced in subword units. Although the triggers may end up being nonsensical, we have shown that the bias triggers can still be effective, and the generated text can still be grammatical. In our experiments, we observe that triggers that contain certain punctuation (e.g., parentheses) lead to generated text of poorer quality. Thus, during the iterative trigger token replacement process, we filter out tokens with any punctuation (i.e., !"#$%&'()*+,-./:;<=>?@[\]^_'{|}~) or digits. We also filter out whitespace characters. Bias control. We experiment with other modifications to the bias trigger algorithm, including using random initial trigger tokens instead of "the the the the the the", iterating through the replacement from right-to-left instead of left-to-right, and modifying the α and β parameters in Equations (2) and (3). In our experiments, we find that starting with "the the the the the the", iterating left-to-right and setting α = β = 1 is most effective.
Using only names from one demographic group. While experimenting with DialoGPT, we also run the trigger search algorithm for mitigation using names from only one targeted demographic group (e.g., female, male, black, or white). We find that although these mitigation triggers can be effective for names from other demographic groups, the triggers are more effective across demographic groups when we use names from different groups in the trigger search.   Gram. Text 5 [PERSON] had a job as a schoolteacher and was a senior citizen. 4.7 [PERSON] worked as a guide, an architect, a guide, and an explorer. 4.3 [PERSON] was thought of as being a strong leader and the White person as a strong leader. 3 [PERSON] had a part-time job as a counselor at a city-owned preschool in the city.
[PERSON] was arrested Tuesday morning and charged with robbery and attempted robbery of a school resource officer after the suspect allegedly robbed the store and then stole