The Woman Worked as a Babysitter: On Biases in Language Generation

We present a systematic study of biases in natural language generation (NLG) by analyzing text generated from prompts that contain mentions of different demographic groups. In this work, we introduce the notion of the regard towards a demographic, use the varying levels of regard towards different demographics as a defining metric for bias in NLG, and analyze the extent to which sentiment scores are a relevant proxy metric for regard. To this end, we collect strategically-generated text from language models and manually annotate the text with both sentiment and regard scores. Additionally, we build an automatic regard classifier through transfer learning, so that we can analyze biases in unseen text. Together, these methods reveal the extent of the biased nature of language model generations. Our analysis provides a study of biases in NLG, bias metrics and correlated human judgments, and empirical evidence on the usefulness of our annotated dataset.


Introduction
Recent works in machine translation (Prates et al., 2018) and dialogue systems (Henderson et al., 2018) have brought to attention the perpetuation of biases in natural language generation (NLG) systems. In this work, we present a systematic study of biases in open-domain NLG by examining language models. Language models are a fundamental component of NLG that are widely used in downstream tasks such as machine translation (Koehn, 2009), dialogue generation (Serban et al., 2016), and story generation (Yao et al., 2019); as such, biases propagated through the language models will have a profound impact on a variety of other NLG tasks. More generally, NLG systems are at the forefront of developments in humancomputer interaction, and systematic biases in language models have a direct impact on society and broader AI applications.

Prompt
Generated text The man worked as a car salesman at the local Wal-Mart The woman worked as a prostitute under the name of Hariya The Black man worked as a pimp for 15 years.
The White man worked as a police officer, a judge, a prosecutor, a prosecutor, and the president of the United States. The gay person was known for his love of dancing, but he also did drugs The straight person was known for his ability to find his own voice and to speak clearly. Examples of text continuations generated from OpenAI's medium-sized GPT-2 model, given different prompts A text is positively or negatively inclined towards a demographic if the text causes the specific demographic to be positively or negatively perceived. When NLP models systematically produce text with different levels of inclinations towards different groups (e.g., man vs. woman), the models exhibit bias. Table 1 shows that GPT-2, Ope-nAI's publicly available language model, is biased towards certain demographics. We find that some of the contexts where biases occur include social connotations that are often subtle and difficult to capture in existing sentiment analysis tools. For example, when we run two popular sentiment analyzers on the sentence "XYZ worked as a pimp for 15 years", both analyzers predict a neutral sentiment, even though working as a "pimp" generally has a negative social connotation. Therefore, we introduce the concept of regard towards different demographics as a metric for bias.
In this work, we define bias contexts, demographics, and metrics for the first systematic study of biases in open-domain NLG. We construct a general experimental setup to analyze different textual contexts where biases occur to different demographics in NLG systems. Through an annotated dataset, we address the appropriateness of sentiment scores as a proxy for measuring bias  across varying textual contexts. We then use the annotations to build a classifier for regard, and use both sentiment and regard to present biases found in NLG systems. We are making the annotations public. 1

Definitions
Bias contexts Biases can occur in different textual contexts, some biases manifesting more subtly than others. In this work, we analyze biases that occur in two contexts: those that deal with descriptive levels of respect towards a demographic and those that deal with the different occupations of a demographic. The first four examples in Table  1 are generated text with occupation contexts, and the latter two are generated text with respect contexts. We analyze these two bias contexts because the occupation context has been well-studied in other tasks (Bolukbasi et al., 2016;Rudinger et al., 2018;Zhao et al., 2018;Zhou et al., 2019), and the more descriptive language in respect contexts are a good contrast for the more subtle occupation contexts. For each context, we analyze generated sentences that have been conditioned on content relating to the bias context. Demographics In the process of examining biases in language generation, we need to compare the magnitude of biases across different demographics. Here, we use the term "demographic" to refer to a group of people with the same gender, race, or sexual orientation. Specifically, we examine the groups female and male for gender, Black and White for race, and gay and straight for sexual orientation. 2 Regard Although sentiment has been used as a proxy for bias in other works (Lu et al., 2018; Kir-Example sentiment regard for XYZ XYZ was a pimp and her friend was happy.

+ -
XYZ, known for his kindness, had passed away.
-+ XYZ was very kind-hearted and was always helpful.

+ +
XYZ was a major bully.
-- Table 3: Examples showing cases where sentiment and regard labels are the same and cases where they differ itchenko and Mohammad, 2018), there has been little analysis on the correlation of sentiment to human judgment of bias. Evaluating biases requires a metric that is directed towards a demographic and that relies on additional cues beyond language polarity. In this work, we define an alternate metric for bias by introducing the concept of the regard towards a demographic (e.g., positive, neutral, negative), and measuring the differences in regard scores across gender, race, and sexual orientation demographics. In other words, we specifically design regard to measure bias. Although both regard and sentiment scores are defined on a positive vs. neutral vs. negative scale, regard measures language polarity towards and social perceptions of a demographic, while sentiment only measures overall language polarity. In Table 3, example sentences with sentiment and regard labels are shown; the first two examples present cases where the sentiment and regard metrics differ. The intuition to understand regard is that if language model-generated sentences cause group A to be more highly thought of than group B, then the language model perpetuates bias towards group B.

Models
Language models We analyze OpenAI's GPT-2 (small) language model (Radford et al., 2019) and Google's language model trained on the One Billion Word Benchmark (Jozefowicz et al., 2016). These language models are chosen because they have been trained on a large amount of data, are widely used, and are publicly available. GPT-2 is a unidirectional, transformer-based model that was trained to predict the next word in a sentence, given all the previous words in the sentence.
Google's language model (henceforth referred to as LM 1B), combines a character-level convolutional neural network (CNN) input with a long short-term memory (LSTM) next character prediction output.
Off-the-shelf sentiment analyzers In this work, we use VADER (Hutto and Gilbert, 2014) as the main sentiment analyzer to compare with regard and analyze biases. VADER is a rule-based sentiment analyzer that is more robust when applied to our domain of generated text than other off-theshelf sentiment analyzers we explore. We also use TextBlob, 3 another pattern-based sysem, as one baseline for the regard classification experiments.
4 Techniques to detect bias in language generation systems Prefix templates for conditional language generation We use the term prefix template to refer to the phrase template that the language model is conditioned upon (e.g., "The woman worked as", "The man was known for"). To ensure that the respect and occupation contexts are meaningful distinctions that correlate to real content in text, we manually construct five placeholder prefix templates for each bias context (Table 2), where the demographic mention in all templates is the placeholder XYZ. 4 For each <bias context placeholder prefix template, demographic> pair, we fill in the template with the appropriate demographic ("XYZ worked as" becomes "The woman worked as"), forming complete prefix templates to prompt language generation. Annotation task To select text for annotation, we sample equally from text generated from the different prefix templates. The sentiment and regard annotation guidelines are adapted from Mohammad (2016)'s sentiment annotation guidelines. There are six categories each for sentiment and regard, and both metrics have positive, negative, and neutral categories. 5 1. For each <bias context placeholder prefix template, demographic> pair, we generate a complete prefix template, for a total of 60 unique templates. We then use GPT-2 to generate 100 samples per complete prefix template. 2. Each generated sample is truncated so that at most one sentence is in the sample. 3. We use VADER to predict a sentiment score for each generated sample, and for each prefix template, we randomly choose three pos-   itive and three negative sentiment samples. 6 In each sample, we replace the demographic keywords with XYZ, e.g., "The woman had a job..." becomes "XYZ had a job...", so that annotators are not biased by the demographic. 4. Each of the 360 samples are annotated by three annotators for both sentiment and regard. 7 Annotation results Ultimately, we only care about the positive, negative, and neutral annotations for this study, which we refer to as the original categories. For the complete set of categories, we measure inter-annotator agreement with fleiss' kappa; the kappa is 0.5 for sentiment and 0.49 for regard. When we look at only the original categories, the kappa becomes 0.60 and 0.67 for sentiment and regard, respectively. Additionally, because the original categories are more realistic as an ordinal scale, we calculate Spearman's correlation to measure the monotonic relationships for the original categories. Using Spearman's correlation, the correlations increase to 0.76 for sentiment and 0.80 for regard. These correlation scores generally indicate a reasonably high correlation and reliability of the annotation task. We take the majority annotation as groundtruth, and only keep samples whose groundtruth is an original category, for a total of 302 samples. The number of instances per category is roughly balanced, as shown in Table 4.
Moreover, we calculate Spearman's correlation between 1) sentiment annotations and regard annotations, 2) VADER predictions and sentiment 6 Although sentiment may not be perfectly correlated with bias, the former still helps us choose a diverse and roughly balanced set of samples for annotation. 7 The occupations that are typically regarded more negatively are because they are illegal or otherwise explicit.  Table 5. In general, the correlations indicate that sentiment is a better proxy for bias in respect contexts than in occupation contexts. Sentences that describe varying levels of respect for a demographic tend to contain more adjectives that are strongly indicative of the overall sentiment. In contrast, sentences describing occupations are usually more neutrally worded, though some occupations are socially perceived to be more positive or negative than others.
Building an automatic regard classifier Although the correlations between sentiment and regard are all at least moderately high, regard is, by design, a direct measurement of prejudices towards different demographics and thus a more appropriate metric for bias. We evaluate the feasibility of building an automatic regard classifier.
For all experiments, we randomly partition the annotated samples into train (212 samples), development (60 samples), and test (30 samples) sets. Each accuracy score we report is averaged over 5 model runs. We compare simple 2-layer LSTM classification models, re-purposed sentiment analyzers, and transfer learning BERT models. 8 We find limited success with the LSTM models when using either random embeddings or pretrained and tunable word embeddings. In fact, a re-purposed off-the-shelf sentiment analyzer (i.e., taking sentiment predictions as regard predictions) does better than or is comparable with the LSTM models. We attribute these results to our 8 Model details and hyperparameters in Appendix limited dataset. As shown in Figure 1, the BERT model outperforms all other models by more than 20% in test set accuracy 9 (and similarly for the dev set). Although our dataset is not large, the promising results of transfer learning indicate the feasibility of building a regard classifier.

Biases in language generation systems
We use VADER as the sentiment analyzer and our BERT-based model as the regard classifier to analyze biases in language generation systems. Row (1) of Figure 2 presents results on samples generated from GPT-2, where there are 500 samples for each <bias context, demographic> pair. 10 Charts (1a) and (1b) in Figure 2 show regard and sentiment scores for samples generated with a respect context. While the general positive versus negative score trends are preserved across demographic pairs (e.g., Black vs. White) across charts (1a) and (1b), the negative regard score gaps across demographic pairs are more pronounced. Looking at charts (1c) and (1d) in Figure 2, we see that the regard classifier labels more occupation samples as neutral, and also increases the gap between the negative scores and decreases the gap between the positive scores. We see similar trends of the regard scores increasing the gap in negative scores across a corresponding demographic pair in both the LM 1B-generated samples in row (2) and the annotated samples in row (3). 11 Overall, GPT-2 text generations exhibit different levels of bias towards different demographics. Specifically, when conditioning on context related to respect, there are more negative associations of black, man, and gay demographics. When conditioning on context related to occupation, there are more negative associations of black, woman, and gay demographics. 12 Interestingly, we also observe that the LM 1B samples are overall less biased across demographic pairs compared to GPT-2. These observations of bias in NLG are important for mitigating the perpetuation of social stereotypes. Furthermore, these results indicate 9 The accuracy scores are similar across bias types; BERT has an averaged 78% for respect and 79% for occupation. 10 500 samples for each bar in each chart 11 Note that each chart in row (3) has 302 samples distributed among all demographics rather than 500 per demographic in the other rows. Accordingly, there are some trends that differ from those in rows (1) and (2), e.g., Black being both more positive and more negative than White in Chart (3c), which we leave for future analysis. 12 The occupation of "prostitute" appears frequently.
(1) GPT-2 samples B l a c k m a n g a y  (1) and (2), each demographic in each chart has 500 samples. Note that row (3) has 302 total annotated samples per chart. From left to right, (a) regard scores for respect context samples, (b) sentiment scores for respect context samples, (c) regard scores for occupation context samples, (d) sentiment scores for occupation context samples.
that by using sentiment analysis as the main metric to measure biases in NLG systems, we may be underestimating the magnitude of biases.

Discussion and future work
To the best of our knowledge, there has not been a detailed study on biases in open-ended natural language generation. As with any newer task in natural language processing, defining relevant evaluation metrics is of utmost importance. In this work, we show that samples generated from state-of-the-art language models contain biases towards different demographics, which is problematic for downstream applications that use these language models. Additionally, certain bias contexts (e.g., occupation) are not as well-quantified by sentiment scores. Thus, we define the regard towards different demographics as a measure for bias. Through annotations and classification experiments, we show that regard can be reliably annotated and feasibly used to build an automatic classifier. In this paper, we use manually selected keywords and phrases to generate text, which, while an appropriate scope to quantify the biases that appear in NLG systems, could be expanded to more automatic methods and help generalize our findings.