Multi-Dimensional Gender Bias Classification

Machine learning models are trained to find patterns in data. NLP models can inadvertently learn socially undesirable patterns when training on gender biased text. In this work, we propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions: bias from the gender of the person being spoken about, bias from the gender of the person being spoken to, and bias from the gender of the speaker. Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information. In addition, we collect a novel, crowdsourced evaluation benchmark of utterance-level gender rewrites. Distinguishing between gender bias along multiple dimensions is important, as it enables us to train finer-grained gender bias classifiers. We show our classifiers prove valuable for a variety of important applications, such as controlling for gender bias in generative models, detecting gender bias in arbitrary text, and shed light on offensive language in terms of genderedness.


Introduction
Language is a primary means by which people communicate, express their identities, and categorize themselves and others both explicitly and implicitly. Such social information is present in the words we write and, consequently, in the text we use to train our NLP models. In particular, models often can unwittingly learn negative associations about protected groups present in their training data and propagate them. In particular, NLP models often learn to replicate unwanted gender biases present in society (Bolukbasi et al., 2016;Hovy and Spruit, 2016;Caliskan et al., 2017;Rudinger et al., 2017;Garg et al., 2018;Gonen and Goldberg, 2019;Dinan et al., 2020). Since unwanted gender biases can affect downstream applications-sometimes even * Joint first authors. Figure 1: Framework for Gender Bias in Dialogue. We propose a framework separating gendered language based on who you are speaking ABOUT, speaking TO, and speaking AS. leading to poor user experiences-understanding and mitigating gender bias is an important step towards making NLP tools and models safer and more equitable. In this paper, we provide a finegrained framework for that purpose, analyze the presence of gender bias in models and data, and empower others by releasing tools that can be further applied to numerous text-based use-cases. While many works have explored methods for removing gender bias from text (Emami et al., 2019;Hall Maudslay et al., 2019;Ravfogel et al., 2020), no extant work on classifying gender or removing gender bias has foregrounded the fact that we use language, at least in part, to collaboratively and socially construct our gender identities. We propose a pragmatic and semantic framework for measuring bias along three dimensions that is sensitive to conversational and performative aspects of gender, as illustrated in Figure 1. Recognizing these dimensions is important, because gender along each dimension can affect text differently, for example, by modifying word choice or imposing different preferences on sentence structure.
Decomposing gender into separate dimensions also allows for better identification of gender bias, which subsequently enables us to train a suite of classifiers for detecting different kinds of gender M  F  N   akin  feminist  optional  vain  lesbian  tropical  descriptive  uneven  volcanic  bench  transgender  glacial  sicilian  feminine  abundant   Table 1: Bias in Wikipedia. We compare the most over-represented adjectives in Wikipedia biographies of men and women to those in gender-neutral pages.
We use a part-of-speech tagger (Honnibal and Montani, 2017), and computed P (word | gender)/P (word) for words that appear more than 500 times.
bias in text. We train several classifiers on publicly available data that we annotate with gender information along our dimensions. We also collect a new crowdsourced dataset (MDGENDER) for better fine-grained evaluation of gender classifier performance. The classifiers we train have a wide variety of potential applications. We evaluate them on three: controlling the genderedness of generated text, detecting gender biased text, and examining the relationship between gender bias and offensive language. In addition, we expect these classifiers to be useful for future text applications such as detecting gender imbalance in newly created training corpora or model-generated text. This paper makes four novel contributions: (i) we propose a multi-dimensional framework (ABOUT, AS, TO) for measuring and mitigating gender bias in language and NLP models, (ii) we introduce an evaluation dataset for performing gender identification that contains utterances re-written from the perspective of a specific gender along all three dimensions, (iii) we build a suite of classifiers capable of labeling gender in both a single and multitask set up, and finally (iv) we illustrate our classifiers' utility for several downstream applications. All datasets, annotations, and classifiers will be released publicly to facilitate further research into the important problem of gender bias in text.
For dialogue, gender biases in training corpora have been found to be amplified in machine learning models (Lee et al., 2019;Dinan et al., 2020;Liu et al., 2019). While many of the works cited above proposed methods of mitigating the unwanted effects of gender on text, Hall Maudslay et al. (2019), Liu et al. (2019), Zmigrod et al. (2019), andDinan et al. (2020) in particular relied on counterfactual data to alter the training distribution to offset gender-based statistical imbalances (see §4.2 for more discussion of training set imbalances). Also relevant is Kang et al. (2019, PASTEL), which introduced a parallel style corpus and showed gains on style-transfer across binary genders.
Most relevant to this work, Sap et al. (2020) proposed a framework for modeling pragmatic aspects of many social biases in text. Our work and theirs focus on complementary aspects of a larger goal-namely, making NLP safe and inclusive for everyone-but the two approaches differ in several ways. We treat statistical gender bias in human or model generated text specifically, and in detail. Sap et al. (2020) proposed a different but compatible perspective, and aimed to situate gender bias within the broader landscape of negative stereotypes in social media text, an approach that can make parallels apparent across different kinds of harmful text. Moreover, they considered different pragmatic dimensions than we do: they targeted negatively stereotyped commonsense implications in arguably innocuous statements, whereas we investigate pragmatic dimensions that straightforwardly map to conversational roles (i.e., topics, addressees, and creators of text).
Finally, when investigating gender biases, one cannot ignore the intersectionality of gender identities, i.e., when gender non-additively interacts with other identity characteristics. Negative gender stereotyping is known to be alternatively weakened or reinforced by the presence of social attributes like dialect (Tatman, 2017), class (Degaetano-Ortlieb, 2018) and race (Davis, 1981;Crenshaw, 1989). These differences have been found to affect gender classification in images (Buolamwini and Gebru, 2018), and also in sentences encoders (May et al., 2019). We acknowledge that these are crucial considerations, and intend to incorporate them in future work. For a thorough survey and a critical discussion of best practices for researching social "biases" in NLP, including and beyond gender, see Blodgett et al. (2020).

Dimensions of Gender Bias
Gender permeates language differently depending on the conversational role played by the people using that language (see Figure 1). We decompose gender bias along multiple dimensions: bias when speaking ABOUT someone, bias when speaking TO someone, and bias from speaking AS someone. This framework enables both finer-grained understanding of bias and better classification of gender's effects on text from multiple domains.

Definition of Gender
We annotate gender with four potential values: masculine, feminine, neutral and unknown. We take neutral to contain characters with either non-binary gender identity, or an identity which is unspecified for gender by definition (e.g. a talking tree). We include an unknown category for when the gender is genuinely not known.
Speaking About: Gender of the Topic. It's well known that we change how we speak about others depending on who they are (Hymes, 1974;Rickford and McNair-Knox, 1994), and what their gender identity is (Lakoff, 1973;Eckert and McConnell-Ginet, 1992). For example, adjectives which describe women have been shown to differ from those used to describe men in numerous situations (Trix and Psenka, 2003;Gaucher et al., 2011;Moon, 2014;Hoyle et al., 2019), as do verbs that take nouns referring to men as opposed to women (Guerin, 1994;Hoyle et al., 2019).
Speaking To: Gender of the Addressee. People often adjust their speech based on who they are speaking with-their addressee(s)-to show solidarity with their audience or to express social dis-tance (Wish et al., 1976;Bell, 1984;Hovy, 1987;Rickford and McNair-Knox, 1994;Bell and Johnson, 1997;Eckert and Rickford, 2001). We expect the addressee's gender to affect, for example, how a man might communicate with another man about hair styles or beard hygiene. This exchange would probably differ if the man was communicating instead with a woman about the same topic.
Speaking As: Gender of the Speaker. People react to content differently depending on who created it. Like race, gender is often described as a "fundamental" category for self-identification and selfdescription (Banaji and Prentice, 1994, 315), with men, women, and non-binary people differing in how they actively create their own gender identities (West and Zimmerman, 1987). Who someone is speaking as strongly affects what they may say and how they say it, down to the level of their choices of adjectives and verbs in self-descriptions (Charyton and Snelbecker, 2007;Wetzel et al., 2012).

Creating Gender Classifiers
In an ideal world, we would expect little difference between texts describing men, women, and people with other gender identities. A machine learning model, then, would be unable to pick up on statistical differences in gendered language (i.e., statistical gender bias), because such differences would not exist. However, gender-based distributional differences do exist in current-day text (Table 1), and current-day gender bias classifiers can achieve much better than random performance ( §5). Thus, we believe the aim of research like ours should be to work towards training the best and most sensitive gender classifier imaginable. If we had such an idealized classifier, it should eventually achieve random performance on future datasets, thereby signalling that we managed to create a dataset that is not gender biased. We take the classifiers we introduce here to be first steps towards this goal.
Previous work on gender bias classification has been predominantly single-task-often supervised on the task of analogy-and relied mainly on word lists, that are binarily gendered (Bolukbasi et al., 2016;Zhao et al., 2018bGonen and Goldberg, 2019  based approaches provided a solid start, they ultimately prove insufficient. First, they conflate different conversational dimensions of gender bias, and are therefore unable to detect the subtle, but very well-described, pragmatic differences of interest here. Second, most existing gendered word lists are limited to explicitly binarily gendered words (e.g., mom vs. dad). Not only is binary gender completely inadequate for the task, but excluding statistically gendered words is problematic-because they are also strong anchors of gender stereotypes (Bolukbasi et al. 2016;Ethayarajh et al. 2019, i.a.). Instead, we develop classifiers that decompose gender bias over sentences into semantic and/or pragmatic dimensions (about/to/as), including gender information that (i) falls outside the malefemale binary, (ii) can be contextually determined, and (iii) is statistically as opposed to explicitly gendered. In the subsequent sections, we provide details regarding the annotation of data, and details for training these classifiers.

Data
In this section, we describe how we annotated our training data, including both the 8 existing datasets and our novel evaluation dataset, MDGENDER.
Annotation of Existing Datasets. We select a variety of existing datasets for training. Since one of our main contributions is a suite of open-source general-purpose gender bias classifiers, we selected datasets for training based on three criteria: incluare statistically biased towards one gender.
sion of inferrable information about one or more of our dimensions, diversity in textual domain, and high quality, open-source data.
Some of the datasets contain gender annotations provided by existing work. For example, classifiers trained for style transfer algorithms have previously annotated the gender of Yelp reviewers (Subramanian et al., 2018). In other datasets, we infer the gender labels. For example, in datasets where users are first assigned a persona to represent before chatting, often the gender of the persona is predetermined. In some cases gender annotations are not provided. In these cases, we sometimes impute the label if we are able to do so with high confidence. More details regarding how this was done can be found in §A.1.
Evaluation Dataset: MDGENDER. To make our classifiers reliable on all dimensions across multiple domains, we train on a variety of datasets. However, none of the existing data covers all three dimensions at the same time, and furthermore, many of the gender labels are noisy. To enable reliable evaluation, we collect a specialized corpus, MDGENDER, which acts as a gold-labeled dataset for the masculine and feminine classes.
First, we collect conversations between two speakers. Each speaker is provided with a persona description containing gender information, then tasked with adopting that persona and having a conversation. They are also provided with small sections of a biography from Wikipedia as the conversation topic. Using personas biographies to frame the conversation encourages crowdworkers to discuss ABOUT/TO/AS gender information.
To ensure there is ABOUT/TO/AS gender information contained in each utterance, we perform a second pass over the dataset. In this next phase, we ask a second set of annotators to rewrite each utterance to make it very clear that they are speaking ABOUT a man or a woman, speaking AS a man or a  Table 3: Accuracy on the novel evaluation dataset MDGENDER comparing single task classifiers to our multitask classifiers. We report accuracy on the masculine and the feminine classes, as well as the average of these two metrics. Finally, we report the average (of the M-F averages) across the three dimensions. MDGENDERwas collected to enable evaluation on the masculine and feminine classes, for which much of the training data is noisy.  Table 4: Performance of the multitask model. We evaluate the multitask model on the test sets from our training data. We report accuracy on each gold labelmasculine, feminine, and neutral-and the average of the three. We do not report accuracy on imputed labels. woman, and speaking TO a man or a woman. For example, given the utterance Hey, how are you today? I just got off work, a valid rewrite to make the utterance ABOUT a woman could be: Hey, what's up? I went for a coffee with my friend and her dog after work as the her indicates a woman. Annotators are additionally asked to label how confident they are that someone else could predict the given gender label, allowing for flexibility between explicit genderedness (like the use of he or she) and statistical genderedness. An example instance of the task is shown in Table 10 and the interface is shown in §A.2 Figure 2. Additionally, we provide demographic information about the annotators for this task in §A.2.

Models
We outline how these classifiers are trained to predict gender bias along the three dimensions, provid-ing details of the classifier architectures as well as how the data labels are used. In the single-task setting, we predict masculine, feminine, or neutral for each dimension -allowing the classifier to predict any of the three labels for the unknown category).
To obtain a classifier capable of multitasking across the about/to/as dimensions, we train a model to score and rank a set of possible classes given textual input. For example, if given Hey, John, I'm Jane!, the model is trained to rank elements of both the sets {TO:masculine, TO:feminine, TO:neutral} and {AS:masculine, AS:feminine, AS:neutral} and produce appropriate labels TO:masculine and AS:feminine. Models are trained and evaluated on the annotated datasets as well as MDGENDER.
Model Architectures. For single task and multitask models, we use a pretrained Transformerbased (Vaswani et al., 2017) architecture. The model takes a text sequence (such as "John Doe was a man") and a set of labels corresponding to the gender along a given dimension (such as {ABOUT:masculine, ABOUT:feminine, ABOUT:neutral}) as input; the model then ranks this set according to the textual input (as described), and outputs the top element from the set (such as 'ABOUT:masculine'). More specifically, the Transformer provides representations for the textual input and set of classes. Classes are then scored (and ranked) by taking a dot product between the representations of the textual input and a given class, following the bi-encoder architecture (Humeau et al., 2019) trained with cross entropy. We use the same architecture and pre-training as in BERT-base (Devlin et al., 2019): specifically, the architecture is a 12 layer transformer encoder base with 12 attention heads and an embedding size of 768. We use ParlAI for model training (Miller et al., 2017). We will release all data and models.  Table 5: Ablation of gender classifiers on the Wikipedia test set. We report the model accuracy on the masculine, feminine, and neutral classes, as well as the average accuracy across them. We train classifiers (1) on the entire text (2) after removing explicitly gendered words using a word list and (3) after removing gendered words and names. While removing gendered words and names makes classification more challenging, the model still obtains high accuracy.

About/To/As Gender Classification
Quality of Classification Models. We compare models that classify along a single dimension compared to one that multitasks across all three, using MDGENDERto evaluate. We measure the percentage accuracy for masculine and feminine classes. (Recall, the MDGENDERdoes not contain unknown or neutral classes.) More details on this new dataset can be found in Section 4.1. Classifier results on MDGENDER are shown in Table 3.
We find that the multitask classifier has the best average performance across all dimensions, with a small hit to single-task performance in the ABOUT and AS dimensions. As expected, the single task models are unable to transfer to other dimensions: this clearly shows that gender information manifests differently along each dimension. Further, it demonstrates that solely using existing datasets is inadequate, as they do not contain labeled data along all three dimensions. Training for a single task allows models to specialize to detect and understand the nuances of text that indicate bias along one of the dimensions. However, in a multitask setting, models can learn to generalize to understand what language characterizes bias across multiple dimensions: we particularly see this manifest in the TO dimension.
Performance by Dataset. The gender classifiers along the TO, AS and ABOUT dimensions are trained on a variety of different existing datasets across multiple domains. We analyze which datasets are the most difficult to classify correctly in Table 4. We find that ABOUT is the easiest dimension, particularly data from Wikipedia or based on Wikipedia, such as Funpedia and Wizard of Wikipedia, achieving almost 80% accuracy. This could be driven by the number of discussions about named entities, so classifying text ABOUT someone may be easier if a name is present.
The TO and AS directions are both more difficult, likely as they involve more context clues rather than relying on textual attributes and surface forms such as she and he to predict correctly. We find that generally the datasets have similar performance, except Yelp restaurant reviews, which has a higher accuracy (70%) on predicting AS.
Analysis of Classifier Performance. We break down choices made during training by comparing different models on Wikipedia (ABOUT dimension). We train with the variations of masking out gendered words and names. As gendered words and names are strongly correlated with gender, masking can force models into a more challenging but nuanced setting where they must learn to detect bias from the remaining text. We present the results in Table 5: masking out gendered words and names makes classification more challenging, but the model is still able to obtain high accuracy, indicating that gender bias is deeply embedded in the language used.

Applications
We demonstrate the broad utility of our multitask classifier by applying it to three different downstream applications. First, we show that we can use the classifier to control the genderedness of generated text. Next, we demonstrate its utility in biased text detection by applying it to Wikipedia to find the most gendered biographies. Finally, we evaluate our classifier on offensive text to explore the interplay between offensive text and gender.

Controllable Generation
By learning to associate control variables with textual properties, generative models can be controlled at inference time to adjust the generated text based on the desired properties of the user. This has been applied to a variety of different cases, including generating text of different lengths (Fan et al., 2018a), generating questions in chit-chat (See et al., 2019), and reducing bias (Dinan et al., 2020).
Previous work in gender bias used word lists to control bias, but found that word lists were lim-  Table 6: Word statistics measured on text generated from 1000 different seed utterances from ConvAI2 for each control token. We measure the number of gendered words (from a word list) that appear in the generated text, and the percentage of masculine-gendered words among all gendered words.
ited in coverage and applicability to a variety of domains (Dinan et al., 2020). However, by decomposing bias along the TO, AS, and ABOUT dimensions, fine-grained control models can be trained to control these different dimensions separately. This is important in various applications -for example, one may want to train a chatbot with a specific personality, leaving the AS dimension untouched, but want the bot to speak to and about everyone in a similar way. In this application, we train three different generative models, each of which controls generation for gender along one of the TO, AS, and ABOUT dimensions.
Methods We generate training data by taking the multitask classifier and using it to classify 250,000 textual utterances from Reddit, using a previously existing dataset extracted and obtained by a third party and made available on pushshift.io. This dataset was chosen as it is conversational in nature, but not one of the datasets that the classifier was trained on. We then use the labels from the classifier to prepend the utterances with tokens that indicate its gender label along the dimension. For example for the ABOUT dimension, we prepend utterances with tokens ABOUT:<gender label>.
At inference time, we choose control tokens to manipulate the text generated by the model. We also compare to a baseline for which the control tokens are determined by a word list: if an utterance contains more masculine-gendered words than feminine-gendered words from the word list it is labeled as masculine (and vice versa for feminine); if it contains no gendered words or an equal number of masculine and feminine gendered words, it is labeled as neutral. Following Dinan et al. (2020), we combine several existing word lists (Zhao et al., 2018bHoyle et al., 2019).
For training, we fine-tune a large, Transformer sequence-to-sequence model pretrained on a Reddit dump made freely available by pushshift.io. At inference time, we generate text via top-k sampling (Fan et al., 2018b), with k = 10 with a beam size of 10, and 3-gram blocking. We force the model to generate a minimum of 20 BPE tokens.
Qualitative Results. Example generations from various methods are shown in Appendix Table 11.
In these examples we see that controlling for gender along different dimensions yields highly varied responses, even for identical input. This illustrates why limiting control to word lists is not enough to capture different aspects of gender. For example, adjusting AS to 'feminine' causes the model to write No, I've been working. I don't think I can make friendships online, whereas setting ABOUT to 'feminine' for the same exact input causes the model to write I think the problem is she's a girl, so there's not a lot of opportunity to make friends.
We can also see from these examples how the genderedness of text differs along each axis when we switch between conditioning on masculine and feminine. For example, adjusting AS to 'feminine' causes the model to write Awwww, that sounds wonderful, whereas setting AS to masculine generates You can do it bro! Quantitative Results. Quantitatively, we evaluate by generating 1000 utterances seeded from Con-vAI2 using both masculine and feminine control tokens and counting the number of gendered words from a gendered word list that also appear in the generated text. Results are shown in Table 6.
Utterances generated using ABOUT control tokens contain many gendered words. One might expect this, as when one speaks ABOUT another person, one may refer to them using gendered pronouns. We observe that for the control tokens TO:feminine and AS:feminine, the utterances contain a roughly equal number of masculine-gendered and feminine-gendered words. This is a reflection of the distribution in the training data: the ConvAI2 and Opensubtitles data show similar trends: on the ConvAI2 data, fewer than half of the gendered words in SELF:feminine utterances are femininegendered, and on the Opensubtitles data, the ratio  Table 7: Masculine genderedness scores of Wikipedia bios. We calculate a masculine genderedness score for a Wikipedia page by taking the median p x = P (x ∈ ABOUT:masculine) among all paragraphs x in the page, where P is the probability distribution given by the classifier. We report the average and median scores for all biographies, as well as for biographies of men and women respectively.  Table 8: Genderedness of offensive content. We measure the percentage of utterances in both the "safe" and "offensive" classes that are classified as masculinegendered, among utterances that are classified as either masculineor feminine-gendered. We test the hypothesis that safe and offensive classes distributions of masculine-gendered utterances differ using a t-test and report the p-value for each dimension.

Percentage of masculine-gendered text
drops to one-third. 3

Bias Detection
Creating classifiers along different dimensions can be used to detect gender bias in any form of text, beyond dialogue itself. Such methods are useful in applications such as detecting, removing, and rewriting biased writing. We investigate using the trained classifiers to detect the most gendered biographies in Wikipedia.

Methods.
We apply the multitask model to detect the most gendered masculine and feminine biographies in Wikipedia. We calculate the probability of being masculine in the ABOUT dimension for each paragraph among 65, 000 biographies. We calculate a masculine genderedness score for the page by taking the median amongst all its paragraphs.
Quantitative Results. We report the average and median masculine genderedness scores for all bi-ographies in the set of 65, 000 that fit this criteria Table 7. We observe that while on average, the biographies skew largely toward masculine (the average score is 0.74), the classifier is more confident in the femininity of pages about women than it is in the masculinity of pages about men: the average feminine genderedness score for pages about women is 1 − 0.042 = 0.958, while the average masculine genderedness score for pages about men is 0.90. This might suggest that biographies about women contain more gendered text on average.
Qualitative Results. We show the pages with the minimum score (most feminine-gendered biographies) and the maximum score (most masculinegendered biographies) in Table 12 in the Appendix. The most masculine-gendered biographies are mostly composers and conductors, likely due to the historical imbalance in these occupations. Amongst the most feminine gendered biographies, there are many actresses from the mid-20th century. By examining the most feminine gendered paragraphs, anecdotally we find these are often describing the subject's life after retirement. For example, Linda Darnell's biography contains the line Because of her then-husband, Philip Liebmann, Darnell put her career on a hiatus, which clearly reflects negative societal stereotypes about the importance of women's careers (Hiller and Philliber, 1982;Duxbury and Higgins, 1991;Pavalko and Elder Jr., 1993;Byrne and Barling, 2017;Reid, 2018).

Offensive Content
Finally, the interplay and correlation between gendered text and offensive text is an important area for study, as many examples of explicitly or contextually gendered text are disparaging or have negative connotations (e.g., "cat fight" and "doll"). Particularly for dialogue, neither form is desirable in the output of any chatbot system. There is a growing body of research on detecting offensive language in text. In particular, there has been recent work aimed at improving the detection of offensive language in the context of dialogue (Dinan et al., 2019a). We investigate this relationship by examining the distribution of labels output by our gender classifier on data that is labeled for offensiveness.
Methods. We use the Standard training and evaluation dataset from Dinan et al. (2019a). We examine the relationship between genderedness and offensive utterances by labeling the gender of utterances (along the three dimensions) in the "safe" and "offensive" classes using our multitask classifier. We then measure the ratio of utterances labeled as masculine-gendered among utterances labeled as either masculineor feminine-gendered.
Quantitative Results. Results are shown in Table 8. We observe that, on the ABOUT dimension, both the safe data and offensive data are more likely to be masculine than feminine; however, the offensive data is relatively less likely to be masculine.
On the other hand, on the AS and TO dimensions, the safe data is more likely to be labeled as feminine and the offensive data is more likely to be labeled as masculine. We test the hypothesis that these distributions are unequal using a T-test, and find that these results are significant.
Qualitative Results. To explore how offensive content differs when it is ABOUT women and ABOUT men, we identified utterances for which the model had high confidence (probability > 0.70) that the utterance was feminine or masculine along the ABOUT dimension. After excluding stop words and words shorter than three characters, we handannotated the top 20 most frequent words as being explicitly gendered, a swear word, and/or bearing sexual connotation. For words classified as masculine, 25% of words fell into these categories, whereas for words classified as feminine, 75% of the words fell into these categories.

Conclusion
We propose a general framework for analyzing gender bias along three dimensions: (1) gender of the person being spoken about (ABOUT), (2) gender of the addressee (TO), and (3) gender of the speaker (AS). We annotate eight large existing datasets and create an evaluation dataset for the task of detect bias along each of these dimensions. We train classifiers (single and multitask) that demonstrate their broad utility by displaying strong performance in controlling bias in dialogue, detecting genderedness in text such as Wikipedia, and highlighting gender differences in offensive text.

A.1 Existing Data Annotation
Many of our annotated datasets contain cases where the ABOUT, AS, TO labels are not provided (i.e. unknown). For example, often we do not know the gender of the content creator for Wikipedia (i.e., the AS dimension is unknown). To retain such examples for training, we either impute the gender label or provide a label at random. We apply the imputation strategy for data for which the ABOUT label is unknown using a classifier trained only on other Wikipedia data for which this label is provided. Data without a TO or AS label was assigned one at random, choosing between masculine and feminine with equal probability. From epoch to epoch, we switch these arbitrarily assigned labels so that the model learns to label unknown examples as masculine or feminine with roughly equal probability. This label flipping allows us to retain greater quantities of data by preserving unknown samples. During training, we balance the data across the masculine, feminine, and neutral classes by up-sampling classes with fewer examples. We describe in more detail how each of the eight training datasets is annotated: 1. Wikipedia -to annotate ABOUT, we use a Wikipedia dump and extract biography pages. We identify biographies using named entity recognition applied to the title of the page (Honnibal and Montani, 2017). We label pages with a gender based on the number of gendered pronouns (he vs. she vs. they) and label each paragraph in the page with this label for the ABOUT dimension. 4 Wikipedia is well known to have gender bias in equity of biographical coverage and lexical bias in noun references to women (Reagle and Rhue, 2011;Graells-Garrido et al., 2015;Wagner et al., 2015;Klein and Konieczny, 2015;Klein et al., 2016;Wagner et al., 2016), making it an interesting test bed for our investigation.
2. Funpedia -Funpedia (Miller et al., 2017) contains rephrased Wikipedia sentences in a more conversational way. We retain only biography related sentences and annotate similar to Wikipedia, to give ABOUT labels.  Table 9: Self-reported gender identities of annotators used to collect the new evaluation dataset MDGEN-DER. Annotators were given the option to not answer this question or to select "prefer not to say." labeled the gender of each persona (Dinan et al., 2020), giving us labels for the speaker (AS) and speaking partner (TO). We impute ABOUT labels on this dataset using a classifier trained on the datasets 1-4.

A.2 New Evaluation Dataset
The interface for our new evaluation dataset MD-GENDER can be seen in Figure 2. Examples from the new dataset can be found in Table 10. This dataset was collected using crowdworkers from Amazon's Mechanical Turk. All workers are English-speaking and located in the United States. During the "re-write phase" (described in §4.1) crowdworkers were asked to provide their own gender identity if they were willing. Workers were given the option to not answer this question or to select "prefer not to say." Results from this survey are shown in Table 9. For privacy reasons we do not associate the self-reported gender of the annotator with the labeled examples in the dataset and only report these statistics in aggregate. Over two thirds of annotators identified as men, which may introduce its own biases into the dataset.

A.3 Applications
Example generations for various control tokens, as well as for our word list baseline, are shown in Table 11. See §6.1 on Controllable Generation in the main paper for more details.
The top 10 most gendered Wikipedia biographies are shown in Table 12. See §6.2 on Detecting Bias in the main paper for more details. Figure 2: Annotation interface. Annotation interface for collecting MDGENDER. Annotators were shown an utterance from a conversation, and asked to re-write it such that it is clear they would be speaker about/to/as a man or a woman. They were then asked for their confidence level.  Table 10: Examples from the MDGENDER. Crowdworkers were asked to re-write dialogue utterances such that most people would guess that the utterance was either said to, said by, or about a man or a woman. Afterwards, they were asked to give a confidence level in their re-write, meant to capture the differences between statistical biases (more men play football than women) and fact (you do not have to be a man to play football). .