Unsupervised Discovery of Implicit Gender Bias

Despite their prevalence in society, social biases are difficult to define and identify, primarily because human judgements in this domain can be unreliable. Therefore, we take an unsupervised approach to identifying gender bias at a comment or sentence level, and present a model that can surface text likely to contain bias. The main challenge in this approach is forcing the model to focus on signs of implicit bias, rather than other artifacts in the data. Thus, the core of our methodology relies on reducing the influence of confounds through propensity score matching and adversarial learning. Our analysis shows how biased comments directed towards female politicians contain mixed criticisms and references to their spouses, while comments directed towards other female public figures focus on appearance and sexualization. Ultimately, our work offers a way to capture subtle biases in various domains without relying on subjective human judgements.


Introduction
Despite widespread documentation of the negative impacts of bias, stereotypes, and prejudice (Krieger, 1990;Goldin, 1990;Steele and Aronson, 1995;Logel et al., 2009;Schluter, 2018), these concepts remain difficult to define and identify, especially for non-experts. Social biases appear to be a natural component of human cognition that allows people to make judgments efficiently (Kahneman et al., 1982;Blair, 2002). As a result, they are often implicit-people are unaware of their own biases (Blair, 2002;Bargh, 1999), and they can manifest in subtle ways, such as through microaggressions, condescension, or even positive endorsements (Huckin, 2002;Sue, 2010).
Much literature in NLP has examined biases in data, algorithms, or model performance, and the negative pipeline between them: models often absorb and amplify biases in data, which impacts their performance (Sun et al., 2019). However, little work has looked further up the pipeline and relied on the assumption that biases in data originate in human cognition.
In contrast, this assumption motivates our work: an unsupervised approach to detecting and analyzing implicit gender bias in text. Text provides an ideal avenue for studying bias, because human cognition is closely tied to natural language. Social psychology studies often examine human perceptions through word associations, which can reveal implicit biases (Greenwald et al., 1998). However, the implicit nature of bias suggests that human annotations for bias detection may not be reliable, which motivates an unsupervised approach.
The goals of our work align with prior work in NLP that has examined or detected biases in real-world data. However, prior work examines bias at a broad corpus level or relies on supervised models. While corpus-level analyses, e.g. associations between gendered words and stereotypes, can be insightful (Bolukbasi et al., 2016;Fast et al., 2016;Caliskan et al., 2017;Garg et al., 2018;Friedman et al., 2019;Chaloner and Maldonado, 2019), they are difficult to interpret over short text spans. They also often rely on humandefined "known" stereotypes, such as lists of traditionally male and female occupations obtained through crowd-sourcing, which restricts analysis to a narrow surface-level domain.
Similarly, supervised approaches can provide insight into carefully defined types of bias (Wang and Potts, 2019;Breitfeller et al., 2019;Sap et al., 2020), but they rely on human annotations. The implicit nature of bias makes annotation tasks difficult to design or generalize to other domains, especially because social concepts differ across contexts and cultures (Dong et al., 2019).
In contrast, our work offers a new approach to surfacing gender bias that does not require direct supervision and is meaningful at a sentence or para-graph level. Our primary methodology involves unsupervisedly training a model to identify differences in text addressed towards men and women, thereby surfacing gender-biased comments. Machine learning models excel at learning patterns from data, so much so that removing stereotypes and bias from models has become a substantial component of NLP research (Bolukbasi et al., 2016;Webster et al., 2018;Rudinger et al., 2018;Zhao et al., 2018;Stanovsky et al., 2019). Rather than trying to mitigate this property of machine learning models, we instead leverage it in order to surface bias.
More specifically, we create a model that takes text in the 2 nd -person perspective as input and predicts the gender of the person the text is addressed to. Then, if the classifier predicts the gender of the addressee with a high degree of confidence based only on the text directed to them, we hypothesize that the text is likely to contain bias. The main challenge is encouraging the model to focus on text features that are indicative of bias, rather than artifacts in data that correlate with the gender of the addressee but occur because of confounding variables and are not indicative of bias (henceforth, confounds). Thus, the core of our methodology focuses on reducing the influence of confounds in the data. Our goal is not to improve accuracy of the gender-prediction task, but rather to validate that this methodology successfully demotes confounds and surfaces comments that are likely to contain gender bias.
In §2, we define the problem and intuition behind our approach more clearly. We describe our methods for confound demotion in §3, and we evaluate them in §5. Our evaluation involves examining how controlling for confounds affects performance on in-domain and out-of-domain classification tasks, including detection of gender-based microaggressions. In §6, we analyze the patterns our model learns. Our results suggest that our model successfully identifies text likely to contain bias, allowing us to analyze how gender bias differs in different domains. To the best of our knowledge, this is the first work that aims to analyze bias in short text spans by learning implicit associations from data sets, rather than relying on human annotations.

Problem Formulation
Our primary task is to detect gender bias in a communicative domain, specifically in texts targeting an addressee (i.e., 2 nd -person perspective) without relying on explicit bias annotations. Our goals align with a causality framework in that we seek to identify content in the text that occurs because of the gender of the addressee rather than because of other factors. We can define a counterfactual: Would the addressee have received different text if the addressee's gender were different? While our framework is broadly applicable, in order to define consistent notation, we consider a setup where our primary text is a comment written in reply to text written by someone else. This includes domains like replies on social media posts, comments on newspaper articles, and book reviews, and can be further generalized to videos and images, e.g., comments on YouTube videos. In this setup, we identify the following variables: • OW: "Original Writer", the person who wrote the original text, e.g., the addressee • O TEXT: the content of the original text • W GENDER: the gender (M or F) 1 of the original writer • W TRAITS: Any traits of the original writer other than gender, including social role, political affiliation, age, nationality, etc. We specifically avoid enumerating these traits. • COMMENT TEXT: The text of comments replying to O TEXT Our goal is to detect bias in COMMENT TEXT values that occurs because of W GENDER. A naive approach would involve training a classifier to predict W GENDER from COMMENT TEXT and assuming that any COMMENT TEXT values for which the classifier correctly predicts W GENDER with high confidence contain bias. However, COM-MENT TEXT may contain features that are predictive of W GENDER but are not indicative of bias.
For example, in Figure 1, when the comment "You're so pretty!" (COMMENT TEXT) is addressed to someone who said "I love tennis!" (O TEXT) , it is an objectification and unsolicited reference to the person's appearance, which could be a sign of bias. However, when it is addressed to someone who said "Do I look ok?", it is likely not indicative of bias. Then, if women ask "Do I look ok?" more frequently than men, this naive classifier would learn to predict "You're so pretty!" is likely addressed towards a woman and identify these types of comment as biased. However, we only want the model to learn that references to appearance are indicative of gender if they occur in unsolicited contexts. Thus our model needs to account for the effects of O TEXT: Because of correlations between W GENDER and O TEXT, COMMENT TEXT values may contain features that are predictive of W GENDER, but are caused by O TEXT, rather than by W GENDER.
We face a similar problem with W TRAITS, which consist of any traits of the original writer, other than gender. Figure 1 shows a synthetic example: suppose our data set contains more men from Canada than women. The model might learn that references to Canada are predictive of W GENDER = M, but this pattern is reflective of a spurious correlation in the data, rather than a sign of bias. We provide empirical examples in §4.
We refer to these factors, which might influence COMMENT TEXT as confounding variables and the artifacts that they produce in COMMENT TEXT as confounds. We distinguish two types: observed and latent or unobserved. Latent confounding variables cannot be controlled if they are entirely unknown; instead, we assume there are observed signals that can be used to infer them, but the values themselves are difficult or impossible to explicitly enumerate. In addition to confounds introduced by O TEXT and W TRAITS, COMMENT TEXT may also contain overt signals, e.g. titles like "Ma'am" or "Sir", that are predictive of gender, but not indicative of bias. We thus identify 3 factors that need to be controlled in order to detect bias: O TEXT, W TRAITS, and overt signals.

Methodology
Our overall methodology centers on creating a classifier that predicts gender of the addressee while controlling for the effects of observed confounding variables (O TEXT), latent confounding variables (W TRAITS), and overt signals. The input to the prediction model is COMMENT TEXT, while the output is W GENDER, and we aim to identify bias in COMMENT TEXT.

Controlling Observed Confounding Variables through Propensity Matching
Our primary method for controlling for O TEXT is propensity matching. Propensity matching originates in causal inference studies and was developed to replicate the conditions of randomized trials Rubin, 1983, 1985). In this preprocessing step, we discard any COMMENT TEXT training samples whose associated O TEXT is heavily affiliated with only one gender. In the example in Figure 1, if we assume that only women post "Do I look ok?", we would discard all comments posted in reply to the O TEXT "Do I look ok?". We ultimately seek to balance our data set, so that the set of all COMMENT TEXT where W GENDER = M has similar associated O TEXT as the set of all COM-MENT TEXT where W GENDER = F. Thus, we match each O TEXT where W GENDER = F with a similar O TEXT where W GENDER = M and discard all unmatched data. Ideally, we would match O TEXT values written by men with identical O TEXT values written by women, but this is infeasible in practice. Instead, the key insight behind propensity matching is that it is sufficient to match data points based on the probability of the target variable, e.g., the probability that W GENDER = F (Rosenbaum and Rubin, 1983Rubin, , 1985Stuart, 2010).
More specifically, the propensity score e i for an individual COMMENT TEXT i is defined as the probability that W GENDER = F, given the confounding variable, O TEXT i : To balance our data set, we need to ensure that the set of COMMENT TEXT where W GENDER = M has a similar propensity score distribution as the set of COMMENT TEXT where W GENDER = F. Because propensity scores are dependent on O TEXT, all COMMENT TEXT replied to the same O TEXT have the same propensity score. We can then equate e i (COMMENT TEXT i ) = e i (O TEXT i ), and focus on estimating scores for O TEXT values.
Propensity scores can be estimated by using a classification model that is trained to predict the target attribute W GENDER = F from the observed confounding variable O TEXT i , such as a neural network or logistic regression classifier (Westreich D, 2010; Lee et al., 2010). We use a bidirectional LSTM encoder followed by two feedforward layers with a tanh activation function and a softmax in the final layer. Then, we use greedy matching to match Thus, for example, we would match a post written by a woman that is "stereotypically female" (e.g., e i is large) with a post written by a man that is also "stereotypically female" (e.g., e j is also large). In the synthetic example in Figure 1, we match the post "Tennis is great" with the post "I love tennis", and we discard the post "Do I look ok?" as unable to be matched. However, using propensity matching rather than direct matching allows us to match O TEXT values that are about different topics, as long as they are equally likely to have been written by a woman.

Controlling Latent Confounding Variables through Adversarial Training
While propensity matching is a desirable method for controlling for confounding variables because of established literature and theoretical grounding, matching is only intended to reduce biases in observed confounding variables. When they cannot be directly measured, they must be addressed in other ways (Gu and Rosenbaum, 1993;Rosenbaum, 1988). In our data, while O TEXT is observed, W TRAITS is not possible to enumerate and match on (we provide further discussion of this in §4).
Instead of matching, we use an adversarial objective to encourage the classification model to ignore W TRAITS. Our method, which we present next, is drawn from Kumar et al. (2019).
Confound representation While we cannot explicitly enumerate W TRAITS, we know that they are associated with the identity of OW, and we can infer them based on COMMENT TEXT addressed to OW. We use associations between OW and COM-MENT TEXT to derive a feature vector for each COMMENT TEXT i that is reflective of W TRAITS i . Specifically, the latent confounds to demote are represented as multinomial probability distributions, derived from log-odds scores with Dirichlet priors (Monroe et al., 2008).
For each label OW = k and each word type w in all COMMENT TEXT, we calculate the log-odds score lo(w, k) ∈ R , where higher scores indicate stronger associations between OW and the word. In the example in Figure 1, lo(Canada, Person 1) would be high, as COMMENT TEXT values addressed to Person 1 often contain the word Canada. Then, following Kumar et al. (2019), we define a distribution: for all k ∈ OW and an input COMMENT TEXT i , = w 1 , . . . , w n : is estimated from the distribution of k in the training data, i.e., the proportion of COM-MENT TEXT values addressed to OW = k. p(w i |k) is proportional to σ(lo(w, k)), where we first use the sigmoid function (σ) to map log-odds scores to the range [0,1] and then normalize them over the vocabulary to obtain valid probabilities. For each input COMMENT TEXT i , we then obtain a vector whose elements are p(k|COMMENT TEXT i ) and whose dimensionality is the number of OW individuals in the training set. We normalize these vectors to obtain multinomial probability distributions which reflect COMMENT TEXT i 's association with each OW individual. Thus, when we demote this vector during training, we force the classifier to learn features that are indicative of the group W GENDER and not to learn features that are indicative of individual members of this group (e.g., some group members are from Canada). We refer to the confound vector as t i . Justification for the log-odds representation as opposed to alternatives is presented in Kumar et al. (2019).
Training Procedure Our overall goal is to obtain a model that can predict the target attribute W GENDER, but that cannot predict the latent confounds represented by t i . To achieve this, the model is trained in an alternate GAN-like procedure (Goodfellow et al., 2014).
First, the input x ∈ COMMENT TEXT is encoded using an encoder neural network h(x; θ h ) to obtain a hidden representation h x . This representation is then passed through two feedforward networks: (1) c(h(x); θ c ) to predict the label y ∈ {M, F}; and (2) an adversary network adv(h(x); θ a ) to predict the vector representation of the latent confounds.
We train the encoder, so that the encoded representation h x does not contain any information predictive of the confound vector, but does contain information predictive of the target attribute. We assess if the encoder representation contains predictive information by training the adversary network to predict the confound vector t i from the encoded input h x i . Thus our training objectives, following Kumar et al. (2019), are: where U represents a uniform distribution, CE represents cross-entropy loss, and KL represents KL-divergence. Thus, Eq. 2 seeks a representation h(x i ) which is maximally predictive of the target attribute y i but not of confound vector t i . We refer to Kumar et al. (2019) for the training procedure that alternates minimizing each objective.  Table 1: Data Statistics. "Matched train size" refers to the size of the training set after propensity matching, and "dem. dim." refers to the size of the latent confound vector that is demoted during training.

Overt Signals
Finally, we are interested in identifying subtle indicators of bias, rather than overtly gendered language. We control for overt signals using word substitutions that replace gendered terms with more neutral language, for example congresswoman → congressperson and congressman → congressperson . We manually create the list of substitution words from existing resources (Zhao et al., 2018;Bolukbasi et al., 2016) as well as our observations of the data. We ultimately use 66 substitutions for replacing overtly-gendered terms. We additionally use wordsubstitutions to remove name of the addressee from comment. For all data where the name of OW is "Firstname Lastname", we replace "Firstname" and "Lastname" with " name " in COMMENT TEXT. We do not attempt to identify nicknames, as the confound demotion method described in §3.2 should already mitigate the influence of individual names, and we perform the substitution as merely an extra precaution.

Experimental Setup
Our primary data is the Facebook subsection of the RtGender corpus (Voigt et al., 2018). This data set consists of Facebook posts written by wellknown people and comments written as replies to those posts. The data is further divided into two subsections: Politicians which contains posts and comments from then-current U.S. members of Congress, and Public Figures, which contains posts and comments from people such as actresses, novelists, and tennis players.
In Table 2 Table 2 suggests that female politicians post more frequently about sexual assault than male politicians. If we do not control for this difference, the model may predict that comments using sexual language are more likely to be addressed towards female politicians. However, increased sexual language in COMMENT TEXT with W GENDER = F may occur because of the content in O TEXT, rather than indicating gender bias.
A similar problem occurs with W TRAITS. For instance, the corpus has a much larger number of comments addressed to female tennis players (9 players; 184K comments) than male players (1 player; 29K comments). Then, the model can obtain high accuracy by predicting that all COMMENT TEXT with the word "tennis" have W GENDER = F. Unlike O TEXT, which is observable from the data, we have no way of enumerating every possible value in W TRAITS.
Furthermore, even if we could enumerate traits for all of the people in the data set, we do not expect propensity matching over W TRAITS to sufficiently balance the data set, because we cannot find reasonable matches. For example, there is only one current senior senator from Massachusetts. Additionally, W TRAITS values can also be as fine-grained as individual names: we cannot find a matching male senator whom commenters address as "Elizabeth Warren".
We divide each data set into train, dev, and test sets, ensuring that there is no OW overlap between subsets. All data is lowercased and tokenized, and we discard data points with fewer than 4 tokens. Table 1 reports data statistics.
When using propensity matching, we perform matching only over the training data. Similarly, we derive the confound vectors to demote using only the training data, leaving the test set untouched. However, we do apply word substitutions to the test set. We use the same model architectures as Kumar et al. (2019), including training multiple adversaries. 2

Evaluation
We train our model to predict W GENDER from COMMENT TEXT, employing propensity matching over O TEXT, word substitutions over COM-MENT TEXT, and latent confound demotion during training. We primarily focus on evaluating how well our model controls for confounds and whether or not it captures gendered language. Successful demotion of confounds would suggest that our model learns to identify text indicative of gender bias. Figure 2, we show log-odds scores, measuring association between O TEXT and W GENDER, in the training set and after applying propensity matching. For comparison, we also show log-odds scores for a randomly matched data set, in which we balance O TEXT to have an equal proportion of F and M labels by random sampling. We construct the random set to be the same size as the propensity matched set.

Observed Confounding Variable Demotion In
In both the Politicians and Public Figures data sets, propensity matching reduces the magnitude of the most polar words, in that the log-odds scores for the matched data are closer to zero than for the non-matched data or the randomly matched data. These polarities were reduced without producing new ones: in the Politicians data, the magnitude of the 2 most polar words decreased from -34.0 and 17.9 to -7.68 and 8.52, and in the Public Figures data, it decreased from -45.5 and 39.3 to -5.29 and 9.43. Further, propensity matching can even cause the polarity to change direction: words that are female-associated in the original data (e.g. "her") are slightly male-associated in the matched data. These figures suggest that propensity matching effectively reduces the influence of the confounding variable O TEXT.  Table 3, where W GENDER = F is considered the positive class. As expected, models with demotion perform best on all metrics, with the exception of recall in the Politicians data. The discrepencies between F1 and Accuracy are explained by the imbalance in the data set, particularly in the Politicians data set, which is imbalanced in favor of M while we report metrics assuming F is the positive class.

Detection of Sexist Comments
Finally, we evaluate to what extent our model captures genderbiased language in text by using it to make predictions over an out-of-domain gendered language task. Specifically, we use it to identify genderbased microaggressions, which are subtle manifestations of bias, such as "you're too pretty to be a computer scientist!". This task is notoriously difficult because words like "pretty" often register as positive content, rather than as indicative of bias (Breitfeller et al., 2019;. Our goal is not to maximize accuracy over microaggression classification, but rather to assess whether or not our model has encoded any indicators of gender bias from the RtGender data set, which would be  indicated by better than random performance. We use a corpus of self-reported microaggressions, 3 in which posters describe a microaggression using quotes, transcripts, or narrative text, and these posts are tagged with type of bias expressed, such as "gender", "ableism", "race", etc. We discard all posts that contain only narrative text, since it is not 2 nd person perspective and thus very different than our training data. In the absence of negative examples that contain no microaggressions, we instead focus on distinguishing gender-tagged microaggressions (704 posts) from other forms of microaggressions (900 posts). We train our model on either the Politicians or Public Figures training data sets, and then we test our model on the microaggressions data set. Because most gender-related microaggressions target women, if our model predicts that the reported microaggression was addressed to a woman (e.g. W GENDER = F), we assume that the post is a gender-tagged microaggression. Thus, our models are not trained at all for the task of identifying gender-tagged microaggressions. Table 4 shows results. For comparison, we also show results from two random baselines. In "Random" we guess gender-tagged or not with equal probability. In "Class Random", we guess gendertagged or not according to the true test distributions (56.11% and 43.89%). All models outperform the class random baseline, and all models with demotion also outperform the true random baseline.
Propensity matching improves performance when training on the Politicians data, but not the Public Figures data. There are several data differences that could account for this: the Public Figures set is smaller, so propensity matching causes a more substantial reduction in size. Additionally, the Politicians data is more heavily imbalanced, though notably, it is imbalanced in the same direction as the microaggressions data, while the Public Figures data is imbalanced oppositely. Finally, many of the microaggressions contain references to appearance, which are also common in the Public Figures data. Many comments to people like actresses focus on their looks, especially because they often post photos. However, by controlling for O TEXT, propensity matching substantially reduces the prevalence of these comments. Thus, by demoting a confounding variable, we make the prediction task more difficult, since the effects of this variable are correlated with the target attribute. In general, the goal of confound demotion is not to improve accuracy, but rather in increase confidence in model predictions.
Nevertheless, the general better-than-random performance of all models is striking, as it suggests strong bias in the underlying training data, which is encoded by our models. Additionally it suggests a strong female component to the self-reported microaggressions: without any explicit training data, we can identify microaggressions simply by assuming they are comments addressed to women.

Analysis of Encoded Bias
Finally, we analyze what type of bias our model learns through several analysis methods. First, we identify words that most impact the model's prediction score; second, we compare posts surfaced by our model with prior work on stereotype detection; and third, we show example posts surfaced by our model. We focus on analyzing posts addressed towards women. Throughout this section,  Table 4: Evaluation over the microaggressions data set. Despite not being trained for this task, our models achieve better-than-random performance.
we use prediction score to refer to the output of the final softmax layer of the prediction model, and we take this score as an estimate of model confidence in the predicted gender. We generally focus on the subsection of data for which our model predicts W GENDER = F with a high prediction score. These posts are the ones our model identifies as most likely to contain bias against women: despite the matching and demotion methods, the model still predicts W GENDER = F with high confidence.
Influential words We first identify words that strongly influence the model's decisions by masking out words from comments in the test set and examining the impact on prediction score. For each data set, we take the 500 comments from the test set for which the model predicts W GENDER = F with the highest confidence, meaning posts with maximal prediction scores. We then generate masked versions of each post: for every word w in the post, we generate a version of the post that omits w. We run these masked posts through our genderprediction model and compare the prediction scores where w is omitted and where w is not omitted, averaging across all occurrences of w in the 500 posts. We then examine the set of w words with the highest differential in prediction score -these are words that, when omitted, cause the model to less associate W GENDER with F. In the Public Figures data the influential word list is dominated by appearance-driven and sexualized language: "beautiful", "bellissima', "amore", "amo", "love", "linda", "sexo". In contrast, influential words in the Politicians data is more mixed.  Bob and I join Bill Hemmer on America's Newsroom to discuss whether or not... COMMENT TEXT I like Bob, but you're hot, so kick theirs butt. Words include references to strength and competence, such as "force", and "situation", as well as traditionally domestic terms, such as " spouse " 4 , "family", "love". When we repeat this process using the 500 highest-confidence posts from the training set instead of the test set, we find similar results. Influential words in the Public Figures training data primarily refer to appearance, while influential words in the Politicians training data focus more on political issues, such as "dino" ("Democrat in Name Only") an insult used to accuse politicians of not being liberal enough.
The influential word set from the training data also includes some correlative terms, like names of counties and states, that we would expect the latent confound demotion part of our model to deemphasize. While the results presented in §5 suggest that our model successfully reduces the influence of confounding variables, more work is needed to eliminate them completely.
Comparison to stereotype lexicons In order to better understand the trends in the influential word list, we draw from prior work on stereotype detection and align our model's predictions with existing stereotypes (Fast et al., 2016). For each data set, we take the set of test comments for which our model predicts W GENDER = F with a high prediction score (0.99 for Public Figures and 0.95 for Politicians). Then, we compare frequency of words from a stereotype lexicon (Fast et al., 2016) in this highconfidence prediction set with their frequency in a random sample of the same number of comments for which the true value of W GENDER = F. 5 Figure 3 reports results. This figure generally reflects the same trends observed in the influential words lists. In the Public Figures data, the lexicons that overlap the most with the high-bias posts are "beautiful", "arrogant", and "sexual". These lexicons suggest that the type of bias in comments directed towards public figures like actresses and tennis players focus on appearance and sexualization. In contrast, bias in comments directed towards politicians are less focused, and the overall differences between the high-confidence prediction posts and the random sample are smaller. The two most prominent lexicons are "arrogant" (primarily driven by lexicon words "special" and "proud" and "strong"). Notably, in examining lexicon words, we do not account for negation. In manually examining the Politicians data, a narrative of power is reflected in surfaced comments, like "you & Nikki Haley lost my vote on the flag issue your both weak". We provide more examples surfaced by our model in Table 5.
Because the stereotype lexicons are relatively small, and scores can be dominated by a few words, we also compare LIWC scores (Pennebaker et al., 2001). While most LIWC categories are too broad to align with well-known stereotypes, results are consistent with Figure 3; specifically the high-bias data scores higher than the random sample in the Public Figures data, for the "sexual" (0.32 vs. 0.10) and "body" (0.70 vs. 0.56) categories, but this trend does not exist in the Politicians data. In the Politicians data, the high-bias comments score lower than the random sample in the "drives" dimension (8.76 vs. 9.71), which encompasses Affiliation, Achievement, Power, Reward focus, and Risk focus.
The difficulty in evaluating our model against existing lexicons as well as the differences between the two data sets motivates our goal in learning to detect bias automatically. Bias can differ in different contexts, making it difficult to crowdsource through annotations or define through lexicons. Table 5, we show examples of comments surfaced by our model. We identify these comments by selecting posts where O TEXT is not strongly-gendered; we use a model trained to predict W GENDER from O TEXT (the same model used in §3 for propensity matching) and choose O TEXT values where the model outputs a prediction score of < 0.6. We additionally discard all O TEXT posts that contained photo or video attachments, leaving text-only posts. We then take all COMMENT TEXT posted in reply to this subset for which our primary model predicts W GENDER with a > 0.9 prediction score. Thus, we identify highly-gendered comments posted in reply to low-gendered posts. Table 5 shows selected examples from the training and test sets. While posts from the Politicians data are more diverse, posts from the Public Figures data are primarily about appearance. These comments serve as examples of the broader trends shown in the influential word lists and in Figure 3.

Related Work
Our work differs from prior work on bias detection in NLP in that we infer bias from data in an unsupervised way, whereas prior work relies on crowd-sourced annotations (Fast et al., 2016;Bolukbasi et al., 2016;Wang and Potts, 2019;Sap et al., 2020). This work typically focuses on specific types of bias, such as condescension (Wang and Potts, 2019) or microaggressions (Breitfeller et al., 2019) and involves carefully constructed annotations schemes that are difficult to generalize to other data sets or types of bias. In contrast, our unsupervised approach is not limited to any particular domain and does not rely on human annotations, which can be subjective.
Less-supervised approaches focus on corpuslevel analyses, such as associations between gendered terms and occupational stereotypes (Wagner et al., 2015;Bolukbasi et al., 2016;Fu et al., 2016;Joseph et al., 2017;Nakandala et al., 2017;Friedman et al., 2019;Chaloner and Maldonado, 2019;Hoyle et al., 2019). Methodologies for identifying gender-related differences in text have varied, including word-embedding similarity (Bolukbasi et al., 2016), language model perplexity (Fu et al., 2016), and predictive words identified by logistic regression (Nakandala et al., 2017). These metrics are meaningful over a corpus-level, but are often difficult to interpret over short text spans. Additionally, none of these methods focus on controlling for confounds.
While matching is a a well-established method for controlling for confounding variables in causality literature Rubin, 1983, 1985;Stuart, 2010), considerably less work has drawn this methodology into NLP. Most work takes one of two approaches. In the first scenario, text maybe be a confounding variable that needs to be controlled in order to measure the effect of a non-text variable (Roberts et al., Forthcoming;Veitch et al., 2019). For example, Roberts et al. (Forthcoming) examine whether or not papers written by male authors are cited more than ones by female authors, while controlling for the content of the paper. Roberts et al. (Forthcoming) also offer a specific method for matching text, which relies on the output of a topic model. In this work, we use the output of an LSTM, which is generally more appropriate for short text, does not make the simplifying BOW assumption, and scales well to large data sets.
In the second scenario, it may be desirable to control for non-text confounds before analyzing text. Chandrasekharan et al. (2017) use matching to identify similar users on Reddit before comparing the content that they post. Our work requires both of these perspectives, as the variable we control for (O TEXT) and the outcome we analyze (COMMENT TEXT) are both text. Egami et al. (2018) do consider a similar setting where text is both an outcome and a confound. While their goals differ greatly from ours, our framework is generally consistent with their recommendations.

Limitations and Future Work
While our work serves as an initial approach toward unsupervised detection of comment-level gender bias, we identify several limitations and areas for future work. We first focus on limitations within our proposed framework. First, while our results in §5 suggest that adversarial training does help reduce the influence of latent confounding variables, the analysis in §6 suggests that there is scope for improvement. Furthermore, while we focus on some confounds in the data, there may be additional ones that our model does not account for, such as the impact of videos, photos or links shared with O TEXT. Similarly, while our model uses O TEXT for propensity matching in the training data, thus encouraging the model to encode indicators of bias, a model to classify comments as biased or unbiased should also incorporate O TEXT when assessing test data. Additionally, we assume that all comments are directly addressed to OW, but some comments may be addressed to other commenters. Finally, our assumption that human judgements are not reliable for this task makes evaluation difficult, and this work would benefit from additional evaluation metrics.
There are additional avenues for future work beyond our proposed framework. Notably, we focus on the perspective of OW and examine what bias social media users may be exposed to, i.e. what comments men and women might expect to receive in response to their posts. We do not examine why comments addressed toward men and women may differ, whether because the same commenters write different comments to men and women, or because men and women attract comments from different types of people. This perspective would require controlling for traits of the commenter, such as gender, age, and occupation. Nevertheless, our work stands without this perspective: biased comments are harmful to the recipient, regardless of who wrote them.

Conclusions
Our results both demonstrate the usefulness of our approach and motivate unsupervised approaches to bias detection. Bias detection is useful for fostering civil communication on social media, as it can allow recipients to screen out biased comments and avoid reading them. Furthermore, our intention is to detect implicit bias that people may not know they have -identifying these biases and revealing them to social media users could prevent them from posting unintentionally biased comments. More generally, detecting and analyzing bias is a first step towards mitigating it, and we hope our work will encourage future work in this area.