It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution

This paper treats gender bias latent in word embeddings. Previous mitigation attempts rely on the operationalisation of gender bias as a projection over a linear subspace. An alternative approach is Counterfactual Data Augmentation (CDA), in which a corpus is duplicated and augmented to remove bias, e.g. by swapping all inherently-gendered words in the copy. We perform an empirical comparison of these approaches on the English Gigaword and Wikipedia, and find that whilst both successfully reduce direct bias and perform well in tasks which quantify embedding quality, CDA variants outperform projection-based methods at the task of drawing non-biased gender analogies by an average of 19% across both corpora. We propose two improvements to CDA: Counterfactual Data Substitution (CDS), a variant of CDA in which potentially biased text is randomly substituted to avoid duplication, and the Names Intervention, a novel name-pairing technique that vastly increases the number of words being treated. CDA/S with the Names Intervention is the only approach which is able to mitigate indirect gender bias: following debiasing, previously biased words are significantly less clustered according to gender (cluster purity is reduced by 49%), thus improving on the state-of-the-art for bias mitigation.


Introduction
Gender bias describes an inherent prejudice against a gender, captured both by individuals and larger social systems. Word embeddings, a popular machine-learnt semantic space, have been shown to retain gender bias present in corpora used to train them (Caliskan et al., 2017). This results in gender-stereotypical vector analogiesà la Mikolov et al. (2013), such as man:computer programmer :: woman:homemaker (Bolukbasi et al., 2016), and such bias has been shown to materialise in a variety of downstream tasks, e.g. coreference resolution (Rudinger et al., 2018;Zhao et al., 2018).
By operationalising gender bias in word embeddings as a linear subspace, Bolukbasi et al. (2016) are able to debias with simple techniques from linear algebra. Their method successfully mitigates direct bias: man is no longer more similar to computer programmer in vector space than woman. However, the structure of gender bias in vector space remains largely intact, and the new vectors still evince indirect bias: associations which result from gender bias between not explicitly gendered words, for example a possible association between football and business resulting from their mutual association with explicitly masculine words (Gonen and Goldberg, 2019). In this paper we continue the work of Gonen and Goldberg, and show that another paradigm for gender bias mitigation proposed by Lu et al. (2018), Counterfactual Data Augmentation (CDA), is also unable to mitigate indirect bias. We also show, using a new test we describe (non-biased gender analogies), that WED might be removing too much gender information, casting further doubt on its operationalisation of gender bias as a linear subspace.
To improve CDA we make two proposals. The first, Counterfactual Data Substitution (CDS), is designed to avoid text duplication in favour of substitution. The second, the Names Intervention, is a method which can be applied to either CDA or CDS, and treats bias inherent in first names. It does so using a novel name pairing strategy that accounts for both name frequency and gender-specificity. Using our improvements, the clusters of the most biased words exhibit a reduction of cluster purity by an average of 49% across both corpora following treatment, thereby offering a partial solution to the problem of indirect bias as formalised by Gonen and Goldberg (2019). Additionally, although one could expect that the debiased embeddings might suffer performance losses in computational linguistic tasks, our embeddings remain useful for at least two such tasks, word similarity and sentiment classification (Le and Mikolov, 2014).

Related Work
The measurement and mitigation of gender bias relies on the chosen operationalisation of gender bias. As a direct consequence, how researchers choose to operationalise bias determines both the techniques at one's disposal to mitigate the bias, as well as the yardstick by which success is determined.

Word Embedding Debiasing
One popular method for the mitigation of gender bias, introduced by Bolukbasi et al. (2016), measures the genderedness of words by the extent to which they point in a gender direction. Suppose we embed our words into R d . The fundamental assumption is that there exists a linear subspace B ⊂ R d that contains (most of) the gender bias in the space of word embeddings. (Note that B is a direction when it is a single vector.) We term this assumption the gender subspace hypothesis. Thus, by basic linear algebra, we may decompose any word vector v ∈ R d as the sum of the projections onto the bias subspace and its complement: v = v B + v ⊥B . The (implicit) operationalisation of gender bias under this hypothesis is, then, the magnitiude of the bias vector ||v B || 2 .
To capture B, Bolukbasi et al. (2016) first construct two sets, D male and D female containing the male-and female-oriented pairs, using a set of gender-definitional pairs, e.g., man-woman and husband-wife. They then define D = D male ∪ D female as the union of the two sets. They compute the empirical covariance matrix where µ is the mean embeddings of the words in D, then B is taken to be the k eigenvectors of C associated with the largest eigenvalues. Bolukbasi et al. set k = 1, and thus define a gender direction. Using this operalisation of gender bias, Bolukbasi et al. go on to provide a linear-algebraic method (Word Embedding Debiasing, WED, originally "hard debiasing") to remove gender bias in two phases: first, for non-gendered words, the gender direction is removed ("neutralised"). Second, pairs of gendered words such as mother and father are made equidistant to all non-gendered words ("equalised"). Crucially, under the gender subspace hypothesis, it is only necessary to identify the subspace B as it is possible to perfectly remove the bias under this operationalisation using tools from numerical linear algebra.
The method uses three sets of words or word pairs: 10 definitional pairs (used to define the gender direction), 218 gender-specific seed words (expanded to a larger set using a linear classifier, the compliment of which is neutralised in the first step), and 52 equalise pairs (equalised in the second step). The relationships among these sets are illustrated in Figure 1; for instance, gender-neutral words are defined as all words in an embedding that are not gender-specific.
Bolukbasi et al. find that this method results in a 68% reduction of stereotypical analogies as identified by human judges. However, bias is removed only insofar as the operationalisation allows. In a comprehensive analysis, Gonen and Goldberg (2019) show that the original structure of bias in the WED embedding space remains intact.

Counterfactual Data Augmentation
As an alternative to WED, Lu et al. (2018) propose Counterfactual Data Augmentation (CDA), in which a text transformation designed to invert bias is performed on a text corpus, the result of which is then appended to the original, to form a new bias-mitigated corpus used for training embeddings. Several interventions are proposed: in the simplest, occurrences of words in 124 gendered word pairs are swapped. For example, 'the woman cleaned the kitchen' would (counterfactually) become 'the man cleaned the kitchen' as man-woman is on the list. Both versions would then together be used in embedding training, in effect neutralising the man-woman bias.
The grammar intervention, Lu et al.'s improved intervention, uses coreference information to veto swapping gender words when they corefer to a proper noun. 1 This avoids Elizabeth . . . she . . . queen being changed to, for instance, Elizabeth . . . he . . . king. It also uses POS information to avoid ungrammaticality related to the ambiguity of her between personal pronoun and possessive determiner. In the context, 'her teacher was proud of her', this results in the correct sentence 'his teacher was proud of him'.

Improvements to CDA
We prefer the philosophy of CDA over WED as it makes fewer assumptions about the operationalisation of the bias it is meant to mitigate.

Counterfactual Data Substitution
The duplication of text which lies at the heart of CDA will produce debiased corpora with peculiar statistical properties unlike those of naturally occurring text. Almost all observed word frequencies will be even, with a notable jump from 2 directly to 0, and a type-token ratio far lower than predicted by Heaps' Law for text of this length. The precise effect this will have on the resulting embedding space is hard to predict, but we assume that it is preferable not to violate the fundamental assumptions of the algorithms used to create embeddings.
As such, we propose to apply substitutions probabilistically (with 0.5 probability), which results in a non-duplicated counterfactual training corpus, a method we call Counterfactual Data Substitution (CDS). Substitutions are performed on a perdocument basis in order to maintain grammaticality and discourse coherence. This simple change should have advantages in terms of naturalness of text and processing efficiency, as well as theoretical foundation.

The Names Intervention
Our main technical contribution in this paper is to provide a method for better counterfactual augmentation, which is based on bipartite-graph matching of names. Instead of Lu et. al's (2018) solution of not treating words which corefer to proper nouns in order to maintain grammaticality, we propose an explicit treatment of first names. This is because we note that as a result of not swapping the gender could be paired with a very gender-specific name (e.g. John), which would negate the gender intervention in many cases (namely whenever a male occurrence of Taylor is transformed into John, which would also result in incorrect pronouns, if present). If, on the other hand, only the degree of genderspecificity were considered, we would see frequent names (like James) being paired with far less frequent names (like Sybil), which would distort the overall frequency distribution of names. This might also result in the retention of a gender signal: for instance, swapping a highly frequent male name with a rare female name might simply make the rare female name behave as a new link between masculine contexts (instead of the original male name), as it rarely appears in female contexts. Figure 3 shows a plot of various names' number of primary gender 4 occurances against their secondary gender occurrences, with red dots for primary-male and blue crosses for primary-female names. 5 The problem of finding name-pairs thus decomposes into a Euclidean-distance bipartite matching problem, which can be solved using the Hungarian method (Kuhn, 1955). We compute pairs for the most frequent 2500 names of each gender in the SSA dataset. There is also the problem that many names are also common nouns (e.g. Amber, Rose, or Mark), which we solve using Named Entity Recognition.

Experimental Setup
We compare eight variations of the mitigation methods. CDA is our reimplementation of Lu et al.'s (2018) naïve intervention, gCDA uses their grammar intervention, and nCDA uses our new Names Intervention. gCDS and nCDS are variants of the grammar and Names Intervention using CDS. WED40 is our reimplementation of Bolukbasi et al.'s (2016) method, which (like the original) uses a single component to define the gender subspace, accounting for > 40% of variance. As this is much lower than in the original paper (where it was 60%, reproduced in Figure 4), we define a second space, WED70, which uses a 2D subspace accounting for > 70% of variance. To test whether WED profits from additional names, we use the 5000 paired names in the names gazetteer as 4 Defined as its most frequently occurring gender. 5 The hatched area demarcates an area of the graph where no names can exist: if any name did then its primary and secondary gender would be reversed and it would belong to the alternate set. Figure 3: Bipartite matching of names by frequency and gender-specificity additional equalise pairs (nWED70). 6 As control, we also evaluate the unmitigated space (none).
We perform an empirical comparison of these bias mitigation techniques on two corpora, the Annotated English Gigaword (Napoles et al., 2012) and Wikipedia. Wikipedia is of particular interest, since though its Neutral Point of View (NPOV) policy 7 predicates that all content should be presented without bias, women are nonetheless less likely to be deemed "notable" than men of equal stature (Reagle and Rhue, 2011), and there are differences in the choice of language used to describe them (Bamman and Smith, 2014;Graells-Garrido et al., 2015). We use the annotation native to the Annotated English Gigaword, and process Wikipedia with CoreNLP (statistical coreference; bidirectional tagger). Embeddings are created using Word2Vec 8 . We use the original complex lexical input (gender-word pairs and the like) for each algorithm as we assume that this benefits each algorithm most. Expanding the set of gender-specific words for WED (following Bolukbasi et al., using a linear classifier) on Gigaword resulted in 2141 such words, 7146 for Wikipedia. 9 6 We use the 70% variant as preliminary experimentation showed that it was superior to WED40. 7 https://en.wikipedia.org/wiki/ Wikipedia:Neutral_point_of_view 8 A CBOW model was trained over five epochs to produce 300 dimensional embeddings. Words were lowercased, punctuation other than underscores and hyphens removed, and tokens with fewer than ten occurrences were discarded. 9 We modify or remove some phrases from the training data not included in the vocabulary of our embeddings. In our experiments, we test the degree to which the spaces are successful at mitigating direct and indirect bias, as well as the degree to which they can still be used in two NLP tasks standardly performed with embeddings, word similarity and sentiment classification. We also introduce one further, novel task, which is designed to quantify how well the embedding spaces capture an understanding of gender using non-biased analogies. Our evaluation matrix and methodology is expanded below. Caliskan et al. (2017) introduce the Word Embedding Association Test (WEAT), which provides results analogous to earlier psychological work by Greenwald et al. (1998) by measuring the difference in relative similarity between two sets of target words X and Y and two sets of attribute words A and B. We compute Cohen's d (a measure of the difference in relative similarity of the word sets within each embedding; higher is more biased), and a one-sided p-value which indicates whether the bias detected by WEAT within each embedding is significant (the best outcome being that no such bias is detectable). We do this for three tests proposed by Nosek et al. (2002) which measure the strength of various gender stereotypes: art-maths, arts-sciences, and careers-family. 10 Indirect bias To demonstrate indirect gender bias we adapt a pair of methods proposed by Gonen and Goldberg (2019). First, we test whether the most-biased words prior to bias mitigation remain clustered following bias mitigation. To do this, we define a new subspace, b test , using the 23 word pairs used in the Google Analogy family test subset (Mikolov et al., 2013) following Bolukbasi et al.'s (2016) method, and determine the 1000 most biased words in each corpus (the 500 words most similar to b test and − b test ) in the unmitigated embedding. For each debiased embedding we then project these words into 2D space with tSNE (van der Maaten and Hinton, 2008), compute clusters with k-means, and calculate the clusters' Vmeasure (Rosenberg and Hirschberg, 2007). Low values of cluster purity indicate that biased words are less clustered following bias mitigation.

Direct bias
Second, we test whether a classifier can be trained to reclassify the gender of debiased words. If it succeeds, this would indicate that biasinformation still remains in the embedding. We trained an RBF-kernel SVM classifier on a random sample of 1000 out of the 5000 most biased words from each corpus using b test (500 from each gender), then report the classifier's accuracy when reclassifying the remaining 4000 words.
Word similarity The quality of a space is traditionally measured by how well it replicates human judgements of word similarity. The SimLex-999 dataset (Hill et al., 2015) provides a ground-truth measure of similarity produced by 500 native English speakers. 11 Similarity scores in an embedding are computed as the cosine angle between wordvector pairs, and Spearman correlation between embedding and human judgements are reported. We measure correlative significance at α = 0.01. Le and Mikolov (2014), we use a standard sentiment classification task to quantify the downstream performance of the embedding spaces when they are used as a pretrained word embedding input (Lau and Baldwin, 2016) to Doc2Vec on the Stanford Large Movie Review dataset. The classification is performed by an SVM classifier using the document embeddings as features, trained on 40,000 labelled reviews and tested on the remaining 10,000 documents, reported as error percentage.

Sentiment classification Following
Non-biased gender analogies When proposing WED, Bolukbasi et al. (2016) use human raters to class gender-analogies as either biased (woman:housewife :: man:shopkeeper) or appropriate (woman:grandmother :: man::grandfather), and postulate that whilst biased analogies are undesirable, appropriate ones should remain. Our new analogy test uses the 506 analogies in the fam-  ily analogy subset of the Google Analogy Test set (Mikolov et al., 2013) to define many such appropriate analogies that should hold even in a debiased environment, such as boy:girl :: nephew:niece. 12 We use a proportional pair-based analogy test, which measures each embedding's performance when drawing a fourth word to complete each analogy, and report error percentage.

Results
Direct bias Table 1 presents the d scores and WEAT one-tailed p-values, which indicate whether the difference in samples means between targets X and Y and attributes A and B is significant. We also compute a two-tailed p-value to determine whether the difference between the various sets is significant. 13 On Wikipedia, nWED70 outperforms every other method (p < 0.01), and even at α = 0.1 bias was undetectable. In all CDA/S variants, the Names Intervention performs significantly better than other intervention strategies (average d for nCDS across all tests 0.95 vs. 1.39 for the best nonnames CDA/S variants). Excluding the Wikipedia careers-family test (in which the CDA and CDS 12 The entire Google Analogy Test set contains 19,544 analogies, which are usually reported as a single result or as a pair of semantic and syntactic results. 13 Throughout this paper, we test significance in the differences between the embeddings with a two-tailed Monte Carlo permutation test at significance interval α = 0.01 with r = 10, 000 permutations. variants are indistinguishable at α = 0.01), the CDS variants are numerically better than their CDA counterparts in 80% of the test cases, although many of these differences are not significant.
Generally, we notice a trend of WED reducing direct gender bias slightly better than CDA/S. Impressively, WED even successfully reduces bias in the careers-family test, where gender information is captured by names, which were not in WED's gender-equalise word-pair list for treatment. Figure 5 shows the V-measures of the clusters of the most biased words in Wikipedia for each embedding. Gigaword patterns similarly (see appendix). Figure 6 shows example tSNE projections for the Gigaword embeddings ("V" refers to their V-measures; these examples were chosen as they represent the best results achieved by Bolukbasi et al.'s (2016) method, Lu et al.'s (2018) method, and our new names variant). On both corpora, the new nCDA and nCDS techniques have significantly lower purity of biased-word cluster than all other evaluated mitigation techniques (0.420 for nCDS on Gigaword, which corresponds to a reduction of purity by 58% compared to the unmitigated embedding, and 0.609 (39%) on Wikipedia). nWED70's V-Measure is significantly higher than either of the other Names variants (reduction of 11% on Gigaword, only 1% on Wikipedia), suggesting that the success of nCDS and nCDA is not merely due to their larger list of gender-words. Figure 7 shows the results of the second test of indirect bias, and reports the accuracy of a classifier trained to reclassify previously gender biased words on the Wikipedia embeddings (Gigaword patterns similarly). 14 These results reinforce the finding of the clustering experiment: once again, nCDS outperforms all other methods significantly on both corpora (p < 0.01), although it should be noted that the successful reclassification rate remains relatively high (e.g. 88.9% on Wikipedia).

Indirect bias
We note that nullifying indirect bias associations entirely is not necessarily the goal of debiasing, since some of these may result from causal links in the domain. For example, whilst associations between man and engineer and between man and car are each stereotypic (and thus could be considered examples of direct bias), an association between engineer and car might well have little to do with gender bias, and so should not be mitigated.  It should be noted that since SimLex-999 was produced by human raters, it will reflect the human biases these methods were designed to remove, so worse performance might result from successful bias mitigation. Figure 8 shows the sentiment classification error rates for Wikipedia (Gigaword patterns similarly). Results are somewhat inconclusive. While WED70 significantly improves the performance of the sentiment classifier from the unmitigated embedding on both corpora (p < 0.05), the improvement is small (never more than 1.1%). On both corpora, nothing outperforms WED70 or the Names Intervention variants. Non-biased gender analogies Figure 9 shows the error rates for non-biased gender analogies for Wikipedia. CDA and CDS are numerically better than the unmitigated embeddings (an effect which is always significant on Gigaword, shown in the appendices, but sometimes insignificant on Wikipedia). The WED variants, on the other hand, perform significantly worse than the unmitigated sets on both corpora (27.1 vs. 9.3% for the best WED variant on Gigaword; 18.8 vs. 8.7% on Wikipedia). WED thus seems to remove too much gender information, whilst CDA and CDS create an improved space, perhaps because they reduce the effect of stereotypical associations which were previously used incorrectly when drawing analogies.

Conclusion
We have replicated two state-of-the-art bias mitigation techniques, WED and CDA, on two large corpora, Wikipedia and the English Gigaword. In our empirical comparison, we found that although both methods mitigate direct gender bias and maintain the interpretability of the space, WED failed to maintain a robust representation of gender (the best variants had an error rate of 23% average when drawing non-biased analogies, suggesting that too much gender information was removed). A new variant of CDA we propose (the Names Intervention) is the only to successfully mitigate indirect gender bias: following its application, previously biased words are significantly less clustered according to gender, with an average of 49% reduction in cluster purity when clustering the most biased words. We also proposed Counterfactual Data Substitution, which generally performed better than the CDA equivalents, was notably quicker to compute (as Word2Vec is linear in corpus size), and in theory allows for multiple intervention layers without a corpus becoming exponentially large.
A fundamental limitation of all the methods compared is their reliance on predefined lists of gender words, in particular of pairs. Lu et al.'s pairs of manager::manageress and murderer::murderess may be counterproductive, as their augmentation method perpetuates a male reading of manager, which has become gender-neutral over time. Other issues arise from differences in spelling (e.g. mum vs. mom) and morphology (e.g. his vs. her and hers). Biologically-rooted terms like breastfeed or uterus do not lend themselves to pairing either. The strict use of pairings also imposes a gender binary, and as a result non-binary identities are all but ignored in the bias mitigation literature.
Future work could extend the Names Intervention to names from other languages beyond the USbased gazetteer used here. Our method only allows for there to be an equal number of male and female names, but if this were not the case one ought to explore the possibility of a many-to-one mapping, or perhaps a probablistic approach (though difficulties would be encountered sampling simultaneously from two distributions, frequency and genderspecificity). A mapping between nicknames (not covered by administrative sources) and formal names could be learned from a corpus for even wider coverage, possibly via the intermediary of coreference chains. Finally, given that names have been used in psychological literature as a proxy for race (e.g. Greenwald et al.), the Names Intervention could also be used to mitigate racial biases (something which, to the authors' best knowledge, has never been attempted), but finding pairings could prove problematic. It is important that other work looks into operationalising bias beyond the subspace definition proposed by Bolukbasi et al. (2016), as it is becoming increasingly evident that gender bias is not linear in embedding space.