Conceptor Debiasing of Word Representations Evaluated on WEAT

Bias in word representations, such as Word2Vec, has been widely reported and investigated, and efforts made to debias them. We apply the debiasing conceptor for post-processing both traditional and contextualized word embeddings. Our method can simultaneously remove racial and gender biases from word representations. Unlike standard debiasing methods, the debiasing conceptor can utilize heterogeneous lists of biased words without loss in performance. Finally, our empirical experiments show that the debiasing conceptor diminishes racial and gender bias of word representations as measured using the Word Embedding Association Test (WEAT) of Caliskan et al. (2017).


Introduction
Word embeddings capture distributional similarities and thus inherit demographic stereotypes (Bolukbasi et al., 2016). Such embedding biases tend to track statistical regularities such as the percentage of people with a given occupation (Nikhil Garg and Zou, 2018) but sometimes deviate from them (Bhatia, 2017). Recent work has shown that gender bias exists in contextualized embeddings May et al., 2019).
Here, we provide a quantitative analysis of bias in traditional and contextual word embeddings and introduce a method of mitigating bias (i.e., debiasing) using the debiasing conceptor, a clean mathematical representation of subspaces that can be operated on and composed by logic-based manipulations (Jaeger, 2014). Specifically, conceptor negation is a soft damping of the principal components of the target subspace (e.g., the subset of words being debiased) (Liu et al., 2019b) (See Figure 1.) Key to our method is how it treats wordassociation lists (sometimes called target lists), which define the bias subspace. These lists include pre-chosen words associated with a target (a) The original space (b) After applying the debiasing conceptor Figure 1: BERT word representations of the union of the set of contextualized word representations of relatives, executive, wedding, salary projected on to the first two principal components of the WEAT gender first names, which capture the primary component of gender. Note how the debiasing conceptor collapses relatives and wedding, and executive and salary once the bias is removed. demographic group (often referred to as a "protected class"). For example, he / she or Mary / John have been used for gender (Bolukbasi et al., 2016). More generally, conceptors can combine multiple subspaces defined by word lists. Unlike most current methods, conceptor debiasing uses a soft, rather than a hard projection.
We test the debiasing conceptor on a range of traditional and contextualized word embeddings 1 and examine whether they remove stereotypical demographic biases. All tests have been performed on English word embeddings. This paper contributes the following: • Introduces debiasing conceptors along with a formal definition and mathematical relation to the Word Embedding Association Test. • Demonstrates the effectiveness of the debiasing conceptor on both traditional and contextualized word embeddings.

Related Work
NLP has begun tackling the problems that inhibit the achievement of fair and ethical AI (Hovy and Spruit, 2016;Friedler et al., 2016), in part by developing techniques for mitigating demographic biases in models. In brief, a demographic bias is a difference in model output based on gender (either of the data author or of the content itself) or selected demographic dimension ("protected class") such as race. Demographic biases manifest in many ways, ranging from disparities in tagging and classification accuracy depending on author age and gender (Hovy, 2015;Dixon et al., 2018), to over-amplification of demographic differences in language generation (Yatskar et al., 2016;Zhao et al., 2017), to diverging implicit associations between words or concepts within embeddings or language models (Bolukbasi et al., 2016;Rudinger et al., 2018).
Here, we are concerned with the societal bias towards protected classes that manifests in prejudice and stereotypes (Bhatia, 2017). Greenwald and Banaji (1995); implicit attitudes such that "introspectively unidentified (or inaccurately identified) traces of past experience that mediate favorable or unfavorable feeling, thought, or action toward social objects." Bias is often quantified in people using the Implicit Association Test (IAT) (Greenwald et al., 1998). The IAT records subjects response times when asked to pair two concepts. Smaller response times occur in concepts subjects perceive to be similar versus pairs of concepts they perceive to be different. A well known example is where subjects were asked to associate black and white names with "pleasant" and "unpleasant" words. A significant racial bias has been found in many populations. Later, Caliskan et al. (2017) formalized the Word Embedding Association Test (WEAT), which replaces reaction time with word similarity to give a bias measure that does not require use of human subjects. May et al. (2019) extended WEAT to the Sentence Embedding Association Test (SEAT); however, in this paper we instead use token-averaged representations over a corpus.
Debiasing Embeddings. The simplest way to remove bias is to project out a bias direction. For example, Bolukbasi et al. (2016) identify a "gender subspace" using lists of gendered words and then remove the first principal component of this subspace.  used both data augmentation and debiasing of Bolukbasi et al. (2016) to mitigate bias found in ELMo and showed improved performance on coreference resolution. Our work is complementary, as debiasing conceptors can be used in place of hard-debiasing. Bolukbasi et al. (2016) also examine a soft debiasing method, but find that it does not perform well. In contrast, our debiasing conceptor does a successful soft damping of the relevant principal components. To understand why, we first introduce the conceptor method for capturing the "bias subspaces", next formalize bias, and then show WEAT in matrix notation.

Conceptors
As in Bolukbasi et al. (2016), our aim is to identify the "bias subspace" using a set of target words, Z and Z is their corresponding word embeddings. A conceptor matrix, C, is a regularized identity map (in our case, from the original word embeddings to their biased versions) that minimizes where α −2 is a scalar parameter. 2 To describe matrix conceptors, we draw heavily on (Jaeger, 2014;He and Jaeger, 2018;Liu et al., 2019b,a). C has a closed form solution: Intuitively, C is a soft projection matrix on the linear subspace where the word embeddings Z have the highest variance. Once C has been learned, it can be 'negated' by subtracting it from the identity matrix and then applied to any word embeddings to shrink the bias directions. Conceptors can represent laws of Boolean logic, such as NOT ¬, AND ∧ and OR ∨. For two conceptors C and B, we define the following operations: Among these Boolean operations, two are critical for this paper: the NOT operator for debiasing, and the OR operation ∨ for multi-list (or multicategory) debiasing. It can be shown that if C and B are of equal sizes, then C ∨ B is the conceptor computed from the union of the two sets of sample points from which C and B are computed (Jaeger, 2014); this is not true if they are of different sizes.
Negated Conceptor. Given that the conceptor, C, represents the subspace of maximum bias, we want to apply the negated conceptor, NOT C (see Equation 3) to an embedding space and remove its bias. We call NOT C the debiasing conceptor. More generally, if we have K conceptors, C i derived from K different word lists, we call NOT (C 1 ∨ ... ∨ C K ) a debiasing conceptor. The negated conceptor matrix has been used in the past on a complete vocabulary to increase the semantic richness of its word embeddings; Liu et al. (2018) showed that the negated conceptor gave better performance on semantic similarity and downstream tasks than the hard debiasing method of Mu and Viswanath (2018). As shown in Liu et al. (2018), the negated conceptor approach does a soft debiasing by shrinking each principal component of the covariance matrix of the target word embeddings ZZ . The shrinkage is a function of the conceptor hyper-parameter α and the singular values σ i of ZZ : α −2 σ i +α −2 .

Formalizing Bias
We follow the formal definition of Lu et al. (2018), where given a class of word sets D and a scoring function s, the bias of s under the concept(s) tested by D, written B s (D), is the expected difference in scores assigned to expected absolute bias across class members, This naturally gives rise to a large set of concepts and scoring functions.

Word Embedding Association Test
The Word Embeddings Association Test (WEAT), as proposed by Caliskan et al. (2017), is a statistical test analogous to the Implicit Association Test (IAT) (Greenwald et al., 1998) which helps quantify human biases in textual data. WEAT uses the cosine similarity between word embeddings, which is analogous to the reaction time when subjects are asked to pair two concepts they find similar in the IAT. WEAT considers two sets of target words and two sets of attribute words of equal size. The null hypothesis is that there is no difference between the two sets of target words and the sets of attribute words in terms of their relative similarities measured as the cosine similarity between the embeddings. For example, consider the target sets as words representing Career and Family and let the two sets of attribute words be Male and Female, in that order. The null hypothesis states that Career and Family are equally similar (mathematically, in terms of the mean cosine similarity between the word representations) to each of the words in the Male and Female word lists. The WEAT test statistic measures the differential association of the two sets of target words with the attribute. The "effect size" is a normalized measure of how separated the two distributions are.
To ground this, we cast WEAT in our formulation where X and Y are two sets of target words, (concretely, X might be Career words and Y Family words) and A, B are two sets of attribute words (A might be female names and B male names) assumed to associate with the bias concept(s). WEAT is then 3 where s(x, y) = cos(vec(x), vec(y)) and vec(x) ∈ R k is the k-dimensional word embedding for word x. Note that for this definition of WEAT, the cardinality of the sets must be equal, so |A|= |B| and |X |= |Y|. Our conceptor formulation given below relaxes this assumption. To motivate our conceptor formulation, we further generalize WEAT to capture the covariance between the target word and the attribute word embeddings. First, let X, Y , A and B be matrices whose columns are word embeddings corresponding to the words in the sets X , Y, A, B, respectively (i.e. the two sets of target words and two sets of attribute words, respectively). To formally define this, without loss of generality choose X , let X = [x i ] i∈I where for i in an index set I with cardinality |X | and x i = vec(x) where the word x is indexed at the ith value of the index set. 4 We can then write WEAT as, where · F is the Frobenius norm. If the embeddings are unit length, then GWEAT is the same as |X | times WEAT. 5 Suppose we want to mitigate bias by applying the k × k bias mitigating matrix, G = ¬C, which optimally removes bias from any matrix of word embeddings. We select G to minimize Since the conceptor, C, is calculated using the word embeddings of Z = X ∪ Y, the negated conceptor will mitigate the variance from the target sets, which hopefully identifies the most important bias directions.

Embeddings
For context-independent embeddings, we used off-the-shelf Fasttext subword embeddings 6 , which were trained with subword information on the Common Crawl (600B tokens), the GloVe embeddings 7 trained on Wikipedia and Gigaword and word2vec 8 trained on roughly 100 billion 4 To clarify, in our notation xi ∈ R k and x ∈ X . 5 Our generalization of WEAT is different from Swinger et al. (2018). 6 https://dl.fbaipublicfiles. com/fasttext/vectors-english/ crawl-300d-2M-subword.zip. 7 https://nlp.stanford.edu/projects/ glove/ 8 words from a Google News dataset. The embeddings used are not centered and normalized to unit length as in Bolukbasi et al. (2016).
For contextualized embeddings, we used ELMo small which was trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. 9 We also experimented with the state-of-the-art contextual model "BERT-Large, Uncased" which has 24layer, 1024-hidden, 16-heads, 340M parameters. BERT is trained on the BooksCorpus (0.8B words) and Wikipedia (2.5B words). We used the last four hidden layers of BERT. We used the Brown Corpus for the word contexts to create instances of the ELMo and BERT embeddings. Embeddings of English words only have been used for all the tests.

WEAT Debiasing Experiments
As described in section 3.1, WEAT assumes as its null hypothesis that there is no relative bias between the pair of concepts defined as the target words and attribute words. In our experiments, we measure the effect size (the WEAT score normalized by the standard deviation of differences of attribute words w.r.t target words) (d) and the onesided p-value of the permutation test. A higher absolute value of effect size indicates larger bias between words in the target set with respect to the words in the attribute set. We would like the absolute value of the effect size to be zero. Since the p-value measures the likelihood that a random permutation of the attribute words would produce at least the observed test statistic, it should be high (at least 0.05) to indicate lack of bias in the positive direction.
Conceptually, the conceptor should be a soft projection matrix on the linear subspace representing the bias direction. For instance, the subspace representing gender must consist of words which are specific to or in some sense related to gender.
A gender word list might be a set of pronouns which are specific to a particular gender such as he / she or himself / herself and gender specific words representing relationships like brother / sister or uncle / aunt. We test conceptor debiasing both using the list of such pronouns used by      Caliskan et al. (2017) and using a more comprehensive list of gender-specific words that includes gender specific terms related to occupations, relationships and other commonly used words such as prince / princess and host / hostess 10 . We further tested conceptor debiasing using male and female names such as Aaron / Alice or Chris / Clary 11 . We also tested our method with the combination of all lists. The combination of the subspace was done in two ways -either by taking the union of all word lists or by applying the OR operator on the three conceptor matrices computed independently. The subspace for racial bias was determined using list of European American and African American names.
We tested target pairs of Science vs. Arts, Math vs. Arts, and Career vs. Family word lists with the attribute of the male vs. female names to test gender debiasing. Similarly, we examined European American names vs. African American names as target pairs with the attribute of pleasant vs. unpleasant to test racial debiasing.
Our findings indicate that expanded lists give better debiasing for word embeddings; however, the results are not as clear for contextualized embeddings. The OR operator on two conceptors describing subspaces of pronouns/nouns and names generally outperforms a union of these words. This further motivates the use of the debiasing conceptor.

Embedding
Original   pleasant). d is the effect size, which we want to be close to 0 and p is the p-value, which we want to be larger than 0.05. Table 7 summarizes the effect size (d) and the one-sided p-value we obtained by running WEAT on each of the word embeddings for racial debiasing. In this experiment we used the same setup as Caliskan et al. (2017) and compare attribute Words of European American / African American names with target words "pleasant" and "unpleasant". In Table 7 we see that racial bias is mitigated in all cases aside from GloVe. Furthermore, for word2vec the associational bias is not significant. We also found that the conceptor nearly always outperforms the hard debiasing methods of Mu and Viswanath (2018) and Bolukbasi et al. (2016).

Gender Debiasing Results
Tables 1, 3 and 5 show the results obtained on gender debiasing between attribute words of "Family" and "Career', "Math" and "Arts" and "Science" and "Arts" with the target words "Male" and "Female" respectively for the traditional word embeddings. We show the results for all the word representations; however, the method of Bolukbasi et al. (2016) can only be applied to standard word embeddings. 12 We show the results when embeddings are debiased using conceptors computed using different subspaces. It can be seen in the tables that the bias for the conceptor negated embeddings is significantly less than that of the original embeddings. In the tables, the conceptor debiasing method is compared with the hard-debiasing technique proposed by Mu and Viswanath (2018) where the first principal component of the subspace from the embeddings is completely project off. The debiasing conceptor outperforms the hard debiasing technique in almost all cases. Note that the OR operator can not be used with the hard debiasing technique and thus is not reported.
Similarly, Tables 2, 4 and 6 show a comparison of the effect size and p-value using the hard debiasing technique and conceptor debiasing on conceptualized embeddings. It can be seen that conceptor debiasing generally outperforms other methods in mitigating (has a small absolute value) bias with the ELMo embeddings for all the subspaces. The results are less clear for BERT as observed in Table 6, which we will discuss in the following section. Note that combining all subspaces gives a significant reduction in the effect size.

Discussion of BERT Results
One of our most surprising findings is that unlike ELMo, the bias in BERT according to WEAT is less consistent than other word representations; WEAT effect sizes in BERT vary largely across different layers. Furthermore, the debiasing conceptor occasionally creates reverse bias in BERT, suggesting that tuning of the hyper-parameter α may be required. Another possibility is that BERT is capturing multiple concepts, and the presumption that the target lists are adequately capturing gender or racial attributes is incorrect. This suggests that further study into word lists is called for, along with visualization and end-task evaluation. It should also be noted that our results are in line with those from May et al. (2019).

Retaining Semantic Similarity
In order to understand if the debiasing conceptor was harming the semantic content of the word embeddings, we examined conceptor debiased embedding for semantic similarity tasks. As done in Liu et al. (2018) we used the seven standard word similarity test set and report Pearson's correlation. The word similarity sets are: the RG65 (Rubenstein and Goodenough, 1965), the WordSim-353 (WS) (Finkelstein et al., 2002), the rare-words (RW) (Luong et al., 2013), the MEN dataset (Bruni et al., 2014), the MTurk (Radinsky et al., 2011), the SimLex-999 (SimLex) (Hill et al., 2015), and the SimVerb-3500 (Gerz et al., 2016). Table 8 shows that conceptors help in preserving and at times increasing the semantic information in the embeddings. It should be noted that these tasks can not be applied to contextualized embeddings such as ELMo and BERT. So, we do not report these results.  Table 8: Word Similarity comparison with conceptor debiased embeddings using all gender words as conceptor subspace.

Conclusion
We have shown that the debiasing conceptor can successfully debias word embeddings, outperforming previous state-of-the art 'hard' debiasing methods. Best results are obtained when lists are broken up into subsets of 'similar' words (pronouns, professions, names, etc.), and separate conceptors are learned for each subset and then OR'd. Conceptors for different protected subclasses such as gender and race can be similarly OR'd to jointly debias.
Contextual embeddings such as ELMo and BERT, which give a different vector for each word token, work particularly well with conceptors, since they produce a large number of embeddings; however, further research on tuning conceptors for BERT needs to be done. Finally, we note that embedding debiasing may leave bias which is undetected by measures such as WEAT Gonen and Goldberg (2019); thus, all debiasing methods should be tested on end-tasks such as emotion classification and co-reference resolution.