Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation

Bolukbasi et al. (2016) presents one of the first gender bias mitigation techniques for word representations. Their method takes pre-trained word representations as input and attempts to isolate a linear subspace that captures most of the gender bias in the representations. As judged by an analogical evaluation task, their method virtually eliminates gender bias in the representations. However, an implicit and untested assumption of their method is that the bias subspace is actually linear. In this work, we generalize their method to a kernelized, nonlinear version. We take inspiration from kernel principal component analysis and derive a nonlinear bias isolation technique. We discuss and overcome some of the practical drawbacks of our method for non-linear gender bias mitigation in word representations and analyze empirically whether the bias subspace is actually linear. Our analysis shows that gender bias is in fact well captured by a linear subspace, justifying the assumption of Bolukbasi et al. (2016).


Introduction
Pre-trained word representations are a necessity for strong performance on modern NLP tasks.Such representations now serve as input to neural methods (Goldberg, 2017), which recently have become the standard models in the field.However, because pre-trained representations are constructed from large, human-created corpora, they naturally contain societal biases encoded in that data; gender bias is among the most well-studied of these biases (Caliskan et al., 2017).Both a-contextual word representations (Mikolov et al., 2013;Pennington et al., 2014) and contextual word representations (Peters et al., 2018;Devlin et al., 2019) have been shown to encode gender bias (Bolukbasi et al., 2016;Caliskan et al., 2017;Zhao et al., 2019;May et al., 2019;Karve et al., 2019).More importantly, bias in pre-trained representations has been shown to influence models for downstream tasks where they are used as input, e.g., coreference resolution (Rudinger et al., 2018;Zhao et al., 2018).Bolukbasi et al. (2016) present one of the first methods for detecting and mitigating gender bias in word representations.They provide a novel linearalgebraic approach that post-processes word representations in order to partially remove gender bias.Under their evaluation, they find they can nearly perfectly remove bias in an analogical reasoning task.However, subsequent work (Gonen and Goldberg, 2019;Hall Maudslay et al., 2019) has indicated that gender bias still lingers in the representations, despite Bolukbasi et al.'s (2016) strong experimental results.In the development of their method, Bolukbasi et al. (2016) make a critical and unstated assumption: Gender bias forms a linear subspace of word representation space.In mathematics, linearity is a strong assumption and there is no reason a-priori why one should expect complex and nuanced societal phenomena, such as gender bias, should be represented by a linear subspace.
In this work, we present the first non-linear gender bias mitigation technique for a-contextual word representations.In doing so, we directly test the linearity assumption made by Bolukbasi et al. (2016).Our method is based on the insight that Bolukbasi et al.'s (2016) method bears a close resemblance to principal component analysis (PCA).Just as one can kernelize PCA (Schölkopf et al., 1997), we show that one can kernelize the method of Bolukbasi et al. (2016).Due to the kernelization, the bias is removed in the feature space, rather in the word representation space.Thus, we also explore pre-image techniques (Mika et al., 1999) to project the bias-mitigated vectors back into R d .
As previously noted, there are now multiple bias removal methodologies (Zhao et al., 2018(Zhao et al., , 2019;;May et al., 2019) that have succeed the method by Bolukbasi et al. (2016).Furthermore Gonen and Goldberg (2019) point out multiple flaws in Bolukbasi et al.'s (2016) bias mitigation technique and the aforementioned methods.Nonetheless, we believe that this method has received sufficient attention from the community such that research into its properties is both interesting and useful.
We test our non-linear gender bias technique in several experiments.First, we consider the Word representation Association Test (WEAT; Caliskan et al., 2017); we notice that across five non-linear kernels and convex combinations thereof, there is seemingly no significant difference between the non-linear bias mitigation technique and the linear one.Secondly, we consider the professions task (Bolukbasi et al., 2016;Gonen and Goldberg, 2019) that measures how word representations representing different professions are potentially genderstereotyped.Again, as with the WEAT evaluation, we find that our non-linear bias mitigation technique performs on par with the linear method.We also consider whether the non-linear gender mitigation technique removes indirect bias from the vectors (Gonen and Goldberg, 2019); yet again, we find the non-linear method performs on par with the linear methods.As a final evaluation, we evaluate whether non-linear bias mitigation hurts semantic performance.On SimLex-999 (Hill et al., 2015), we show that similarity estimates between the vectors remain on par with the linear methods.We conclude that much of the gender bias in word representations is indeed captured by a linear subspace, answering this paper's titular question.

Bias as a Linear Subspace
The first step of Bolukbasi et al.'s (2016) technique is the discovery of a subspace B ⊂ R d that captures most of the gender bias.Specifically, they stipulate that this space is linear.Given word representations that live in R d , they provide a spectral method for isolating the bias subspace.In this section, we review their approach and show how it is equivalent to principal component analysis (PCA) on a specific design (input) matrix.Then, we introduce and discuss the implicit assumption made by their work; we term this assumption the linear subspace hypothesis and test it in §4.
Hypothesis 1. Gender bias in word representations may be represented as a linear subspace.

Construction of a Bias Subspace
We will assume the existence of a fixed and finite vocabulary V , each element of which is a word w i .The hard-debiasing approach takes a set of N sets D = {D n } N n=1 as input.Each set D n contains words that are considered roughly semantically equivalent modulo their gender; Bolukbasi et al. (2016) call the D n defining sets.For example, {man, woman} and {he, she} form two such defining sets.We identify each word with a unique integer i for the sake of our indexing notation; indeed, we exclusively reserve the index i for words.We additionally introduce the function f : that maps an individual word to its defining set.In general, the defining sets are not limited to a cardinality of two, but in practice Bolukbasi et al. (2016) exclusively employ defining sets with a cardinality of two in their experiments.Using the sets D n , Bolukbasi et al. (2016) where we write w i for the i th word's representation and the empirical mean vector µ n is defined as (2) Bolukbasi et al. (2016) then extract a bias subspace B using the singular value decomposition (SVD).Specifically, they define the bias subspace to be the space spanned by the first k columns of V where As C is symmetric and positive semi-definite, the SVD is equivalent to an eigendecomposition as our notation in Eq. (3) shows.We assume the columns of V , the eigenvectors of C, are ordered by the magnitude of their corresponding eigenvalues.

Bias Subspace Construction as PCA
As briefly noted by Bolukbasi et al. (2016), this can thus be cast as performing principal component analysis (PCA) on a recentered input matrix.We prove this assertion more formally.We first prove that the matrix C may be written as an empirical covariance matrix.
Proposition 1. Suppose |D n | = 2 for all n.Then we have where we define the design matrix W as: Proof.
where W i,• ∈ R 2N ×d is defined as above.■ Next, we show that the matrix is centered, which is a requirement for PCA.
Proposition 2. The matrix W is row-wise centered. Proof.

■
Remark 3. The method of Bolukbasi et al. (2016) may be considered principal component analysis performed on the matrix 2C.
Proof.As the algebra in Proposition 1 and Proposition 2 show we may formulate the problem as an SVD on a mean-centered covariance matrix.One view of PCA is performing matrix factorization on such a matrix.■ 3 Bolukbasi et al. (2016) In this section, we review the bias mitigation technique introduced by Bolukbasi et al. (2016).When possible, we take care to reformulate their method in terms of matrix notation.They introduce a twostep process that neutralizes and equalizes the vectors to mitigate gender bias in the representations.The underlying assumption of their method is that there exists a linear subspace B ⊂ R d that captures most of the gender bias.

Neutralize
After finding the linear bias subspace B, the gist behind Bolukbasi et al.'s (2016) approach is based on elementary linear algebra.We may decompose any word vector w as the sum of its orthogonal projection onto the bias subspace (range of the projection) and its orthogonal projection onto the complement of the bias subspace (null space of the projection), i.e., We may then re-embed every vector as We may re-write this in terms of matrix notation in the following manner.Let {v k } K k=1 be an orthogonal basis for the linear bias subspace B. This may be found by taking the eigenvectors C that correspond to the top-K eigenvalues with largest magnitude.Then, we define the projection matrix onto the bias subspace as k it follows that the matrix (I − P K ) is a projection matrix on the complement of B. We can then write the neutralize step using matrices The matrix formulation of the neutralize step offers a cleaner presentation of what the neutralize step does: it projects the vectors onto the orthogonal complement of the bias subspace.

Equalize
Bolukbasi et al. ( 2016) decompose words into two classes.The neutral words which undergo neutralization as explained above, and the gendered words, some of which receive the equalizing treatment.Given a set of equality sets E = {E 1 , . . ., E L } which we can see as a greater extension of the defining sets D, i.e., E i = {guy, gal}, Bolukbasi et al. (2016) then proceed to decompose each of the words w ∈ E into their gendered and neutral counterparts, setting their neutral component to a constant (the mean of the equality set) and the gendered component to its mean-centered projection into the gendered subspace: where we define the following quantities: the "normalizer" Z ensures the vector is of unit length.This fact is left unexplained in the original work, but Hall Maudslay et al. ( 2019) provide a proof in their appendix.
4 Bias as a Non-Linear Subspace We generalize the framework presented in Bolukbasi et al. ( 2016) and cast it to a non-linear setting by exploiting its relationship to PCA.Thus, the natural extension of Bolukbasi et al. ( 2016) is to kernelize it analogously to Schölkopf et al. (1997), which is the kernelized generalization of PCA.Our approach preserves all the desirable formal properties presented in the linear method of Bolukbasi et al. (2016).

Adapting the Design Matrix
The idea behind our non-linear bias mitigation technique is based on kernel PCA (Schölkopf et al., 1998).In short, the idea is to map the original word representations w i ∈ R d to a higher-dimensional space H via a function Φ : R d → H.We will consider cases where H is a reproducing kernel Hilbert space (RKHS) with reproducing kernel κ(w i , w j ) = ⟨Φ(w i ), Φ(w j )⟩ where the notation We start the development of bias mitigation technique in feature space with a modification of the design matrix presented in Eq. ( 5).In the RKHS setting the non-linear analogue is ) where we define ( As one can see, this is a relatively straightforward mapping from the set of linear operations to non-linear ones.

Kernel PCA
Using our modified design matrix, we can cast our non-linear bias mitigation technique as a form of kernel PCA.Specifically, we form the matrix Our goal is to find the eigenvalues λ k and their corresponding eigenfunctions V Φ k ∈ H by solving the eigenvalue problem Computing these directly from Eq. ( 16) is impossible since H's dimension may be prohibitively large or even infinite.However, Schölkopf et al. note that V Φ k is spanned by {Φ(w i )} 2N i=1 .This allows us to rewrite Eq. ( 16) as where there exist coefficients α k i ∈ R. Now, by substituting Eq. ( 17) and Eq. ( 16) into the respective terms in λ⟨Φ(w i ), Schölkopf et al. (1997) derive a computationally feasible eigendecomposition problem.Specifically, they consider where K ij = ⟨Φ(w i ), Φ(w j )⟩.Once all the α k vectors have been estimated the inner product between an eigenfunction V Φ k and a point w can be computed as A projection into the basis {V Φ k } K k=1 can then be carried out by applying the projection operator P K : H → H as follows: where K is the number of principal components.Projection operator P K is analogous to the linear projection P K introduced in §3.1.

Centering Kernel Matrix
We can perform the required mean-centering operations on the design matrix by centering the kernel matrix in a similar fashion to Schölkopf et al. (1998).For the case of equality sets of size 2, which is what Bolukbasi et al. use in practice, we realize that the centered design matrix reduces to pairwise differences: which leads to a very simple re-centering in terms of the Gram matrices: where In simpler terms, Eq. ( 23b) is creating two matrices: matrix W (1) which is constructed by looping over the definition sets and placing pairs within the same definition set as adjacent rows, then W (2) is constructed in the same way but the order of the adjacent pairs is swapped relative to W (1) .Once we have this pairwise centered Gram matrix K we can apply the eigendecomposition procedure described in Eq. ( 18) directly on K.We note that carrying out this procedure using a linear kernel recovers the linear bias subspace from Bolukbasi et al. (2016).

Inner Product Correction (Neutralize)
We now focus on neutralizing and equalizing the inner products in the RKHS rather than correcting the word representations directly.Just as in the linear case, we can decompose the representation of a word in the RKHS into biased and neutral components which provides a nonlinear equivalent for Eq. ( 10): Proposition 4. The corrected inner product in the feature space for two neutralized words z, w is given by Proof.
Applying Eq. ( 19b) and Eq. ( 20) where as derived by Schölkopf et al. (1998).■ An advantage of this approach is that it will not rely on errors due to the approximation of the preimage.However, it will not give us back a set of bias-mitigated representations.Instead, it returns a bias mitigation metric, thus limiting the classifiers and regressors we could use.Eq. ( 26) provides us with an O(KN d) approach to compute the inner product between words in feature space.

Inner Product Correction (Equalize)
To equalize, we may naturally convert Eq. ( 11) to its feature-space equivalent.We define an equalizing function where we define where g : [|V |] → [L] maps an individual word index to its corresponding equality set index.
Given vector dot products in the linear case follow the same geometric properties as inner products in the RKHS we can show that θ eq (w) is unit norm follows directly from the proof for Proposition 1 in (Hall Maudslay et al., 2019) which can be found in Appendix A of Hall Maudslay et al. ( 2019).Proposition 5.For any two vectors in the observed space w, z and their corresponding representations in feature space Φ(w), Φ(z) the inner product ⟨Φ(w) − P K Φ(w), P K Φ(z)⟩ is 0. Proof.
■ Proposition 6.For a given neutral word w and a word in an equality set e ∈ E the inner product ⟨Φ ntr (w), θ(e)⟩ is invariant across members in the equality set E. Proof.
⟨Φ ntr (w), θ(e)⟩ where (i) follows from Proposition 5 and (ii) follows from Proposition 4. ■ At this point, we have completely kernelized the approach in Bolukbasi et al. (2016).Note that a linear kernel reduces to the method described in Bolukbasi et al. (2016) as we would expect.We can see an initial disadvantage that equalizing via inner product correction has in comparison to Bolukbasi et al. (2016) and that is that we now require switching in between three different inner products at test time depending on whether the words are neutral or not.To overcome this in practice, we neutralize all words and do not use the equalize correction, however, we present it for completeness.

Computing the Pre-Image
As mentioned in the previous section, a downfall of the metric correction approach is that it does not provide us representations that we can use in downstream tasks: the bias-mitigated representations only exist in feature space.Thus, when it comes to transfer tasks such as classification we are limited to kernel methods, i.e., support vector machines).One way to resolve this problem is by obtaining the pre-image of the corrected representations in the feature space.
Finding the pre-image is a well-studied problem for kernel PCA (Kwok and Tsang, 2004).The goal is to fine the pre-image mappings Γ : compute (or approximate) the pre-images for Φ(w i ), Φ ⊥ P K (w i ) and Φ P K (w i ), respectively.In our case, with the pre-image mapping, the neutralize step from Bolukbasi et al. (2016) becomes In general, we will not have access to Γ ⊥ so we fall back on the following approximation scheme.
Additive Decomposition Approach.Alternatively, following Kandasamy and Yu (2016), we can construct an approximation to Γ that additively decomposes over the direct sum ⊕.We decompose Γ additively over the direct sum H K ⊕ H ⊥ K .That is, we assume that the pre-image mappings have the following form: Given that we will always know that the pre-image of Γ(Φ(w)) exists and is w, we can select Γ to return w resulting in: We then learn an analytic approximation for Γ ⊤ using the method in Bakır et al. (2004).Note that most pre-imaging methods, e.g., Mika et al. (1999) andBakır et al. (2004), are designed to approximate a pre-image Γ ⊤ and do not generalize to approximating the pre-image mappings Γ and Γ ⊥ .This is because such methods explicitly optimize for pre-imaging representations in the space Γ ⊤ using points in the training set as examples of their pre-image, for the null-space Γ ⊥ we have no such examples.

Experiments and Results
We carry out experiments across a range of benchmarks and statistical tests designed to quantify the underlying bias in word representations (Gonen and Goldberg, 2019).Our experiments focus on quantifying both direct and indirect bias as defined in Gonen and Goldberg (2019);Hall Maudslay et al. (2019).Furthermore, we also carry out word similarity experiments using the Hill et al. (2015) benchmark in order to assess that our new bias-mitigated spaces still preserve the original properties of word representations (Mikolov et al., 2013).

Experimental Setup
Across all experiments we apply the neutralize metric correction step to all word representations, in contrast to Bolukbasi et al. (2016) where the equalize step is applied to the equality sets E and the neutralize step to a set of neutral words as determined in Bolukbasi et al. (2016).We show in Tab. 3 that applying the equalize step does not bring an enhancement over neutralizing all words.We varied kernel hyper-parameters using a grid search and found that they had little effect on performance, as a result we used default initialization strategies as suggested in Schölkopf et al. (1998).Unless mentioned otherwise, all experiments use the inner product correction approach introduced in §4.4.

Kernel Variations
The main kernels used throughout experiments are specified in Tab. 1.We also explored the following compound kernels: (i) convex combinations of the Laplace, radial basis function (RBF), cosine and sigmoid kernels; (ii) convex combinations of cosine similarity, RBF, and sigmoid kernels; (iii) convex combinations of RBF and sigmoid kernels; (iv) polynomial kernels up to 4 th degree.We only report the results on the most fundamental kernels out of the explored kernels.

Direct Bias: WEAT
The Word Embedding Association Test Caliskan et al. (WEAT;2017) is a statistical test analogous to the implicit association test (IAT) for quantifying human biases in textual data (Greenwald and Banaji, 1995).WEAT computes the difference in relative cosine similarity between two sets of target words X and Y (e.g., careers and family) and two sets of attribute words A and B (e.g., male names and female names).Formally, this quantity is Cohen's d-measure (Cohen, 1992) also known as the effect size: The higher the measure, the more biased the representations.To quantify the significance of the estimated d, Caliskan et al. (2017) define the null hypothesis that there is no difference between the two sets of target words and the sets of attribute words in terms of their relative similarities (i.e., d = 0).Using this null hypothesis, Caliskan et al. (2017) then carry out a one-sided hypothesis test where failure to reject the null-hypothesis (p > 0.05) means that the degree of bias measured by d is not significant.We obtain WEAT scores across different kernels (Tab.2).We observe that the differences between the linear and the non-linear kernels is small and, in most cases, the linear kernel has a smaller value for the effect size indicating a lesser degree of bias in the corrected space.Overall, we conclude that the non-linear kernels do not reduce the linear bias as measured by WEAT further than the linear kernels.We also experiment with polynomial kernels and obtain similar results, which can be found in Tab.7 of App. A. (Gonen and Goldberg, 2019) We consider the professions dataset introduced by Bolukbasi et al. (2016) and apply the benchmark defined in Gonen and Goldberg (2019).We find the neighbors (100 nearest neighbors) of each word using the corrected cosine similarity and count the number of male neighbors.We then report the Pearson correlation coefficient between the number of male neighbors for each word and the original bias of that word.The original bias of a word vector w is given by the cosine similarity cos(w, he − she) in the original word representation space.We can observe from the results in Tab. 4 that the nonlinear kernels yield only marginally different results, which in most cases seem to be slightly worse, i.e., their induced space exhibits marginally higher correlations with the original biased vector space.

Indirect Bias
Following Gonen and Goldberg (2019), we build a balanced training set of male and female words using the 5000 most biased words according to the bias in the original representations as described in §6.4, and then train an RBF-kernel support vector machine (SVM) classifier (Pedregosa et al., 2011) on a random sample of 1000 (training set) of them to predict the gender, and evaluate its generalization on the remaining 4000 (test set).We can perform classification in our corrected RKHS with any SVM kernel κ svm (Φ ntr (w), Φ ntr (z)) that can be written in the forms1 κ svm (⟨w, z⟩) or κ svm (||w − z|| 2 ) since we can use the kernel trick in our corrected RKHS ⟨Φ ntr (w), Φ ntr (z)⟩ = κ(w, z) to compute the inputs to our SVM kernel, resulting in It is clear that the RBF kernel is an example of a kernel that follows Eq. ( 35).
We can see that the bias removal induced by non-linear kernels results in a slightly higher classification accuracy (shown in Tab. 5) of gendered words for GoogleNews Word2Vec representations (Mikolov et al., 2013) and a slightly lower classification accuracy for GloVe representations (Pennington et al., 2014) (with the exception of the Laplace kernel which has a very high classification accuracy).Overall for the RBF and the sigmoid kernels there is no improvement in comparison to the linear kernel (PCA), the Laplace kernel seems to have notably worse results than the others, still being able to classify gendered words at a high accuracy of 91.4% for GloVe representations.

Word Similarity: SimLex-999
The quality of a word vector space is traditionally measured by how well it replicates human judgments of word similarity.We use the SimLex-999 benchmark by Hill et al. (2015) which provides a ground-truth measure of similarity produced by 500 native English speakers.Similarity scores by our method are computed using Spearman correlation between representation and human judgments are reported.We can observe that the metric corrections only slightly change the Spearman correlation results on SimLex-999 (Tab.6) from the original representation space.We can thus conclude that the representation quality is mostly preserved.

Conclusion
We offer a non-linear extension to the method presented in Bolukbasi et al. (2016) by connecting its bias space construction to PCA and subsequently applying kernel PCA.We contend our extension is natural in the sense that it reduces to the method of Bolukbasi et al. (2016) in the special case when we employ a linear kernel and in the non-linear case it preserves all the desired linear properties in the feature space.This allows us to provide equivalent constructions of the neutralize and equalize steps presented.We compare the linear bias mitigation technique to our new kernelized non-linear version across a suite of tasks and datasets.We observe that our non-linear extensions of Bolukbasi et al. (2016) show no notable performance differences across a set of benchmarks designed to quantify gender bias in word representations.Furthermore, the results in Tab.7(App.A) show that gradually increasing the degree of non-linearity has again no significant change in performance for the WEAT (Caliskan et al., 2017) benchmark.Thus, we provide empirical evidence for the linear subspace hypothesis; our results suggest representing gender bias as a linear subspace is a suitable assumption.We would like to highlight that our results are specific to our proposed kernelized extensions and does not imply that all non-linear variants of (Bolukbasi et al., 2016) will yield similar results.There may very well exist a non-linear technique that works better, but we were unable to find one in this work.

A Polynomial Kernel Results
For experimental completeness, we provide direct bias experiments on WEAT using a range of polynomial kernels.The results are displayed in Tab. 7. The results for the polynomial kernels suggest the same conclusions we arrived at in the main text, i.e., a linear kernel is generally enough.
maps a defining set index to a tuple containing the word indices in the corresponding defining set and π 1 , π 2 : [|V |] × [|V |] → [|V |] are projection operators which return the first or second elements of a tuple respectively.

Figure 1 :
Figure 1: Pre-image problem illustration for the neutralized representations (null-space).The plane represents the bias subspace in the RKHS.

Figure 2 :
Figure 2: 2D toy example of non-linear component removal using Kernel PCA and the pre-image (neutralize step) described in §5.

Table 1 :
Different kernels used throughout experiments.

Table 2 :
WEAT results using GloVe and Google News word representations.

Table 3 :
Effect of the equalize step

Table 5 :
Classification accuracy results on male versus female terms.